# <span style="color:red"> Lecture 23 - Text Data  </span>

<font size = "3">

- Today's lecture covers some basic commands for working with text data

- Text data is becoming more and more relevant, with huge amounts of text data available.

- Text is messy and unstructured, and is not as straightforward to deal with as compared to numerical or categorical data

<font size = "3">

Import necessary libraries

In [1]:
import pandas as pd

<font size = "3">

Import data. 

The file "bills_actions.csv" contains data concerning the 116th United States Congress (January 3, 2019 - January 3, 2021).

In [2]:
bills_actions = pd.read_csv("data_raw/bills_actions.csv")
bills_actions.dtypes

Congress        int64
bill_number     int64
bill_type      object
action         object
main_action    object
category       object
member_id       int64
dtype: object

# <span style="color:red"> I. Basic Text Operations </span>

<font size = "3">

The "category" column tells us what type of data the row corresponds to (amendment, senate bill, etc.)

In [3]:
display(bills_actions["category"])

0         amendment
1         amendment
2         amendment
3       senate bill
4       senate bill
           ...     
3298      amendment
3299      amendment
3300      amendment
3301      amendment
3302      amendment
Name: category, Length: 3303, dtype: object

<font size = "3">

How many categories are there? How often do they appear in the data set?

An answer to the first question can be found by using the ``.nunique`` method attached to the Pandas Series

*Both* questions can be answered using the ``.value_counts()`` method.

In [4]:
bills_actions["category"].value_counts()

category
amendment                       1529
house bill                       902
senate bill                      514
house resolution                 234
senate resolution                 60
house joint resolution            22
house concurrent resolution       20
senate concurrent resolution      14
senate joint resolution            8
Name: count, dtype: int64

<font size = "4">

Subsetting text categories

<font size = "3">

- Suppose we are only interested in bills.

- We will use ``.query()`` to extract the subset of the data corresponding to House bills and Senate bills

- We use the ``.copy()`` method, because we will be adding a column to this dataframe, and we don't want this to affect the original one.

In [5]:
# categories we are interested in
list_categories = ["house bill","senate bill"]

# "in" is used to test whether a word belongs to a list

# remember, "@" is the syntax needed to reference "global" variables
bills = bills_actions.query('category in @list_categories').copy()

# double check that everything worked: compare to above value counts
bills["category"].value_counts()

category
house bill     902
senate bill    514
Name: count, dtype: int64

<font size = "4">

Data manipulation with strings

<font size = "3">

Q: How many bills mention the word "Senator"? 

- The "action" column contains a summary of each bill
- We'll use the ``.contains`` method.
- The method belongs to the collection ``bills["action].str`` (the Pandas StringMethods collection)

In [6]:
# Create Boolean Series
mentions_senator = bills["action"].str.contains("Senator")

# False = 0, True = 1
# So summing will count how many have the word "Senator"

print(mentions_senator.sum())

453


<font size = "3">

Q: What **percentage** of bills mention the word "Senator"?

Here are two ways we can compute this:

In [7]:
# multiply by 100 to convert to a percent.

print(mentions_senator.sum()/len(mentions_senator) * 100)
print()
print(mentions_senator.mean() * 100)
print()

# round to two digits and add "%" sign
senator_percent = round(mentions_senator.mean() * 100, 2)
print(senator_percent, "%")

31.991525423728813

31.991525423728813

31.99 %


<font size = "3">

- Recall that the ``.replace`` method can be used to replace an entry with a new value.

- However, the *entire* entry is replaced.

- How can we replace the word "Senator" with "Sen." within a string? Using ``bills["actions"].replace`` would only allow for us to replace an entire string.

- Instead, we can use the ``.str.replace`` method.

In [8]:
bills["action_custom"] = bills["action"].str.replace("Senator","Sen.")
print(bills["action_custom"].head(10))

3     Committee on Health, Education, Labor, and Pen...
4     Committee on the Judiciary. Reported by Sen. G...
5     Committee on the Judiciary. Reported by Sen. G...
6     Committee on Commerce, Science, and Transporta...
7     Committee on Veterans' Affairs. Reported by Se...
9     Committee on Homeland Security and Governmenta...
10    Committee on Homeland Security and Governmenta...
12    Committee on Foreign Relations. Reported by Se...
13    Committee on the Judiciary. Reported by Sen. G...
15    Committee on Foreign Relations. Reported by Se...
Name: action_custom, dtype: object


<font size = "3">

**Exercise**

- Obtain a new DataFrame called "resolutions" <br>
 which subsets rows where "category" values are either (i) house resolution or (ii) senate resolution

In [9]:
# Write your own code

list_resolution_names = ["house resolution","senate resolution"]
resolutions    = bills_actions.query(" category in @list_resolution_names").copy()

resolutions


Unnamed: 0,Congress,bill_number,bill_type,action,main_action,category,member_id
485,116,123,sres,Committee on Foreign Relations. Reported by Se...,senate committee/subcommittee actions,senate resolution,505
486,116,135,sres,Committee on Foreign Relations. Reported by Se...,senate committee/subcommittee actions,senate resolution,505
487,116,142,sres,Committee on Foreign Relations. Reported by Se...,senate committee/subcommittee actions,senate resolution,505
488,116,152,sres,Committee on Foreign Relations. Reported by Se...,senate committee/subcommittee actions,senate resolution,505
489,116,183,sres,Committee on Foreign Relations. Reported by Se...,senate committee/subcommittee actions,senate resolution,505
...,...,...,...,...,...,...,...
1085,116,603,hres,Mr. Hoyer moved to table the measure.,house floor actions,house resolution,1065
1086,116,603,hres,QUESTION OF THE PRIVILEGES OF THE HOUSE - The ...,house floor actions,house resolution,1560
1087,116,647,hres,Mr. Hoyer moved to table the measure.,house floor actions,house resolution,1065
1088,116,770,hres,Mr. Hoyer moved to table the measure.,house floor actions,house resolution,1065


# <span style="color:red"> II. Searching Strings and Regular Expressions </span>

<font size = "3">

- **Regular expressions** (Regex) are sequences of symbols/characters that express a string pattern that can be searched for within a longer piece of text.

- There is a built-in Python library called "re" that you can use. (import re)

- Pandas DataFrames and Series provides regular expression functionality without needing to import this library.

- First, recall how we can search for an entire string using ``.query()``

In [10]:
senate_bills = bills_actions.query('category == "senate bill"')
amendments = bills_actions.query('category == "amendment"')

<font size = "3">

- We can use the ``str.contains`` method to search for "sub-strings" instead of entire strings.

- Below, we look for entris in the "action" column which contain the phrase "to reconsider"

In [11]:
data_subset = bills_actions.query('action.str.contains("to reconsider")')
display(data_subset)

Unnamed: 0,Congress,bill_number,bill_type,action,main_action,category,member_id
38,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
39,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
40,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
41,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
268,116,2657,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
269,116,2657,s,S.Amdt.1407 Motion by Senator McConnell to rec...,other senate amendment actions,amendment,858
400,116,3985,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
548,116,50,sres,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate resolution,858
823,116,28,hjres,VITIATION OF EARLIER PROCEEDINGS - Mr. Hoyer a...,house floor actions,house joint resolution,1065
1023,116,758,hres,Mr. Nadler moved to table the motion to recons...,house floor actions,house resolution,546


<font size = "3">

We can also create a Boolean Pandas series, and then use that to subset the data

In [12]:
contains_phrase = bills_actions['action'].str.contains('to reconsider')

data_subset = bills_actions[contains_phrase]

display(data_subset)

Unnamed: 0,Congress,bill_number,bill_type,action,main_action,category,member_id
38,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
39,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
40,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
41,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
268,116,2657,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
269,116,2657,s,S.Amdt.1407 Motion by Senator McConnell to rec...,other senate amendment actions,amendment,858
400,116,3985,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
548,116,50,sres,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate resolution,858
823,116,28,hjres,VITIATION OF EARLIER PROCEEDINGS - Mr. Hoyer a...,house floor actions,house joint resolution,1065
1023,116,758,hres,Mr. Nadler moved to table the motion to recons...,house floor actions,house resolution,546


<font size = "3">

Of course, we can combine the commands and do it in one line:

In [13]:
data_subset = bills_actions[bills_actions['action'].str.contains('to reconsider')]
display(data_subset)

Unnamed: 0,Congress,bill_number,bill_type,action,main_action,category,member_id
38,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
39,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
40,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
41,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
268,116,2657,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
269,116,2657,s,S.Amdt.1407 Motion by Senator McConnell to rec...,other senate amendment actions,amendment,858
400,116,3985,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
548,116,50,sres,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate resolution,858
823,116,28,hjres,VITIATION OF EARLIER PROCEEDINGS - Mr. Hoyer a...,house floor actions,house joint resolution,1065
1023,116,758,hres,Mr. Nadler moved to table the motion to recons...,house floor actions,house resolution,546


<font size = "4">

Search words and wildcards

$\quad$ <img src="figures/wildcards_regex1.png" alt="drawing" width="300"/>

<font size = "3">

Above, we created a DataFrame with amendments using ``.query``:

```python
    amendments = bills_actions.query('category == "amendment"')
```

- In the "action" column, most entries contain the string "Amdt." followed by 4 digits. 

- Below, for each row we return the portion of the string containing:
    - "Amdt"
    - A period ("\\.") (A period without the slash will search for **any** character.)
    - A digit character ("d"), one or more occurrences ("+")
    - A non-digit character ("\D")

In [14]:
display(amendments["action"])

substrings = amendments["action"].str.findall("Amdt\.\d+\D")
display(substrings)

0       S.Amdt.1274 Amendment SA 1274 proposed by Sena...
1       S.Amdt.2698 Amendment SA 2698 proposed by Sena...
2       S.Amdt.2659 Amendment SA 2659 proposed by Sena...
8       S.Amdt.2424 Amendment SA 2424 proposed by Sena...
11      S.Amdt.1275 Amendment SA 1275 proposed by Sena...
                              ...                        
3298    H.Amdt.172 Amendment (A004) offered by Ms. Kus...
3299    H.Amdt.171 Amendment (A003) offered by Ms. Hou...
3300    H.Amdt.170 Amendment (A002) offered by Ms. Oma...
3301    POSTPONED PROCEEDINGS - At the conclusion of d...
3302    H.Amdt.169 Amendment (A001) offered by Mr. Esp...
Name: action, Length: 1529, dtype: object

0       [Amdt.1274 ]
1       [Amdt.2698 ]
2       [Amdt.2659 ]
8       [Amdt.2424 ]
11      [Amdt.1275 ]
            ...     
3298     [Amdt.172 ]
3299     [Amdt.171 ]
3300     [Amdt.170 ]
3301              []
3302     [Amdt.169 ]
Name: action, Length: 1529, dtype: object

In [15]:
display(amendments["action"])

substrings = amendments["action"].str.findall("Amdt\.....\D")
display(substrings)

0       S.Amdt.1274 Amendment SA 1274 proposed by Sena...
1       S.Amdt.2698 Amendment SA 2698 proposed by Sena...
2       S.Amdt.2659 Amendment SA 2659 proposed by Sena...
8       S.Amdt.2424 Amendment SA 2424 proposed by Sena...
11      S.Amdt.1275 Amendment SA 1275 proposed by Sena...
                              ...                        
3298    H.Amdt.172 Amendment (A004) offered by Ms. Kus...
3299    H.Amdt.171 Amendment (A003) offered by Ms. Hou...
3300    H.Amdt.170 Amendment (A002) offered by Ms. Oma...
3301    POSTPONED PROCEEDINGS - At the conclusion of d...
3302    H.Amdt.169 Amendment (A001) offered by Mr. Esp...
Name: action, Length: 1529, dtype: object

0       [Amdt.1274 ]
1       [Amdt.2698 ]
2       [Amdt.2659 ]
8       [Amdt.2424 ]
11      [Amdt.1275 ]
            ...     
3298    [Amdt.172 A]
3299    [Amdt.171 A]
3300    [Amdt.170 A]
3301              []
3302    [Amdt.169 A]
Name: action, Length: 1529, dtype: object

<font size = "3">

Here are four more examples:

In [16]:
# Get period + 1 character of any kind after Amdt
example1 = amendments["action"].str.findall("Amdt\..")

# Get any character before Amdt + period
example2 = amendments["action"].str.findall(".Amdt\.")

# Get two characters before Amdt, period, then 3 characters of any kind
example3 = amendments["action"].str.findall("..Amdt\....")

# Get two characters before "dt", then period, then 4 characters of any kind
example4 = amendments["action"].str.findall(".{2}dt\..{4}")

display(example1)
display(example2)
display(example3)
display(example4)

0       [Amdt.1]
1       [Amdt.2]
2       [Amdt.2]
8       [Amdt.2]
11      [Amdt.1]
          ...   
3298    [Amdt.1]
3299    [Amdt.1]
3300    [Amdt.1]
3301          []
3302    [Amdt.1]
Name: action, Length: 1529, dtype: object

0       [.Amdt.]
1       [.Amdt.]
2       [.Amdt.]
8       [.Amdt.]
11      [.Amdt.]
          ...   
3298    [.Amdt.]
3299    [.Amdt.]
3300    [.Amdt.]
3301          []
3302    [.Amdt.]
Name: action, Length: 1529, dtype: object

0       [S.Amdt.127]
1       [S.Amdt.269]
2       [S.Amdt.265]
8       [S.Amdt.242]
11      [S.Amdt.127]
            ...     
3298    [H.Amdt.172]
3299    [H.Amdt.171]
3300    [H.Amdt.170]
3301              []
3302    [H.Amdt.169]
Name: action, Length: 1529, dtype: object

0       [Amdt.1274]
1       [Amdt.2698]
2       [Amdt.2659]
8       [Amdt.2424]
11      [Amdt.1275]
           ...     
3298    [Amdt.172 ]
3299    [Amdt.171 ]
3300    [Amdt.170 ]
3301             []
3302    [Amdt.169 ]
Name: action, Length: 1529, dtype: object

<font size = "4">

Wildcards + Quantifiers

$\quad$ <img src="figures/wildcards_regex2.png" alt="drawing" width="300"/>

<font size = "3">

Using these wildcards are best described with examples.


In [17]:
# . = character of any kind
# * = All consecutive
# .* = All consecutive characters of any kind
# .*Amdt = All consecutive characters of any kind before the string "Amdt"
example1 = amendments["action"].str.findall(".*Amdt")
display(example1)

# \S = non-space character
# * = All consecutive
# \S* = All consecutive non-space characters
# Amdt\S* = All consecutive non-space characters after the string "Amdt"
example2 = amendments["action"].str.findall("Amdt\S*")
display(example2)

# can combine them together:
example3 = amendments["action"].str.findall(".*Amdt\S*")
display(example3)

0       [S.Amdt]
1       [S.Amdt]
2       [S.Amdt]
8       [S.Amdt]
11      [S.Amdt]
          ...   
3298    [H.Amdt]
3299    [H.Amdt]
3300    [H.Amdt]
3301          []
3302    [H.Amdt]
Name: action, Length: 1529, dtype: object

0       [Amdt.1274]
1       [Amdt.2698]
2       [Amdt.2659]
8       [Amdt.2424]
11      [Amdt.1275]
           ...     
3298     [Amdt.172]
3299     [Amdt.171]
3300     [Amdt.170]
3301             []
3302     [Amdt.169]
Name: action, Length: 1529, dtype: object

0       [S.Amdt.1274]
1       [S.Amdt.2698]
2       [S.Amdt.2659]
8       [S.Amdt.2424]
11      [S.Amdt.1275]
            ...      
3298     [H.Amdt.172]
3299     [H.Amdt.171]
3300     [H.Amdt.170]
3301               []
3302     [H.Amdt.169]
Name: action, Length: 1529, dtype: object

In [18]:
# Look for:
# "Amdt"
# \. = followed by a period
# \d = followed by a digit
# \d = followed by a digit
# \d? = (optional) followed by a digit
# \d? = (optional) followed by a digit

example4 = amendments["action"].str.findall("Amdt\.\d\d\d?\d?")
display(example4)

0       [Amdt.1274]
1       [Amdt.2698]
2       [Amdt.2659]
8       [Amdt.2424]
11      [Amdt.1275]
           ...     
3298     [Amdt.172]
3299     [Amdt.171]
3300     [Amdt.170]
3301             []
3302     [Amdt.169]
Name: action, Length: 1529, dtype: object

In [19]:
# Get all consecutive digits after "Amdt" and period
example5 = amendments["action"].str.findall("Amdt\.\d*")

display(example5)


0       [Amdt.1274]
1       [Amdt.2698]
2       [Amdt.2659]
8       [Amdt.2424]
11      [Amdt.1275]
           ...     
3298     [Amdt.172]
3299     [Amdt.171]
3300     [Amdt.170]
3301             []
3302     [Amdt.169]
Name: action, Length: 1529, dtype: object

<font size = "3">

Do all of the entries in the "action" column contain the pattern "Amdt.[numerical digits]"?

The answer is no. We can see that at least 3 of the items do not contain that sequence of characters.

In [20]:
display(example5.iloc[110:120])

593     [Amdt.927]
595    [Amdt.2680]
596    [Amdt.2673]
597    [Amdt.2652]
598             []
599             []
600    [Amdt.2499]
601             []
610     [Amdt.938]
611     [Amdt.883]
Name: action, dtype: object

<font size = "3">

How many rows contain this sequence, and how many don't? 

To answer this, we can use the ``str.len`` method. First we see demonstrate how it works:

In [21]:
num_matches = example5.str.len()
print(example5.iloc[110])
print(num_matches.iloc[110])
print()
print(example5.iloc[114])
print(num_matches.iloc[114])

['Amdt.927']
1

[]
0


<font size = "3">

We can create a Boolean Series telling us if a match was found or not for a given row.

In [22]:
found_match = example5.str.len() > 0
display(example5[found_match])

num_matches = len(example5[found_match])
num_no_matches = len(example5[~found_match]) # "~" means "not"
print(num_matches)
print(num_no_matches)

0       [Amdt.1274]
1       [Amdt.2698]
2       [Amdt.2659]
8       [Amdt.2424]
11      [Amdt.1275]
           ...     
3297     [Amdt.173]
3298     [Amdt.172]
3299     [Amdt.171]
3300     [Amdt.170]
3302     [Amdt.169]
Name: action, Length: 1116, dtype: object

1116
413


In [23]:
# check that we did everything right:

print(num_matches + num_no_matches)
print()
print(len(example5))

1529

1529


<font size = "3">

**Exercise**

- Practice using the ```senate_bills``` dataset
- Use ```.str.findall()``` to find the phrase "Senator [Name]"
- [Name] is a placeholder for the Senators' last names: Johnson, Graham, Moran, etc.
- To do so, you will need to search for the following:
    - The string "Senator"
    - A white space
    - All consecutive non-space characters.
- Use Python commands to determine how many rows contain the phrase "Senator [Name]" and how many do not.

In [28]:
# Write your own code

senator_match = senate_bills["action"].str.findall("Senator \S*")

num_matches = len(senate_bills[senator_match.str.len() > 0])
num_no_match = len(senate_bills[senator_match.str.len() == 0])

print(num_matches)
print(num_no_match)
print()

# sanity check, these should be equal if we did everything right.
print(num_matches + num_no_match)
print(len(senator_match))

391
123

514
514
