# <span style="color:red"> Lecture 23 - Text Data  </span>

<font size = "3">

- Today's lecture covers some basic commands for working with text data

- Text data is becoming more and more relevant, with huge amounts of text data available.

- Text is messy and unstructured, and is not as straightforward to deal with as compared to numerical or categorical data

<font size = "3">

Import necessary libraries

In [None]:
import pandas as pd

<font size = "3">

Import data. 

The file "bills_actions.csv" contains data concerning the 116th United States Congress (January 3, 2019 - January 3, 2021).

In [None]:
bills_actions = pd.read_csv("data_raw/bills_actions.csv")
bills_actions.dtypes

# <span style="color:red"> I. Basic Text Operations </span>

<font size = "3">

The "category" column tells us what type of data the row corresponds to (amendment, senate bill, etc.)

In [None]:
display(bills_actions["category"])

<font size = "3">

How many categories are there? How often do they appear in the data set?

An answer to the first question can be found by using the ``.nunique`` method attached to the Pandas Series

*Both* questions can be answered using the ``.value_counts()`` method.

In [None]:
bills_actions["category"].value_counts()

<font size = "4">

Subsetting text categories

<font size = "3">

- Suppose we are only interested in bills.

- We will use ``.query()`` to extract the subset of the data corresponding to House bills and Senate bills

- We use the ``.copy()`` method, because we will be adding a column to this dataframe, and we don't want this to affect the original one.

In [None]:
# categories we are interested in
list_categories = ["house bill","senate bill"]

# "in" is used to test whether a word belongs to a list

# remember, "@" is the syntax needed to reference "global" variables
bills = bills_actions.query('category in @list_categories').copy()

# double check that everything worked: compare to above value counts
bills["category"].value_counts()

<font size = "4">

Data manipulation with strings

<font size = "3">

Q: How many bills mention the word "Senator"? 

- The "action" column contains a summary of each bill
- We'll use the ``.contains`` method.
- The method belongs to the collection ``bills["action].str`` (the Pandas StringMethods collection)

In [None]:
# Create Boolean Series
mentions_senator = bills["action"].str.contains("Senator")

# False = 0, True = 1
# So summing will count how many have the word "Senator"

print(mentions_senator.sum())

<font size = "3">

Q: What **percentage** of bills mention the word "Senator"?

Here are two ways we can compute this:

In [None]:
# multiply by 100 to convert to a percent.

print(mentions_senator.sum()/len(mentions_senator) * 100)
print()
print(mentions_senator.mean() * 100)
print()

# round to two digits and add "%" sign
senator_percent = round(mentions_senator.mean() * 100, 2)
print(senator_percent, "%")

<font size = "3">

- Recall that the ``.replace`` method can be used to replace an entry with a new value.

- However, the *entire* entry is replaced.

- How can we replace the word "Senator" with "Sen." within a string? Using ``bills["actions"].replace`` would only allow for us to replace an entire string.

- Instead, we can use the ``.str.replace`` method.

In [None]:
bills["action_custom"] = bills["action"].str.replace("Senator","Sen.")
print(bills["action_custom"].head(10))

<font size = "3">

**Exercise**

- Obtain a new DataFrame called "resolutions" <br>
 which subsets rows where "category" values are either (i) house resolution or (ii) senate resolution

In [None]:
# Write your own code


# <span style="color:red"> II. Searching Strings and Regular Expressions </span>

<font size = "3">

- **Regular expressions** (Regex) are sequences of symbols/characters that express a string pattern that can be searched for within a longer piece of text.

- There is a built-in Python library called "re" that you can use. (import re)

- Pandas DataFrames and Series provides regular expression functionality without needing to import this library.

- First, recall how we can search for an entire string using ``.query()``

In [None]:
senate_bills = bills_actions.query('category == "senate bill"')
amendments = bills_actions.query('category == "amendment"')

<font size = "3">

- We can use the ``str.contains`` method to search for "sub-strings" instead of entire strings.

- Below, we look for entris in the "action" column which contain the phrase "to reconsider"

In [None]:
data_subset = bills_actions.query('action.str.contains("to reconsider")')
display(data_subset)

<font size = "3">

We can also create a Boolean Pandas series, and then use that to subset the data

In [None]:
contains_phrase = bills_actions['action'].str.contains('to reconsider')

data_subset = bills_actions[contains_phrase]

display(data_subset)

<font size = "3">

Of course, we can combine the commands and do it in one line:

In [None]:
data_subset = bills_actions[bills_actions['action'].str.contains('to reconsider')]
display(data_subset)

<font size = "4">

Search words and wildcards

$\quad$ <img src="figures/wildcards_regex1.png" alt="drawing" width="300"/>

<font size = "3">

Above, we created a DataFrame with amendments using ``.query``:

```python
    amendments = bills_actions.query('category == "amendment"')
```

- In the "action" column, most entries contain the string "Amdt." followed by 4 digits. 

- Below, for each row we return the portion of the string containing:
    - "Amdt"
    - A period ("\\.") (A period without the slash will search for **any** character.)
    - A digit character ("d"), one or more occurrences ("+")
    - A non-digit character ("\D")

In [None]:
display(amendments["action"])

substrings = amendments["action"].str.findall("Amdt\.\d+\D")
display(substrings)

In [None]:
display(amendments["action"])

substrings = amendments["action"].str.findall("Amdt\.....\D")
display(substrings)

<font size = "3">

Here are four more examples:

In [None]:
# Get period + 1 character of any kind after Amdt
example1 = amendments["action"].str.findall("Amdt\..")

# Get any character before Amdt + period
example2 = amendments["action"].str.findall(".Amdt\.")

# Get two characters before Amdt, period, then 3 characters of any kind
example3 = amendments["action"].str.findall("..Amdt\....")

# Get two characters before "dt", then period, then 4 characters of any kind
example4 = amendments["action"].str.findall(".{2}dt\..{4}")

display(example1)
display(example2)
display(example3)
display(example4)

<font size = "4">

Wildcards + Quantifiers

$\quad$ <img src="figures/wildcards_regex2.png" alt="drawing" width="300"/>

<font size = "3">

Using these wildcards are best described with examples.


In [None]:
# . = character of any kind
# * = All consecutive
# .* = All consecutive characters of any kind
# .*Amdt = All consecutive characters of any kind before the string "Amdt"
example1 = amendments["action"].str.findall(".*Amdt")
display(example1)

# \S = non-space character
# * = All consecutive
# \S* = All consecutive non-space characters
# Amdt\S* = All consecutive non-space characters after the string "Amdt"
example2 = amendments["action"].str.findall("Amdt\S*")
display(example2)

# can combine them together:
example3 = amendments["action"].str.findall(".*Amdt\S*")
display(example3)

In [None]:
# Look for:
# "Amdt"
# \. = followed by a period
# \d = followed by a digit
# \d = followed by a digit
# \d? = (optional) followed by a digit
# \d? = (optional) followed by a digit

example4 = amendments["action"].str.findall("Amdt\.\d\d\d?\d?")
display(example4)

In [None]:
# Get all consecutive digits after "Amdt" and period
example5 = amendments["action"].str.findall("Amdt\.\d*")

display(example5)


<font size = "3">

**Exercise**

- Practice using the ```senate_bills``` dataset
- Use ```.str.findall()``` to find the phrase "Senator [Name]"
- [Name] is a placeholder for the Senators' last names: Johnson, Graham, Moran, etc.
- To do so, you will need to search for the following:
    - The string "Senator"
    - A white space
    - All consecutive non-space characters.

In [None]:
# Write your own code
