<a href="https://colab.research.google.com/github/muziejus/21F-UP206A/blob/master/src/EDA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Consolidation on _Congressional Record_

by Moacir P. de Sá Pereira

This notebook uses NTLK to tokenize and stem pandas dataframes that contain speeches from the 97th through 106th Congresses. The speeches come from the [Hein Corpus provided by Gentzkow et al.](https://data.stanford.edu/congress_text) and are OCRed by Hein Online.

The notebook iterates over the text file for each set of speeches and extracts the `speech_id` and `speech` data from each row (where a row is a speech) and carries this over into a pandas dataframe. Next, we iterate over the speeches and tokenize and stem them using tools from NLTK. We also remove stopwords, including congressional stopwords indicated by Gentzkow et al.

Finally, the dataframe is saved as a parquet file for later processing.

In [1]:
# Import libraries
import pandas as pd
from tqdm.notebook import tqdm
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords

In [2]:
# Download NTLK data
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [35]:
# Create stopwords set
congressional_stopwords = set("absent adjourn ask can chairman committee con democrat etc gentleladies gentlelady gentleman gentlemen gentlewoman gentlewomen hereabout hereafter hereat hereby herein hereinafter hereinbefore hereinto hereof hereon hereto heretofore hereunder hereunto hereupon herewith month mr mrs nai nay none now part per pro republican say senator shall sir speak speaker tell thank thereabout thereafter thereagainst thereat therebefore therebeforn thereby therefor therefore therefrom therein thereinafter thereof thereon thereto theretofore thereunder thereunto thereupon therewith therewithal today whereabouts whereafter whereas whereat whereby wherefore wherefrom wherein whereinto whereof whereon whereto whereunder whereupon wherever wherewith wherewithal will yea yes yield".split(" "))
default_stopwords = set(stopwords.words('english'))
stopwords = default_stopwords.union(congressional_stopwords)

In [3]:
# Make our filenames
speech_filenames = [f"speeches_{i:03d}.txt" for i in range(97, 107)]


In [37]:
# Initialize stemmer
stemmer = PorterStemmer()

In [38]:
# Define a function that takes a line from the speech text,
# separates out the speech_id, then tokenizes the speech,
# removes stopwords, and stems the speech,
# returning the speech_id and list of stemmed tokens.
def parse_line_of_speech(line):
  l = line.split("|")
  speech_id = l[0]
  speech = " ".join(l[1:]).strip()
  tokens = word_tokenize(speech)
  tokens = [word for word in tokens if word.lower() not in stopwords]
  tokens = [stemmer.stem(token) for token in tokens]
  return speech_id, tokens


In [45]:
# Iterate over the speech files
for speech_filename in speech_filenames:
  with open(speech_filename, 'r', encoding='latin-1') as f:
    print(f"Processing {speech_filename}")
    lines = f.readlines()
    data = {
        "ids": [],
        "tokens": []
    }
    for line in tqdm(lines[1:]):
      id, tokens = parse_line_of_speech(line)
      data["ids"].append(int(id))
      data["tokens"].append(tokens)
    df = pd.DataFrame(data)
    df.to_parquet(f"{speech_filename.split('.')[0]}.parquet")

Processing speeches_098.txt


  0%|          | 0/280288 [00:00<?, ?it/s]

Processing speeches_099.txt


  0%|          | 0/281527 [00:00<?, ?it/s]

Processing speeches_100.txt


  0%|          | 0/276161 [00:00<?, ?it/s]

Processing speeches_101.txt


  0%|          | 0/251216 [00:00<?, ?it/s]

Processing speeches_102.txt


  0%|          | 0/243091 [00:00<?, ?it/s]

Processing speeches_103.txt


  0%|          | 0/235973 [00:00<?, ?it/s]

Processing speeches_104.txt


  0%|          | 0/274984 [00:00<?, ?it/s]

Processing speeches_105.txt


  0%|          | 0/209266 [00:00<?, ?it/s]

Processing speeches_106.txt


  0%|          | 0/209647 [00:00<?, ?it/s]

In [4]:
# Read the parquet files back into memory and concatenate them into
# one giant dataframe.
df = pd.DataFrame()
for speech_filename in speech_filenames:
  df = pd.concat([df, pd.read_parquet(f"{speech_filename.split('.')[0]}.parquet")])

df.rename(columns={"ids": "speech_id"}, inplace=True)

In [5]:
# Generate a vocabulary file and integer encode the speeches.
from collections import defaultdict
from tqdm import tqdm

tqdm.pandas()

vocab = defaultdict(lambda: len(vocab))

def tokens_to_integers(tokens):
    return [vocab[token] for token in tokens]

df['integer_tokens'] = df['tokens'].progress_apply(tokens_to_integers)

vocab_parquet = pd.DataFrame.from_dict(vocab, orient='index')
vocab_parquet.to_parquet('vocabulary.parquet')

df.drop(columns=['tokens']).to_parquet('integer_encoded_speeches.parquet')

100%|██████████| 2545551/2545551 [00:45<00:00, 55472.67it/s]


In [None]:
good_words = "share change opportunity legacy challenge control truth moral courage reform prosperity crusade movement children family debate compete active we candid humane pristine provide liberty commitment principle unique duty precious premise care tough listen learn help lead vision success empower citizen activist mobilize conflict light dream freedom peace rights pioneer proud building preserve pro-flag pro-children pro-environment reform workfare strength choice fair protect confident incentive initiative passionate".split(" ")
good_n_grams = ["eliminate good-time in prison", "hard work", "common sense"]

In [None]:
bad_words = "decay failure collapse deeper crisis urgent destructive destroy sick pathetic lie liberal they betray consequences limit shallow traitors sensationalists endanger coercion hypocricy radical threaten devour waste corruption incompetent impose self-serving greed ideological insecure anti-flag anti-family anti-child anti-jobs pessimistic excuses intolerant stagnation welfare corrupt selfish insensitive mandate taxes spend shame disgrace punish bizarre cynicism cheat steal machine bosses obsolete patronage".split(" ")
bad_n_grams = ["unionized bureaucracy", "compassion is not enough", "permissive attitude", "status quo", "abuse of power", "criminal rights"]