## [Computational Social Science] Project 5: Natural Language Processing

In this project, you will use natural language processing techniques to explore a dataset containing tweets from members of the 116th United States Congress that met from January 3, 2019 to January 2, 2021. The dataset has also been cleaned to contain information about each legislator. Concretely, you will do the following:

* Preprocess the text of legislators' tweets
* Conduct Exploratory Data Analysis of the text
* Use sentiment analysis to explore differences between legislators' tweets
* Featurize text with manual feature engineering, frequency-based, and vector-based techniques
* Predict legislators' political parties and whether they are a Senator or Representative

You will explore two questions that relate to two central findings in political science and examine how they relate to the text of legislators' tweets. First, political scientists have argued that U.S. politics is currently highly polarized relative to other periods in American history, but also that the polarization is asymmetric. Historically, there were several conservative Democrats (i.e. "blue dog Democrats") and liberal Republicans (i.e. "Rockefeller Republicans"), as measured by popular measurement tools like [DW-NOMINATE](https://en.wikipedia.org/wiki/NOMINATE_(scaling_method)#:~:text=DW\%2DNOMINATE\%20scores\%20have\%20been,in\%20the\%20liberal\%2Dconservative\%20scale.). However, in the last few years, there are few if any examples of any Democrat in Congress being further to the right than any Republican and vice versa. At the same time, scholars have argued that this polarization is mostly a function of the Republican party moving further right than the Democratic party has moved left. **Does this sort of asymmetric polarization show up in how politicians communicate to their constituents through tweets?**

Second, the U.S. Congress is a bicameral legislature, and there has long been debate about partisanship in the Senate versus the House. The House of Representatives is apportioned by population and all members serve two year terms. In the Senate, each state receives two Senators and each Senator serves a term of six years. For a variety of reasons (smaller chamber size, more insulation from the voters, rules and norms like the filibuster, etc.), the Senate has been argued to be the "cooling saucer" of Congress in that it is more bipartisan and moderate than the House. **Does the theory that the Senate is more moderate have support in Senators' tweets?**

**Note**: See the project handout for more details on caveats and the data dictionary.

In [1]:
####fixes from internet for ImportError: cannot import name 'triu' from 'scipy.linalg'
##doesn't help
#from numpy import triu

##looks like scipy version 1.13.1 doesn't support "triu" so I guess I'm going to install an older version??
#!pip uninstall scipy verbose
#!pip install scipy==1.12


In [2]:
pip show scipy

Name: scipyNote: you may need to restart the kernel to use updated packages.

Version: 1.12.0
Summary: Fundamental algorithms for scientific computing in Python
Home-page: https://scipy.org/
Author: 
Author-email: 
License: Copyright (c) 2001-2002 Enthought, Inc. 2003-2024, SciPy Developers.
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:

1. Redistributions of source code must retain the above copyright
   notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above
   copyright notice, this list of conditions and the following
   disclaimer in the documentation and/or other materials provided
   with the distribution.

3. Neither the name of the copyright holder nor the names of its
   contributors may be used to endorse or promote products derived
   from this software without specific prior written permissi

In [3]:
#!pip install contractions
import contractions

In [4]:
import scipy
print(scipy.__version__)

1.12.0


In [5]:
#1. pandas and numpy
import pandas as pd
import numpy as np

#some stuff I needed later
import re

In [6]:
#2. punctuation, stop words and English language model
from string import punctuation
from spacy.lang.en.stop_words import STOP_WORDS
import en_core_web_sm
nlp = en_core_web_sm.load()

In [7]:
#3. textblob
from textblob import TextBlob

In [8]:
#4. countvectorizer, tfidfvectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [9]:
#5. gensim -- this threw an error that I have no idea how to fix. "ImportError: cannot import name 'triu' from 'scipy.linalg'"
# Installing scipy 1.12 fixed that error.
import gensim
from gensim import models

In [10]:
#6. plotting
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [11]:
# #1. pandas and numpy
# import pandas as pd
# import numpy as np

# # punctuation, stop words and English language model
# from string import punctuation
# from spacy.lang.en.stop_words import STOP_WORDS
# import en_core_web_sm
# nlp = en_core_web_sm.load()

# # textblob
# from textblob import TextBlob

# # countvectorizer, tfidfvectorizer
# from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# # gensim
# import gensim
# from gensim import models

# # plotting
# import matplotlib.pyplot as plt
# import seaborn as sns
# %matplotlib inline

In [12]:
# load data 
# ----------
congress_tweets = pd.read_csv("data/116th Congressional Tweets and Demographics.csv")
# fill in this line of code with a sufficient number of tweets, depending on your computational resources
#congress_tweets = congress_tweets.sample(...)
congress_tweets.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data/116th Congressional Tweets and Demographics.csv'

## Preprocessing

The first step in working with text data is to preprocess it. Make sure you do the following:

* Remove punctuation and stop words. The `rem_punc_stop()` function we used in lab is provided to you but you should feel free to edit it as necessary for other steps
* Remove tokens that occur frequently in tweets, but may not be helpful for downstream classification. For instance, many tweets contain a flag for retweeting, or share a URL 

As you search online, you might run into solutions that rely on regular expressions. You are free to use these, but you should also be able to preprocess using the techniques we covered in lab. Specifically, we encourage you to use spaCy's token attributes and string methods to do some of this text preprocessing.

In [None]:
type(congress_tweets)

In [None]:
##subsetting the data for testing
# subset the first 100 rows
congress_tweets_100 = congress_tweets[:100]



In [None]:
##or random set of 100? setting a seed for repro
np.random.seed(10)
ct_samp = congress_tweets.sample(n=100)
ct_samp


In [None]:
## setting up function: original
def rem_punc_stop(text):
    stop_words = STOP_WORDS
    punc = set(punctuation)
    
    punc_free = "".join([ch for ch in text if ch not in punc])
    
    doc = nlp(punc_free)
    
    spacy_words = [token.text for token in doc]
    
    spacy_words = [word for word in spacy_words if not word.startswith('http')]

    no_punc = [word for word in spacy_words if word not in stop_words]
    
    return no_punc

In [None]:
##applying function to relevant column in ct_samp
ct_samp['tokens'] = ct_samp['text'].map(lambda x: rem_punc_stop(x)) # can use apply here 
ct_samp['tokens'] # visualize

In [None]:
for text in ct_samp['text']:
    print(text)

In [None]:
for t in ct_samp['tokens']:
    print(t)

## Making edits to preprocessing here: 

Removing punctuation and stop words, removing mentions and hashtags, removing emojis (if possible!), removing http, lowercase. I think also maybe removing numbers, abbreviations, and quotation marks is a good idea. My computer is really slow so I think the more I remove the better off I'll be.  I guess I'll lemmatize for the same reason? 

Not sure what to do about entities. Going to leave them alone for now. 

On removing hashtags and mentions -- mentions I feel sure of, because while they illustrate sentiment and loyalty there's no real way to code which was the mention is going (support or recrimination) and what valence each account shows. Hashtags I'm less sure of -- a lot of time they are just filler (#TBT!) but sometimes they do show sentiment. But also, they are not spaced so I'm not sure spacy will deal with them as words or not? Could just be a huge lemmatization problem. And in general hastags only repeat sentiments already expressed in the body of the text, unless they are there for comedic effect. So I'm leaning towards getting rid of them. I might regret it though. 

In [None]:
 ##commenting out because the efficient model is below
##custom dictionary of random words I keep seeing in the random samples
abbreviations = {
    'qt': '', 
    'rt': '',    
    'amp': '',  
    's': '',  
    'n': ''  
}

##preprocessing function. Not sure if it's best practice to shove everything into this function, but it seems to work so I'm going for it.
def rem_punc_stop_lem(text):
    stop_words = STOP_WORDS
    punc = set(punctuation)

    #Lowercase
    text = text.lower()

    # Expand contractions
    text = contractions.fix(text)

    # Remove fragments like 'l, 't, 's, etc.
    text = re.sub(r"'[a-zA-Z]+", "", text)

    # Remove abbreviations
    for abbr, replacement in abbreviations.items():
        text = re.sub(r'\b' + abbr + r'\b', replacement, text)
    
    # Tokenize 
    doc = nlp(text)
    
    # Extract/lemmatize while filtering out mentions (words that start with '@')
    lem_words = [token.lemma_ for token in doc if not token.text.startswith('@')]
    
    # Reconstruct text again without mentions
    text_no_mentions = " ".join(lem_words)

    # Remove punctuation 
    punc_free = "".join([ch for ch in text_no_mentions if ch not in punc])

    # Remove numbers
    punc_free = re.sub(r'\d+', '', punc_free) 

    #Remove emojis
    punc_free = re.sub(r'[^\x00-\x7F]+', '', punc_free) 

    #getting rid of random spaces and newline html stuff 
    punc_free = re.sub(r'\s+', ' ', punc_free)  # replace multiple spaces/newlines/tabs/etc with single space
    punc_free = punc_free.strip()  # remove leading/trailing 
    
    # Tokenize again
    doc = nlp(punc_free)

    # Extract 
    lem_words = [token.lemma_ for token in doc]

    # Remove words that start with 'http' (URLs)
    lem_words = [word for word in lem_words if not word.startswith('http')]

    # Remove stop words
    no_punc = [word for word in lem_words if word not in stop_words]

    return no_punc

In [None]:
np.random.seed(12)
ct_samp = congress_tweets.sample(n=100)
ct_samp['tokens'] = ct_samp['text'].map(lambda x: rem_punc_stop(x)) 
for t in ct_samp['tokens']:
    print(t)

In [None]:
np.random.seed(4)
ct_samp = congress_tweets.sample(n=100)
ct_samp['tokens_lem'] = ct_samp['text'].map(lambda x: rem_punc_stop_lem(x)) 
for t in ct_samp['tokens_lem']:
    print(t)

In [None]:
np.random.seed(12)
ct_samp = congress_tweets.sample(n=100)
ct_samp['tokens_lem'] = ct_samp['text'].map(lambda x: rem_punc_stop_lem(x)) 
for t in ct_samp['tokens_lem']:
    print(t)

In [None]:
# # did lemmatizing do anything? (commenting out since lemmatizing worked and the comparison var is defunct. )
# subset_tokens = ct_samp[['tokens', 'tokens_lem']]
# print(subset_tokens)

In [None]:
# ##applying function to full data
congress_tweets['tokens'] = congress_tweets['text'].map(lambda x: rem_punc_stop_lem(x))


In [None]:
## keeps timing out. Let's try with a bit more than 100
np.random.seed(12)
ct_samp = congress_tweets.sample(n=100)
ct_samp['tokens_lem'] = ct_samp['text'].map(lambda x: rem_punc_stop_lem(x)) 
for t in ct_samp['tokens_lem']:
    print(t)

In [None]:
## The first pass took a really long time, so I asked chatgpt to make the code more efficient...
##Load spaCy model once (for efficiency)
nlp = spacy.load('en_core_web_sm', disable=["ner", "parser"])  # Disable unnecessary components

# Precompile regex patterns for efficiency
apostrophe_fragments = re.compile(r"'[a-zA-Z]+")  # Match contractions like 's, 't, etc.
non_ascii_chars = re.compile(r"[^\x00-\x7F]+")  # Match emojis and non-ASCII characters
extra_spaces = re.compile(r"\s+")  # Match multiple spaces

# Custom dictionary of words to remove
abbreviations = {'qt', 'rt', 'amp', 's', 'n'}

def rem_punc_stop_lem_efficient(text):
    """Preprocess text by lowercasing, removing contractions, lemmatizing, 
    and filtering out stop words, punctuation, and unwanted elements."""
    
    # Lowercase text
    text = text.lower()

    # Expand contractions (e.g., "he'd" -> "he would")
    text = contractions.fix(text)

    # Remove apostrophe fragments (e.g., 's, 't, etc.)
    text = apostrophe_fragments.sub("", text)

    # Tokenize and process text in **one pass**
    doc = nlp(text)

    # Extract & lemmatize words while filtering
    lem_words = [
        token.lemma_ for token in doc 
        if not token.is_stop  # Remove stop words
        and not token.is_punct  # Remove punctuation
        and not token.text.startswith('@')  # Remove mentions (@user)
        and not token.text.startswith('http')  # Remove URLs
        and token.text not in abbreviations  # Remove custom words
        and not token.is_digit  # Remove numbers
    ]

    # Remove non-ASCII characters (emojis)
    punc_free = non_ascii_chars.sub("", " ".join(lem_words))

    # Replace multiple spaces and strip leading/trailing whitespace
    cleaned_text = extra_spaces.sub(" ", punc_free).strip()

    return cleaned_text.split()  # Return as a list of tokens


In [None]:
##applying function to full data
congress_tweets['tokens'] = congress_tweets['text'].map(lambda x: rem_punc_stop_lem_efficient(x))


## Exploratory Data Analysis

Use two of the techniques we covered in lab (or other techniques outside of lab!) to explore the text of the tweets. You should construct these visualizations with an eye toward the eventual classification tasks: (1) predicting the legislator's political party based on the text of their tweet, and (2) predicting whether the legislator is a Senator or Representative. As a reminder, in lab we covered word frequencies, word clouds, word/character counts, scattertext, and topic modeling as possible exploration tools. 

### EDA 1

In [None]:
... 

### EDA 2

In [None]:
...

## Sentiment Analysis

Next, let's analyze the sentiments contained within the tweets. You may use TextBlob or another library for these tasks. Do the following:

* Choose two legislators, one who you think will be more liberal and one who you think will be more conservative, and analyze their sentiment and/or subjectivity scores per tweet. For instance, you might do two scatterplots that plot each legislator's sentiment against their subjectivity, or two density plots for their sentiments. Do the scores match what you thought?
* Plot two more visualizations like the ones you chose in the first part, but do them to compare (1) Democrats v. Republicans and (2) Senators v. Representatives 

`TextBlob` has already been imported in the top cell.

In [None]:
...

## Featurization

Before going to classification, explore different featurization techniques. Create three dataframes or arrays to represent your text features, specifically:

* Features engineered from your previous analysis. For example, word counts, sentiment scores, topic model etc.
* A term frequency-inverse document frequency matrix. 
* An embedding-based featurization (like a document averaged word2vec)

In the next section, you will experiment with each of these featurization techniques to see which one produces the best classifications.

In [None]:
...

### Engineered Text Features

In [None]:
# Engineered Features
...

### Bag-of-words or Tf-idf

In [None]:
# Frequency Based featurization
...

### Word Embedding

In [None]:
# Load Word2Vec model from Google; OPTIONAL depending on your computational resources (the file is ~1 GB)
# Also note that this file path assumes that the word vectors are underneath 'data'; you may wish to point to the CSS course repo and change the path
# or move the vector file to the project repo 

#model = gensim.models.KeyedVectors.load_word2vec_format('data/GoogleNews-vectors-negative300.bin.gz', binary = True) 

In [None]:
# Function to average word embeddings for a document; use examples from lab to apply this function. You can use also other techniques such as PCA and doc2vec instead.
def document_vector(word2vec_model, doc):
    doc = [word for word in doc if word in model.vocab]
    return np.mean(model[doc], axis=0)

In [None]:
# embedding based featurization
...

## Classification

Either use cross-validation or partition your data with training/validation/test sets for this section. Do the following:

* Choose a supervised learning algorithm such as logistic regression, random forest etc. 
* Train six models. For each of the three dataframes you created in the featurization part, train one model to predict whether the author of the tweet is a Democrat or Republican, and a second model to predict whether the author is a Senator or Representative.
* Report the accuracy and other relevant metrics for each of these six models.
* Choose the featurization technique associated with your best model. Combine those text features with non-text features. Train two more models: (1) A supervised learning algorithm that uses just the non-text features and (2) a supervised learning algorithm that combines text and non-text features. Report accuracy and other relevant metrics. 

If time permits, you are encouraged to use hyperparameter tuning or AutoML techniques like TPOT, but are not explicitly required to do so.

### Train Six Models with Just Text

In [None]:
# six models ([engineered features, frequency-based, embedding] * [democrat/republican, senator/representative])
...

### Two Combined Models

In [None]:
# two models ([best text features + non-text features] * [democrat/republican, senator/representative])
...

## Discussion Questions

1. Why do standard preprocessing techniques need to be further customized to a particular corpus?

**YOUR ANSWER HERE** ...

2. Did you find evidence for the idea that Democrats and Republicans have different sentiments in their tweets? What about Senators and Representatives?

**YOUR ANSWER HERE** ...

3. Why is validating your exploratory and unsupervised learning approaches with a supervised learning algorithm valuable?

**YOUR ANSWER HERE** ...

4. Did text only, non-text only, or text and non-text features together perform the best? What is the intuition behind combining text and non-text features in a supervised learning algorithm?

**YOUR ANSWER HERE** ...