# Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [35]:
# import packages and libraries
import sqlite3
import nltk
# nltk.download('punkt')
import random
import numpy as np
from collections import Counter, defaultdict

from string import punctuation
from nltk.tokenize import sent_tokenize
import os
import re
import string
from nltk.corpus import stopwords
from collections import defaultdict
from sklearn.metrics import confusion_matrix

# Feel free to include your text patterns functions
# from text_functions_solutions import clean_tokenize, get_patterns

In [2]:
# create connection to database
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

## 1. Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" exercise. First, we'll pull in the text 
for each party and prepare it for use in Naive Bayes. 

In [3]:
# Query to pull data and exclude 'Other' party
query = '''
    SELECT text, party
    FROM conventions
    WHERE party != 'Other'
'''

# Execute the query and fetch all the data
query_results = convention_cur.execute(query).fetchall()

# Initialize the list to store results
convention_data = [[row[0], row[1]] for row in query_results]

# Close the database connection
convention_db.close()

# Print only a few records from the convention_data for verification
print(convention_data[:5])  

[['Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Convention (DNC) Night 4 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Democratic National Convention (DNC) Night 4 Transcript Night 4 of the 2020 Democratic National Convention (DNC) on August 20. Read the full transcript of the event here. Transcribe Your Own Content  Try Rev for free  and save time transcribing, captioning, and subtitling.', 'Democratic'], ['I’m her

Let's look at some random entries and see if they look right. 

In [4]:
# Print random entries from database
random.choices(convention_data,k=5)

[['Then this bold man comes down the escalator in New York City. I couldn’t come out and say it right away, but deep down inside, I knew it was going to be the first Republican that I ever voted for.',
  'Republican'],
 ['Our official roll call and the business of our Republican convention was conducted today in Charlotte. We have created a short video to symbolize the excitement for President Trump across all 50 states and territories. Thank you for watching. God bless you and God bless the United States of America.',
  'Republican'],
 ['Kentucky.', 'Republican'],
 ['China would prefer Joe Biden.', 'Republican'],
 ['That was so sweet with the grandkids. Yay. And now we have an official nominee. Onto the next step. Electing Joe Biden and Kamala Harris in November. Make sure you have a plan to vote. Text vote to 30330 to find out how. Now we’re going to talk about a topic that touches all of our lives. Healthcare. The Affordable Care Act was game-changing. This pandemic has revealed jus

It'll be useful for us to have a large sample size than 2024 affords, since those speeches tend to be long and contiguous. Let's make a new list-of-lists called `conv_sent_data`. Instead of each first entry in the sublists being an entire speech, make each first entry just a sentence from the speech. Feel free to use NLTK's `sent_tokenize` [function](https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html). 

In [6]:
# Initialize the list to store individual sentences
conv_sent_data = []

# Process each speech in convention_data and split it into sentences using regex
for row in convention_data:
    speech_text, party = row
    sentences = re.split(r'(?<=[.!?]) +', speech_text)
    for sentence in sentences:
        conv_sent_data.append([sentence, party])

# Step 2: Print only a few records from conv_sent_data for verification
print(conv_sent_data[:5]) 

# I had various troubles using nltk in VS Code and Jupyter, so i decided to use Python re instead.

[['Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Convention (DNC) Night 4 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Democratic National Convention (DNC) Night 4 Transcript Night 4 of the 2020 Democratic National Convention (DNC) on August 20.', 'Democratic'], ['Read the full transcript of the event here.', 'Democratic'], ['Transcribe Your Own Content  Try Rev for free  and save time transcribing, captioning, and 

Again, let's look at some random entries. 

In [7]:
# Print random selections
random.choices(conv_sent_data,k=5)

[['30330, that would be the president’s golf score if he didn’t cheat.',
  'Democratic'],
 ['Donald Trump believed in me when I was a teenage golf caddy and he was already one of the wealthiest, most famous people on the entire planet.',
  'Republican'],
 ['While the hurricane was fierce, one of the strongest to make landfall in 150 years, the casualties and damage were far less than thought possible only 24 hours ago.',
  'Republican'],
 ['President Trump is bringing this country back roaring.', 'Republican'],
 ['There are battles that we need to fight and we need to win to secure our future in this country, but there’s one issue that is an existential threat to all of us and that is climate change)',
  'Democratic']]

Now it's time for our final cleaning before modeling. Go through `conv_sent_data` and take the following steps: 

1. Tokenize on whitespace
1. Remove punctuation
1. Remove tokens that fail the `isalpha` test
1. Remove stopwords
1. Casefold to lowercase
1. Join the remaining tokens into a string


In [8]:
# Define stopwords and punctuation
stop_words = set(stopwords.words('english'))
punctuation = string.punctuation

# Initialize the list to store cleaned sentences
clean_conv_sent_data = []

# Process each sentence in conv_sent_data
for idx, sent_party in enumerate(conv_sent_data):
    sentence, party = sent_party
    tokens = sentence.split()
    cleaned_tokens = [
        token.casefold() for token in tokens
        if token.isalpha() and token.casefold() not in stop_words
    ]
    # Join the remaining tokens into a string
    cleaned_sentence = ' '.join(cleaned_tokens)
    if cleaned_sentence:  # Only add if the cleaned sentence is not empty
        clean_conv_sent_data.append((cleaned_sentence, party))

# Print 5 random records from clean_conv_sent_data for verification
print(random.choices(clean_conv_sent_data, k=5)) 

[('every', 'Democratic'), ('common sense goals hopes believe', 'Republican'), ('knowing change vote leaders think make', 'Democratic'), ('imagine like going sleep wondering', 'Democratic'), ('samuel even told life worth', 'Republican')]


If that looks good, let's make our function to turn these into features. First we need to build our list of candidate words. I started my exploration at a cutoff of 5. 

In [9]:
# Set the word frequency cutoff
word_cutoff = 5

# Get all tokens from clean_conv_sent_data
tokens = [w for t, p in clean_conv_sent_data for w in t.split()]

# Calculate word frequency distribution
word_dist = nltk.FreqDist(tokens)

# Identify feature words based on the cutoff
feature_words = set()

for word, count in word_dist.items():
    if count > word_cutoff:
        feature_words.add(word)

# Print the number of feature words
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} words as features in the model.")

With a word cutoff of 5, we have 1776 words as features in the model.


In [10]:
# Define function to extract features from text
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    
    # Tokenize the text on whitespace and convert it to a set of words
    tokens = set(text.casefold().split())

    # Create a dictionary for the feature words
    ret_dict = {word: True for word in tokens if word in fw}
    
    # Return dictionary
    return(ret_dict)

In [11]:
# Assertion to make sure feature_words is not empty
assert(len(feature_words)>0)

# Assertion to make sure the target words in quotations are identified as features
assert(conv_features("obama was the president",feature_words)==
       {'obama':True,'president':True})

# Assertion to make sure the target words in quotations are identified as features
assert(conv_features("some people in america are citizens",feature_words)==
                     {'people':True,'america':True,"citizens":True})

print("All assertions passed!")

All assertions passed!


Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [12]:
# Create list of tuples with feature dictionaries with corresponding party label
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [13]:
# Set random seed
random.seed(20220507)

# Set random shuffle to ensure that the training and test data are randomly distributed
random.shuffle(featuresets)

# Define test size
test_size = 500

In [14]:
# Define test and train data
test_set, train_set = featuresets[:test_size], featuresets[test_size:]

# Train Naive classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Print classifier accuracy
print(nltk.classify.accuracy(classifier, test_set))

0.502


In [16]:
# Print 25 most informative features used by the classifier.
classifier.show_most_informative_features(25)

Most Informative Features
             enforcement = True           Republ : Democr =     27.5 : 1.0
                   china = True           Republ : Democr =     26.5 : 1.0
                   votes = True           Democr : Republ =     21.6 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                 destroy = True           Republ : Democr =     17.1 : 1.0
                supports = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.9 : 1.0
                preserve = True           Republ : Democr =     15.1 : 1.0
                  signed = True           Republ : Democr =     15.1 : 1.0
                freedoms = True           Republ : Democr =     14.0 : 1.0
                 abraham = True           Republ : Democr =     11.9 : 1.0
                 private = True           Republ : Democr =     11.9 : 1.0
                    drug = True           Republ : Democr =     10.9 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

Based on the classifier's most informative features, I can see some clear distinctions in the language used by Republican and Democratic speakers. I believe these to align very well with the themes and priorities of each party. For instance, Republicans use more firmer words such as "enforcement," "china," "media," "freedoms," "private," and "amendment". This suggests that Republicans frequently emphasize themes like law enforcement, national security, individual freedoms, and media issues. On the other hand, democrats tend to use a bit softer words such as "climate," "votes," "elect," "sanders," and "america". This indicates that climate change is a key issue for Democrats, as expected. The presence of "votes" and "elect" suggests a focus on electoral participation and democracy, which is often a major theme during Democratic conventions and campaigns.

Both parties also tend to place an emphasis on personalities, like "sanders" by Democrats and "abraham" by Replublicans. I believe these are used by each party to strengthen their message using historical figures. Polarizing themes also become pretty evident, justified by words such as "defund" (Republicans) or "climate" (Democrats). Based on opposing views of each party, this is definitely not a susprise. Finally, the ratio values (e.g., 27.5 : 1.0 for "enforcement") suggest that certain words are extremely indicative of a party affiliation. These high ratios imply that some terms are rarely, if ever, used by the opposing party, demonstrating a clear division in the rhetoric or topics discussed by each party.

## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [23]:
# Create connection to database
cong_db = sqlite3.connect("congressional_data.db")

# Configure cursor
cong_cur = cong_db.cursor()

In [24]:
# Query
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

# Results in lists
results = list(results) # Just to store it, since the query is time consuming

In [25]:
# Verify a few results from query
print(results[:3]) 

[('Mo Brooks', 'Republican', b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq'), ('Mo Brooks', 'Republican', b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6'), ('Mo Brooks', 'Republican', b'"NASA on the Square" event this Sat. 11AM \xe2\x80\x93 4PM. Stop by &amp; hear about the incredible work done in #AL05! @DowntownHSV http://t.co/R9zY8WMEpA')]


In [26]:
# Initialize the list to store tweet data
tweet_data = []

# Iterate through the query results
for row in results:
    candidate, party, tweet_text = row
    tweet_data.append([tweet_text, party])

# Print only a few records from tweet_data for verification
print(tweet_data[:5])

[[b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq', 'Republican'], [b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6', 'Republican'], [b'"NASA on the Square" event this Sat. 11AM \xe2\x80\x93 4PM. Stop by &amp; hear about the incredible work done in #AL05! @DowntownHSV http://t.co/R9zY8WMEpA', 'Republican'], [b'"The trouble with Socialism is that eventually you run out of other people\'s money." - Margaret Thatcher https://t.co/X97g7wzQwJ', 'Republican'], [b'"The trouble with socialism is eventually you run out of other people\'s money" \xe2\x80\x93 Thatcher. She\'ll be sorely missed. http://t.co/Z8gBnDQUh8', 'Republican']]


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [27]:
# Set random seed for reproducibility
random.seed(20201014)

# Take random sample
tweet_data_sample = random.choices(tweet_data,k=10)

In [29]:
# Update clean_text to handle byte strings
def clean_text(text):
    if isinstance(text, bytes):
        text = text.decode('utf-8')
        
    # Tokenize on whitespace and convert to lowercase
    tokens = text.casefold().split()
    # Remove punctuation, non-alphabetic tokens, stopwords
    cleaned_tokens = [
        token for token in tokens
        if token.isalpha() and token not in stop_words
    ]
    # Join tokens back into a single cleaned string
    return ' '.join(cleaned_tokens)

# Loop through each tweet in the sample
for tweet, party in tweet_data_sample:
    cleaned_tweet = clean_text(tweet)
    
    # Convert cleaned tweet to feature dictionary
    tweet_features = conv_features(cleaned_tweet, feature_words)
    
    # Use the classifier to estimate the party
    estimated_party = classifier.classify(tweet_features)
    
    # Print the results
    print(f"Here's our (cleaned) tweet: {cleaned_tweet}")
    print(f"Actual party is {party} and our classifier says {estimated_party}.")
    print("")

Here's our (cleaned) tweet: earlier spoke house floor abt protecting health care women praised work central
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: go
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: trump thinks easy students overwhelmed crushing burden debt pay student loans
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: grateful first rescue volunteers working tirelessly keep people provide putting lives
Actual party is Republican and our classifier says Republican.

Here's our (cleaned) tweet: make even greater
Actual party is Republican and our classifier says Republican.

Here's our (cleaned) tweet: tie series
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: congrats new gig sd city glad continue
Actual party is Democratic and our classifier says Republican.

Here's our (cleaned) tweet: really raised

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [33]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

# Initialize the dictionary with counts set to zero
for p in parties :
    for p1 in parties :
        results[p][p1] = 0

# Number of tweets to classify
num_to_score = 10000

# Shuffle tweet_data for random sampling
random.seed(20220507)
random.shuffle(tweet_data)

# Classify tweets and store results
for idx, tp in enumerate(tweet_data):
    tweet, actual_party = tp
    
    # Clean the tweet text
    cleaned_tweet = clean_text(tweet)
    
    # Convert cleaned tweet to feature dictionary
    tweet_features = conv_features(cleaned_tweet, feature_words)
    
    # Estimate the party using the classifier
    estimated_party = classifier.classify(tweet_features)
    
    # Increment the count in the results dictionary
    results[actual_party][estimated_party] += 1
    
    # Stop after scoring the specified number of tweets
    if idx >= num_to_score:
        break

# Print the results
for actual in parties:
    for estimated in parties:
        print(f"Actual: {actual}, Estimated: {estimated}, Count: {results[actual][estimated]}")

Actual: Republican, Estimated: Republican, Count: 3540
Actual: Republican, Estimated: Democratic, Count: 767
Actual: Democratic, Estimated: Republican, Count: 4564
Actual: Democratic, Estimated: Democratic, Count: 1130


In [36]:
# Extract the counts from the results dictionary
y_true = []
y_pred = []

# Populate the lists with true and predicted labels based on the counts in results
for actual_party in parties:
    for estimated_party in parties:
        count = results[actual_party][estimated_party]
        y_true.extend([actual_party] * count)
        y_pred.extend([estimated_party] * count)

# Create a simple confusion matrix
conf_matrix = confusion_matrix(y_true, y_pred, labels=['Republican', 'Democratic'])

# Print the confusion matrix
print("Confusion Matrix:")
print(conf_matrix)

Confusion Matrix:
[[3540  767]
 [4564 1130]]


### Reflections

Based on the results, I can conclude that the Naive classifier is only slightly better than random guessing. Specifically, 46.7% better noted by the model's accuracy. Also, the high recall for Republican (82.2%) but low precision (43.7%) indicates that the classifier is good at finding Republican tweets, but it often incorrectly labels Democratic tweets as Republican. The low recall for Democratic (19.8%) suggests that the classifier is struggling significantly to correctly identify Democratic tweets.

In essence, it seems that the classifier might be biased towards predicting "Republican" more often, possibly because of an imbalance in the data or because the feature words are more representative of Republican language. It might be worth revisiting the features or trying to balance the dataset to improve the model's performance.