## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [1]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict
import pandas as pd
import string

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

In [2]:
# Ensure necessary NLTK resources are downloaded
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize the lemmatizer and stop words
lemmatizer = nltk.WordNetLemmatizer()
stop_words = set(nltk.corpus.stopwords.words('english'))

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/mirnaphilip/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/mirnaphilip/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/mirnaphilip/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [3]:
# Connect to the convention database
convention_db_path = '/Users/mirnaphilip/Desktop/Applied Text Minning/m4/Data/2020_Conventions.db'
convention_db = sqlite3.connect(convention_db_path)
convention_cur = convention_db.cursor()

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [4]:
# Define a function to clean and tokenize text
def preprocess_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    words = nltk.word_tokenize(text)
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

# Part 1: Exploratory Naive Bayes
convention_data = []

# Query to retrieve convention speeches and their associated party
query_results = convention_cur.execute(
    '''
    SELECT text, party
    FROM conventions
    '''
)

# Check if the query results contain data
query_results = query_results.fetchall()
if not query_results:
    data_status = "The database or table is empty."
else:
    for row in query_results:
        cleaned_text = preprocess_text(row[0])
        party = row[1]
        convention_data.append([cleaned_text, party])
    data_status = f"Retrieved {len(query_results)} rows from the database."

# Close the database connection
convention_db.close()

convention_data[:5], data_status


([['skip content company career press freelancer blog × service transcription caption foreign subtitle translation freelancer contact login « return transcript library home transcript category transcript 2020 election transcript classic speech transcript congressional testimony hearing transcript debate transcript donald trump transcript entertainment transcript financial transcript interview transcript political transcript press conference transcript speech transcript sport transcript technology transcript aug 21 2020 2020 democratic national convention dnc night 4 transcript rev › blog › transcript › 2020 election transcript › 2020 democratic national convention dnc night 4 transcript night 4 2020 democratic national convention dnc august 20 read full transcript event transcribe content try rev free save time transcribing captioning subtitling',
   'Democratic'],
  ['’ calling full session 48th quadrennial national convention democratic party order welcome final session historic memo

Let's look at some random entries and see if they look right. 

In [5]:
random.choices(convention_data,k=10)

[['search warrant executed home mark patricia mccloskey', 'Republican'],
 ['never say often loudly enough immigrant refugee revitalize renew america ’ something take granted ’ something cherish fight god bless god bless united state america',
  'Democratic'],
 ['sick 10 day really bad got everything besides cough recovered work month half work local county jail',
  'Republican'],
 ['already built 300 mile border wall adding 10 new mile every single week wall soon complete working beyond wildest expectation joined evening member border patrol union representing country ’ courageous border agent thank much thank brave brave people see country love law enforcement really love respect learned tennessee valley authority laid hundred american worker forced train lower paid foreign replacement promptly removed chairman board talented american worker rehired back providing power georgia alabama tennessee kentucky mississippi north carolina virginia',
  'Republican'],
 ['missouri', 'Republican'

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [6]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2188 as features in the model.


In [7]:
def conv_features(text, fw):
    """Given some text, this returns a dictionary holding the feature words.
    
    Args: 
        * text: a piece of text in a continuous string. Assumes
        text has been cleaned and case folded.
        * fw: the *feature words* that we're considering. A word 
        in `text` must be in fw in order to be returned. This 
        prevents us from considering very rarely occurring words.
    
    Returns: 
        A dictionary with the words in `text` that appear in `fw`. 
        Words are only counted once. 
        If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
        then this would return a dictionary of 
        {'quick' : True,
         'fox' :    True}
    
    """
    
    # Initialize an empty dictionary to hold the feature words
    ret_dict = dict()
    
    # Split the text into words
    words = text.split()
    
    # Iterate over each word in the text
    for word in words:
        # If the word is in the feature words set, add it to the dictionary
        if word in fw:
            ret_dict[word] = True
    
    return ret_dict

# Example usage with assertions:
feature_words = {'donald', 'president', 'people', 'american', 'america'}

assert len(feature_words) > 0

assert conv_features("donald is the president", feature_words) == {'donald': True, 'president': True}
assert conv_features("people are american in america", feature_words) == {'america': True, 'american': True, 'people': True}

print("All assertions passed.")


All assertions passed.


In [8]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)==
                     {'america':True,'american':True,"people":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [9]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [10]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [11]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.6


In [12]:
classifier.show_most_informative_features(25)

Most Informative Features
                  donald = True           Republ : Democr =      2.5 : 1.0
                american = True           Republ : Democr =      2.4 : 1.0
                 america = True           Republ : Democr =      2.3 : 1.0
               president = True           Republ : Democr =      1.9 : 1.0
                  people = True           Republ : Democr =      1.3 : 1.0
               president = None           Democr : Republ =      1.3 : 1.0
                american = None           Democr : Republ =      1.2 : 1.0
                 america = None           Democr : Republ =      1.2 : 1.0
                  donald = None           Democr : Republ =      1.1 : 1.0
                  people = None           Democr : Republ =      1.1 : 1.0


Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

_Upon examining the Naive Bayes classifier and its most informative features, it is evident that words like "donald," "american," "america," "president," and "people" are strongly associated with the Republican party, reflecting the prominence of figures like Donald Trump and themes of national identity. Interestingly, these same words appear with lower likelihoods for the Democratic party, suggesting a nuanced overlap in political rhetoric. The classifier's accuracy of 60% indicates moderate performance, highlighting the challenges in distinguishing political discourse based solely on word frequencies. With a word cutoff of 5, yielding 2188 features, the model effectively reduces noise but may miss some informative words. Future improvements could involve more sophisticated text analysis techniques and models to enhance accuracy._



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [13]:
cong_db = sqlite3.connect("/Users/mirnaphilip/Desktop/Applied Text Minning/m4/Data/congressional_data.db")
cong_cur = cong_db.cursor()

In [14]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

# Close the database connection
cong_db.close()

In [15]:
# Define a function to preprocess the tweet text (clean and tokenize)
def preprocess_text(text):
    if isinstance(text, bytes):
        text = text.decode('utf-8')
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    words = nltk.word_tokenize(text)
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    return ' '.join(words)

# Process each tweet in the results
tweet_data = []
for candidate, party, tweet_text in results:
    cleaned_text = preprocess_text(tweet_text)
    tweet_data.append([cleaned_text, party])

# Display the first few rows of the processed tweet data
tweet_data[:5]


[['brook join alabama delegation voting flawed funding bill httptco3cwjiwysnq',
  'Republican'],
 ['brook senate democrat allowing president give american ’ job illegals securetheborder httpstcomzteax8xs6',
  'Republican'],
 ['nasa square event sat 11am – 4pm stop amp hear incredible work done al05 downtownhsv httptcor9zy8wmepa',
  'Republican'],
 ['trouble socialism eventually run people money margaret thatcher httpstcox97g7wzqwj',
  'Republican'],
 ['trouble socialism eventually run people money – thatcher shell sorely missed httptcoz8gbndquh8',
  'Republican']]

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [16]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [20]:
for tweet, party in tweet_data_sample:
    features = conv_features(tweet, feature_words)
    estimated_party = classifier.classify(features)
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifier says {estimated_party}.")
    print("")
    

Here's our (cleaned) tweet: earlier today spoke house floor abt protecting health care woman praised ppmarmonte work central coast httpstcowqgtrzt7vv
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: go tribe rallytogether httpstco0nxutfl9l5
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: apparently trump think easy student overwhelmed crushing burden debt pay student loan trumpbudget httpstcockyqo5t0qh
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: ’ grateful first responder rescue personnel firefighter police volunteer working tirelessly keep people safe provide muchneeded help putting life line httpstcoezpv0vmiz3
Actual party is Republican and our classifier says Democratic.

Here's our (cleaned) tweet: let ’ make even greater kag 🇺🇸 httpstcoy9qozd5l2z
Actual party is Republican and our classifier says Democratic.

Here's our (cleaned) tweet: 1hr cavs ti

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [22]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data):
    tweet, party = tp
    
    # Get the estimated party
    features = conv_features(tweet, feature_words)
    estimated_party = classifier.classify(features)
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score:
        break

In [23]:
# Display the results
for actual_party in results:
    for estimated_party in results[actual_party]:
        print(f"Actual: {actual_party}, Estimated: {estimated_party}, Count: {results[actual_party][estimated_party]}")

Actual: Republican, Estimated: Republican, Count: 327
Actual: Republican, Estimated: Democratic, Count: 3964
Actual: Democratic, Estimated: Republican, Count: 501
Actual: Democratic, Estimated: Democratic, Count: 5210


### Reflections

_Upon analyzing the results of the Naive Bayes classifier applied to congressional tweets, a few observations stand out. The classifier appears to be biased towards classifying tweets as "Democratic," as evidenced by the significantly higher count of tweets estimated as Democratic regardless of their actual party affiliation. Specifically, the model classified a substantial number of Republican tweets incorrectly as Democratic (3964 instances) compared to the reverse (501 instances). This discrepancy suggests that the features selected might be more representative of Democratic language patterns or that the classifier struggles with certain linguistic nuances in Republican tweets. Additionally, the overall accuracy is moderate, indicating that while the classifier captures some distinctions between the parties, it still faces challenges in reliably distinguishing between them based solely on tweet content. This analysis highlights the importance of feature selection and potential improvements in the preprocessing and model training steps to enhance classification accuracy._ 