## Naive Bayes on Political Text
### Ghassan Seba
In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [1]:
import sqlite3
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
import os
import contextlib
import string
import random
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
import sys
import emoji

In [2]:
# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('stopwords', quiet=True)
    
# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Text preprocessing functions
def lemmatize(tokens):
    """Lemmatizes tokens."""
    return [lemmatizer.lemmatize(token) for token in tokens]

# Define stopwords to retain
retain_words = {'about'}  

def remove_stop(tokens):
    """Removes stop words except those in the retain_words set."""
    stop_words = set(stopwords.words('english'))
    return [word for word in tokens if word not in stop_words or word in retain_words]
    
def remove_punctuation(text, punct_set=set(string.punctuation)):
    """Removes punctuation."""
    return "".join([ch for ch in text if ch not in punct_set])

def tokenize(text):
    """Tokenizes text."""
    collapse_whitespace = re.compile(r'\s+')
    return collapse_whitespace.split(text)

def prepare(text, pipeline):
    """Applies transformations to text."""
    text = remove_punctuation(text)
    text = text.lower()  
    tokens = tokenize(text)
    for transform in pipeline:
        tokens = transform(tokens)  
    return tokens

def clean_text(text):
    # Chnage emojis to text
    text = emoji.demojize(text)

    # Remove 'b' and quotes at the start of the string
    text = re.sub(r'^b[\'\"]', '', text)

    # Replace byte-encoded characters
    text = text.replace('xe2x80x99', "'")  # Apostrophe
    text = text.replace('xe2x80x9c', '"').replace('xe2x80x9d', '"')  # Double quotes
    text = text.replace('xe2x80x94', '-')  # Em dash
    text = text.replace('xe2x80x93', '-')  # En dash
    text = text.replace('xe2x80xa6', '...')  # Ellipsis
    text = text.replace('xf0x9fx87xba', '🇺🇸')  # Flag emoji
    text = text.replace('xf0x9fx98x82', '😂')  # Laughing emoji
    
    # Remove sequences of text that have an x followed by 2 digits
    text = re.sub(r'\\x\w{2}', '', text)

    # Remove URLs
    text = re.sub(r'http\S+', '', text)

    return text

# Define transformation pipeline
pipeline = [remove_stop, lemmatize]


In [3]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

### Explore the Database

In [4]:
# Get a list of all table names
convention_cur.execute("SELECT name FROM sqlite_schema WHERE type='table' AND name NOT LIKE 'sqlite_%'")
table_names = convention_cur.fetchall()

# Display Table Names
table_names

[('conventions',)]

In [5]:
# Get column names and types for the 'conventions' table
convention_cur.execute("PRAGMA table_info(conventions)")
columns = convention_cur.fetchall()

# Display Table Info
columns

[(0, 'party', 'TEXT', 0, None, 0),
 (1, 'night', 'INTEGER', 0, None, 0),
 (2, 'speaker', 'TEXT', 0, None, 0),
 (3, 'speaker_count', 'INTEGER', 0, None, 0),
 (4, 'time', 'TEXT', 0, None, 0),
 (5, 'text', 'TEXT', 0, None, 0),
 (6, 'text_len', 'TEXT', 0, None, 0),
 (7, 'file', 'TEXT', 0, None, 0)]

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

### Fill this list up with items that are themselves lists. 
- The first element in the sublist should be the cleaned and tokenized text in a single string.
- The second element should be the party.

In [6]:
# Create to hold convention data
convention_data = []

# Query convention text and party info
query_results = convention_cur.execute(
                            '''
                            SELECT text, party
                            FROM conventions
                            ''')

# Process each row using the prepare function
for row in query_results:
    raw_text, party = row
    tokens = prepare(raw_text, pipeline)
    processed_text = " ".join(tokens)
    
    # Add data to list
    convention_data.append([processed_text, party])

# Close database connection
convention_db.close()

# # Display processed data
# for item in convention_data:
#     print(item)

Let's look at some random entries and see if they look right. 

In [7]:
random.choices(convention_data,k=10)

[['first generation american know dangerous socialist agenda mother mercedes special education teacher aguadilla puerto rico father also immigrant came nation pursuit american dream consider duty fight protect dream rioter must allowed destroy city human sex drug trafficker allowed cross border socialist policy destroyed place like cuba venezuela must take root city school want see socialized bidenharris future country take look california place immense wealth immeasurable innovation immaculate environment democrat turned land discarded heroin needle park riot street blackout home president trump’s america light thing don’t dim build thing don’t burn',
  'Republican'],
 ['today beautiful daughter hope thriving twoyear old crystal fast approaching three year recovery dear friend constant inspiration others hold special place heart facing opioid addiction that’s i’m enormously grateful president leadership fighting deadly enemy effort we’re turning tide crisis addiction president trump d

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [8]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2247 as features in the model.


In [9]:
def conv_features(text, fw):
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    
    # Empty dictionary for feature words
    feature_dict = {}
    
    # Tokenize text into words
    tokens = text.split()
    
    # Check if tokens are in feature words
    for token in tokens:
        if token in fw:
            feature_dict[token] = True
    
    # Return the dictionary
    return feature_dict


In [10]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)==
                     {'america':True,'american':True,"people":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [11]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [12]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [13]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.48


In [14]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     25.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                religion = True           Republ : Democr =     16.1 : 1.0
                  medium = True           Republ : Democr =     14.9 : 1.0
                 liberal = True           Republ : Democr =     14.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                     isi = True           Republ : Democr =     13.0 : 1.0
                 patriot = True           Republ : Democr =     13.0 : 1.0
                    flag = True           Republ : Democr =     12.8 : 1.0
                   trade = True           Republ : Democr =     12.7 : 1.0
               greatness = True           Republ : Democr =     12.1 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

***One interesting aspect of the classifier's output is the partisan divide in the most informative features. This divide reflects talking points often emphasized by each party, underscoring the model's ability to capture some ideological differences.***

### My Observations

***The Naive Bayes classifier, with an accuracy of 48%, is performing below the level of random guessing. The most informative features, such as "china" and "climate," do reflect distinct party language patterns. However, the low accuracy, indicates that the model is not effectively capturing the differences between the classes. We should consider improvements such as using n-grams, TF-IDF, or a more advanced classifier to enhance performance.***

## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [15]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

### Explore the Database

In [16]:
# Get a list of all table names
cong_cur.execute("SELECT name FROM sqlite_schema WHERE type='table' AND name NOT LIKE 'sqlite_%'")
table_names = cong_cur.fetchall()

# Display Table Names
table_names

[('websites',), ('candidate_data',), ('tweets',)]

In [17]:
# Get column names and types for the 'conventions' table
cong_cur.execute("PRAGMA table_info(tweets)")
columns = cong_cur.fetchall()

# Display Table Info
columns

[(0, 'district', 'TEXT', 0, None, 0),
 (1, 'candidate', 'TEXT', 0, None, 0),
 (2, 'pull_time', 'DATETIME', 0, None, 0),
 (3, 'tweet_time', 'DATETIME', 0, None, 0),
 (4, 'handle', 'TEXT', 0, None, 0),
 (5, 'is_retweet', 'INTEGER', 0, None, 0),
 (6, 'tweet_id', 'TEXT', 0, None, 0),
 (7, 'tweet_text', 'TEXT', 0, None, 0),
 (8, 'likes', 'INTEGER', 0, None, 0),
 (9, 'replies', 'INTEGER', 0, None, 0),
 (10, 'retweets', 'INTEGER', 0, None, 0),
 (11, 'tweet_ratio', 'REAL', 0, None, 0)]

In [18]:
# Get column names and types for the 'conventions' table
cong_cur.execute("PRAGMA table_info(candidate_data)")
columns = cong_cur.fetchall()

# Display Table Info
columns

[(0, 'index', 'INTEGER', 0, None, 0),
 (1, 'student', 'TEXT', 0, None, 0),
 (2, 'state', 'TEXT', 0, None, 0),
 (3, 'district_num', 'TEXT', 0, None, 0),
 (4, 'formatted_dist_num', 'INTEGER', 0, None, 0),
 (5, 'abbrev', 'TEXT', 0, None, 0),
 (6, 'district', 'TEXT', 0, None, 0),
 (7, 'candidate', 'TEXT', 0, None, 0),
 (8, 'party', 'TEXT', 0, None, 0),
 (9, 'website', 'TEXT', 0, None, 0),
 (10, 'twitter_handle', 'TEXT', 0, None, 0),
 (11, 'incumbent', 'TEXT', 0, None, 0),
 (12, 'age', 'REAL', 0, None, 0),
 (13, 'gender', 'TEXT', 0, None, 0),
 (14, 'marital_status', 'TEXT', 0, None, 0),
 (15, 'white_non_hispanic', 'TEXT', 0, None, 0),
 (16, 'hispanic', 'TEXT', 0, None, 0),
 (17, 'black', 'TEXT', 0, None, 0),
 (18, 'partisian_lean_pvi', 'TEXT', 0, None, 0),
 (19, 'opposed', 'TEXT', 0, None, 0),
 (20, 'pct_urban', 'TEXT', 0, None, 0),
 (21, 'income', 'REAL', 0, None, 0),
 (22, 'region', 'TEXT', 0, None, 0)]

In [19]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [20]:
# Process tweet data
tweet_data = []

# Apply text processing and encoding fix
for row in results:
    candidate, party, raw_tweet = row
    raw_tweet = clean_text(str(raw_tweet))
    tokens = prepare(raw_tweet, pipeline)
    processed_tweet = " ".join(tokens)
    
    # Store processed tweet and party
    tweet_data.append([processed_tweet, party])


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [21]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [22]:
for tweet, party in tweet_data_sample :
   # Extract features
    tweet_features = conv_features(tweet, feature_words)
    
    # Classify tweet
    estimated_party = classifier.classify(tweet_features)
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")
    

Here's our (cleaned) tweet: earlier today spoke house floor abt protecting health care woman praised ppmarmonte work central coast 
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: go tribe rallytogether 
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: apparently trump think easy student overwhelmed crushing burden debt pay student loan trumpbudget 
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: grateful first responder rescue personnel firefighter police volunteer working tirelessly keep people safe provide muchneeded help putting life linenn
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: let make even greater kag 
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: about 1hr cavs tie series 22 im allin216 repbarbaralee scared roadtovictory
Actual party is Democratic and our c

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [23]:
# Dictionary of counts by actual and estimated party
parties = ['Republican', 'Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties:
    for p1 in parties:
        results[p][p1] = 0

num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data):
    tweet, party = tp    
    
    # Extract features and classify
    tweet_features = conv_features(tweet, feature_words)
    estimated_party = classifier.classify(tweet_features)
    
    # Update results
    results[party][estimated_party] += 1
    
    if idx > num_to_score:
        break


In [24]:
print("Classification Results:\n")
print(f"{' ':<20}{'Predicted Republican':<25}{'Predicted Democratic':<25}")
print("="*70)
print(f"{'Actual Republican':<20}{results['Republican']['Republican']:<25}{results['Republican']['Democratic']:<25}")
print(f"{'Actual Democrat':<20}{results['Democratic']['Republican']:<25}{results['Democratic']['Democratic']:<25}")

Classification Results:

                    Predicted Republican     Predicted Democratic     
Actual Republican   3776                     502                      
Actual Democrat     4897                     827                      


In [25]:
# Democratic class values
democrat_values = list(results['Democratic'].values())
dem_FP = democrat_values[0]  
dem_TP = democrat_values[1]  

# Republican class values
republican_values = list(results['Republican'].values())
rep_TP = republican_values[0]  
rep_FN = republican_values[1]  

# Complementary values
rep_FP = dem_FP  
rep_TN = dem_TP 
dem_FN = rep_FN 
dem_TN = rep_TP  

# Output
print(f"Democratic Class: TP={dem_TP}, FP={dem_FP}, FN={dem_FN}, TN={dem_TN}")
print(f"Republican Class: TP={rep_TP}, FP={rep_FP}, FN={rep_FN}, TN={rep_TN}")


Democratic Class: TP=827, FP=4897, FN=502, TN=3776
Republican Class: TP=3776, FP=4897, FN=502, TN=827


In [26]:
# Function to calculate metrics
def calculate_metrics(tp, fp, fn, tn):
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    return accuracy, precision, recall, f1

# Calculate metrics for Democratic and Republican classes
dem_accuracy, dem_precision, dem_recall, dem_f1 = calculate_metrics(dem_TP, dem_FP, dem_FN, dem_TN)
rep_accuracy, rep_precision, rep_recall, rep_f1 = calculate_metrics(rep_TP, rep_FP, rep_FN, rep_TN)

# Create a DataFrame to store the metrics
metrics_summary = {
    "Class": ["Democratic", "Republican"],
    "Accuracy": [dem_accuracy, rep_accuracy],
    "Precision": [dem_precision, rep_precision],
    "Recall": [dem_recall, rep_recall],
    "F1 Score": [dem_f1, rep_f1]
}

# Create and display the DataFrame
metrics_df = pd.DataFrame(metrics_summary)

# Round DataFrame to 4 decimal places
metrics_df = metrics_df.round(4)

# Display dataframe
metrics_df

Unnamed: 0,Class,Accuracy,Precision,Recall,F1 Score
0,Democratic,0.4602,0.1445,0.6223,0.2345
1,Republican,0.4602,0.4354,0.8827,0.5831


In [27]:
# Close database connection
cong_db.close()

### Reflections

***The classifier seems to show uneven performance between the two classes, with better precision, recall, and F1 score for Republican tweets than Democratic ones. One possible reason for this could be the feature selection process, which includes words found in a predefined list (feature_words). If the selected list contains more words commonly used in Republican tweets, this could lead to better predictions for that class. This input list, perhaps containing more Republican tweets, may also be causing an imbalance and affecting the output, causing a skew in the classifier's performance. The text preprocessing step, while simplifying the data, could be removing important contextual information that helps distinguish between Democratic and Republican tweets. Although, when I attempted to run the model without removing the stop words, I received the same metrics. Ultimately, more testing is needed to improve the results.***

<center><b>References:</b></center>

- Tutorials Point. (2024). *SQLite - PRAGMA.* Tutorials Point. https://www.tutorialspoint.com/sqlite/sqlite_pragma.htm
- Python Software Foundation. (2024). *sqlite3—DB-API 2.0 interface for SQLite databases.* Python Documentation. https://docs.python.org/3/library/sqlite3.html
- Kumar, S. (2021, May 16). *How to execute a SQLite statement in Python?* GeeksforGeeks. https://www.geeksforgeeks.org/how-to-execute-a-sqlite-statement-in-python/
- SQLite Tutorial. (n.d.). *SQLite select.* SQLite Tutorial. https://www.sqlitetutorial.net/sqlite-select/
- SQLite Tutorial. (n.d.). *SQLite show tables.* SQLite Tutorial. https://www.sqlitetutorial.net/sqlite-show-tables/
- Pankaj. (2022, August 3). *Python string encode() decode().* DigitalOcean. https://www.digitalocean.com/community/tutorials/python-string-encode-decode
- Jain, Y. (2021, February 23). *How to convert a string to UTF-8 in Python?* Studytonight. https://www.studytonight.com/python-howtos/how-to-convert-a-string-to-utf8-in-python
- Singh, V. K., & Obi Tulton, A. (2022, November 30). *How to work with Unicode in Python.* DigitalOcean. https://www.digitalocean.com/community/tutorials/how-to-work-with-unicode-in-python
- Solomon, B. (2021, August 29). *demoji (Version 1.1.0)* [Python package]. PyPI. https://pypi.org/project/demoji/
- pythontutorial.net. (2023). *Python regex sub().* https://www.pythontutorial.net/python-regex/python-regex-sub/
- Unicode Table. (n.d.). *UTF-8 encoding table and Unicode characters: Code points U+2000 to U+207F.* Retrieved from https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=string-literal
- OpenAI. (2023). ChatGPT (September 29 version) [Large language model]. https://chat.openai.com/
- Jablonski, J. (2023, October 18). *Python's F-String: An improved string interpolation and formatting tool.* Real Python. https://realpython.com/python-f-strings/
- GeeksforGeeks. (2023, October 18). *Classification metrics using Sklearn.* GeeksforGeeks. https://www.geeksforgeeks.org/sklearn-classification-metrics/
- Shah, F. P., & Patel, V. (2016). *A review on feature selection and feature extraction for text classification.* 2016 International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET), 2264–2268. https://doi.org/10.1109/WiSPNET.2016.7566545