```
Muris Saab
ADS509 - Assignment 4.1
University of San Diego
```

## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [1]:
import sqlite3
import nltk
import random
import numpy as np
import re
from collections import Counter, defaultdict

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

from nltk.stem import WordNetLemmatizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('punkt_tab')

def clean_tokenize(text):
    text = text.lower() 
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    words = nltk.word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    return words

[nltk_data] Downloading package omw-1.4 to /Users/muriss/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/muriss/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/muriss/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/muriss/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/muriss/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [2]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()


### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [3]:
convention_data = []

# fill this list up with items that are themselves lists. The 
# first element in the sublist should be the cleaned and tokenized
# text in a single string. The second element should be the party. 

query_results = convention_cur.execute(
    '''
    SELECT text, party
    FROM conventions                            
    '''
)

for row in query_results :
    text = row[0]  # the raw text
    party = row[1]  # the associated party
    cleaned_text = clean_tokenize(text)
    convention_data.append([cleaned_text, party])
      

Let's look at some random entries and see if they look right. 

In [4]:
random.choices(convention_data,k=10)

[[['hereby',
   'endorsing',
   'joe',
   'biden',
   'next',
   'president',
   'united',
   'states'],
  'Democratic'],
 [['us',
   'knowing',
   'power',
   'kamala',
   'harris',
   'america',
   'us',
   'knowing',
   'power',
   'us',
   'live',
   'people',
   'right',
   'remind',
   'see',
   'hear',
   'matter',
   'going',
   'vice',
   'president',
   'united',
   'states'],
  'Democratic'],
 [['together',
   'nations',
   'europe',
   'experienced',
   'greater',
   'increase',
   'excess',
   'mortality',
   'united',
   'states',
   'think',
   'enacted',
   'largest',
   'package',
   'financial',
   'relief',
   'american',
   'history',
   'thanks',
   'paycheck',
   'protection',
   'program',
   'saved',
   'supported',
   'million',
   'american',
   'jobs',
   'thats',
   'one',
   'reasons',
   'advancing',
   'rapidly',
   'economy',
   'great',
   'job',
   'result',
   'seen',
   'smallest',
   'economic',
   'contraction',
   'major',
   'western',
   'nation

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [5]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t] # .split()] # Since the clean_tokenize function has already tokenized the text, there's no need to split the tokens again.

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2327 as features in the model.


In [6]:
def conv_features(text, fw):
    """
    Given some text, this returns a dictionary holding the feature words.

    Args: 
        * text: a piece of text or tokenized text.
        * fw: the *feature words* that we're considering. A word 
        in `text` must be in fw in order to be returned.
    
    Returns: 
        A dictionary with the words in `text` that appear in `fw`. 
        Words are only counted once. 
    """
    
    # If text is a string, tokenize it
    if isinstance(text, str):
        words = nltk.word_tokenize(text)
    else:
        words = text  # If text is already tokenized
    
    # Initialize the lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # Initialize the return dictionary
    ret_dict = dict()
    
    # Iterate over the words in the text and check if they are in the feature words
    for word in words:
        lemma_word = lemmatizer.lemmatize(word)
        if lemma_word in fw:
            ret_dict[lemma_word] = True
    
    return ret_dict


In [7]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("people are american in america",feature_words)==
                     {'america':True,'american':True,"people":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [8]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [9]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [10]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.474


In [11]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     27.1 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                religion = True           Republ : Democr =     16.1 : 1.0
                 liberal = True           Republ : Democr =     14.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                  defund = True           Republ : Democr =     13.0 : 1.0
                    flag = True           Republ : Democr =     12.8 : 1.0
                   trade = True           Republ : Democr =     12.7 : 1.0
               greatness = True           Republ : Democr =     12.1 : 1.0
                 abraham = True           Republ : Democr =     11.9 : 1.0
               terrorist = True           Republ : Democr =     11.9 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

- **Republican Keywords**:
   Many of the most informative features appear to be strongly associated with the Republican party. Words like "china," "enforcement," "destroy," "liberal," "defund," "flag," and "trade" appear frequently in Republican speeches, as evidenced by the high ratios (e.g., "china" is 27.1 times more likely to appear in Republican speeches than Democratic ones). This suggests that the Republican convention focused heavily on issues such as foreign policy (e.g., "china"), law and order ("enforcement," "crime"), national pride ("flag"), and cultural criticism ("liberal," "defund").

- **Democratic Keywords**:
   On the Democratic side, fewer features stand out, but "climate" is a prominent word associated with Democratic speeches. This aligns with the Democratic party’s focus on addressing climate change, which is often a major platform issue for them.

- **Polarity**:
   There is a stark polarity in the words used by the two parties, especially in the context of security and national defense. For example, words like "enforcement," "defund," "defense," and "terrorist" are highly associated with Republican speeches. These reflect themes around law enforcement, military strength, and national security, which are traditionally more emphasized by Republicans.

- **Unusual Words**:
   Some words like "mike" and "abraham" appear as highly informative features for Republicans. These could refer to prominent figures (e.g., Mike Pence or Abraham Lincoln), which may indicate that these figures were commonly referenced during Republican speeches.

- **Potential Bias**:
   It’s also interesting that the classifier seems more capable of identifying features associated with Republican speeches than Democratic ones, given the higher number of informative features for the Republican party. This could suggest that Republican speeches in this dataset are more distinct in their word choice compared to Democratic speeches. Alternatively, it might reflect imbalances in the dataset or differences in how the two parties frame their key issues.

## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [12]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [16]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')


results = list(results) # Just to store it, since the query is time consuming

results[:1]

[('Mo Brooks',
  'Republican',
  b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq')]

In [19]:
tweet_data = []

# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.

for row in results:
    name = row[0] 
    party = row[1]  
    tweet = row[2].decode('utf-8') 
    cleaned_text = clean_tokenize(tweet)
    tweet_data.append([cleaned_text, party])

print(tweet_data[:5])


[[['brooks', 'joins', 'alabama', 'delegation', 'voting', 'flawed', 'funding', 'bill'], 'Republican'], [['brooks', 'senate', 'democrats', 'allowing', 'president', 'give', 'americans', 'jobs', 'illegals', 'securetheborder'], 'Republican'], [['nasa', 'square', 'event', 'sat', 'pm', 'stop', 'amp', 'hear', 'incredible', 'work', 'done', 'al', 'downtownhsv'], 'Republican'], [['trouble', 'socialism', 'eventually', 'run', 'peoples', 'money', 'margaret', 'thatcher'], 'Republican'], [['trouble', 'socialism', 'eventually', 'run', 'peoples', 'money', 'thatcher', 'shell', 'sorely', 'missed'], 'Republican']]


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [22]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [24]:
def tweet_features(words):
    return {word: True for word in words}
featuresets = [(tweet_features(tweet), party) for tweet, party in tweet_data]

random.seed(20201014)
random.shuffle(featuresets)

train_set = featuresets[:int(len(featuresets) * 0.8)]
test_set = featuresets[int(len(featuresets) * 0.8):]
classifier = nltk.NaiveBayesClassifier.train(train_set)

for tweet, party in tweet_data_sample:
    features = tweet_features(tweet)
    estimated_party = classifier.classify(features)
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifier says {estimated_party}.")
    print("")


Here's our (cleaned) tweet: ['liz', 'indiana', 'like', 'line', 'dance', 'amp', 'good', 'time']
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: ['every', 'day', 'sit', 'wait', 'people', 'infected', 'sayfie']
Actual party is Republican and our classifier says Democratic.

Here's our (cleaned) tweet: ['realdonaldtrump', 'promised', 'hed', 'end', 'radical', 'obama', 'jobkilling', 'regulations', 'coal', 'restore', 'american', 'energy', 'independence', 'hes', 'delivering', 'todays', 'action', 'means', 'jobs', 'lower', 'prices', 'return', 'american', 'strength']
Actual party is Republican and our classifier says Republican.

Here's our (cleaned) tweet: ['thank', 'yous', 'around', 'reelection', 'time', 'heads']
Actual party is Republican and our classifier says Democratic.

Here's our (cleaned) tweet: ['speed', 'walk', 'run', 'click', 'link', 'register', 'vote', 'deadline', 'tomorrow', 'register', 'make', 'sure', 'voice', 'heard', 'november']
Actual 

In [30]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican', 'Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties:
    for p1 in parties:
        results[p][p1] = 0

num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data):
    tweet, party = tp  
    features = tweet_features(tweet)
    estimated_party = classifier.classify(features)
    
    results[party][estimated_party] += 1
    
    if idx >= num_to_score:
        break

In [31]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 3073, 'Democratic': 1277}),
             'Democratic': defaultdict(int,
                         {'Republican': 389, 'Democratic': 5262})})

### Reflections

The results can be interpreted as a **confusion matrix**, showing how well the classifier is performing in terms of classifying tweets as either **Republican** or **Democratic**.

- `'Republican': {'Republican': 3073, 'Democratic': 1277}`:
  - **3073** tweets from actual Republicans were correctly classified as **Republican**.
  - **1277** tweets from actual Republicans were incorrectly classified as **Democratic**.

- `'Democratic': {'Republican': 389, 'Democratic': 5262}`:
  - **5262** tweets from actual Democrats were correctly classified as **Democratic**.
  - **389** tweets from actual Democrats were incorrectly classified as **Republican**.

### Observations:
1. **Accuracy**:
   - The classifier performs well, with the majority of tweets being classified correctly for both parties:
     - **Republican tweets**: \( \frac{3073}{3073 + 1277} \approx 70.6\% \) accuracy for Republican tweets.
     - **Democratic tweets**: \( \frac{5262}{5262 + 389} \approx 93.1\% \) accuracy for Democratic tweets.

2. **Class Imbalance**:
   - The classifier is more accurate at predicting **Democratic** tweets than **Republican** tweets. This could be due to a number of factors:
     - **More distinctive language** in Democratic tweets.
     - **Possible class imbalance**: If there are more Democratic tweets in the dataset, the classifier might be biased toward predicting the Democratic party more often.
