## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details.

In [48]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict

from string import punctuation
from nltk.corpus import stopwords
import re, itertools
import pandas as pd
import pprint

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

In [49]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [50]:
class TokenCleaner():
    def __init__(self, remove_stopwords=True, return_as_string=True):
        
        # Some punctuation variations
        self.punctuation = set(punctuation) # speeds up comparison
        self.punct_set = self.punctuation - {"#"}
        self.punct_pattern = \
            re.compile("[" + re.escape("".join(self.punct_set)) + "]")

        # Stopwords
        if remove_stopwords:
            self.sw = stopwords.words("english") + ['️','',' ']
        else:
            self.sw = ''
            
        # Two useful regex
        self.whitespace_pattern = re.compile(r"\s+")
        self.hashtag_pattern = re.compile(r"^#[0-9a-zA-Z]+")
        
        self.CleanText_return_format = return_as_string

    def CleanText(self, _text):

        # decode bytes to string if necessary
        if isinstance(_text, str): 
            self.text = _text
        else:
            # this is for the case of tweets which are saved as bytes
            self.text = _text.decode("utf-8") 
      
        self.__RemovePunctuation()
        self.__TokenizeText()
        self.__RemoveStopWords()
        if self.CleanText_return_format:
            return ' '.join(self.tokens)
        else:
            return self.tokens
        
    def __RemovePunctuation(self): 
        """
        Loop through the original text and check each character,
        if the character is a punctuation, then it is removed.
        ---------------------------------------------------------
        input: original text
        output: text without punctuation
        """
        self.text = \
            "".join([ch for ch in self.text if ch not in self.punct_set])
            
        self.text = re.sub(self.punct_pattern, '', self.text)
        
    def __TokenizeText(self):
        """
        Tokenize by splitting the text by white space
        ---------------------------------------------------------
        input: text without punctuation
        output: A list of tokens
        """
        self.tokens = \
            [item for item in self.whitespace_pattern.split(self.text)]
                
    def __RemoveStopWords(self): 
        """
        Tokenize by splitting the text by white space
        ---------------------------------------------------------
        input: text without punctuation
        output: A list of tokens with all token as lower case
        """
        self.tokens = [token.lower() for token in self.tokens]
        
        self.tokens = \
            [token for token in self.tokens if not token in self.sw]
            

In [70]:
convention_data = list()

# fill this list up with items that are themselves lists. 
# The first element in the sublist should be the cleaned 
# and tokenized text in a single string. 
# The second element should be the party. 

sql_query = "SELECT party, text FROM conventions"
query_results = convention_cur.execute(sql_query)

# create an instance of the TokenCleaner class 
# which will be used to clean the text

# Some configuration that can be set are:
# -> removing stop words or not 
# -> returning as a list of tokens or as one string

tc = TokenCleaner(remove_stopwords=True, return_as_string=True)
for party, text in query_results:
    # clean the text and tokenize
    tokens = tc.CleanText(text)
    convention_data.append([tokens, party])    


# show sample of the text and their party 
df = pd.DataFrame(convention_data, columns =[['text', 'party']])
df.sample(10)

Unnamed: 0,text,party
2166,voted campaigned republicans since reagan year...,Democratic
2413,could use little help right seems like get one...,Democratic
512,utah,Republican
681,got back yeah,Republican
2004,thank kristin sharing story nation grieves fat...,Democratic
1525,lived feeling helplessness someone love sick a...,Democratic
1488,welcome,Democratic
462,official roll call business republican convent...,Republican
1261,together ivanka wrote policies made easier sta...,Republican
1607,good evening presidential election world’s imp...,Democratic


Let's look at some random entries and see if they look right. 

In [52]:
random.choices(convention_data,k=10)

[['jon honor devotion showing returning citizens forgotten believe person made god purpose continue give americans including former inmates best chance build new life achieve american dream great american dream i’d like ask john richard say words',
  'Republican'],
 ['navy senate liaison used carry bags overseas trips', 'Democratic'],
 ['support defend', 'Republican'],
 ['son scranton claymont wilmington become one consequential vice presidents american history accolade nonetheless rest firmly behind legacy husband father grandfather grateful nation thanks vice president joseph r biden jr lifetime service behalf united states america',
  'Democratic'],
 ['care environment', 'Democratic'],
 ['next', 'Democratic'],
 ['jordan president trump standing right would say today right try',
  'Republican'],
 ['president trump recognized one small ways instill sense normalcy people’s lives bring back entertainment options president went beyond help sports leagues involved figure way overcome chal

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [53]:
def get_feature_words(dataset, word_cutoff = 5, print_accuracy = False):

    tokens = [w for t, p in dataset for w in t.split()]
    word_dist = nltk.FreqDist(tokens)

    feature_words = set()

    for word, count in word_dist.items() :
        if count > word_cutoff :
            feature_words.add(word)
    if print_accuracy:       
        print(f"With a word cutff of {word_cutoff}, we have {len(feature_words)} as features in the model.")
    
    return feature_words

In [54]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
    """
    
    ret_dict = {token:True for token in text.split(' ') if token in fw}
    
    return(ret_dict)

In [55]:
feature_words_convention = get_feature_words(convention_data, print_accuracy = True)

assert(len(feature_words_convention)>0)
assert(conv_features("donald is the president",feature_words_convention)==
       {'donald':True,'president':True})

assert(conv_features("people are american in america",feature_words_convention)==
                     {'america':True,'american':True,"people":True})

With a word cutff of 5, we have 2391 as features in the model.


Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [56]:
featuresets_conv = [(conv_features(text,feature_words_convention), party) for (text, party) in convention_data]

In [57]:
random.seed(20220507)
random.shuffle(featuresets_conv)
test_size = 500

In [58]:
test_set, train_set = featuresets_conv[:test_size], featuresets_conv[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.5


In [59]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     25.8 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     14.9 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                    isis = True           Republ : Democr =     13.0 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

_Your observations to come._



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [60]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [61]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

tweet_results = list(results) # Just to store it, since the query is time consuming

In [62]:
# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.

tweet_data = []
for candidate, party, text in tweet_results:
    tokens = tc.CleanText(text)
    tweet_data.append([tokens, party])    
    
df_tweet = pd.DataFrame(tweet_data, columns =[['text', 'party']])
df_tweet.sample(10)

Unnamed: 0,text,party
303219,50 years ago #shirleychisholm became first afr...,Democratic
581429,#teammcclure great weekend getting talk voters...,Democratic
32428,today remember victims 911 terrorist attacks p...,Republican
361606,mental health awareness month must recommit in...,Democratic
100069,thoughts prayers mccain family history remembe...,Republican
124416,dont forget send best original pictures cas 14...,Democratic
437373,70 years ago today president truman recognized...,Republican
257613,happy st patricks day,Democratic
283299,two weeks election day #voteready,Democratic
61638,insurance companies ralphnorman amp tommypopes...,Democratic


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [63]:
random.seed(20201014)
tweet_data_sample = random.choices(tweet_data,k=10)
tweet_data_sample   
              

[['earlier today spoke house floor abt protecting health care women praised ppmarmonte work central coast httpstcowqgtrzt7vv',
  'Democratic'],
 ['go tribe #rallytogether httpstco0nxutfl9l5', 'Democratic'],
 ['apparently trump thinks easy students overwhelmed crushing burden debt pay student loans #trumpbudget httpstcockyqo5t0qh',
  'Democratic'],
 ['we’re grateful first responders rescue personnel firefighters police volunteers working tirelessly keep people safe provide muchneeded help putting lives line httpstcoezpv0vmiz3',
  'Republican'],
 ['let’s make even greater #kag 🇺🇸 httpstcoy9qozd5l2z', 'Republican'],
 ['1hr cavs tie series 22 im #allin216 repbarbaralee scared #roadtovictory',
  'Democratic'],
 ['congrats belliottsd new gig sd city hall glad continue serve… httpstcofkvmw3cqdi',
  'Democratic'],
 ['really close 3500 raised toward match right whoot that’s 7000 nonmath majors room 😂 help us get httpstcotu34c472sd httpstcoqsdqkypsmc',
  'Democratic'],
 ['today comment period po

In [64]:
for tweet, party in tweet_data_sample :
    tokens = conv_features(tweet,feature_words_convention)
    estimated_party = classifier.classify(tokens)
       
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.\n")

Here's our (cleaned) tweet: earlier today spoke house floor abt protecting health care women praised ppmarmonte work central coast httpstcowqgtrzt7vv
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: go tribe #rallytogether httpstco0nxutfl9l5
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: apparently trump thinks easy students overwhelmed crushing burden debt pay student loans #trumpbudget httpstcockyqo5t0qh
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: we’re grateful first responders rescue personnel firefighters police volunteers working tirelessly keep people safe provide muchneeded help putting lives line httpstcoezpv0vmiz3
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: let’s make even greater #kag 🇺🇸 httpstcoy9qozd5l2z
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: 1h

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [65]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated

parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0

num_to_score = 10000
random.shuffle(tweet_data)

# Loop through tweets and predict whether the person tweeting is a Republican or a Democratic 
# based on tokens from convention_data

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp    

    tokens = conv_features(tweet,feature_words_convention)
    estimated_party = classifier.classify(tokens)   
     
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break

In [66]:
# Get the proportion of the times which Democratic were predicted correctly by the model
TP_Democratic = results['Democratic']['Democratic'] / (results['Republican']['Democratic'] + results['Democratic']['Democratic'])

print(f'True Positive Rate for Democratic Party: {TP_Democratic:0.2%}','\n\n')

pp = pprint.PrettyPrinter()
pp.pprint(dict(results))

True Positive Rate for Democratic Party: 60.87% 


{'Democratic': defaultdict(<class 'int'>,
                           {'Democratic': 907,
                            'Republican': 4817}),
 'Republican': defaultdict(<class 'int'>,
                           {'Democratic': 583,
                            'Republican': 3695})}


### Reflections

It is apparent that most of the estimates are the model makes is Republican. Since there is 50% chance of guessing, the model does just a little bit better which is about 61%. There are some elements that can improve the model such as using the features that return False, optimizing word_cutoffs, split of train and test, etc.
