## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [74]:
import sqlite3
import nltk
import random
import numpy as np
import pandas as pd
from collections import Counter, defaultdict
import nltk
nltk.download('stopwords', download_dir='Users/justinfarnan_hakkoda/ads_text_mining/M4/tm_nb_conventions_m4/nltk_data')
from nltk.corpus import stopwords
from string import punctuation
sw = stopwords.words("english")
nltk.download('punkt')
stopwords_list = stopwords.words('english')
from nltk.tokenize import word_tokenize
import re
# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

[nltk_data] Downloading package stopwords to Users/justinfarnan_hakkod
[nltk_data]     a/ads_text_mining/M4/tm_nb_conventions_m4/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/justinfarnan_hakkoda/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [16]:
convention_db = sqlite3.connect("/Users/justinfarnan_hakkoda/ads_text_mining/M4/2020_Conventions.db")
convention_cur = convention_db.cursor()

In [17]:
# Execute a query
convention_cur.execute("SELECT	name	FROM	sqlite_master	WHERE	type='table';")
print(convention_cur.fetchall())

[('conventions',)]


In [24]:
convention_cur.execute("SELECT	*	FROM	conventions;")
rows = convention_cur.fetchall()
df = pd.DataFrame(rows, columns=['party', 'night', 'speaker', 'speaker_count', 'time', 'text', 'text_len', 'file'])

# Print the DataFrame
print(df.head())

        party  night           speaker  speaker_count   time  \
0  Democratic      4           Unknown              1  00:00   
1  Democratic      4         Speaker 1              1  00:33   
2  Democratic      4         Speaker 2              1  00:59   
3  Democratic      4  Kerry Washington              1  01:07   
4  Democratic      4    Bernie Sanders              1  01:18   

                                                text text_len  \
0  Skip to content The Company Careers Press Free...      127   
1  I’m here by calling the full session of the 48...       41   
2  Every four years, we come together to reaffirm...       17   
3  We fight for a more perfect union because we a...       28   
4  We must come together to defeat Donald Trump, ...       22   

                                                file  
0  www_rev_com_blog_transcripts2020-democratic-na...  
1  www_rev_com_blog_transcripts2020-democratic-na...  
2  www_rev_com_blog_transcripts2020-democratic-na...  
3  w

In [59]:
df['party'].value_counts()

party
Democratic    1551
Republican     990
Name: count, dtype: int64

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [34]:
convention_data = []

# fill this list up with items that are themselves lists. The 
# first element in the sublist should be the cleaned and tokenized
# text in a single string. As part of your cleaning process,
# remove the stopwords from the text. The second element of the sublist
# should be the party. 

query_results = convention_cur.execute(
                            '''
                            SELECT 
                                text,
                                party
                            FROM
                                conventions
                            ''')

for row in query_results :
    text_data = row[0]
    party_data = row[1]
    # standardize the text by lowercasing teh words
    text_lower = text_data.lower()
    text_remove_punct = ''.join(char for char in text_lower if char not in punctuation)
    # Split on whitespace
    tokens = word_tokenize(text_remove_punct)
    tokens_no_stopwords = [token for token in tokens if token not in stopwords_list]
    cleaned_text = ' '.join(tokens_no_stopwords)
    convention_data.append([cleaned_text, party_data])
    pass # remove this
    


In [35]:
convention_data

[['skip content company careers press freelancers blog × services transcription captions foreign subtitles translation freelancers contact login « return transcript library home transcript categories transcripts 2020 election transcripts classic speech transcripts congressional testimony hearing transcripts debate transcripts donald trump transcripts entertainment transcripts financial transcripts interview transcripts political transcripts press conference transcripts speech transcripts sports transcripts technology transcripts aug 21 2020 2020 democratic national convention dnc night 4 transcript rev › blog › transcripts › 2020 election transcripts › 2020 democratic national convention dnc night 4 transcript night 4 2020 democratic national convention dnc august 20 read full transcript event transcribe content try rev free save time transcribing captioning subtitling',
  'Democratic'],
 ['’ calling full session 48th quadrennial national convention democratic party order welcome final

Let's look at some random entries and see if they look right. 

In [36]:
random.choices(convention_data,k=5)

[['’ add ’ going inaudible 015837 see help vote fellow look would',
  'Democratic'],
 ['hello name maximo alvarez live miami florida far state florida ’ 90 mile wide blue strip map divides freedom fear divides past future know past ’ never forget family fled totalitarianism communism first dad spain cuba family ’ ’ run away grace god live american dream greatest blessing ever dad sixth grade education told “ ’ lose place never lucky ” maximo alvarez 020714 ’ speaking today family ’ done abandoning rightfully earned ’ place hide ’ speaking today president trump may always politically correct fact successful businessman average career politician president another family man friend important elected commander chief puts america first keep mind guy running president mostly concerned power yes yes power benefit americans',
  'Republican'],
 ['hey everyone dana white president ufc many know friends president spoke convention four years ago ’ back believe need president trump ’ leadership eve

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [37]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2332 as features in the model.


In [52]:
def conv_features(text,fw) :
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    
    # Your code here
    
    ret_dict = dict()
    # Input validation
    if not isinstance(text, str):
        raise TypeError("Text must be a string")
    if not isinstance(fw, (set, list)):
        raise TypeError("Feature words must be a set or list")

    # interate over the words and extract the feature words and add it to teh dctionary
    for word in text.split():
        # check if a word is a feature word fw
        if word in fw and word not in ret_dict:
            # if it as add it to teh list
            ret_dict[word] = True

    
    return(ret_dict)

In [53]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("some people in america are citizens",feature_words)==
                     {'people':True,'america':True,"citizens":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [55]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]
featuresets

[({'skip': True,
   'content': True,
   'company': True,
   'careers': True,
   'press': True,
   'freelancers': True,
   'blog': True,
   '×': True,
   'services': True,
   'transcription': True,
   'captions': True,
   'foreign': True,
   'subtitles': True,
   'translation': True,
   'contact': True,
   'login': True,
   '«': True,
   'return': True,
   'transcript': True,
   'library': True,
   'home': True,
   'categories': True,
   'transcripts': True,
   '2020': True,
   'election': True,
   'classic': True,
   'speech': True,
   'congressional': True,
   'testimony': True,
   'hearing': True,
   'debate': True,
   'donald': True,
   'trump': True,
   'entertainment': True,
   'financial': True,
   'interview': True,
   'political': True,
   'conference': True,
   'sports': True,
   'technology': True,
   'aug': True,
   'democratic': True,
   'national': True,
   'convention': True,
   'dnc': True,
   'night': True,
   '4': True,
   'rev': True,
   '›': True,
   'august': True,


In [56]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [57]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.496


In [58]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     27.1 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.8 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                  defund = True           Republ : Democr =     13.0 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

_Your observations to come._

Based on what I see above, and considering the party feature, there is a clear imbalance in the data. There are more Democratic labels compared to Republican, so it seems that the model is favoring the Democratic party with more weight than the Republican party. It's also interesting to see that the Democratic party has a 100% probability, which is unusual for this model, as it's not always 100%. Additionally, the accuracy of 49% is lower than I expected, given the data imbalance and the model's favoritism towards the Democratic party. I would have guessed that the accuracy would be inflated. However, with this in mind, I suggest we try another model, such as a logistic regression model, to see how it performs. If that model performs poorly or shows a clear favoritism towards Democrats due to the imbalance, we can make adjustments to the data to try to increase the accuracy.



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [62]:
cong_db = sqlite3.connect("/Users/justinfarnan_hakkoda/ads_text_mining/M4/congressional_data.db")
cong_cur = cong_db.cursor()

In [63]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [88]:
df_2 = pd.DataFrame(list(results.items()), columns=['Candidate', 'Party'])
# Print the DataFrame
print(df_2)
df_2.head()

    Candidate                                     Party
0  Republican  {'Republican': 3184, 'Democratic': 1094}
1  Democratic  {'Republican': 4306, 'Democratic': 1418}


Unnamed: 0,Candidate,Party
0,Republican,"{'Republican': 3184, 'Democratic': 1094}"
1,Democratic,"{'Republican': 4306, 'Democratic': 1418}"


In [68]:
results

[('Mo Brooks',
  'Republican',
  b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq'),
 ('Mo Brooks',
  'Republican',
  b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6'),
 ('Mo Brooks',
  'Republican',
  b'"NASA on the Square" event this Sat. 11AM \xe2\x80\x93 4PM. Stop by &amp; hear about the incredible work done in #AL05! @DowntownHSV http://t.co/R9zY8WMEpA'),
 ('Mo Brooks',
  'Republican',
  b'"The trouble with Socialism is that eventually you run out of other people\'s money." - Margaret Thatcher https://t.co/X97g7wzQwJ'),
 ('Mo Brooks',
  'Republican',
  b'"The trouble with socialism is eventually you run out of other people\'s money" \xe2\x80\x93 Thatcher. She\'ll be sorely missed. http://t.co/Z8gBnDQUh8'),
 ('Mo Brooks',
  'Republican',
  b'"U.S. Debt Biggest Threat to National Security" #armyaviation http://t.co/4RWjrehH'),
 ('Mo Brooks',
  'Repu

In [76]:
tweet_data = []

# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.
for row in results :
    candidate, party, tweet = row
    # standardize the text by lowercasing teh words
    text_lower = tweet.lower()
    text_remove_punct = re.sub(r'[^a-z\s]', '', tweet.decode('utf-8'))
    # Split on whitespace
    tokens = word_tokenize(text_remove_punct)
    tokens_no_stopwords = [token for token in tokens if token not in stopwords_list]
    cleaned_text = ' '.join(tokens_no_stopwords)
    tweet_data.append([cleaned_text, party])
    pass # remove this


In [77]:
tweet_data

[['rooks oins labama elegation oting gainst lawed unding ill httptcowjsq',
  'Republican'],
 ['rooks enate emocrats llowing resident ive mericans obs llegals securetheborder httpstcomtax',
  'Republican'],
 ['quare event top amp hear incredible work done owntown httptcozp',
  'Republican'],
 ['trouble ocialism eventually run peoples money argaret hatcher httpstcogwzw',
  'Republican'],
 ['trouble socialism eventually run peoples money hatcher hell sorely missed httptcognh',
  'Republican'],
 ['ebt iggest hreat ational ecurity armyaviation httptcojreh', 'Republican'],
 ['jobs amp economic solution merica securing borders protect jobs amp wages mericans httpstcophkz',
  'Republican'],
 ['amp merican hero eremiah enton laid rest today one finest patriots pleasure meet httptcobyrplu',
  'Republican'],
 ['loss gain egislative ounsel eter hite joins hite ouse omestic olicy ouncil httpstcosp httpstconli',
  'Republican'],
 ['orderrisis billion problem million solution httptcokpvy', 'Republica

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [78]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [89]:

for tweet, party in tweet_data_sample :
    estimated_party = classifier.classify(conv_features(tweet,	feature_words))
    # Fill in the right-hand side above with code that estimates the actual party
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")
    

Here's our (cleaned) tweet: arlier today spoke ouse loor abt protecting health care women praised marmonte work entral oast httpstcoqgz
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: ribe allyogether httpstcout
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: pparently rump thinks easy students overwhelmed crushing burden debt pay student loans rumpudget httpstcockh
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: ere grateful first responders rescue personnel firefighters police volunteers working tirelessly keep people safe provide muchneeded help putting lives line httpstcoevvz
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: ets make even reater httpstcoyqoz
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: e hr cavs tie series eparbaraee scared roadtovictory
Actual party is

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [90]:
# dictionary of counts by actual party and estimated party. 
# first key is actual, second is estimated
parties = ['Republican','Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties :
    for p1 in parties :
        results[p][p1] = 0


num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data) :
    tweet, party = tp    
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
    estimated_party = classifier.classify(conv_features(tweet,	feature_words))
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score : 
        break
for actual_party in parties:
    for estimated_party in parties:
        count = results[actual_party][estimated_party]
        print(f"Actual: {actual_party}, Estimated: {estimated_party}, Count: {count}")

Actual: Republican, Estimated: Republican, Count: 3230
Actual: Republican, Estimated: Democratic, Count: 1142
Actual: Democratic, Estimated: Republican, Count: 4203
Actual: Democratic, Estimated: Democratic, Count: 1427


In [91]:
# Calculate accuracy
correct_count = 0
total_count = 0

for actual_party in parties:
    for estimated_party in parties:
        count = results[actual_party][estimated_party]
        if actual_party == estimated_party:
            correct_count += count
        total_count += count

accuracy = correct_count / total_count
print(f"Accuracy: {accuracy:.3f}")

Accuracy: 0.466


### Reflections

_Write a little about what you see in the results_ 

In these results we can see that the model is favoring the repblica party over the democratic party as it under estimcated teh democratci party and over esticmated the rbulican party. It is instering to see how this model performed very differently that the previous one, as this data might have had more republican labeles when compared to the previous one, it would be nteresting to see if we were able to combine the two datasets since they are both imbalanced with the opposite party how they would perform. 