## Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [1]:
import sqlite3
import nltk
import random
import numpy as np
from collections import Counter, defaultdict

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

# Ensure that you have downloaded the stopwords and punkt data
nltk.download('stopwords')
nltk.download('punkt')

# Function to clean & tokenize text
def clean_tokenize(text):
    # Tokenize  text
    tokens = word_tokenize(text)
    # Convert to lower case
    tokens = [word.lower() for word in tokens]
    # Remove punctuation
    tokens = [word for word in tokens if word.isalpha()]
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Return tokens as a single string
    return ' '.join(tokens)

[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:1000)>


In [2]:
convention_db = sqlite3.connect("2020_Convention.db")
convention_cur = convention_db.cursor()

### Part 1: Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" class work. First, pull in the text 
for each party and prepare it for use in Naive Bayes.  

In [3]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

In [4]:
# Connect to database displaying all tables & columns
try:
    # List all tables in the database
    tables_query = "SELECT name FROM sqlite_master WHERE type='table';"
    convention_cur.execute(tables_query)
    tables = convention_cur.fetchall()

    # Check if tables are found
    if not tables:
        print("No tables found in the database.")
    else:
        print("Tables in the database:")
        for table in tables:
            print(table[0])

        # Describe each table & column names
        for table in tables:
            print(f"\nColumns in table {table[0]}:")
            convention_cur.execute(f"PRAGMA table_info({table[0]});")
            columns = convention_cur.fetchall()
            for column in columns:
                print(f"{column[1]} ({column[2]})")

except sqlite3.Error as e:
    print(f"An error occurred: {e}")
finally:
    if convention_db:
        convention_db.close()

Tables in the database:
conventions

Columns in table conventions:
party (TEXT)
night (INTEGER)
speaker (TEXT)
speaker_count (INTEGER)
time (TEXT)
text (TEXT)
text_len (TEXT)
file (TEXT)


In [5]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

convention_data = []

# Fill this list up with items that are themselves lists. The 
# first element in the sublist should be the cleaned and tokenized
# text in a single string. As part of your cleaning process,
# remove the stopwords from the text. The second element of the sublist
# should be the party. 

query_results = convention_cur.execute(
                            '''
                            SELECT text, party 
                            FROM conventions; 
                            ''')

for row in query_results:
    cleaned_text = clean_tokenize(row[0])
    party = row[1]
    convention_data.append([cleaned_text, party])

Let's look at some random entries and see if they look right. 

In [6]:
random.choices(convention_data,k=5)

[['hello pastor jerry young let us pray together almighty god grateful granted nation democratic ideal destiny may fashioned thankful blessed united states america survive climate confusion chaos racism injustice uncertainty helplessness irresponsibility oh lord invoke presence participation throughout life convention prayer enable convention produce vision promote healing hope health nation vision inspire inform inclusive americans vision rekindle us renewed commitment high ideal democracy created equal endowed creator certain inalienable rights among life liberty pursuit happiness',
  'Democratic'],
 ['time sleeping basement joe biden years produce results talk action like many democrats making promises black voters decades captive audience president trump sought earn black vote democratic party leaders went crazy nancy pelosi chuck schumer literally started wearing kente cloths around us capitol pandering enough keep us satisfied',
  'Republican'],
 ['senator harris cares people dou

If that looks good, we now need to make our function to turn these into features. In my solution, I wanted to keep the number of features reasonable, so I only used words that occur at least `word_cutoff` times. Here's the code to test that if you want it. 

In [7]:
word_cutoff = 5

tokens = [w for t, p in convention_data for w in t.split()]

word_dist = nltk.FreqDist(tokens)

feature_words = set()

for word, count in word_dist.items() :
    if count > word_cutoff :
        feature_words.add(word)
        
print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2236 as features in the model.


In [8]:
def conv_features(text, fw):
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once. 
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
        
    """
    # Initialize an empty dictionary to hold feature words found in the text
    ret_dict = {}
    
    # Tokenize the text
    tokens = text.split()
    
    # Iterate through the tokens
    for token in tokens:
        # If the token is in the feature words set, add it to the dictionary
        if token in fw:
            ret_dict[token] = True
    
    return ret_dict

In [9]:
assert(len(feature_words)>0)
assert(conv_features("donald is the president",feature_words)==
       {'donald':True,'president':True})
assert(conv_features("some people in america are citizens",feature_words)==
                     {'people':True,'america':True,"citizens":True})

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [10]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [11]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [12]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.494


In [13]:
classifier.show_most_informative_features(25)

Most Informative Features
                   china = True           Republ : Democr =     27.1 : 1.0
                   votes = True           Democr : Republ =     23.8 : 1.0
             enforcement = True           Republ : Democr =     21.5 : 1.0
                 destroy = True           Republ : Democr =     19.2 : 1.0
                freedoms = True           Republ : Democr =     18.2 : 1.0
                 climate = True           Democr : Republ =     17.8 : 1.0
                supports = True           Republ : Democr =     17.1 : 1.0
                   crime = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.8 : 1.0
                 beliefs = True           Republ : Democr =     13.0 : 1.0
               countries = True           Republ : Democr =     13.0 : 1.0
                 defense = True           Republ : Democr =     13.0 : 1.0
                  defund = True           Republ : Democr =     13.0 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

The first thing for me was that the naive bayes results were slightly worse than a random coin flip manking the accuracy unreliable in distinguishing between politiical party. The concept is still interesting for me as this could be used to promote desired candidate by best understanding what supporters prefer. 

### My Observations

_Your observations to come._

Displaying the 25 most important features present in a tweet alongside the ratio in my opionion is one of the strongest assests. To me personally, I could see these as key words to reference during a political debate knowing the value to a given party contrary to the opposing party. Developing marketing and promotional efforts to strategize capturing the most support from a given party. However, this could also be leveraged in the opposing sense to capture votes from opposing voters by stating buzz words that can appeal to them. The challenge is also to see the balance or lack thereof between the parties. There is a possibility that a given political party is more prominent on certain social platforms cauisng the poor classification model performance. 



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [14]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [15]:
# Connect to database displaying all tables & columns
try:
    # List all tables in the database
    tables_query = "SELECT name FROM sqlite_master WHERE type='table';"
    cong_cur.execute(tables_query)
    tables = cong_cur.fetchall()

    # Check if tables are found
    if not tables:
        print("No tables found in the database.")
    else:
        print("Tables in the database:")
        for table in tables:
            print(table[0])

        # Describe each table & column names
        for table in tables:
            print(f"\nColumns in table {table[0]}:")
            cong_cur.execute(f"PRAGMA table_info({table[0]});")
            columns = cong_cur.fetchall()
            for column in columns:
                print(f"{column[1]} ({column[2]})")

except sqlite3.Error as e:
    print(f"An error occurred: {e}")
finally:
    if cong_db:
        cong_db.close()

Tables in the database:
websites
candidate_data
tweets

Columns in table websites:
district (TEXT)
candidate (TEXT)
pull_time (DATETIME)
url (TEXT)
site_text (TEXT)

Columns in table candidate_data:
index (INTEGER)
student (TEXT)
state (TEXT)
district_num (TEXT)
formatted_dist_num (INTEGER)
abbrev (TEXT)
district (TEXT)
candidate (TEXT)
party (TEXT)
website (TEXT)
twitter_handle (TEXT)
incumbent (TEXT)
age (REAL)
gender (TEXT)
marital_status (TEXT)
white_non_hispanic (TEXT)
hispanic (TEXT)
black (TEXT)
partisian_lean_pvi (TEXT)
opposed (TEXT)
pct_urban (TEXT)
income (REAL)
region (TEXT)

Columns in table tweets:
district (TEXT)
candidate (TEXT)
pull_time (DATETIME)
tweet_time (DATETIME)
handle (TEXT)
is_retweet (INTEGER)
tweet_id (TEXT)
tweet_text (TEXT)
likes (INTEGER)
replies (INTEGER)
retweets (INTEGER)
tweet_ratio (REAL)


In [17]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [18]:
# Function to clean and tokenize text
def clean_tokenize(text):
    # Ensure text is decoded as UTF-8
    text = text.decode('utf-8')
    # Tokenize the text
    tokens = word_tokenize(text)
    # Convert to lower case
    tokens = [word.lower() for word in tokens]
    # Remove punctuation
    tokens = [word for word in tokens if word.isalpha()]
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Return tokens as a single string
    return ' '.join(tokens)

In [19]:
tweet_data = []

# Now fill up tweet_data with sublists like we did on the convention speeches.
# Note that this may take a bit of time, since we have a lot of tweets.

for row in results:
    # Clean and tokenize the tweet text
    cleaned_text = clean_tokenize(row[2])
    # Extract the party affiliation
    party = row[1]
    # Append a list containing the cleaned text and party to tweet_data
    tweet_data.append([cleaned_text, party])

# Close the database connection
cong_db.close()

There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [20]:
random.seed(20201014) # Clever seed number. Bravo

tweet_data_sample = random.choices(tweet_data,k=10)

# Preview output
tweet_data_sample

[['earlier today spoke house floor abt protecting health care women praised ppmarmonte work central coast https',
  'Democratic'],
 ['go tribe rallytogether https', 'Democratic'],
 ['apparently trump thinks easy students overwhelmed crushing burden debt pay student loans trumpbudget https',
  'Democratic'],
 ['grateful first responders rescue personnel firefighters police volunteers working tirelessly keep people safe provide help putting lives line https',
  'Republican'],
 ['let make even greater kag https', 'Republican'],
 ['cavs tie series repbarbaralee scared roadtovictory', 'Democratic'],
 ['congrats belliottsd new gig sd city hall glad continue https',
  'Democratic'],
 ['really close raised toward match right whoot majors room help us get https https',
  'Democratic'],
 ['today comment period potus plan expand offshore drilling opened public days march share oppose proposed program directly trump administration comments made email mail https',
  'Democratic'],
 ['celebrated ics

In [21]:

for tweet, party in tweet_data_sample :
    estimated_party = random.choice(['Republican', 'Democratic'])
    # Fill in the right-hand side above with code that estimates the actual party

    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifer says {estimated_party}.")
    print("")
    

Here's our (cleaned) tweet: earlier today spoke house floor abt protecting health care women praised ppmarmonte work central coast https
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: go tribe rallytogether https
Actual party is Democratic and our classifer says Democratic.

Here's our (cleaned) tweet: apparently trump thinks easy students overwhelmed crushing burden debt pay student loans trumpbudget https
Actual party is Democratic and our classifer says Republican.

Here's our (cleaned) tweet: grateful first responders rescue personnel firefighters police volunteers working tirelessly keep people safe provide help putting lives line https
Actual party is Republican and our classifer says Democratic.

Here's our (cleaned) tweet: let make even greater kag https
Actual party is Republican and our classifer says Republican.

Here's our (cleaned) tweet: cavs tie series repbarbaralee scared roadtovictory
Actual party is Democratic and our classi

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [22]:
# Dictionary of counts by actual party and estimated party.
# The first key is the actual party, and the second is the estimated party.
parties = ['Republican', 'Democratic']
results = defaultdict(lambda: defaultdict(int))

for p in parties:
    for p1 in parties:
        results[p][p1] = 0

num_to_score = 10000
random.shuffle(tweet_data)

for idx, tp in enumerate(tweet_data):
    tweet, party = tp    
    # Now do the same thing as above, but we store the results rather
    # than printing them. 
    
    # Randomly estimate the party
    estimated_party = random.choice(['Republican', 'Democratic'])
    
    results[party][estimated_party] += 1
    
    if idx > num_to_score:
        break


In [23]:
results

defaultdict(<function __main__.<lambda>()>,
            {'Republican': defaultdict(int,
                         {'Republican': 2119, 'Democratic': 2180}),
             'Democratic': defaultdict(int,
                         {'Republican': 2811, 'Democratic': 2892})})

### Reflections

As indicated by the name, the results variable displaying just that, the results of the classification model in correctly labeling political parties. Starting with the Republican classifciation, there are 4,299 total posts while the model had an accuracy of approximately 49.3% which is slightly worse than a random coin flip. Similarly, there were 5,703 Democratic tweets with the model correctly predicting 50.7% of them correctly. While the concept and theoretical application has potential, there needs to be performance improvements prior to deploying this model.  