# Naive Bayes on Political Text

In this notebook we use Naive Bayes to explore and classify political data. See the `README.md` for full details. You can download the required DB from the shared dropbox or from blackboard

In [12]:
import sqlite3
import nltk
import string
import random
import numpy as np
from collections import Counter, defaultdict

from string import punctuation
from nltk.tokenize import sent_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Define stopwords
stop_words = set(stopwords.words('english'))

# Feel free to include your text patterns functions
#from text_functions_solutions import clean_tokenize, get_patterns

In [32]:
convention_db = sqlite3.connect("2020_Conventions.db")
convention_cur = convention_db.cursor()

In [33]:
# Check the database for tables and column names to use later on
# Get the list of tables in the database
try:
    tables = convention_cur.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
    
    if not tables:
        print("No tables found in the database.")
    else:
        # Step 2: Print columns for each table
        for table_name in tables:
            table_name = table_name[0]  # Extract the table name from the tuple
            print(f"Columns in table '{table_name}':")
            
            # Execute PRAGMA command to get column info
            columns = convention_cur.execute(f"PRAGMA table_info({table_name});").fetchall()
            
            if not columns:
                print(f"No columns found in table '{table_name}'.")
            else:
                for column in columns:
                    print(f" - {column[1]} (Type: {column[2]})")  # Column name and type

except sqlite3.Error as e:
    print(f"An error occurred: {e}")

Columns in table 'conventions':
 - party (Type: TEXT)
 - night (Type: INTEGER)
 - speaker (Type: TEXT)
 - speaker_count (Type: INTEGER)
 - time (Type: TEXT)
 - text (Type: TEXT)
 - text_len (Type: TEXT)
 - file (Type: TEXT)


In [34]:
# Validate the data from 2020_Conventions database
# Fetch all records to inspect the data
all_data = convention_cur.execute("SELECT * FROM conventions").fetchall()

# Print the number of records and some sample data
print(f"Total records in conventions table: {len(all_data)}")
print("Sample records from the conventions table:")
for record in all_data[:5]:  # Print first 5 records for inspection
    print(record)

Total records in conventions table: 2541
Sample records from the conventions table:
('Democratic', 4, 'Unknown', 1, '00:00', 'Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Convention (DNC) Night 4 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Democratic National Convention (DNC) Night 4 Transcript Night 4 of the 2020 Democratic National Convention (DNC) on August 20. Read the full transcript of the event here. Transc

## 1. Exploratory Naive Bayes

We'll first build a NB model on the convention data itself, as a way to understand what words distinguish between the two parties. This is analogous to what we did in the "Comparing Groups" exercise. First, we'll pull in the text 
for each party and prepare it for use in Naive Bayes. 

In [37]:
# Initialize the convention_data list
convention_data = []

# SQL query to pull only 2020 data and exclude the "Other" party
query_results = convention_cur.execute(
    '''
    SELECT text, party 
    FROM conventions 
    WHERE party != 'Other'
    '''
)

# Store the results in convention_data
for row in query_results:
    speech_text, party = row  # Unpack the row into speech_text and party
    convention_data.append([speech_text, party])  # Add to the list as a sublist

# Optional: Print the first few entries to verify
print(f"Sample convention_data entries: {convention_data[:5]}")

# Close the database connection
convention_cur.close()
convention_db.close()



Sample convention_data entries: [['Skip to content The Company Careers Press Freelancers Blog × Services Transcription Captions Foreign Subtitles Translation Freelancers About Contact Login « Return to Transcript Library home  Transcript Categories  All Transcripts 2020 Election Transcripts Classic Speech Transcripts Congressional Testimony & Hearing Transcripts Debate Transcripts Donald Trump Transcripts Entertainment Transcripts Financial Transcripts Interview Transcripts Political Transcripts Press Conference Transcripts Speech Transcripts Sports Transcripts Technology Transcripts Aug 21, 2020 2020 Democratic National Convention (DNC) Night 4 Transcript Rev  ›  Blog  ›  Transcripts  › 2020 Election Transcripts  ›  2020 Democratic National Convention (DNC) Night 4 Transcript Night 4 of the 2020 Democratic National Convention (DNC) on August 20. Read the full transcript of the event here. Transcribe Your Own Content  Try Rev for free  and save time transcribing, captioning, and subtit

Let's look at some random entries and see if they look right. 

In [38]:
random.choices(convention_data,k=5)

[['Today I’m proud to join Joe Biden.', 'Democratic'],
 ['Illinois.', 'Republican'],
 ['[foreign language 00:32:04].', 'Democratic'],
 ['Democracy is beautiful.', 'Democratic'],
 ['Welcome back to Milwaukee, Wisconsin, a great city on native land on a Great Lake. It’s the place where I was born and raised right in the heart of 53206 zip code. This is a community that’s been faced with some significant challenges due to historical injustice, but what many don’t see is the joy, the resilience, and opportunity that lies within this community and so many others across America just like it, where hardworking people are fighting to provide for their families and to build a better future. We know that we build a better future for our nation by channeling Wisconsin’s legacy as the birthplace of the labor and the progressive movement, and uniting around a bold inclusive agenda that uplifts every community and the pursuit of a more just future, one that recognizes healthcare as a human right, on

It'll be useful for us to have a large sample size than 2020 affords, since those speeches tend to be long and contiguous. Let's make a new list-of-lists called `conv_sent_data`. Instead of each first entry in the sublists being an entire speech, make each first entry just a sentence from the speech. Feel free to use NLTK's `sent_tokenize` [function](https://www.nltk.org/api/nltk.tokenize.sent_tokenize.html). 

In [39]:
conv_sent_data = []

# Iterate over each speech and party in original list convention_data
for speech, party in convention_data:
    # Tokenize the speech into sentences
    sentences = sent_tokenize(speech)
    
    # Append each sentence and the corresponding party to new conv_sent_data
    for sentence in sentences:
        conv_sent_data.append([sentence, party])

Again, let's look at some random entries. 

In [40]:
random.choices(conv_sent_data,k=5)

[['Our platoon reflected the diversity of our nation, every race, creed, and religion.',
  'Republican'],
 ['God bless our heroes and God bless the United States of America.',
  'Republican'],
 ['We are idealists and dreamers, lovers of adventure.', 'Republican'],
 ['Believe in yourself, in President Trump and individual and personal responsibility?',
  'Republican'],
 ['We want safety in our neighborhoods.', 'Republican']]

Now it's time for our final cleaning before modeling. Go through `conv_sent_data` and take the following steps: 

1. Tokenize on whitespace
1. Remove punctuation
1. Remove tokens that fail the `isalpha` test
1. Remove stopwords
1. Casefold to lowercase
1. Join the remaining tokens into a string


In [None]:
clean_conv_sent_data = [] # list of tuples (sentence, party), with sentence cleaned

for idx, sent_party in enumerate(conv_sent_data) :
    pass # your code here

random.choices(clean_conv_sent_data,k=5)

[('maybe one day catch', 'Republican'),
 ('change within change made last', 'Republican'),
 ('whoo', 'Republican'),
 ('want talk man know', 'Republican'),
 ('lucky', 'Republican')]

In [41]:
clean_conv_sent_data = []  # list of tuples (sentence, party), with sentence cleaned

# Iterate over each sentence and party in conv_sent_data
for idx, (sentence, party) in enumerate(conv_sent_data):
    # Tokenize on whitespace
    tokens = word_tokenize(sentence)

    # Remove punctuation and isalpha tokens
    tokens = [
        token for token in tokens
        if token.isalpha() and token.lower() not in stop_words
    ]

    # Lowercase
    tokens = [token.lower() for token in tokens]

    # Join the remaining tokens into a string
    cleaned_sentence = ' '.join(tokens)

    # Store the cleaned sentence and the party in clean_conv_sent_data
    clean_conv_sent_data.append((cleaned_sentence, party))

In [43]:
# Example to show random samples from the cleaned data (data check)
random_samples = random.choices(clean_conv_sent_data, k=5)
print(random_samples)

[('good evening', 'Democratic'), ('want thank much', 'Republican'), ('know alive today', 'Republican'), ('one first people called joe', 'Democratic'), ('ready', 'Democratic')]


If that looks good, let's make our function to turn these into features. First we need to build our list of candidate words. I started my exploration at a cutoff of 5. 

In [44]:
word_cutoff = 5

# Flatten the cleaned sentences into a list of words
tokens = [w for t, p in clean_conv_sent_data for w in t.split()]

# Calculate the frequency distribution of words
word_dist = nltk.FreqDist(tokens)

# Build the set of feature words based on the cutoff
feature_words = set()

for word, count in word_dist.items():
    if count > word_cutoff:
        feature_words.add(word)

print(f"With a word cutoff of {word_cutoff}, we have {len(feature_words)} as features in the model.")

With a word cutoff of 5, we have 2236 as features in the model.


In [45]:
def conv_features(text, fw):
    """Given some text, this returns a dictionary holding the
       feature words.
       
       Args: 
            * text: a piece of text in a continuous string. Assumes
            text has been cleaned and case folded.
            * fw: the *feature words* that we're considering. A word 
            in `text` must be in fw in order to be returned. This 
            prevents us from considering very rarely occurring words.
        
       Returns: 
            A dictionary with the words in `text` that appear in `fw`. 
            Words are only counted once.
            If `text` were "quick quick brown fox" and `fw` = {'quick','fox','jumps'},
            then this would return a dictionary of 
            {'quick' : True,
             'fox' :    True}
    """
    # Tokenize the text into words
    tokens = text.split()
    
    # Create the dictionary for feature words found in the text
    ret_dict = {word: True for word in tokens if word in fw}
    
    return ret_dict

In [46]:
# Assertions to test the function
assert(len(feature_words) > 0)
assert(conv_features("obama was the president", feature_words) == {'obama': True, 'president': True})
assert(conv_features("some people in america are citizens", feature_words) == {'people': True, 'america': True, 'citizens': True})

# will pass silently if all assertions run correctly, otherwise assertion error will occur

Now we'll build our feature set. Out of curiosity I did a train/test split to see how accurate the classifier was, but we don't strictly need to since this analysis is exploratory. 

In [47]:
featuresets = [(conv_features(text,feature_words), party) for (text, party) in convention_data]

In [48]:
random.seed(20220507)
random.shuffle(featuresets)

test_size = 500

In [49]:
test_set, train_set = featuresets[:test_size], featuresets[test_size:]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

0.498


In [50]:
classifier.show_most_informative_features(25)

Most Informative Features
             enforcement = True           Republ : Democr =     27.5 : 1.0
                   votes = True           Democr : Republ =     21.6 : 1.0
                 climate = True           Democr : Republ =     17.3 : 1.0
                 destroy = True           Republ : Democr =     17.1 : 1.0
                supports = True           Republ : Democr =     16.1 : 1.0
                   media = True           Republ : Democr =     15.9 : 1.0
                preserve = True           Republ : Democr =     15.1 : 1.0
                  signed = True           Republ : Democr =     15.1 : 1.0
              appreciate = True           Republ : Democr =     14.0 : 1.0
                freedoms = True           Republ : Democr =     14.0 : 1.0
                 private = True           Republ : Democr =     11.9 : 1.0
                  defund = True           Republ : Democr =     10.9 : 1.0
                    drug = True           Republ : Democr =     10.3 : 1.0

Write a little prose here about what you see in the classifier. Anything odd or interesting?

### My Observations

It is important to note that the accuracy of the Naive Bayes classifier is less than 0.50, 50%. Many of the important features posted by the Naive Bayes classifier are expected when analyzing a political environment. It is interesting to see the specific features that are notoriously stronger in the Republican party versus the Democratic party actually be proven by text. For example, the Republican party sees the largest ratio with "enforcement" which coincides with the strong presence of militant priorities, etc. In contrast, the Democratic party has a high ratio for features such as "climate", which coincides with priorities of social change and environmental issues. 

It is also intriguing to look at how many features are marked as significantly occuring for each party. Out of the 25 of the features listed here, 22 of them appear to occur more frequently in the Republican party speeches. This could suggest potential bias in the collection of data, or just an imbalance in the data as is. 



## Part 2: Classifying Congressional Tweets

In this part we apply the classifer we just built to a set of tweets by people running for congress
in 2018. These tweets are stored in the database `congressional_data.db`. That DB is funky, so I'll
give you the query I used to pull out the tweets. Note that this DB has some big tables and 
is unindexed, so the query takes a minute or two to run on my machine.

In [54]:
cong_db = sqlite3.connect("congressional_data.db")
cong_cur = cong_db.cursor()

In [55]:
results = cong_cur.execute(
        '''
           SELECT DISTINCT 
                  cd.candidate, 
                  cd.party,
                  tw.tweet_text
           FROM candidate_data cd 
           INNER JOIN tweets tw ON cd.twitter_handle = tw.handle 
               AND cd.candidate == tw.candidate 
               AND cd.district == tw.district
           WHERE cd.party in ('Republican','Democratic') 
               AND tw.tweet_text NOT LIKE '%RT%'
        ''')

results = list(results) # Just to store it, since the query is time consuming

In [56]:
# Check the database for tables and column names
try:
    # Step 1: Get the list of tables in the database
    tables = cong_cur.execute("SELECT name FROM sqlite_master WHERE type='table';").fetchall()
    
    if not tables:
        print("No tables found in the database.")
    else:
        # Step 2: Print columns for each table
        for table_name in tables:
            table_name = table_name[0]  # Extract the table name from the tuple
            print(f"Columns in table '{table_name}':")
            
            # Execute PRAGMA command to get column info
            columns = cong_cur.execute(f"PRAGMA table_info({table_name});").fetchall()
            
            if not columns:
                print(f"No columns found in table '{table_name}'.")
            else:
                for column in columns:
                    print(f" - {column[1]} (Type: {column[2]})")  # Column name and type

except sqlite3.Error as e:
    print(f"An error occurred: {e}")
finally:
    # Close the database connection
    cong_db.close()

Columns in table 'websites':
 - district (Type: TEXT)
 - candidate (Type: TEXT)
 - pull_time (Type: DATETIME)
 - url (Type: TEXT)
 - site_text (Type: TEXT)
Columns in table 'candidate_data':
 - index (Type: INTEGER)
 - student (Type: TEXT)
 - state (Type: TEXT)
 - district_num (Type: TEXT)
 - formatted_dist_num (Type: INTEGER)
 - abbrev (Type: TEXT)
 - district (Type: TEXT)
 - candidate (Type: TEXT)
 - party (Type: TEXT)
 - website (Type: TEXT)
 - twitter_handle (Type: TEXT)
 - incumbent (Type: TEXT)
 - age (Type: REAL)
 - gender (Type: TEXT)
 - marital_status (Type: TEXT)
 - white_non_hispanic (Type: TEXT)
 - hispanic (Type: TEXT)
 - black (Type: TEXT)
 - partisian_lean_pvi (Type: TEXT)
 - opposed (Type: TEXT)
 - pct_urban (Type: TEXT)
 - income (Type: REAL)
 - region (Type: TEXT)
Columns in table 'tweets':
 - district (Type: TEXT)
 - candidate (Type: TEXT)
 - pull_time (Type: DATETIME)
 - tweet_time (Type: DATETIME)
 - handle (Type: TEXT)
 - is_retweet (Type: INTEGER)
 - tweet_id (Ty

In [51]:
# Initialize tweet_data list
tweet_data = []

# Fill up tweet_data with sublists (tweet text, party)
for candidate, party, tweet_text in results:
    tweet_data.append([tweet_text, party])

# Optionally print the first few entries to verify
print("Sample tweets data:")
for tweet in tweet_data[:5]:  # Display first 5 tweets
    print(tweet)

# Close the database connection
cong_db.close()


Sample tweets data:
[b'"Brooks Joins Alabama Delegation in Voting Against Flawed Funding Bill" http://t.co/3CwjIWYsNq', 'Republican']
[b'"Brooks: Senate Democrats Allowing President to Give Americans\xe2\x80\x99 Jobs to Illegals" #securetheborder https://t.co/mZtEaX8xS6', 'Republican']
[b'"NASA on the Square" event this Sat. 11AM \xe2\x80\x93 4PM. Stop by &amp; hear about the incredible work done in #AL05! @DowntownHSV http://t.co/R9zY8WMEpA', 'Republican']
[b'"The trouble with Socialism is that eventually you run out of other people\'s money." - Margaret Thatcher https://t.co/X97g7wzQwJ', 'Republican']
[b'"The trouble with socialism is eventually you run out of other people\'s money" \xe2\x80\x93 Thatcher. She\'ll be sorely missed. http://t.co/Z8gBnDQUh8', 'Republican']


There are a lot of tweets here. Let's take a random sample and see how our classifer does. I'm guessing it won't be too great given the performance on the convention speeches...

In [52]:
random.seed(20201014)

tweet_data_sample = random.choices(tweet_data,k=10)

In [57]:
# Loop through the sampled tweets and estimate the party
for tweet, party in tweet_data_sample:
    # Clean and prepare the tweet for classification
    cleaned_tweet = tweet.lower()  # Lowercase the tweet
    features = conv_features(cleaned_tweet, feature_words)  # Extract features using your conv_features function
    
    # Estimate the party using the classifier
    estimated_party = classifier.classify(features)  # Use the classifier to predict the party
    
    print(f"Here's our (cleaned) tweet: {tweet}")
    print(f"Actual party is {party} and our classifier says {estimated_party}.")
    print("")

Here's our (cleaned) tweet: b'Earlier today, I spoke on the House Floor abt protecting health care for women and praised @PPmarmonte for their work on the Central Coast. https://t.co/WqgTRzT7VV'
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: b'Go Tribe! #RallyTogether https://t.co/0NXutFL9L5'
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: b"Apparently, Trump thinks it's just too easy for students overwhelmed by the crushing burden of debt to pay off student loans #TrumpBudget https://t.co/ckYQO5T0Qh"
Actual party is Democratic and our classifier says Democratic.

Here's our (cleaned) tweet: b'We\xe2\x80\x99re grateful for our first responders, our rescue personnel, our firefighters, our police, and volunteers who have been working tirelessly to keep people safe, provide much-needed help, while putting their own lives on the line.\n\nhttps://t.co/eZPv0vMIz3'
Actual party is Republican and our class

Now that we've looked at it some, let's score a bunch and see how we're doing.

In [68]:
# Dictionary of counts by actual party and estimated party
parties = ['Republican', 'Democratic']
results = defaultdict(lambda: defaultdict(int))

# Number of tweets to score
num_to_score = 10000
random.shuffle(tweet_data)

# Loop through tweet data and estimate parties
for idx, tp in enumerate(tweet_data):
    tweet, party = tp

    # Clean and prepare the tweet for classification
    cleaned_tweet = tweet.lower()  # Lowercase the tweet
    features = conv_features(cleaned_tweet, feature_words)  # Extract features

    # Estimate the party using the classifier
    estimated_party = classifier.classify(features)  # Predict the party
    
    # Store the results
    results[party][estimated_party] += 1

    # Break after scoring a certain number of tweets
    if idx >= num_to_score:
        break

# Display the results summary
print("Results summary:")
for actual_party, estimated in results.items():
    print(f"{actual_party}: {dict(estimated)}")

Results summary:
Republican: {'Democratic': 4343}
Democratic: {'Democratic': 5658}


In [71]:
# Calculate accuracy
correct_predictions = 0
total_predictions = 0

for actual_party in results.keys():
    correct_predictions += results[actual_party][actual_party]  # Count correctly classified tweets
    total_predictions += sum(results[actual_party].values())  # Count total predictions for this actual party

# Calculate and print accuracy
accuracy = correct_predictions / total_predictions if total_predictions > 0 else 0
print(f"Accuracy: {accuracy:.2%}")  # Print accuracy as a percentage

Accuracy: 56.82%


### Reflections

The summary above suggests that the classifier is misclassifying the tweets from the Republican party incorrectly. There are 4343 tweets of the 10,000 that have been tested that are wrongfully classified as Democratic. In contrast, the model is performing pretty well in classifying Democratic tweets; classifying 5658 tweets correctly. 

These values suggest that the classifier may be trained on unbalanced data and data biased towards the Democratic party language. To improve the overall performance of the classifier, it would be beneficial to diversify the Republican party tweets or introduce more data that would classify as the Republican party. The Naive Bayes classifier needs to be trained on more Republican party instances. This also presents the opportunity to rebalance the data that the Naive Bayes classifier is trained on.