In this notebook, we use the tweet data assembled as in Notebook 2 - Data Preprocessing, and use this data to train political party classifiers. We use cross validation to evaluate the accuracy of these classifiers.

# Setup

In this section, we repeat the code from Notebook 2 - Data Preprocessing, which sets up the data structures that we use for analysis. The code here is included without explanation, for the documentation for this code, refer back to Notebook 2.

In [1]:
import pandas as pd
import numpy as np

# List of dates for which data was gathered. This is used in the file names.
dates = ["11" + str(day) for day in range(15, 27)]

# Location of the csv files containing tweets
location = "https://raw.githubusercontent.com/lynn0032/CourseProject/main/data_files/"

column_names = ["Index", "Twitter Handle", "Created At", "ID", "Tweet Text", "State", "Branch", "Last Name", "First Name"]
tweets_df = pd.DataFrame(columns = column_names)

# Load tweet data from each day, and combine them in the dataframe tweets_df
for d in dates:
  file_name = "all_tweets_" + d + ".csv"
  day_df = pd.read_csv(location + file_name)
  tweets_df = pd.concat([tweets_df, day_df], axis = 0)

#Drop the extra Index column
tweets_df = tweets_df.drop(columns = ["Index"])

In [2]:
#Remove duplicate tweets
tweets_df = tweets_df.drop_duplicates()

In [3]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
def remove_punctuation(word:str) -> str:
  """Takes a word, and returns the word with characters 
  except for lowercase letters and @ and # removed"""
  new_word = ""
  allowed = "abcdefghijklmnopqrstuvwxyz@#"
  for char in word:
    if char in allowed:
      new_word += char
  return new_word

stop_words = stopwords.words('english')     #stop words from nltk

def get_words(text_str) -> list:
  """Takes a tweet, splits it into words based on spaces,
  removes extra punctuation from the words, and then
  removes stop words"""
  words = [remove_punctuation(w.lower()) for w in text_str.split(" ")]
  filtered_words = [w for w in words if w not in stop_words]
  return [word for word in filtered_words if word != ""]

In [5]:
def retweet(words:list) -> bool:
  """Takes a list of words from a tweet (as returned by get_words), 
  and returns True if 'rt' is one of the words (which means that 
  the tweet is a retweet)"""
  return ('rt' in words)

def tweet_at(words:list) -> list:
  """Takes a list of words from a tweet (as returned by get_words),
  and returns a list of handles that are tweeted at in the tweet"""
  at = []
  for word in words:
    if word[0] == '@':
      at.append(word[1:])
  return at

def hash_tags(words:list) -> list:
  """Takes a list of words from a tweet (as returned by get_words),
  and returns a list of all hashtags included in the tweet"""
  tags = []
  for word in words:
    if word[0] == '#':
      tags.append(word[1:])
  return tags

In [6]:
ps = PorterStemmer()    # Word stemmer from nltk

def word_counts(words:list) -> dict:
  """Takes a list of words from a tweet (as returned by get_words),
  and returns a dictionary of the counts of words from the tweet"""
  counts_dict = {}
  for word in words:
    if word[0] != '@' and word[0] != '#' and word != 'rt' and 'http' not in word:
      word = ps.stem(word)
      if word in counts_dict:
        counts_dict[word] += 1
      else:
        counts_dict[word] = 1
  return counts_dict

In [7]:
def parse_tweet(text:str) -> dict:
  words = get_words(text)
  return {"counts": word_counts(words), "retweet": retweet(words), "at": tweet_at(words), "tags": hash_tags(words)}
  
tweet_dict = {}
for tweet in tweets_df['Tweet Text']:
  tweet = str(tweet)
  tweet_dict[tweet] = parse_tweet(tweet)

In [8]:
handle_parties_df = tweets_df[['Twitter Handle', 'Party']]
handle_parties_df.drop_duplicates()
parties = {}
for index, row in handle_parties_df.iterrows():
  handle = row['Twitter Handle']
  parties[handle] = row['Party']

In [9]:
def handle_word_counts(handle):
  handle_df = tweets_df[tweets_df['Twitter Handle'] == handle]
  all_counts = {}
  for tweet in handle_df['Tweet Text']:
    tweet = str(tweet)
    for word in tweet_dict[tweet]['counts']:
      if word in all_counts:
        all_counts[word] += tweet_dict[tweet]['counts'][word]
      else:
        all_counts[word] = tweet_dict[tweet]['counts'][word]
  return all_counts

In [10]:
handles = list(set(tweets_df['Twitter Handle'].unique()))

handle_counts_dict = {}
for handle in handles:
  handle_counts_dict[handle] = handle_word_counts(handle)

# Training a Political Party Classifier, and Evaluating it's Accuracy

Now, we will explore how accurately we can classify accounts or individual tweets by political party based on only the text of tweets. In order to do this, we divide the dataset into a training set and a testing set. We use the training set to build a topic model for tweets by democrats, then use a maximum likelihood classifier to predict the party of accounts/tweets in the test set. These are compared with their actual party, to evaluate the accuracy of the classifier. We use cross-validation to evaluate the classifier over multiple different training-testing splits.
We begin by building a classifier to classify accounts as democrat or republican. We build a list of handles from the dataset, shuffle this list, and divide it into  𝑘  groups (for  𝑘 -fold cross-validation)

In [11]:
import random

# Number of folds for cross validation
num_folds = 5

# Shuffle list of handles
random.seed(7)         # Added so that results are consistent, delete for a random split
random.shuffle(handles)

# Divide list of handles
num_handles = len(handles)
folds = []
for num in range(num_folds):
  folds.append(handles[round(num / num_folds * num_handles) : round((num + 1) / num_folds * num_handles)])

# Create list of training dataframes, each one omitting one of the folds
all_train_df = [tweets_df for index in range(num_folds)]
for index in range(num_folds):
  for handle in folds[index]:
    all_train_df[index] = all_train_df[index][(all_train_df[index]['Twitter Handle'] != handle)]

For each of the training data frames, we build a topic model for the democrats and republicans based on that dataframe, finding the word frequencies for each party within the training set.

In order to do this, we begin by defining a function party_counts, which takes a dataframe of tweets (one of the training set), and returns a dictionary giving word counts for democrats and a dictionary of word counts for republicans.

In [12]:
def party_counts(train_df):
  """Takes a dataframe of tweets (one of the training sets),
  and returns two dictionaries: one with the word counts for democrats
  in the training set, and one with word counts for republicans."""

  dem_counts = {}
  rep_counts = {}

  for index, row in train_df.iterrows():
    tweet = str(row['Tweet Text'])
    words = tweet_dict[tweet]['counts']
    for word in words:
      if str(row['Party']) == 'D':
        if word in dem_counts:
          dem_counts[word] += words[word]
        else:
          dem_counts[word] = words[word]
      if str(row['Party']) == "R":
        if word in rep_counts:
          rep_counts[word] += words[word]
        else:
          rep_counts[word] = words[word]
  return (dem_counts, rep_counts)

Now, we define a function that adds pseudocounts for unseen words to the topic models, for smoothing. For our purposes, we will take the vocabulary to be all words that appear in the entire dataset of tweets (as parsed by the preprocessing functions). When a classifier encountered an unseen word, we will ignore the word. Since we are focused on comparing frequencies between the two parties, this shouldn't affect the results.

For the pseudocounts, we add one to the count for every word in the vocabulary.

In [13]:
def add_pseudo_counts(word_dict:dict, vocab):
  """Adds pseudo counts to the word_dict, +1 for each word in the vocabulary"""
  for word in vocab:
    if word in word_dict:
      word_dict[word] += 1
    else:
      word_dict[word] = 1

Finally, we normalize the counts by the total of all counts within the dictionary. We compute this result for democrat and republican word counts for each of our training sets, constructed for cross-validation.

In [14]:
def count_to_freq(count_dict:dict) -> dict:
  total_counts = sum(count_dict.values())
  freq_dict = {}
  for word in count_dict:
    freq_dict[word] = count_dict[word] / total_counts
  return freq_dict

vocab = []
for tweet in tweet_dict:
  for word in tweet_dict[tweet]["counts"]:
    vocab.append(word)

word_counts = [party_counts(train_df) for train_df in all_train_df]
for train_set in word_counts:
  for word_dict in train_set:
    add_pseudo_counts(word_dict, vocab)
freq_dicts = [[count_to_freq(word_dict) for word_dict in train_dicts] for train_dicts in word_counts]

Now that the training sets are ready to create a maximum likelihood classifier, and for the test sets, we use the dictionary of word counts for each handle. So, we are ready to define the functions that will be used for the maximum likelihood classifier.

The function log_prob computes the log probability of generating the collection of words given by handle_counts, from the distribution word_dist. This will be applied to the collection of words for a handle in the test set, with the word distribution from the training set for one of the parties, to compute how likely it is that the topic model for that party generated the account's tweets.

The function predict compares the log probabilities, to give a prediction of whether the account is from a democrat or from a republican.

In [15]:
import math

def log_prob(word_dist, handle_counts):
  """Takes a word distribution, and a dictionary of word counts.
  Returns the log probability of generating the collection of words 
  in the dictionary from the given word distribution"""
  result = 0
  for word in handle_counts:
    if word in word_dist:
      result += handle_counts[word] * math.log(word_dist[word])
  return result

def predict(dem_prob, rep_prob):
  """Compares log probability for democrat and for republican, 
  and returns 'D' if democrat is more likely, 'R' if republican
  is more likely"""
  if dem_prob > rep_prob:
    return "D"
  else:
    return "R"

Now, we define a function that evaluates the maximum likelihood classifier on a set of test handles, and prints the accuracy, as well as the counts of misclassification errors. For each misclassification, we also print the handle and the log probabilities. Finally, we return the accuracy.

In [16]:
def evaluate(dem_dist, rep_dist, handles):
  total = 0
  total_correct = 0
  rep_as_dem = 0
  dem_as_rep = 0
  other_mis = 0
  for handle in handles:
    dem_prob = log_prob(dem_dist, handle_counts_dict[handle])
    rep_prob = log_prob(rep_dist, handle_counts_dict[handle])
    prediction = predict(dem_prob, rep_prob)
    actual = parties[handle]

    total += 1
    if actual == prediction:
      total_correct += 1
    else:
      print("Misclassified:", handle, "\n\tDemocrat log prob:", dem_prob, "\n\tRepublican log prob:", rep_prob)
      if actual == "D" and prediction == "R":
        dem_as_rep += 1
      elif actual == "R" and prediction == "D":
        rep_as_dem += 1
      else:
        other_mis +=1
  accuracy = total_correct / total
  print("Accuracy:", accuracy)
  print("Misclassified Dem as Rep:", dem_as_rep)
  print("Misclassified Rep as Dem:", rep_as_dem)
  print("Misclassified other:", other_mis) 
  return accuracy     

Now, we apply the evaluation function to all of folds for cross-validation, printing the results, and computing and printing the average accuracy across all folds.

In [17]:
total_accuracy = 0
for index in range(num_folds):
  test_handles = folds[index]
  dem_dist = freq_dicts[index][0]
  rep_dist = freq_dicts[index][1]
  print("*************************\nResults for fold", index)
  total_accuracy += evaluate(dem_dist, rep_dist, test_handles)

print("*************************\nAverage Accuracy:", total_accuracy / num_folds)

*************************
Results for fold 0
Misclassified: SenSanders 
	Democrat log prob: -16003.70430073335 
	Republican log prob: -15954.211345190217
Misclassified: FrancisRooney 
	Democrat log prob: -2704.842047002437 
	Republican log prob: -2735.5398679225204
Misclassified: MarkAmodeiNV2 
	Democrat log prob: -3740.32921310178 
	Republican log prob: -3780.17318187544
Misclassified: IlhanMN 
	Democrat log prob: -15322.211605446817 
	Republican log prob: -15308.939877402607
Misclassified: RepBryanSteil 
	Democrat log prob: -12450.846729265071 
	Republican log prob: -12355.625535986554
Misclassified: Donald_McEachin 
	Democrat log prob: -5104.202829070854 
	Republican log prob: -5123.625538606095
Accuracy: 0.9439252336448598
Misclassified Dem as Rep: 2
Misclassified Rep as Dem: 3
Misclassified other: 1
*************************
Results for fold 1
Misclassified: BuddforCongress 
	Democrat log prob: -36.559338908084925 
	Republican log prob: -37.37632499725778
Accuracy: 0.9907407407407

We see that our average accuracy for cross-validation is about 95.7%, which seems quite good. We note several interesting appearances in the misclassifications


*   Bernie Sanders (SenSanders) and Angus King (SenAngusKing) are both independents, with party listed as I, and were misclassified. Since this classifier only predicts republican or democrat, this is expected.
*   Lisa Murkowski (lisamurkowski) and Susan Collins (SenatorCollins) are republicans who were misclassified as democrats. These senators have been frequently discussed as voting with democrats in some cases, so it is interesting that their tweets align with the democrat topic models.



The results above rely on grouping all tweets together for an account, and then classifying the account. We could also attempt to classify individual tweets by party, which is significantly more difficult because of how short tweets are. In the code below, we repeat our above work of using a maximum likelihood classifier and evaluate the results, this time for individual tweets.

In [18]:
def tweets_by_handle(handle):
  handle_df = tweets_df[tweets_df['Twitter Handle'] == handle]
  return handle_df[["Tweet Text", "Party"]].dropna()

def evaluate_individual_tweets(dem_dist, rep_dist, handle):
  total = 0
  total_correct = 0
  rep_as_dem = 0
  dem_as_rep = 0
  other_mis = 0
  for handle in handles:
    tweets_with_party = tweets_by_handle(handle)
    for index, row in tweets_with_party.iterrows():
      tweet = str(row["Tweet Text"])
      dem_prob = log_prob(dem_dist, tweet_dict[row["Tweet Text"]]["counts"])
      rep_prob = log_prob(rep_dist, tweet_dict[row["Tweet Text"]]["counts"])
      prediction = predict(dem_prob, rep_prob)
      actual = str(row["Party"])

      total += 1
      if actual == prediction:
        total_correct += 1
      else:
        if actual == "D" and prediction == "R":
          dem_as_rep += 1
        elif actual == "R" and prediction == "D":
          rep_as_dem += 1
        else:
          other_mis +=1
  accuracy = total_correct / total
  print("Accuracy:", accuracy)
  print("Misclassified Dem as Rep:", dem_as_rep)
  print("Misclassified Rep as Dem:", rep_as_dem)
  print("Misclassified other:", other_mis) 
  return accuracy   

total_accuracy = 0
for index in range(num_folds):
  test_handles = folds[index]
  dem_dist = freq_dicts[index][0]
  rep_dist = freq_dicts[index][1]
  print("*************************\nResults for fold", index)
  total_accuracy += evaluate_individual_tweets(dem_dist, rep_dist, test_handles)

print("*************************\nAverage Accuracy:", total_accuracy / num_folds)  

*************************
Results for fold 0
Accuracy: 0.8599527835995279
Misclassified Dem as Rep: 6356
Misclassified Rep as Dem: 4861
Misclassified other: 410
*************************
Results for fold 1
Accuracy: 0.8600009636000097
Misclassified Dem as Rep: 6372
Misclassified Rep as Dem: 4841
Misclassified other: 410
*************************
Results for fold 2
Accuracy: 0.8570860735708608
Misclassified Dem as Rep: 6333
Misclassified Rep as Dem: 5122
Misclassified other: 410
*************************
Results for fold 3
Accuracy: 0.8573269735732697
Misclassified Dem as Rep: 6423
Misclassified Rep as Dem: 5012
Misclassified other: 410
*************************
Results for fold 4
Accuracy: 0.8631085736310857
Misclassified Dem as Rep: 6035
Misclassified Rep as Dem: 4920
Misclassified other: 410
*************************
Average Accuracy: 0.8594950735949507


Here, we see that the average accuracy is about 85.9%. While this is less than the accuracy for classifying an entire account, this is still impressive accuracy considering how short individual tweets are.