In this notebook, we use the tweet data assembled as in Notebook 2 - Data Preprocessing, and combine these into a dictionary of word counts for tweets by democrats, and a dictionary of word counts for tweets by republicans. We then look at the most frequent words tweeted by each party, to give us insight into patterns in what each party is tweeting about.

# Setup

In this section, we repeat the code from Notebook 2 - Data Preprocessing, which sets up the data structures that we use for analysis. The code here is included without explanation, for the documentation for this code, refer back to Notebook 2.

In [1]:
import pandas as pd
import numpy as np

# List of dates for which data was gathered. This is used in the file names.
dates = ["11" + str(day) for day in range(15, 27)]

# Location of the csv files containing tweets
location = "https://raw.githubusercontent.com/lynn0032/CourseProject/main/data_files/"

column_names = ["Index", "Twitter Handle", "Created At", "ID", "Tweet Text", "State", "Branch", "Last Name", "First Name"]
tweets_df = pd.DataFrame(columns = column_names)

# Load tweet data from each day, and combine them in the dataframe tweets_df
for d in dates:
  file_name = "all_tweets_" + d + ".csv"
  day_df = pd.read_csv(location + file_name)
  tweets_df = pd.concat([tweets_df, day_df], axis = 0)

#Drop the extra Index column
tweets_df = tweets_df.drop(columns = ["Index"])

In [2]:
#Remove duplicate tweets
tweets_df = tweets_df.drop_duplicates()

In [3]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [4]:
def remove_punctuation(word:str) -> str:
  """Takes a word, and returns the word with characters 
  except for lowercase letters and @ and # removed"""
  new_word = ""
  allowed = "abcdefghijklmnopqrstuvwxyz@#"
  for char in word:
    if char in allowed:
      new_word += char
  return new_word

stop_words = stopwords.words('english')     #stop words from nltk

def get_words(text_str) -> list:
  """Takes a tweet, splits it into words based on spaces,
  removes extra punctuation from the words, and then
  removes stop words"""
  words = [remove_punctuation(w.lower()) for w in text_str.split(" ")]
  filtered_words = [w for w in words if w not in stop_words]
  return [word for word in filtered_words if word != ""]

In [5]:
def retweet(words:list) -> bool:
  """Takes a list of words from a tweet (as returned by get_words), 
  and returns True if 'rt' is one of the words (which means that 
  the tweet is a retweet)"""
  return ('rt' in words)

def tweet_at(words:list) -> list:
  """Takes a list of words from a tweet (as returned by get_words),
  and returns a list of handles that are tweeted at in the tweet"""
  at = []
  for word in words:
    if word[0] == '@':
      at.append(word[1:])
  return at

def hash_tags(words:list) -> list:
  """Takes a list of words from a tweet (as returned by get_words),
  and returns a list of all hashtags included in the tweet"""
  tags = []
  for word in words:
    if word[0] == '#':
      tags.append(word[1:])
  return tags

In [6]:
ps = PorterStemmer()    # Word stemmer from nltk

def word_counts(words:list) -> dict:
  """Takes a list of words from a tweet (as returned by get_words),
  and returns a dictionary of the counts of words from the tweet"""
  counts_dict = {}
  for word in words:
    if word[0] != '@' and word[0] != '#' and word != 'rt' and 'http' not in word:
      word = ps.stem(word)
      if word in counts_dict:
        counts_dict[word] += 1
      else:
        counts_dict[word] = 1
  return counts_dict

In [7]:
def parse_tweet(text:str) -> dict:
  words = get_words(text)
  return {"counts": word_counts(words), "retweet": retweet(words), "at": tweet_at(words), "tags": hash_tags(words)}
  
tweet_dict = {}
for tweet in tweets_df['Tweet Text']:
  tweet = str(tweet)
  tweet_dict[tweet] = parse_tweet(tweet)

In [8]:
handle_parties_df = tweets_df[['Twitter Handle', 'Party']]
handle_parties_df.drop_duplicates()
parties = {}
for index, row in handle_parties_df.iterrows():
  handle = row['Twitter Handle']
  parties[handle] = row['Party']

In [9]:
def handle_word_counts(handle):
  handle_df = tweets_df[tweets_df['Twitter Handle'] == handle]
  all_counts = {}
  for tweet in handle_df['Tweet Text']:
    tweet = str(tweet)
    for word in tweet_dict[tweet]['counts']:
      if word in all_counts:
        all_counts[word] += tweet_dict[tweet]['counts'][word]
      else:
        all_counts[word] = tweet_dict[tweet]['counts'][word]
  return all_counts

In [10]:
handles = list(set(tweets_df['Twitter Handle'].unique()))

handle_counts_dict = {}
for handle in handles:
  handle_counts_dict[handle] = handle_word_counts(handle)

# Word Frequencies by Party

Now, we are ready to start computing word frequencies by party.

In the next code cell, we build dictionaries with word counts for tweets by democrats and tweets by republicans. This is done by iterating through tweets from the dataframe tweets_df, using the word counts assembled in the dictionary tweet_dict, and accumulating these accounts in the dictionary for the correct party. 

Note that we are already iterating through the dataframe tweets_df, so we can determine the party directly from the dataframe, rather than looking it up in the dictionary parties.

In [11]:
dem_counts = {}
rep_counts = {}

for index, row in tweets_df.iterrows():
  tweet = str(row['Tweet Text'])
  words = tweet_dict[tweet]['counts']
  for word in words:
    if str(row['Party']) == 'D':    #if the party for the current tweet is D, accumulate in dem_counts
      if word in dem_counts:
        dem_counts[word] += words[word]
      else:
        dem_counts[word] = words[word]
    if str(row['Party']) == "R":    #if the party for the current tweet is R, accumulate in rep_counts
      if word in rep_counts:
        rep_counts[word] += words[word]
      else:
        rep_counts[word] = words[word]

The code above produces two dictionaries with raw word counts. In the next block of code, we take these dictionaries and normalize by the total number of words, producing word frequencies by political party.

In [12]:
dem_total_words = sum(dem_counts.values())
dem_freq = {}
for word in dem_counts:
  dem_freq[word] = dem_counts[word] / dem_total_words

rep_total_words = sum(rep_counts.values())
rep_freq = {}
for word in rep_counts:
  rep_freq[word] = rep_counts[word] / rep_total_words

Now, we sort the words in descending order of frequency, in order the find the top ten words used by each political party.

We first find the top ten words used by democrats, along with their frequencies.

In [13]:
dem_pairs = list(dem_freq.items())
dem_pairs.sort(reverse = True, key = lambda x:x[1])
for pair in dem_pairs[:10]:
  print(pair[0], pair[1])

infrastructur 0.011572551424271141
act 0.010243603322370254
today 0.009654941747031607
amp 0.009425274541501756
invest 0.008651541334522549
famili 0.007503205306873293
bipartisan 0.007427392831261498
bill 0.007182117174870394
thank 0.006901165059367858
work 0.006381626623557612


Next, we find the top ten words used by republicans, along with their frequencies.

In [14]:
rep_pairs = list(rep_freq.items())
rep_pairs.sort(reverse = True, key = lambda x:x[1])
for pair in rep_pairs[:10]:
  print(pair[0], pair[1])

biden 0.014706966558444218
american 0.009330431624620106
democrat 0.008068898288192102
amp 0.007853181548846188
today 0.006990314591462532
thank 0.006725447202645397
bill 0.006569803479319864
spend 0.005971803910753343
presid 0.005431146766569913
year 0.0053656125672749515


Just from these most common words, we can make some interesting observations.

The top two most frequent (stemmed) words for democrats are "infrastructur" and "act", which makes sense since the democrats have been focusing on passing the Infrastructure Investment and Jobs Act, which was recently passed on 11/15, during the timeframe of data collection. The other most frequent words also seem to connect to passing this act.

Among the top three most frequent (stemmed) words are "biden" and "democrat", and "presid" is lower on the list. This shows the republican focus on the opposing party.