In this notebook, we load and preprocess the data in preparation for the analysis in later sections.

The tweet data is stored in csv files (as created in the previous notebook, on scraping tweets), with one csv file for each day of data collection. These files are posted in the github repository for this project. The following block of code loads the data from these files, and combines them into a single pandas dataframe.

We look at the summary of this data, seeing there are three parties represented in this dataset. These are "D" (democrat), "R" (republican), and "I" (independent). There are 115,982 Tweet texts (though this includes some duplicates, which we will deal with next), with 23,498 unique.

In [1]:
import pandas as pd
import numpy as np

# List of dates for which data was gathered. This is used in the file names.
dates = ["11" + str(day) for day in range(15, 27)]

# Location of the csv files containing tweets
location = "https://raw.githubusercontent.com/lynn0032/CourseProject/main/data_files/"

column_names = ["Index", "Twitter Handle", "Created At", "ID", "Tweet Text", "State", "Branch", "Last Name", "First Name"]
tweets_df = pd.DataFrame(columns = column_names)

# Load tweet data from each day, and combine them in the dataframe tweets_df
for d in dates:
  file_name = "all_tweets_" + d + ".csv"
  day_df = pd.read_csv(location + file_name)
  tweets_df = pd.concat([tweets_df, day_df], axis = 0)

#Drop the extra Index column
tweets_df = tweets_df.drop(columns = ["Index"])

tweets_df.describe(include = 'all')

Unnamed: 0.1,Twitter Handle,Created At,ID,Tweet Text,State,Branch,Last Name,First Name,Unnamed: 0,Party
count,114069,115982,115982.0,115982,115995,115995,115353,115353,115995.0,115353
unique,536,22551,,23498,488,2,491,340,,3
top,SenBlumenthal,2021-08-01 00:30:42,,RT @POTUS: Join me as I sign the Bipartisan In...,New York 2nd District,U.S. Representative,Johnson,Mike,,D
freq,214,36,,82,642,94595,1284,2996,,58203
mean,,,1.449471e+18,,,,,,4837.255925,
std,,,5.338167e+16,,,,,,2797.926267,
min,,,6.115793e+17,,,,,,0.0,
25%,,,1.458628e+18,,,,,,2416.0,
50%,,,1.460784e+18,,,,,,4833.0,
75%,,,1.461734e+18,,,,,,7249.0,


The Twitter API does not allow for restricting tweets collected by dates, so in many cases, retrieved 18 tweets for each day resulted in collecting tweets from the previous day or earlier. This results in duplicate tweets in the dataset. The following block of code eliminates these duplicates, leaving us with 84,948 distinct tweets. Note that the text isn't distinct for all tweets, since politicians will frequently retweet other tweets.

In [2]:
#Remove duplicate tweets
tweets_df = tweets_df.drop_duplicates()

tweets_df.describe(include = 'all')

Unnamed: 0.1,Twitter Handle,Created At,ID,Tweet Text,State,Branch,Last Name,First Name,Unnamed: 0,Party
count,83026,84948,84948.0,84948,84952,84952,84310,84310,84952.0,84310
unique,536,22551,,23498,488,2,491,340,,3
top,SenBlumenthal,2021-11-24 00:07:06,,RT @POTUS: Join me as I sign the Bipartisan In...,Louisiana 5th District,U.S. Representative,Johnson,John,,D
freq,214,36,,64,642,68214,870,2062,,45648
mean,,,1.457561e+18,,,,,,4880.892916,
std,,,2.658129e+16,,,,,,2792.262229,
min,,,6.115793e+17,,,,,,0.0,
25%,,,1.459545e+18,,,,,,2509.0,
50%,,,1.461051e+18,,,,,,4907.0,
75%,,,1.461803e+18,,,,,,7286.25,


To tokenize words, remove stop words, and stem words, I use nltk.

In [3]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


The work to preprocess words is divided up into functions. 

The function remove_punctuation takes a word as an argument, and returns the string with all characters removed except for lowercase letters and "@" and "#" (kept because they indicate tweeting at another count, and hashtag).

The function get_words takes a tweet, and splits it into words based on where spaces occur. It uses the function remove_punctuation to remove extra punctuation, then uses the stopwords from nltk to remove stopwords.

In [4]:
def remove_punctuation(word:str) -> str:
  """Takes a word, and returns the word with characters 
  except for lowercase letters and @ and # removed"""
  new_word = ""
  allowed = "abcdefghijklmnopqrstuvwxyz@#"
  for char in word:
    if char in allowed:
      new_word += char
  return new_word

stop_words = stopwords.words('english')     #stop words from nltk

def get_words(text_str) -> list:
  """Takes a tweet, splits it into words based on spaces,
  removes extra punctuation from the words, and then
  removes stop words"""
  words = [remove_punctuation(w.lower()) for w in text_str.split(" ")]
  filtered_words = [w for w in words if w not in stop_words]
  return [word for word in filtered_words if word != ""]

The following functions are used to determine some common aspects of tweets: whether it was a retweet, who was tweeted at (with @handle), and any hashtags (#hashtag) included in the tweet.

The function retweet takes a list of words from a tweet, as returned by the function get_words defined above, and returns True if the tweet is a retweet. It returns False otherwise. This is identified by looking for 'rt' in the list of words from the tweet.

The function tweet_at takes a list of words from a tweet, as returned by the function get_words, and returns a list of handles tweeted at in the tweet. These are identified by looking for '@' at the beginning of words in the tweet.

The function hash_tags takes a list of words from a tweet, as returned by the function get_words, and returns a list of hashtags referenced in the tweet. These are identified by looking for '#' at the beginning of words in the tweet.

In [5]:
def retweet(words:list) -> bool:
  """Takes a list of words from a tweet (as returned by get_words), 
  and returns True if 'rt' is one of the words (which means that 
  the tweet is a retweet)"""
  return ('rt' in words)

def tweet_at(words:list) -> list:
  """Takes a list of words from a tweet (as returned by get_words),
  and returns a list of handles that are tweeted at in the tweet"""
  at = []
  for word in words:
    if word[0] == '@':
      at.append(word[1:])
  return at

def hash_tags(words:list) -> list:
  """Takes a list of words from a tweet (as returned by get_words),
  and returns a list of all hashtags included in the tweet"""
  tags = []
  for word in words:
    if word[0] == '#':
      tags.append(word[1:])
  return tags

The function word_counts takes a list of words from a tweet, as returned by the function get_words, and returns a dictionary of word counts for words in the tweet: {word:count}. Words beginning with '@' or '#' are omitted, since these are tweeted at someone and hashtags, and they will be stored separately. The word 'rt' is also ommitted, since it indicates a retweet. Links are also excluded, determined by containing 'http'.

In [6]:
ps = PorterStemmer()    # Word stemmer from nltk

def word_counts(words:list) -> dict:
  """Takes a list of words from a tweet (as returned by get_words),
  and returns a dictionary of the counts of words from the tweet"""
  counts_dict = {}
  for word in words:
    if word[0] != '@' and word[0] != '#' and word != 'rt' and 'http' not in word:
      word = ps.stem(word)
      if word in counts_dict:
        counts_dict[word] += 1
      else:
        counts_dict[word] = 1
  return counts_dict

The function parse_tweet uses the functions defined above to create a dictionary with information on a tweet. It takes the original text of the tweet, and uses the function get_words to separate the text into individual words. The dictionary for the tweet stores the following information:


*   Under the key "counts", a dictionary with the word counts from the tweet, which is returned by calling the function word_counts on the words from the tweet.
*   Under the key "retweets", a boolean indicating whether or not the tweet is a retweet. This is done using the function retweet defined above.
*   Under the key "at", a list of handles tweeted at in this tweet, as constructed by the function tweet_at defined above.
*   Under the key "tags", a list of hashtags included in this tweet, as constructed by the function hash_tags defined above.

Finally, we call the function parse_tweet on every tweet in the dataset tweets_df, constructing a dictionary tweet_dict where the keys are the text of a tweet, and the corresponding value is the dictionary created by parse_tweet.

For example, if example_tweet is the text of a tweet, we could get the word counts dictionary for the tweet with: tweet_dict[example_tweet]["counts"].



In [7]:
def parse_tweet(text:str) -> dict:
  words = get_words(text)
  return {"counts": word_counts(words), "retweet": retweet(words), "at": tweet_at(words), "tags": hash_tags(words)}
  
tweet_dict = {}
for tweet in tweets_df['Tweet Text']:
  tweet = str(tweet)
  tweet_dict[tweet] = parse_tweet(tweet)

Next, we construct a dictionary storing the political party associated with each Twitter handle, so that we can easily look up these parties.

This is done by taking the Twitter handle and party from each record in the dataframe tweets_df, iterating through these and adding them to the dictionary parties.

In [8]:
handle_parties_df = tweets_df[['Twitter Handle', 'Party']]
handle_parties_df.drop_duplicates()
parties = {}
for index, row in handle_parties_df.iterrows():
  handle = row['Twitter Handle']
  parties[handle] = row['Party']

Next, we build a word count dictionary for each handle. This will be hepful for classifying accounts as democrat or republican.

To do this, we define a function that takes a twitter handle, and returns a dictionary of word counts from all tweets from that handle.

In [9]:
def handle_word_counts(handle):
  handle_df = tweets_df[tweets_df['Twitter Handle'] == handle]
  all_counts = {}
  for tweet in handle_df['Tweet Text']:
    tweet = str(tweet)
    for word in tweet_dict[tweet]['counts']:
      if word in all_counts:
        all_counts[word] += tweet_dict[tweet]['counts'][word]
      else:
        all_counts[word] = tweet_dict[tweet]['counts'][word]
  return all_counts

Now, we use this function to compute word counts for each handle. These are stored in a dictionary, with the handle as the key and the dictionary of word counts as the corresponding value.

In [10]:
handles = list(set(tweets_df['Twitter Handle'].unique()))

handle_counts_dict = {}
for handle in handles:
  handle_counts_dict[handle] = handle_word_counts(handle)

The code in this notebook set up the datastructures needed for our data analysis, and these datastructures will be used in subsequent notebooks for analysis.