In this notebook, we use LDA to find the natural topic groupings among Twitter accounts. We then look at the distribution of democrats and republicans in these groupings, to see if LDA grouped the accounts along party lines.

# Setup

In this section, we repeat the code from Notebook 2 - Data Preprocessing, which sets up the data structures that we use for analysis. The code here is included without explanation, for the documentation for this code, refer back to Notebook 2.

In [1]:
import pandas as pd
import numpy as np

# List of dates for which data was gathered. This is used in the file names.
dates = ["11" + str(day) for day in range(15, 27)]

# Location of the csv files containing tweets
location = "https://raw.githubusercontent.com/lynn0032/CourseProject/main/data_files/"

column_names = ["Index", "Twitter Handle", "Created At", "ID", "Tweet Text", "State", "Branch", "Last Name", "First Name"]
tweets_df = pd.DataFrame(columns = column_names)

# Load tweet data from each day, and combine them in the dataframe tweets_df
for d in dates:
  file_name = "all_tweets_" + d + ".csv"
  day_df = pd.read_csv(location + file_name)
  tweets_df = pd.concat([tweets_df, day_df], axis = 0)

#Drop the extra Index column
tweets_df = tweets_df.drop(columns = ["Index"])

In [2]:
#Remove duplicate tweets
tweets_df = tweets_df.drop_duplicates()

In [3]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [4]:
def remove_punctuation(word:str) -> str:
  """Takes a word, and returns the word with characters 
  except for lowercase letters and @ and # removed"""
  new_word = ""
  allowed = "abcdefghijklmnopqrstuvwxyz@#"
  for char in word:
    if char in allowed:
      new_word += char
  return new_word

stop_words = stopwords.words('english')     #stop words from nltk

def get_words(text_str) -> list:
  """Takes a tweet, splits it into words based on spaces,
  removes extra punctuation from the words, and then
  removes stop words"""
  words = [remove_punctuation(w.lower()) for w in text_str.split(" ")]
  filtered_words = [w for w in words if w not in stop_words]
  return [word for word in filtered_words if word != ""]

In [5]:
def retweet(words:list) -> bool:
  """Takes a list of words from a tweet (as returned by get_words), 
  and returns True if 'rt' is one of the words (which means that 
  the tweet is a retweet)"""
  return ('rt' in words)

def tweet_at(words:list) -> list:
  """Takes a list of words from a tweet (as returned by get_words),
  and returns a list of handles that are tweeted at in the tweet"""
  at = []
  for word in words:
    if word[0] == '@':
      at.append(word[1:])
  return at

def hash_tags(words:list) -> list:
  """Takes a list of words from a tweet (as returned by get_words),
  and returns a list of all hashtags included in the tweet"""
  tags = []
  for word in words:
    if word[0] == '#':
      tags.append(word[1:])
  return tags

In [6]:
ps = PorterStemmer()    # Word stemmer from nltk

def word_counts(words:list) -> dict:
  """Takes a list of words from a tweet (as returned by get_words),
  and returns a dictionary of the counts of words from the tweet"""
  counts_dict = {}
  for word in words:
    if word[0] != '@' and word[0] != '#' and word != 'rt' and 'http' not in word:
      word = ps.stem(word)
      if word in counts_dict:
        counts_dict[word] += 1
      else:
        counts_dict[word] = 1
  return counts_dict

In [7]:
def parse_tweet(text:str) -> dict:
  words = get_words(text)
  return {"counts": word_counts(words), "retweet": retweet(words), "at": tweet_at(words), "tags": hash_tags(words)}
  
tweet_dict = {}
for tweet in tweets_df['Tweet Text']:
  tweet = str(tweet)
  tweet_dict[tweet] = parse_tweet(tweet)

In [8]:
handle_parties_df = tweets_df[['Twitter Handle', 'Party']]
handle_parties_df.drop_duplicates()
parties = {}
for index, row in handle_parties_df.iterrows():
  handle = row['Twitter Handle']
  parties[handle] = row['Party']

In [9]:
def handle_word_counts(handle):
  handle_df = tweets_df[tweets_df['Twitter Handle'] == handle]
  all_counts = {}
  for tweet in handle_df['Tweet Text']:
    tweet = str(tweet)
    for word in tweet_dict[tweet]['counts']:
      if word in all_counts:
        all_counts[word] += tweet_dict[tweet]['counts'][word]
      else:
        all_counts[word] = tweet_dict[tweet]['counts'][word]
  return all_counts

In [10]:
handles = list(set(tweets_df['Twitter Handle'].unique()))

handle_counts_dict = {}
for handle in handles:
  handle_counts_dict[handle] = handle_word_counts(handle)

# Latent Dirichlet Allocation (LDA)

We will build our LDA model using the library gensim. In order to do this, we need to encode words as numbers. This is done in the following code, by creating a set of all words from the dataset of tweets (already tokenized and stemmed), and then creating dictionaries that can be used to look up the id (number) for a word, and vice versa.

In [11]:
# Construct a set containing all words
all_words = set()
for tweet in tweet_dict:
  words = tweet_dict[tweet]['counts'].keys()
  for w in words:
    all_words.add(w)

# construct dictionaries id_to_word
# id_to_word can be used to look up the word by its id, with id_to_word[id]
# word_to_id can be used to look up the id by its word, with word_to_id[word]
id_to_word = {}
word_to_id = {}
id = 0
for word in all_words:
  id_to_word[id] = word
  word_to_id[word] = id
  id += 1

Next we build the corpus, which is the set of documents. In this case, each document is the combined word count of all tweets from a handle. So, each document corresponds to one Twitter account.

For gensim's LDA model, the corpus is stored as a list of documents, where each document is a list of tuples. Each tuple consists of the id of a word, and the count of that word in the document.

In [12]:
handles = tweets_df['Twitter Handle'].unique()

doc_handles = {}

corpus = []
doc_num = 0
for handle in handle_counts_dict:
  doc_handles[doc_num] = handle
  doc_num += 1
  doc = []
  for word in handle_counts_dict[handle]:
    doc.append((word_to_id[word], handle_counts_dict[handle][word]))
  corpus.append(doc)

Now, we train the LDA model, using the library gensim. We do this for two topics, which we will then compare with the parties for each tweet.

In [13]:
import gensim

num_topics = 2

lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=id_to_word, num_topics=num_topics)

The LDA model takes documents, and returns the estimated mixing values for each topic. For instance, this might be the list [(0, .25), (1, .75)], which would indicate that 25% of the document is drawn from topic 0, while 75% of the document is drawn from topic 1. The following function takes these results, and selects and returns the topic with the largest proportion, identifying this as the main topic of the document. For the example given, this would be topic 1.

In [14]:
def main_topic(topics):
  best_percent = 0
  for pair in topics:
    if pair[1] > best_percent:
      best_percent = pair[1]
      best_topic = pair[0]
  return best_topic

In the code below, we iterate through all documents in the corpus. For each document, we find the political party of the account owner, and identify the main topic of the document (as discovered by the LDA model).

For each party, we find the number of documents in each topic discovered by the LDA model, to determine if the split into topics aligns with political parties.

In [15]:
topic_dist = []
dem_topics = [0 for index in range(num_topics)]
rep_topics = [0 for index in range(num_topics)]
for index in range(len(corpus)):
  doc = corpus[index]
  handle = doc_handles[index]
  party = parties[handle]
  topics = lda_model[doc]
  best_topic = main_topic(topics)
  topic_dist.append((handle, party, topics, best_topic))
  if party == "D":
    dem_topics[best_topic] += 1
  if party == "R":
    rep_topics[best_topic] += 1

print("Democrat's topics:", dem_topics)
print("Republican's topics:", rep_topics)

Democrat's topics: [170, 104]
Republican's topics: [124, 137]


From these results (for an example run - results will vary), we see that Topic 0 has 203 republicans and 58 democrats, and Topic 1 has 216 democrats and 58 republicans. So, we see that the topics discovered by the LDA model align somewhat with political parties, though not completely.

Finally, we print the twenty words with the highest frequency in each of the topics generated by the LDA model, to see if this provides any insight into what these topics are.

In [16]:
lda_model.print_topics(num_words=20)

[(0,
  '0.009*"today" + 0.008*"amp" + 0.008*"act" + 0.008*"infrastructur" + 0.007*"famili" + 0.007*"thank" + 0.007*"biden" + 0.007*"american" + 0.006*"bill" + 0.006*"join" + 0.006*"hous" + 0.005*"invest" + 0.005*"presid" + 0.005*"veteran" + 0.005*"back" + 0.005*"us" + 0.004*"pass" + 0.004*"work" + 0.004*"year" + 0.004*"build"'),
 (1,
  '0.009*"biden" + 0.009*"amp" + 0.008*"american" + 0.007*"today" + 0.007*"bill" + 0.007*"infrastructur" + 0.006*"bipartisan" + 0.006*"thank" + 0.006*"year" + 0.006*"work" + 0.006*"act" + 0.005*"day" + 0.005*"democrat" + 0.005*"us" + 0.005*"invest" + 0.005*"back" + 0.005*"hous" + 0.004*"im" + 0.004*"great" + 0.004*"tax"')]

Unfortunately, it's difficult to clearly understand the topics generated from these lists of words. Both topics seem to have similar words with high frequency, though with slightly different weights and rankings.