# Topic Modeling on 'Russian Troll Tweets'

Does topic modeling using LDA make sense on short-form data such as tweets?
[Maybe yes](https://www.researchgate.net/post/What_is_a_good_way_to_perform_topic_modeling_on_short_text). Let's try.

In [1]:
import os
import json
import string
import warnings
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.corpora import Dictionary
from gensim.models import LdaModel
import pyLDAvis
from pyLDAvis import gensim

  return f(*args, **kwds)


In [2]:
# displaying the vis right in our notebook
pyLDAvis.enable_notebook()

## Import what we'll work with

In [3]:
data_dir = 'data'
files = os.listdir(data_dir)

In [4]:
# change to loop over all of them
df = pd.read_csv(f"{data_dir}/{files[0]}")

## Inspect the data quickly

In [5]:
df.head()

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,post_type,account_type,new_june_2018,retweet,account_category
0,1674084000.0,GAB1ALDANA,People are too toxic. I think I have people po...,United States,English,7/30/2016 20:15,7/30/2016 20:15,3395,2014,2150,RETWEET,Hashtager,0,1,HashtagGamer
1,1674084000.0,GAB1ALDANA,#NowPlaying Don't Shoot (I'm a Man) by @DEVO -...,United States,English,7/30/2016 20:15,7/30/2016 20:15,3395,2014,2146,RETWEET,Hashtager,0,1,HashtagGamer
2,1674084000.0,GAB1ALDANA,the 'I'm the most boring person in the world' ...,United States,English,7/30/2016 20:16,7/30/2016 20:16,3395,2013,2159,RETWEET,Hashtager,0,1,HashtagGamer
3,1674084000.0,GAB1ALDANA,#MyAchillesHeel slippery floors https://t.co/R...,United States,Norwegian,7/30/2016 20:16,7/30/2016 20:16,3395,2013,2160,RETWEET,Hashtager,0,1,HashtagGamer
4,1674084000.0,GAB1ALDANA,#MyAchillesHeel Boring narcissists.....nothing...,United States,English,7/30/2016 20:16,7/30/2016 20:16,3395,2013,2158,RETWEET,Hashtager,0,1,HashtagGamer


In [6]:
df.dtypes

external_author_id    float64
author                 object
content                object
region                 object
language               object
publish_date           object
harvested_date         object
following               int64
followers               int64
updates                 int64
post_type              object
account_type           object
new_june_2018           int64
retweet                 int64
account_category       object
dtype: object

In [7]:
df.describe()

Unnamed: 0,external_author_id,following,followers,updates,new_june_2018,retweet
count,388452.0,388452.0,388452.0,388452.0,388452.0,388452.0
mean,7.200485e+16,2486.855076,6820.914885,7590.11605,0.137391,0.602679
std,2.330445e+17,3804.948594,15269.406368,10951.128592,0.344261,0.489344
min,410005600.0,0.0,0.0,1.0,0.0,0.0
25%,1670448000.0,476.0,474.0,1550.0,0.0,0.0
50%,2439943000.0,1225.0,1043.0,3259.0,0.0,1.0
75%,2787256000.0,2545.0,2272.0,7493.0,0.0,1.0
max,9.81251e+17,34338.0,110155.0,52860.0,1.0,1.0


## Text preprocessing and data cleaning

I decided to avoid cleaning the text data of tweets, because punctuation and capitalization might play an important role. Maybe not a good idea, we'll see. Eventually it's possible to just go back to this point and do some cleaning before running the rest of the analysis.

Bad idea.

Here I am again. :)

In [8]:
def preprocess(text, add_punct=None, add_stopwords=None):
    # remove capitalization
    text = text.lower()
    # tokenize
    tokens = word_tokenize(text)
    # remove punctuation
    punct = string.punctuation
    # adding additional common punctuation chars of the texts, if provided
    if add_punct:
        punct += add_punct
    no_punct = [t for t in tokens if not t in punct]
    # remove stopwords
    stpw = stopwords.words('english')
    if add_stopwords:
        stpw.extend(add_stopwords)
    no_stpw = [t for t in no_punct if not t in stpw]
    # lemmatize remaining tokens
    lem = WordNetLemmatizer()
    lem_tokens = [lem.lemmatize(t) for t in no_stpw]
    return lem_tokens

In [9]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [10]:
tweet_corpus = [preprocess(tweet, add_punct="’“”–''''``—", add_stopwords=["http", "https"]) 
                for tweet in list(df.content) 
                if type(tweet) == str]

In [11]:
tweet_corpus[2:3]

[["'i",
  "'m",
  'boring',
  'person',
  'world',
  'starterpack',
  'pokemongo',
  '//t.co/u8woa1s3j7']]

## Topic modeling with `gensim`

In [12]:
dictionary = Dictionary(tweet_corpus)
gen_corpus = [dictionary.doc2bow(tweet) for tweet in tweet_corpus]

In [13]:
warnings.filterwarnings("ignore")
ldamodel = LdaModel(corpus=gen_corpus, num_topics=10, id2word=dictionary)

In [14]:
ldamodel.show_topics()

[(0,
  '0.080*"news" + 0.013*"police" + 0.012*"man" + 0.011*"u" + 0.009*"trump" + 0.008*"\'s" + 0.007*"obama" + 0.007*"state" + 0.007*"new" + 0.007*"politics"'),
 (1,
  '0.016*"pjnet" + 0.016*"trump" + 0.015*"hillary" + 0.014*"clinton" + 0.011*"\'s" + 0.010*"mt" + 0.009*"death" + 0.007*"maga" + 0.007*"medium" + 0.007*"amp"'),
 (2,
  '0.012*"vote" + 0.008*"health" + 0.007*"party" + 0.007*"�" + 0.006*"lie" + 0.006*"truth" + 0.006*"left" + 0.006*"question" + 0.006*"person" + 0.005*"conservative"'),
 (3,
  '0.025*"\'s" + 0.022*"trump" + 0.017*"n\'t" + 0.015*"..." + 0.009*"woman" + 0.009*"rt" + 0.008*"politics" + 0.007*"people" + 0.007*"say" + 0.006*"gun"'),
 (4,
  '0.012*"dead" + 0.011*"help" + 0.010*"show" + 0.010*"campaign" + 0.010*"4" + 0.010*"3" + 0.008*"election" + 0.008*"video" + 0.007*"ka" + 0.007*"boy"'),
 (5,
  '0.011*"law" + 0.010*"suspect" + 0.009*"arrested" + 0.007*"kc" + 0.006*"new" + 0.006*"charge" + 0.006*"sander" + 0.005*"post" + 0.005*"city" + 0.005*"issue"'),
 (6,
  '0.02

## Visualize the resulting topics

Here we are using `pyLDAvis` - the `import` needs to be explicit to work, so it might appear that this is part of the `gensim` package, but instead this is from `pyLDAvis`.

In [15]:
gensim.prepare(ldamodel, gen_corpus, dictionary)

## Ideas / Todos

* run this analysis on the full corpus (one file takes approximately 8-9 minutes to process)
* cherry-pick some words to include into the stopword list
* adapt the tokenizer to avoid splitting `#` symbols from twitter hashtags