# Topic Modeling on English-language 'Russian Troll Tweets'

Does topic modeling using LDA make sense on short-form data such as tweets?
[Maybe yes](https://www.researchgate.net/post/What_is_a_good_way_to_perform_topic_modeling_on_short_text). Let's try.

In [21]:
import re
import os
import json
import string
import warnings
import pandas as pd
from nltk.tokenize.casual import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from gensim.corpora import Dictionary
from gensim.models import LdaModel
import pyLDAvis
from pyLDAvis import gensim

In [22]:
# displaying the vis right in our notebook
pyLDAvis.enable_notebook()

## Import what we'll work with

In [23]:
data_dir = 'data'
files = os.listdir(data_dir)

In [24]:
# change to loop over all of them
df = pd.read_csv(f"{data_dir}/{files[0]}")

## Inspect the data quickly

In [25]:
df.head()

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,post_type,account_type,new_june_2018,retweet,account_category
0,1674084000.0,GAB1ALDANA,People are too toxic. I think I have people po...,United States,English,7/30/2016 20:15,7/30/2016 20:15,3395,2014,2150,RETWEET,Hashtager,0,1,HashtagGamer
1,1674084000.0,GAB1ALDANA,#NowPlaying Don't Shoot (I'm a Man) by @DEVO -...,United States,English,7/30/2016 20:15,7/30/2016 20:15,3395,2014,2146,RETWEET,Hashtager,0,1,HashtagGamer
2,1674084000.0,GAB1ALDANA,the 'I'm the most boring person in the world' ...,United States,English,7/30/2016 20:16,7/30/2016 20:16,3395,2013,2159,RETWEET,Hashtager,0,1,HashtagGamer
3,1674084000.0,GAB1ALDANA,#MyAchillesHeel slippery floors https://t.co/R...,United States,Norwegian,7/30/2016 20:16,7/30/2016 20:16,3395,2013,2160,RETWEET,Hashtager,0,1,HashtagGamer
4,1674084000.0,GAB1ALDANA,#MyAchillesHeel Boring narcissists.....nothing...,United States,English,7/30/2016 20:16,7/30/2016 20:16,3395,2013,2158,RETWEET,Hashtager,0,1,HashtagGamer


In [26]:
len(df)

388452

## Filter for English language tweets

We want to look at "Russians covertly posing as English-speakers", therefore we'll only look at tweets that have `English` as a `language`.

Another option would be to search for `region` and select only those coming from `United States`, but IMO the previous filter would be more interesting.

In [27]:
en_df = df[df.language == 'English']

In [28]:
en_df.head()

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,post_type,account_type,new_june_2018,retweet,account_category
0,1674084000.0,GAB1ALDANA,People are too toxic. I think I have people po...,United States,English,7/30/2016 20:15,7/30/2016 20:15,3395,2014,2150,RETWEET,Hashtager,0,1,HashtagGamer
1,1674084000.0,GAB1ALDANA,#NowPlaying Don't Shoot (I'm a Man) by @DEVO -...,United States,English,7/30/2016 20:15,7/30/2016 20:15,3395,2014,2146,RETWEET,Hashtager,0,1,HashtagGamer
2,1674084000.0,GAB1ALDANA,the 'I'm the most boring person in the world' ...,United States,English,7/30/2016 20:16,7/30/2016 20:16,3395,2013,2159,RETWEET,Hashtager,0,1,HashtagGamer
4,1674084000.0,GAB1ALDANA,#MyAchillesHeel Boring narcissists.....nothing...,United States,English,7/30/2016 20:16,7/30/2016 20:16,3395,2013,2158,RETWEET,Hashtager,0,1,HashtagGamer
5,1674084000.0,GAB1ALDANA,Your opinion on Hillary really matters to a no...,United States,English,7/30/2016 20:16,7/30/2016 20:16,3395,2014,2154,RETWEET,Hashtager,0,1,HashtagGamer


In [29]:
len(en_df)

300510

## Inspect the non-English tweets

For good measure, we'll also take a look at what we are discarding.

In [30]:
other_df = df[df.language != 'English']

In [31]:
len(other_df)

87942

In [32]:
other_df.count()

external_author_id    87942
author                87942
content               87942
region                87335
language              87942
publish_date          87942
harvested_date        87942
following             87942
followers             87942
updates               87942
post_type             46797
account_type          87933
new_june_2018         87942
retweet               87942
account_category      87942
dtype: int64

## Text preprocessing and data cleaning

Lesson learned from yesterday, let's only pull out what we really don't want to happen inside the function!

In [33]:
# instantiating our multi-use tokenizer
tknzr = TweetTokenizer()

# creating the punctuation list we want to exclude
punct = string.punctuation
# adding additional common punctuation chars of the texts
add_punct = ""
punct += add_punct
    
# our extended stopwords list
stpw = stopwords.words('english')
add_stopwords = ['http', 'https']
stpw.extend(add_stopwords)

Finally it's time to introduce RegExp to boost our matching :)

In [34]:
# matches two or more alpha characters
# thus it should exclude things such as 's or numbers, or any single-letters floating around
# however, there will still be a match if we run it on e.g. #usa
# play here: https://regexr.com/
regexp = re.compile(r'[a-z]{2,}')

In [42]:
def preprocess(tweet, tokenizer, regexp, punct, stpw):
    # remove capitalization
    tweet = tweet.lower()
    # tokenize
    tokens = tokenizer.tokenize(tweet)
    # remove punctuation ('t' stands for 'token' - we're looping over all tokens)
    no_punct = (t for t in tokens if not t in punct)
    # remove stopwords
    no_stpw = (t for t in no_punct if not t in stpw)
    # remove other strange character-letter-punctuation combinations
    # NOTE: this will filter out things such as emojis and text-based emoticons
    # TODO: adapt the RegExp above to keep matching those!
    no_weirds = (t for t in no_stpw if re.search(regexp, t))
    # lemmatize remaining tokens
    lem = WordNetLemmatizer()
    lem_tokens = [lem.lemmatize(t) for t in no_weirds]
    return lem_tokens

In [43]:
tweet_corpus = [preprocess(tweet, tknzr, regexp, punct, stpw) 
                for tweet in iter(df.content) 
                if type(tweet) == str]

In [44]:
tweet_corpus[2:3]

[["i'm",
  'boring',
  'person',
  'world',
  'starterpack',
  '#pokemongo',
  'https://t.co/u8woa1s3j7']]

## Topic modeling with `gensim`

In [45]:
dictionary = Dictionary(tweet_corpus)
gen_corpus = [dictionary.doc2bow(tweet) for tweet in tweet_corpus]

In [46]:
warnings.filterwarnings("ignore")
ldamodel = LdaModel(corpus=gen_corpus, num_topics=10, id2word=dictionary)

In [47]:
ldamodel.show_topics()

[(0,
  '0.026*"в" + 0.014*"и" + 0.012*"на" + 0.009*"не" + 0.008*"с" + 0.007*"#isis" + 0.005*"что" + 0.005*"read" + 0.005*"vehicle" + 0.004*"а"'),
 (1,
  '0.014*"case" + 0.010*"judge" + 0.009*"missing" + 0.009*"—" + 0.008*"refugee" + 0.007*"»" + 0.007*"«" + 0.007*"6" + 0.006*"12" + 0.005*"manhattan"'),
 (2,
  '0.036*"kansa" + 0.031*"#news" + 0.009*"wichita" + 0.009*"“" + 0.009*"”" + 0.008*"man" + 0.008*"2" + 0.008*"know" + 0.007*"officer" + 0.007*"4"'),
 (3,
  '0.020*"#pjnet" + 0.017*"mt" + 0.009*"change" + 0.008*"girl" + 0.008*"live" + 0.007*"leader" + 0.007*"join" + 0.006*"..." + 0.005*"accused" + 0.005*"speech"'),
 (4,
  '0.050*"#news" + 0.013*"#tcot" + 0.012*"police" + 0.011*"school" + 0.010*"rt" + 0.010*"…" + 0.007*"county" + 0.007*"court" + 0.006*"#pjnet" + 0.006*"city"'),
 (5,
  '0.085*"�" + 0.009*"️" + 0.007*"storm" + 0.007*"cruz" + 0.007*"weekend" + 0.004*"die" + 0.004*"investigate" + 0.004*"truck" + 0.004*"de" + 0.004*"r"'),
 (6,
  '0.013*"state" + 0.011*"found" + 0.009*"week"

## Visualize the resulting topics

Here we are using `pyLDAvis` - the `import` needs to be explicit to work, so it might appear that this is part of the `gensim` package, but instead this is from `pyLDAvis`.

In [48]:
gensim.prepare(ldamodel, gen_corpus, dictionary)

## Ideas / Todos

* run this analysis on the full corpus (one file takes approximately 8-9 minutes to process)
* cherry-pick some words to include into the stopword list
* adapt the tokenizer to avoid splitting `#` symbols from twitter hashtags