# Topic Modeling on English-language 'Russian Troll Tweets'

Does topic modeling using LDA make sense on short-form data such as tweets?
[Maybe yes](https://www.researchgate.net/post/What_is_a_good_way_to_perform_topic_modeling_on_short_text). Let's try.

In [13]:
import os
# suppress the numerous deprecation warnings
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

# text preprocessing
import re
import string
import pandas as pd
from nltk.tokenize.casual import TweetTokenizer
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

# topic modeling
from gensim.corpora import Dictionary
from gensim.models import LdaModel, Phrases, phrases
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# visualization
import pyLDAvis
from pyLDAvis import gensim
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
# displaying the vis right in our notebook
pyLDAvis.enable_notebook()

## Import what we'll work with

In [4]:
data_dir = 'data'
files = os.listdir(data_dir)

In [5]:
# change to loop over all of them
%time df = pd.read_csv(f"{data_dir}/{files[0]}")

CPU times: user 2.18 s, sys: 371 ms, total: 2.55 s
Wall time: 2.84 s


## Inspect the data quickly

In [6]:
df.head()

Unnamed: 0,external_author_id,author,content,region,language,publish_date,harvested_date,following,followers,updates,post_type,account_type,new_june_2018,retweet,account_category
0,1674084000.0,GAB1ALDANA,People are too toxic. I think I have people po...,United States,English,7/30/2016 20:15,7/30/2016 20:15,3395,2014,2150,RETWEET,Hashtager,0,1,HashtagGamer
1,1674084000.0,GAB1ALDANA,#NowPlaying Don't Shoot (I'm a Man) by @DEVO -...,United States,English,7/30/2016 20:15,7/30/2016 20:15,3395,2014,2146,RETWEET,Hashtager,0,1,HashtagGamer
2,1674084000.0,GAB1ALDANA,the 'I'm the most boring person in the world' ...,United States,English,7/30/2016 20:16,7/30/2016 20:16,3395,2013,2159,RETWEET,Hashtager,0,1,HashtagGamer
3,1674084000.0,GAB1ALDANA,#MyAchillesHeel slippery floors https://t.co/R...,United States,Norwegian,7/30/2016 20:16,7/30/2016 20:16,3395,2013,2160,RETWEET,Hashtager,0,1,HashtagGamer
4,1674084000.0,GAB1ALDANA,#MyAchillesHeel Boring narcissists.....nothing...,United States,English,7/30/2016 20:16,7/30/2016 20:16,3395,2013,2158,RETWEET,Hashtager,0,1,HashtagGamer


## Filter for English language tweets

We want to look at "Russians covertly posing as English-speakers", therefore we'll only look at tweets that have `English` as a `language`.

Another option would be to search for `region` and select only those coming from `United States`, but IMO the previous filter would be more interesting.

In [7]:
en_df = df[df.language == 'English']

## Text preprocessing and data cleaning

Lesson learned from yesterday, let's only pull out what we really don't want to happen inside the function!

In [8]:
# instantiating our multi-use tokenizer
tknzr = TweetTokenizer()

# creating the punctuation list we want to exclude
punct = string.punctuation
# adding additional common punctuation chars of the texts
add_punct = ""
punct += add_punct
    
# our extended stopwords list
stpw = stopwords.words('english')
add_stopwords = ['http', 'https']
stpw.extend(add_stopwords)

Finally it's time to introduce RegExp to boost our matching :)

In [9]:
# matches two or more alpha characters
# thus it should exclude things such as 's or numbers, or any single-letters floating around
# however, there will still be a match if we run it on e.g. #usa
# play here: https://regexr.com/
regexp = re.compile(r'[a-z]{2,}')

In [10]:
def preprocess(tweet, tokenizer, regexp, punct, stpw):
    # remove capitalization
    tweet = tweet.lower()
    # tokenize
    tokens = tokenizer.tokenize(tweet)
    # remove punctuation ('t' stands for 'token' - we're looping over all tokens)
    no_punct = (t for t in tokens if not t in punct)
    # remove stopwords
    no_stpw = (t for t in no_punct if not t in stpw)
    # remove other strange character-letter-punctuation combinations
    # NOTE: this will filter out things such as emojis and text-based emoticons
    # TODO: adapt the RegExp above to keep matching those!
    no_weirds = (t for t in no_stpw if re.search(regexp, t))
    # lemmatize remaining tokens
    lem = WordNetLemmatizer()
    lem_tokens = [lem.lemmatize(t) for t in no_weirds]
    return lem_tokens

In [11]:
%time tweet_corpus = [preprocess(tweet, tknzr, regexp, punct, stpw) for tweet in iter(df.content) if type(tweet) == str]

CPU times: user 1min 23s, sys: 1.33 s, total: 1min 24s
Wall time: 1min 28s


# Creating Bigrams and Trigrams

Should work with a collection of texts.

In [25]:
# Build the bigram and trigram models
bigram = Phrases(tweet_corpus, min_count=10, threshold=200) # higher threshold fewer phrases.
trigram = Phrases(bigram[tweet_corpus], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = phrases.Phraser(bigram)
trigram_mod = phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[tweet_corpus[0]]])



['people', 'toxic', 'think', 'people', 'poisoning']


In [26]:
def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

In [27]:
# Form Bigrams
%time data_words_bigrams = make_bigrams(tweet_corpus)

CPU times: user 31.3 s, sys: 10.8 s, total: 42.1 s
Wall time: 1min 46s


In [28]:
print(data_words_bigrams[:20])

[['people', 'toxic', 'think', 'people', 'poisoning'], ['#nowplaying', 'shoot', 'man', '@devo', 'https://t.co/9ildvpexkb', 'buy', 'https://t.co/6gkovvcur0'], ['boring', 'person', 'world', 'starterpack', '#pokemongo', 'https://t.co/u8woa1s3j7'], ['#myachillesheel', 'slippery', 'floor', 'https://t.co/r8nqnxnx4l'], ['#myachillesheel', 'boring', 'narcissist', 'nothing', 'wrong', 'narcissism', 'boring', 'dare'], ['opinion', 'hillary', 'really', 'matter', 'non-american', 'https://t.co/tze6denkr0'], ['#myachillesheel', 'lilith', 'frasier'], ['come', 'find', 'u', 'national', 'mall', '#dc', '#pokewalk', 'treat', 'thanks', '@kindsnacks', '@drinkbai', '#pokemon', 'https://t.co/teofzdwm66'], ['#myachillesheel', 'trolling', 'celebrity', 'blocked', 'wil', 'wheathead', 'explain', '#nra', 'caused', "earth's", 'evil'], ['#myachillesheel', 'morbid', 'comedy'], ['#myachillesheel', 'yo', 'momma', 'beyonce', 'costume'], ['#myachillesheel', 'woman', 'heel', 'https://t.co/ycggu01btv'], ['sometimes', 'seeing',

In [29]:
bigrams = []

for word_li in data_words_bigrams:
    for w in word_li:
        if "_" in w:
            bigrams.append(w)

In [30]:
# bigrams[:30]  # old_fashioned

In [31]:
len(set(bigrams))

7669

## Topic modeling with `gensim`

In [35]:
%time dictionary = Dictionary(tweet_corpus)

CPU times: user 9.45 s, sys: 628 ms, total: 10.1 s
Wall time: 10.5 s


In [36]:
%time gen_corpus = [dictionary.doc2bow(tweet) for tweet in tweet_corpus]

CPU times: user 7.91 s, sys: 2.27 s, total: 10.2 s
Wall time: 13.2 s


After timing the different cell execution times with `%time`, it becomes clear that the main time-eater here is building the LDA model. We can't really speed that up without having a good understanding of the internal workings of `gensim`.

In [37]:
warnings.filterwarnings("ignore")
%time ldamodel = LdaModel(corpus=gen_corpus, num_topics=10, id2word=dictionary)

CPU times: user 4min 55s, sys: 15.9 s, total: 5min 10s
Wall time: 5min 25s


In [38]:
ldamodel.show_topics()

[(0,
  '0.015*"#pjnet" + 0.012*"#tcot" + 0.009*"obama" + 0.008*"u" + 0.007*"like" + 0.006*"medium" + 0.006*"help" + 0.006*"people" + 0.006*"say" + 0.005*"report"'),
 (1,
  '0.008*"rt" + 0.008*"killing" + 0.008*"missing" + 0.008*"cruz" + 0.007*"texas" + 0.007*"terror" + 0.006*"return" + 0.006*"die" + 0.006*"driver" + 0.006*"district"'),
 (2,
  '0.042*"#news" + 0.034*"police" + 0.018*"man" + 0.012*"shooting" + 0.009*"school" + 0.009*"officer" + 0.007*"suspect" + 0.007*"shot" + 0.007*"arrested" + 0.007*"topeka"'),
 (3,
  '0.007*"supreme" + 0.007*"sentenced" + 0.007*"congress" + 0.006*"mother" + 0.005*"air" + 0.005*"train" + 0.005*"#job" + 0.005*"oklahoma" + 0.005*"protect" + 0.004*"east"'),
 (4,
  '0.054*"kansa" + 0.035*"#news" + 0.015*"american" + 0.010*"county" + 0.010*"city" + 0.010*"death" + 0.008*"new" + 0.008*"tax" + 0.006*"charged" + 0.006*"missouri"'),
 (5,
  '0.019*"gun" + 0.008*"#isis" + 0.006*"control" + 0.006*"targeted" + 0.006*"freedom" + 0.006*"islam" + 0.006*"#opiceisis" + 

## Visualize the resulting topics

Here we are using `pyLDAvis` - the `import` needs to be explicit to work, so it might appear that this is part of the `gensim` package, but instead this is from `pyLDAvis`.

In [39]:
%time gensim.prepare(ldamodel, gen_corpus, dictionary)

CPU times: user 1min 16s, sys: 5.57 s, total: 1min 21s
Wall time: 2min 37s


## Ideas / Todos

* run this analysis on the full corpus (one file takes approximately 8-9 minutes to process)
* cherry-pick some words to include into the stopword list
* ~~adapt the tokenizer to avoid splitting `#` symbols from twitter hashtags~~
* reduce run-time

## Notes

`u` is probably still in the results because we are running the **lemmatization** (`WordNetLemmatizer`) after filtering - therefore something such as `ur` (which would pass our RegExp criteria, because it's 2 characters) might get stemmed into `u` and remain in the results.