For this project, I have downloaded the tweets from the @realDonaldTrump twitter account. Clearly it's an influential account, and those 280 characters (formerly 140) have proven to weild far too much power. The aim here is to perform an NLP analysis on the tweets to try and discover the main themes of his tweets. We will use an unsupervised approach called LDA (http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf) to perform Topic Modelling. We will be using it's implementation in gensim.    

In [3]:
# imports
import pandas as pd
import spacy
from tqdm import tqdm
from gensim.models.phrases import Phrases , Phraser
from gensim.corpora import Dictionary
from gensim.models.ldamulticore import LdaModel
import pyLDAvis
import pyLDAvis.gensim
import seaborn as sns
import nltk
import string
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
punctuations = string.punctuation
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
pd.set_option('display.max_colwidth', -1) #to show full tweets in the cells.

I have used the twitter API to download the tweets from user: @realDonaldTrump. tweet_download.py contains the script to use the Twitter API to download tweets from any user. Twitter allows only for the 3000 most recent tweets to be downloaded. However I had downloaded another batch sometime ago, and I merged the two (see Twitter_compiler_purge_scaled.py to see how I went about this process accounting for the Twitter purge that happened a few months ago when millions of bot accounts were deleted from twitter and the effect it had on number of likes/retweets for Trump) to end up with 5000 something tweets. 

In [4]:
#load tweets
tweets_df = pd.read_csv('trump_tweets_6000.csv', parse_dates = True)
tweets_df.head()

Unnamed: 0,id,created_at,text,retweets,favorites
0,1057254051254013953,2018-10-30 12:53:03,"“If the Fed backs off and starts talking a little more Dovish, I think we’re going to be right back to our 2,800 to 2,900 target range that we’ve had for the S&amp;P 500.” Scott Wren, Wells Fargo.",3854.0,13636.0
1,1057249169507803137,2018-10-30 12:33:39,"The Stock Market is up massively since the Election, but is now taking a little pause - people want to see what happens with the Midterms. If you want your Stocks to go down, I strongly suggest voting Democrat. They like the Venezuela financial model, High Taxes &amp; Open Borders!",7569.0,25503.0
2,1057247021919297536,2018-10-30 12:25:07,"Congressman Kevin Brady of Texas is so popular in his District, and far beyond, that he doesn’t need any help - but I am giving it to him anyway. He is a great guy and the absolute “King” of Cutting Taxes. Highly respected by all, he loves his State &amp; Country. Strong Endorsement!",4214.0,16375.0
3,1057243826899877889,2018-10-30 12:12:25,"Congressman Andy Barr of Kentucky, who just had a great debate with his Nancy Pelosi run opponent, has been a winner for his State. Strong on Crime, the Border, Tax Cuts, Military, Vets and 2nd Amendment, we need Andy in D.C. He has my Strong Endorsement!",4532.0,17532.0
4,1057110242541080577,2018-10-30 03:21:36,".@Erik_Paulsen, @Jason2CD, \r\n@JimHagedornMN and @PeteStauber love our Country and the Great State of Minnesota. They are winners and always get the job done. We need them all in Congress for #MAGA. Border, Military, Vets, 2nd A. Go Vote Minnesota. They have my Strong Endorsement!",6598.0,24168.0


Right away, I notice that some of the tweets contain elemenst like '&amp' and '\n'. We can clean this up quite simply...

In [25]:
import html
def tweet_cleaner(tweet):
    tweet = html.unescape(tweet)
    clean_tweet = tweet.replace('\r','').replace('\n','').replace('...','')
    return clean_tweet

tweets_df['text'] = tweets_df['text'].apply(tweet_cleaner)

Now that we have 'clean' tweets. I go about extracting some simple features from these tweets, like number of characters, words, etc. While doing this I realised that sometime in November last year, Twitter introduced the 280 character linit as opposed to 140, and these features need to be scaled for them to make sense. So I created a feature called 'text_len_%' which is simply the nmber of characters/maximum allowed, at the time of the tweet. I also created a boolean for if the tweet has a html link. 

In [26]:
#create new features Word length, Char length and %max length. This takes into account the 280 character extension that Twitter implemented in Nov 2017
def word_length(text):
    return len(text.split())

def char_length(text):
    return len(text)

#create new column with boolean of if text has html
tweets_df['contains_html'] = [1 if'http' in text else 0 for text in tweets_df['text']]

tweets_df['word_len_tweet'] = tweets_df['text'].apply(word_length)
tweets_df['char_len_tweet'] = tweets_df['text'].apply(char_length)

#split data into before extension to 280 chars and after. Create feature % of max length
after_ext = tweets_df.iloc[:2804]
before_ext = tweets_df.iloc[2804:]

max_before_ext = max(before_ext['char_len_tweet'])
max_after_ext = max(after_ext['char_len_tweet'])

before_ext['text_len_%'] = before_ext['char_len_tweet']/max_before_ext
after_ext['text_len_%'] = after_ext['char_len_tweet']/max_after_ext
new_tweets_df = after_ext.append(before_ext)
new_tweets_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,id,created_at,text,retweets,favorites,contains_html,word_len_tweet,char_len_tweet,text_len_%
0,1057254051254013953,2018-10-30 12:53:03,"“If the Fed backs off and starts talking a little more Dovish, I think we’re going to be right back to our 2,800 to 2,900 target range that we’ve had for the S&P 500.” Scott Wren, Wells Fargo.",3854.0,13636.0,0,38,192,0.617363
1,1057249169507803137,2018-10-30 12:33:39,"The Stock Market is up massively since the Election, but is now taking a little pause - people want to see what happens with the Midterms. If you want your Stocks to go down, I strongly suggest voting Democrat. They like the Venezuela financial model, High Taxes & Open Borders!",7569.0,25503.0,0,50,278,0.893891
2,1057247021919297536,2018-10-30 12:25:07,"Congressman Kevin Brady of Texas is so popular in his District, and far beyond, that he doesn’t need any help - but I am giving it to him anyway. He is a great guy and the absolute “King” of Cutting Taxes. Highly respected by all, he loves his State & Country. Strong Endorsement!",4214.0,16375.0,0,53,280,0.900322
3,1057243826899877889,2018-10-30 12:12:25,"Congressman Andy Barr of Kentucky, who just had a great debate with his Nancy Pelosi run opponent, has been a winner for his State. Strong on Crime, the Border, Tax Cuts, Military, Vets and 2nd Amendment, we need Andy in D.C. He has my Strong Endorsement!",4532.0,17532.0,0,46,255,0.819936
4,1057110242541080577,2018-10-30 03:21:36,".@Erik_Paulsen, @Jason2CD, @JimHagedornMN and @PeteStauber love our Country and the Great State of Minnesota. They are winners and always get the job done. We need them all in Congress for #MAGA. Border, Military, Vets, 2nd A. Go Vote Minnesota. They have my Strong Endorsement!",6598.0,24168.0,0,44,278,0.893891


Ok, here we will begin with the actual topic modelling of the tweets. First step in any NLP project is to pre-process our text, This will include removing stop words, punctuations, converting everything to lower-case etc. spaCy is a modern NLP package that lets us do all this quite easily. Here is a nice notebook to learn more about spaCy (https://github.com/skipgram/modern-nlp-in-python/blob/master/executable/Modern_NLP_in_Python.ipynb). Their official documentation is good as well!

In [27]:
#preprocess all tweets. Create corpus. Build LDA model. Visulaize results
nlp = spacy.load('en_core_web_sm')
def clean_text(text):
    doc = nlp(text, disable=['parser', 'ner'])
    tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
    tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations]
    tokens = ' '.join(tokens)
    return tokens

In [28]:
new_tweets_df['processed_text'] = new_tweets_df['text'].apply(clean_text)

Here is what text look like after its lower cased, lemmatized, stripped of punctuations and rid of stop words. This process will greatly help us reduce our corpus size and ensure that we only have meaningful vocabulary in there. 

In [29]:
new_tweets_df[['text', 'processed_text']].head()

Unnamed: 0,text,processed_text
0,"“If the Fed backs off and starts talking a little more Dovish, I think we’re going to be right back to our 2,800 to 2,900 target range that we’ve had for the S&P 500.” Scott Wren, Wells Fargo.","fed back start talk little dovish think go right back 2,800 2,900 target range s&p 500 scott wren wells fargo"
1,"The Stock Market is up massively since the Election, but is now taking a little pause - people want to see what happens with the Midterms. If you want your Stocks to go down, I strongly suggest voting Democrat. They like the Venezuela financial model, High Taxes & Open Borders!",stock market massively since election take little pause people want see happen midterms want stock go strongly suggest vote democrat like venezuela financial model high taxes open borders
2,"Congressman Kevin Brady of Texas is so popular in his District, and far beyond, that he doesn’t need any help - but I am giving it to him anyway. He is a great guy and the absolute “King” of Cutting Taxes. Highly respected by all, he loves his State & Country. Strong Endorsement!",congressman kevin brady texas popular district far beyond need help give anyway great guy absolute king cut taxes highly respect love state country strong endorsement
3,"Congressman Andy Barr of Kentucky, who just had a great debate with his Nancy Pelosi run opponent, has been a winner for his State. Strong on Crime, the Border, Tax Cuts, Military, Vets and 2nd Amendment, we need Andy in D.C. He has my Strong Endorsement!",congressman andy barr kentucky great debate nancy pelosi run opponent winner state strong crime border tax cuts military vets 2nd amendment need andy d.c. strong endorsement
4,".@Erik_Paulsen, @Jason2CD, @JimHagedornMN and @PeteStauber love our Country and the Great State of Minnesota. They are winners and always get the job done. We need them all in Congress for #MAGA. Border, Military, Vets, 2nd A. Go Vote Minnesota. They have my Strong Endorsement!",.@erik_paulsen @jason2cd @jimhagedornmn @petestauber love country great state minnesota winner always get job need congress maga border military vets 2nd a. go vote minnesota strong endorsement


In the following are a few helper functions that will help us tokenize our text and create bigrams, trigrams and qgrams. This helps us pick up entities that belong together. You want your model to understand 'Failing New York Times' as a single entity as opposed to seperate entities of 'Failing', 'New', 'York' and 'Times' (also think: 'Make Americal Great Again'.) 

In [30]:
def tokenizer_2(text_list):
    '''
    This is the faster tokenizer. Default to this. 
    '''
    sent_stream = []
    translator = str.maketrans('', '', string.punctuation)
    print ('tokenizing text...')
#    for content in text_list:
    for content in tqdm(text_list):
        sent_tokenize_list = sent_tokenize(content)
        for sent in sent_tokenize_list:
            sent_no_punct = sent.translate(translator)
            sent_no_space = " ".join(sent_no_punct.split())
            word_tokens = word_tokenize(sent_no_space)
            sent_stream.append(word_tokens)
    return sent_stream 

def qgram_creator(sent_stream):
    '''
    accepts a list of sentences and returns a list with meaningfully combined words. Comment out the model saving lines when working with individual mails
    '''
    print('creating trigrams...')
    b_phrases = Phrases(sent_stream)
    b_phrases.save('bigram_model_all')
    bigram = Phraser(b_phrases)
    bigram_text = []
    for sent in tqdm(sent_stream):
        b_text = bigram[sent]
        bigram_text.append(b_text)

    t_phrases = Phrases(bigram_text)
    t_phrases.save('trigram_model_all')
    trigram = Phraser(t_phrases)
    trigram_text = []
    for sent in tqdm(bigram_text):
        t_text = trigram[sent]
        trigram_text.append(t_text)
        
    q_phrases = Phrases(trigram_text)
    q_phrases.save('qgram_model_all')
    qgram = Phraser(q_phrases)
    qgram_text = []
    for sent in tqdm(trigram_text):
        q_text = qgram[sent]
        qgram_text.append(q_text)
    return qgram_text

def text_cleaner(sent_tokens):
    '''
    This cleans up the sentence tokens of symbols and stopwords
    '''
    print('cleaning text...')
    trigram_clean = []
    for text in tqdm(sent_tokens):
        text = [term for term in text if term not in stopwords]
        trigram_clean.append(text)
    return trigram_clean

In [31]:
tokenized_text = tokenizer_2(new_tweets_df['processed_text'])

tokenizing text...


100%|██████████| 5403/5403 [00:01<00:00, 4246.97it/s]


In [32]:
tokenized_text[0]

['fed',
 'back',
 'start',
 'talk',
 'little',
 'dovish',
 'think',
 'go',
 'right',
 'back',
 '2800',
 '2900',
 'target',
 'range',
 'sp',
 '500',
 'scott',
 'wren',
 'wells',
 'fargo']

In [33]:
qgrammed_text = qgram_creator(tokenized_text)

creating trigrams...


100%|██████████| 5606/5606 [00:00<00:00, 24437.65it/s]
100%|██████████| 5606/5606 [00:00<00:00, 26123.01it/s]
100%|██████████| 5606/5606 [00:00<00:00, 24849.29it/s]


In [34]:
#clean_trigrams = text_cleaner(trigrammed_text)

We see that stock_market and open_borders have been picked up as single entities by our model. That is good!

In [35]:
qgrammed_text[1]

['stock_market',
 'massively',
 'since_election',
 'take',
 'little',
 'pause',
 'people',
 'want',
 'see',
 'happen',
 'midterms',
 'want',
 'stock',
 'go',
 'strongly',
 'suggest',
 'vote',
 'democrat',
 'like',
 'venezuela',
 'financial',
 'model',
 'high',
 'taxes',
 'open_borders']

We will now create our corpus from all the tweets that we can feed into the LDA model and visulaise the different topics

In [36]:
def corpus_creator(clean_text):
    trigram_dict = Dictionary(clean_text)
    print('creating corpus...')
    trigram_dict.filter_extremes(no_below = 10, no_above = 0.3)
    corpus = [trigram_dict.doc2bow(sent) for sent in clean_text]
    return trigram_dict, corpus

In [38]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
t_dict, corpus = corpus_creator(qgrammed_text)
print('building lda model')
lda = LdaModel(corpus, num_topics = 6, id2word = t_dict)
pyLDAvis.enable_notebook() 
print('creating lda visualization')
LDAvis_prepared = pyLDAvis.gensim.prepare(lda, corpus, t_dict)
pyLDAvis.display(LDAvis_prepared)

creating corpus...
building lda model
creating lda visualization


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


Ok, there's the LDA model visulized. We see some seperation of topics. Topic 4 seems to be focussed on Hillary Clinton's campaign and the FBI investigation into her email server. Also Wall, tax cuts, repeal-replace, the heavy weight campaigning material. Topic 2 seems to be about Fake News. Contains, predictable CNN, NBS News, Facebook, biased etc, Trump's now familiar laments. Topic 6 is somewhat confusing but has VP Pence, NFL, kneel etc on the Trump and Pence's opposition to kneeling during the national anthem in NFL games. Topic 5 has travel ban, mexico, Ford bringing back jobs, somewhat Foreign policy related. You can explore the visualization above to see if you can spot some trends. Also play with the number of topics that you want the model to uncover. I noticed that between 4 and 8 gives somewhat sensible results. 