# Table of Contents
 <p>

In this exercise we will go over `realDonaldTrump_tweets` and perform topic modeling. Each line in this file is a tweet. 

In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import re

**Task 1: Load the data**

Consider each tweet as a document. Load the tweets. Strip away symbols and web links in the tweets. If the tweet becomes empty string after preprocessing, then discard the tweet from analysis.


In [2]:
file_path = '/dsa/data/all_datasets/linguistic/realDonaldTrump_tweets.txt'

In [3]:
# load each tweet as a document
with open(file_path, 'r') as f:
    tweets = f.read().splitlines()
    tweets = [re.sub(r'[^\w]|https.*\b', ' ', t) for t in tweets]

In [4]:
# list comp to remove any string that is only spaces (empty)

tweets_clean = [tweet for tweet in tweets if tweet.strip()]

**Task 2: Create term frequency matrix for these tweets.**


In [5]:
# Load libraries
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation

In [6]:
countVectorizer = CountVectorizer(stop_words = 'english', max_df=100)
termFrequency = countVectorizer.fit_transform(tweets_clean)
featureNames = countVectorizer.get_feature_names()

**Task 3: Apply LDA topic modeling method with 5 topics**

Fit an LDA model with 5 topics on these tweets. 


In [7]:
lda = LatentDirichletAllocation(n_components = 5)
lda.fit(termFrequency)

LatentDirichletAllocation(n_components=5)

**Task 4: Print the top 10 words for each of the topics**

In [8]:
for idx, topic in enumerate(lda.components_):
    print("Topic ", idx, " ".join(featureNames[i] for i in topic.argsort()[:-10 - 1:-1]))

Topic  0 makeamericagreatagain nytimes failing cnn way bad healthcare tickets working hard
Topic  1 ohio florida tomorrow state obama wonderful rally movement pennsylvania watch
Topic  2 states like enjoy united foxandfriends tonight interviewed don need bad
Topic  3 draintheswamp americafirst live crookedhillary night bigleaguetruth minister 000 prime world
Topic  4 honor north obamacare house korea watch women white repeal china


**Task 5: Name each of the topic (No right answer)**

After observing top-10 words in each topic, do these topics make sense to you? Can you name each of the topic? 

My take on the topics are

1) Make America Great Again

2) Policies - Healthcare reform

3) Debates - Florida

4) Fox and Friends (news)

5) Polls

**Task 6: Create a TFIDF matrix**

Create TFIDF matrix for these tweets.

In [9]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(tweets_clean)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

**Task 6: Apply NMF topic modeling with 5 topics**

In [10]:
nmf = NMF(n_components=5, random_state=0)
nmf.fit(tfidf)



NMF(n_components=5, random_state=0)

**Task 7: Print the top 10 words for each of the topics**

In [14]:
for idx, topic in enumerate(nmf.components_):
    print("Topic ", idx, " ".join(tfidf_feature_names[i] for i in topic.argsort()[:-10 - 1:-1]))

Topic  0 great america make safe going honor day people today trumppence16
Topic  1 makeamericagreatagain imwithyou erictrump lets movement lesm 6days rt nfib poll
Topic  2 tickets join tomorrow 00pm available center 6pm tonight 7pm florida
Topic  3 thank maga join americafirst watch florida ohio tomorrow imwithyou new
Topic  4 amp draintheswamp hillary rt clinton trump people time realdonaldtrump media


**Task 8: Perform a comparison between the topics identified by LDA and NMF methods.**

In [13]:
# Using similar method from lab

# lda taking make america great

topic = lda.components_[0]  
no_top_words = 10

weights_lda = {}
for i in topic.argsort()[:-no_top_words - 1:-1]:
    print(featureNames[i], topic[i])
    weights_lda[featureNames[i]] = topic[i]
    
# nmf taking make america great

topic = nmf.components_[0]  
no_top_words = 10

weights_nmf = {}
for i in topic.argsort()[:-no_top_words - 1:-1]:
    weights_nmf[tfidf_feature_names[i]] = topic[i]
weights_nmf

makeamericagreatagain 56.29160711389638
nytimes 50.19595590545257
failing 50.19515419836757
cnn 48.43989549712596
way 47.348555076279155
bad 44.83261225330579
healthcare 39.479150855180585
tickets 38.41291706084292
working 37.322650125029455
hard 37.2693047000536


{'great': 1.7212520085963903,
 'america': 1.7118153536390415,
 'make': 1.6293645647445902,
 'safe': 0.3890961876697245,
 'going': 0.2859811517064995,
 'honor': 0.13590612162466056,
 'day': 0.11683115476915339,
 'people': 0.11013596341593734,
 'today': 0.07597932743364502,
 'trumppence16': 0.0735481618402112}

In [15]:
import pandas as pd
df1 = pd.DataFrame(weights_lda.items())
df2 = pd.DataFrame(weights_nmf.items())

df = pd.concat([df1, df2], axis=1)
df

Unnamed: 0,0,1,0.1,1.1
0,makeamericagreatagain,56.291607,great,1.721252
1,nytimes,50.195956,america,1.711815
2,failing,50.195154,make,1.629365
3,cnn,48.439895,safe,0.389096
4,way,47.348555,going,0.285981
5,bad,44.832612,honor,0.135906
6,healthcare,39.479151,day,0.116831
7,tickets,38.412917,people,0.110136
8,working,37.32265,today,0.075979
9,hard,37.269305,trumppence16,0.073548


In the above when I analyze the 2 systems, the lda model seemed to lean more towards tweets combining makeamericagreat and negative tweets against news stations such as CNN and NYTimes.
The nmf model alternatively took things that seemed more positive  with words like honor and safe and today. 

In [16]:
# Second set using the 'drain the swamp' topic

# lda taking drain the swamp

topic = lda.components_[3]  
no_top_words = 10

weights_lda = {}
for i in topic.argsort()[:-no_top_words - 1:-1]:
    print(featureNames[i], topic[i])
    weights_lda[featureNames[i]] = topic[i]
    
# nmf taking drain the swamp

topic = nmf.components_[4]  
no_top_words = 10

weights_nmf = {}
for i in topic.argsort()[:-no_top_words - 1:-1]:
    weights_nmf[tfidf_feature_names[i]] = topic[i]
weights_nmf

draintheswamp 46.82920898712719
americafirst 43.75801102863496
live 39.81839380594453
crookedhillary 30.1959897436008
night 23.382520986693113
bigleaguetruth 22.935965073333687
minister 22.263967571772472
000 21.856838792624796
prime 21.52086711139623
world 19.058042840020725


{'amp': 0.6272633355501145,
 'draintheswamp': 0.6157084913433648,
 'hillary': 0.5150582069385732,
 'rt': 0.49555741764462025,
 'clinton': 0.44502729282695674,
 'trump': 0.40160460826186667,
 'people': 0.322511695799733,
 'time': 0.3055033282778956,
 'realdonaldtrump': 0.29427884485934275,
 'media': 0.28129271417372503}

In [17]:
df1 = pd.DataFrame(weights_lda.items())
df2 = pd.DataFrame(weights_nmf.items())

df = pd.concat([df1, df2], axis=1)
df

Unnamed: 0,0,1,0.1,1.1
0,draintheswamp,46.829209,amp,0.627263
1,americafirst,43.758011,draintheswamp,0.615708
2,live,39.818394,hillary,0.515058
3,crookedhillary,30.19599,rt,0.495557
4,night,23.382521,clinton,0.445027
5,bigleaguetruth,22.935965,trump,0.401605
6,minister,22.263968,people,0.322512
7,000,21.856839,time,0.305503
8,prime,21.520867,realdonaldtrump,0.294279
9,world,19.058043,media,0.281293


In this comparison, lda geared more towards america first, a little bit of hillary clinton, and looking for truth. The nmf appeared more about hillary and himself and the comparison between the two.