# Homework 2
## Social Media Analytics

Clarissa Franklin, Kyle Katzen, Paige McKenzie, Meyappan Subbaiah

### Task A.
We brainstormed a list of major political topics that we thought would be relevant from 1789-2018. Our list included eight topics: war, voting, equality, slavery, taxes, economy, transportation, and international relations. We believe that these are the major topics that will appear, but also expected approximately two additional topics to appear in the speeches that we did not include. For this reason, we chose to model 10 topics. 

### Task B. 
#### topic modelling with LDA

In [1]:
import pandas as pd
import os
from nltk.corpus import stopwords
from nltk.stem.porter import *
import nltk
import regex as re
import numpy as np
import gensim
import sklearn
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

In [2]:
stemmer = PorterStemmer()

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()


stop = stopwords.words('english')

        
df = pd.DataFrame(columns=['year', 'month', 'day', 'speaker', 'text'])

In [3]:
for filename in os.listdir("data"):
    with open(os.path.join("data", filename), 'r',encoding="utf8", errors='ignore') as book:
        s = pd.Series(filename.split(' ',1)[0].split('-')+[filename.split(' ',1)[1][:-4], book.read()], 
                      index=['year', 'month', 'day', 'speaker', 'text'])
        df=df.append(s, ignore_index=True)

In [4]:
#test with small
#df=df[0:5]

df['text']=df['text'].apply(lambda x: re.sub("[^\w\s'+]",'', x)) 

In [5]:
df['text']=df['text'].apply(lambda x: [stemmer.stem(lemmatizer.lemmatize(item.lower())) for item in x.split() if item.lower() not in stop])

tf_vect=TfidfVectorizer(max_df=0.75, min_df=0.1)

df['text_joined']=df['text'].str.join(sep=" ")
x = tf_vect.fit_transform(df['text_joined']).todense()
tf_words=tf_vect.get_feature_names()

real_words=[]
for word in tf_words:
    first=word.split()[0]
    if first.isalpha():
        real_words+=[word]
real_words

#get frequency of words whole corpus
#init=[]
#corp=[i+init for i in df['text']][0]
#freq_dist = nltk.FreqDist(corp)

#also remove words that do not meet tf-idf thresholds (appear in nearly all documents, or are obscure words that do not appear in more than 10% of documents)
df['text'] = df['text'].apply(lambda x: [ word for word in x if word in real_words])

df.head()

Unnamed: 0,year,month,day,speaker,text,text_joined
0,1789,4,30,George Washington,"[fellow, citizen, senat, hous, repres, among, ...",fellow citizen senat hous repres among vicissi...
1,1789,10,3,George Washington,"[duti, acknowledg, provid, god, grate, benefit...",wherea duti nation acknowledg provid almighti ...
2,1790,1,8,George Washington,"[fellow, citizen, senat, hous, repres, embrac,...",fellow citizen senat hous repres embrac great ...
3,1790,12,8,George Washington,"[fellow, citizen, senat, hous, repres, meet, f...",fellow citizen senat hous repres meet feel muc...
4,1790,12,29,George Washington,"[written, speech, sign, hand, speak, desir, at...",presid unit state mouth written speech sign ha...


In [6]:
from gensim import corpora as cp
#create the term dictionary of our corpus, where every unique term is assigned an index
dictionary=cp.Dictionary(df['text'].values)

len(dictionary)
print(dictionary)
#convert list of documets into document term matrix using the dictionary

doc_term_matrix=[dictionary.doc2bow(doc) for doc in df['text'].values]

Dictionary(1703 unique tokens: ['accomplish', 'accordingli', 'acknowledg', 'act', 'actual']...)


In [7]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and Trainign LDA model on the document term matrix.
ldamodel = Lda(doc_term_matrix, num_topics=10, id2word = dictionary, passes=50,update_every=1)

#print the topics
topics=pd.DataFrame(ldamodel.print_topics(num_topics=10, num_words=6), columns=['topic', 'words_and_weights'])
topics.head()

Unnamed: 0,topic,words_and_weights
0,0,"0.029*""mr"" + 0.023*""think"" + 0.015*""vietnam"" +..."
1,1,"0.009*""problem"" + 0.009*""busi"" + 0.007*""public..."
2,2,"0.014*""upon"" + 0.014*""constitut"" + 0.013*""shal..."
3,3,"0.018*""law"" + 0.011*""busi"" + 0.011*""upon"" + 0...."
4,4,"0.012*""soviet"" + 0.007*""nuclear"" + 0.007*""unio..."


In [8]:
all_topics = ldamodel.get_document_topics(doc_term_matrix)

#### Distribution of Topics

In [9]:
df['topics'] = 0
df.head()

Unnamed: 0,year,month,day,speaker,text,text_joined,topics
0,1789,4,30,George Washington,"[fellow, citizen, senat, hous, repres, among, ...",fellow citizen senat hous repres among vicissi...,0
1,1789,10,3,George Washington,"[duti, acknowledg, provid, god, grate, benefit...",wherea duti nation acknowledg provid almighti ...,0
2,1790,1,8,George Washington,"[fellow, citizen, senat, hous, repres, embrac,...",fellow citizen senat hous repres embrac great ...,0
3,1790,12,8,George Washington,"[fellow, citizen, senat, hous, repres, meet, f...",fellow citizen senat hous repres meet feel muc...,0
4,1790,12,29,George Washington,"[written, speech, sign, hand, speak, desir, at...",presid unit state mouth written speech sign ha...,0


In [10]:
topic_agg = []
for num,doc in enumerate(all_topics):
    topic_lst = [j[0] for j in doc]
    topic_agg.append(topic_lst)
df['topics'] = topic_agg

In [11]:
df.head()

Unnamed: 0,year,month,day,speaker,text,text_joined,topics
0,1789,4,30,George Washington,"[fellow, citizen, senat, hous, repres, among, ...",fellow citizen senat hous repres among vicissi...,"[2, 5]"
1,1789,10,3,George Washington,"[duti, acknowledg, provid, god, grate, benefit...",wherea duti nation acknowledg provid almighti ...,"[0, 2, 5]"
2,1790,1,8,George Washington,"[fellow, citizen, senat, hous, repres, embrac,...",fellow citizen senat hous repres embrac great ...,"[2, 5]"
3,1790,12,8,George Washington,"[fellow, citizen, senat, hous, repres, meet, f...",fellow citizen senat hous repres meet feel muc...,"[2, 5]"
4,1790,12,29,George Washington,"[written, speech, sign, hand, speak, desir, at...",presid unit state mouth written speech sign ha...,"[2, 3, 5, 8]"


In [12]:
def new_columns(i, lst):
    if i in lst:
        return 1
    else:
        return 0

for j in range(0,10):
    df['Topic_' + str(j)] = df['topics'].apply(lambda x: new_columns(j, x))

In [13]:
df.head()

Unnamed: 0,year,month,day,speaker,text,text_joined,topics,Topic_0,Topic_1,Topic_2,Topic_3,Topic_4,Topic_5,Topic_6,Topic_7,Topic_8,Topic_9
0,1789,4,30,George Washington,"[fellow, citizen, senat, hous, repres, among, ...",fellow citizen senat hous repres among vicissi...,"[2, 5]",0,0,1,0,0,1,0,0,0,0
1,1789,10,3,George Washington,"[duti, acknowledg, provid, god, grate, benefit...",wherea duti nation acknowledg provid almighti ...,"[0, 2, 5]",1,0,1,0,0,1,0,0,0,0
2,1790,1,8,George Washington,"[fellow, citizen, senat, hous, repres, embrac,...",fellow citizen senat hous repres embrac great ...,"[2, 5]",0,0,1,0,0,1,0,0,0,0
3,1790,12,8,George Washington,"[fellow, citizen, senat, hous, repres, meet, f...",fellow citizen senat hous repres meet feel muc...,"[2, 5]",0,0,1,0,0,1,0,0,0,0
4,1790,12,29,George Washington,"[written, speech, sign, hand, speak, desir, at...",presid unit state mouth written speech sign ha...,"[2, 3, 5, 8]",0,0,1,1,0,1,0,0,1,0


In [14]:
df.to_csv("topics_speech.csv")

Picture below created in R (code attached)

![title](Topics_over_time.png)

### Distribution of words in each topic

In [65]:
word_topic_dist = pd.DataFrame(columns=['Topic','word','weight'])
word_topic_dist.head()

Unnamed: 0,Topic,word,weight


In [66]:
## https://stackoverflow.com/questions/17662916/how-to-print-out-the-full-distribution-of-words-in-an-lda-topic-in-gensim
for words in ldamodel.show_topics(formatted=False,num_words=10):
    topic_num = words[0]
    print(topic_num)
    for word_prob in words[1]:
        print(word_prob)
        row = [topic_num,word_prob[0],word_prob[1]]
        word_topic_dist = word_topic_dist.append({
     "Topic": topic_num,
     "word":  word_prob[0],
     "weight": word_prob[1]
      }, ignore_index=True)

0
('mr', 0.028679498)
('think', 0.022893343)
('vietnam', 0.014794714)
('go', 0.013468637)
('question', 0.009425681)
('senat', 0.008579963)
('want', 0.008500857)
('say', 0.008394521)
('south', 0.0074093854)
('believ', 0.007248727)
1
('problem', 0.009007612)
('busi', 0.008596857)
('public', 0.0072408332)
('system', 0.0068431376)
('employ', 0.0064181034)
('bank', 0.00630518)
('today', 0.006228642)
('court', 0.0061225607)
('opportun', 0.0059980517)
('progress', 0.005870454)
2
('upon', 0.014276791)
('constitut', 0.0135479905)
('shall', 0.013432109)
('union', 0.00873695)
('public', 0.0075864885)
('law', 0.0073384587)
('duti', 0.0064346483)
('principl', 0.006072928)
('free', 0.0050568026)
('citizen', 0.004862452)
3
('law', 0.01844465)
('busi', 0.010908679)
('upon', 0.010845834)
('congress', 0.009418972)
('labor', 0.0077078864)
('public', 0.0075070737)
('condit', 0.0074172537)
('legisl', 0.0072132046)
('court', 0.0069335136)
('men', 0.0067299176)
4
('soviet', 0.012447268)
('nuclear', 0.0071570

In [67]:
test = word_topic_dist.groupby(['Topic']).apply(lambda x: x['weight']/x['weight'].sum())

In [68]:
word_topic_dist['weighted_numbers'] = test.values

In [69]:
word_topic_dist.to_csv("topic_word_weights.csv")

Note: This graph was produced in R, I will clean it up later. I spent the last hour looking up how to fix this order and it's only pissing me off more. Will do later.

![title](word_topic_dist.png)

### Task C
In terms of topics addressed "heavily" in a speech, which 3 former presidents does President Trump share the highest similarity with? How did you arrive at your conclusion?

### Note still need to do a cosine-simialirty or something

Picture below created in R (code attached)

![title](Part_C_similarity_Pres.png)

### Task D
In terms of his own speeches, do you see President Trump shifting the emphasis on certain topics over time? Explain your response. 

### Task E
If you do a K-means clustering with the same number of clusters as topics, do you see President Trump's speeches and those of the 3 former presidents you identified in Task C in the same cluster? What was the basis of clustering (e.g., tf-idf, cosine similarity, etc.). Discuss your findings.   

### Task F
Provide a visualization of both clusters (with colors) and cosine scores using MDS. 