>* We learned how to count N-grams (i.e., sequences of N words such as unigrams (N=1) and bigrams (N=2)) by tokenizing the text.
>* But we faced a problem of counting N-grams when there are unnecessary or meaningless tokens after tokenization.
>* Therefore, we needed a further processing to remove stopwords (i.e., function words) and punctuations. 
>* The last step was to convert the text into lemma form (i.e., the base form of words) to avoid the duplication of the same word with different forms. For instance, 'running' and 'ran' are converted into 'run' because they have the same meaning.

>* In this week, we will learn topic modeling, one of the unsupervised learning techniques, to extract topics from the text. 
>* Topic modeling is a type of statistical model to discover abstract topics that occur in a collection of documents. 
>* We will use the Latent Dirichlet Allocation (LDA) model, one of the most popular topic modeling techniques. 

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

>* We are going to use the `nltk' library for tokenization and stopwords removal.
>* We will use the `gensim' library for topic modeling. 

>* Let's import the data from week 3.

In [2]:
data=pd.read_csv('../week3/Political-media-DFE.csv', encoding='latin1')

In [3]:
data.columns

Index(['_unit_id', '_golden', '_unit_state', '_trusted_judgments',
       '_last_judgment_at', 'audience', 'audience:confidence', 'bias',
       'bias:confidence', 'message', 'message:confidence', 'orig__golden',
       'audience_gold', 'bias_gold', 'bioid', 'embed', 'id', 'label',
       'message_gold', 'source', 'text'],
      dtype='object')

In [4]:
data.dtypes

_unit_id                 int64
_golden                   bool
_unit_state             object
_trusted_judgments       int64
_last_judgment_at       object
audience                object
audience:confidence    float64
bias                    object
bias:confidence        float64
message                 object
message:confidence     float64
orig__golden           float64
audience_gold          float64
bias_gold              float64
bioid                   object
embed                   object
id                      object
label                   object
message_gold           float64
source                  object
text                    object
dtype: object

>* `dtypes' is used to check the data type of each column.

>* Let's subset the data to have who posted, where they posed (social media platform), and what they posted.

In [5]:
content=data[['label', 'source', 'text']]

In [6]:
content

Unnamed: 0,label,source,text
0,From: Trey Radel (Representative from Florida),twitter,RT @nowthisnews: Rep. Trey Radel (R- #FL) slam...
1,From: Mitch McConnell (Senator from Kentucky),twitter,VIDEO - #Obamacare: Full of Higher Costs and ...
2,From: Kurt Schrader (Representative from Oregon),twitter,Please join me today in remembering our fallen...
3,From: Michael Crapo (Senator from Idaho),twitter,RT @SenatorLeahy: 1st step toward Senate debat...
4,From: Mark Udall (Senator from Colorado),twitter,.@amazon delivery #drones show need to update ...
...,...,...,...
4995,From: Ted Yoho (Representative from Florida),facebook,I applaud Governor PerryÛªs recent decision t...
4996,From: Ted Yoho (Representative from Florida),facebook,"Today, I voted in favor of H.R. 5016 - Financi..."
4997,From: Ted Yoho (Representative from Florida),facebook,(Taken from posted WOKV interview) Congressm...
4998,From: Ted Yoho (Representative from Florida),facebook,Join me next week for a town hall in Ocala! I'...


>* Let's print out the text data in the first row.

In [7]:
content['text'].iloc[0]

'RT @nowthisnews: Rep. Trey Radel (R- #FL) slams #Obamacare. #politics https://t.co/zvywMG8yIH'

>* If you want to check the text data in the fifth row, you can use `df['text'][4]`.

In [8]:
content['text'].iloc[4]

'.@amazon delivery #drones show need to update law to promote #innovation &amp; protect #privacy. My #UAS bill does that: http://t.co/l9ta5SKq6u'

>* Let's clean the data for LDA.

>* First step is to lowercase the text data.

In [9]:
content['text-lower']=content['text'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  content['text-lower']=content['text'].str.lower()


In [10]:
content['text'].iloc[0]

'RT @nowthisnews: Rep. Trey Radel (R- #FL) slams #Obamacare. #politics https://t.co/zvywMG8yIH'

In [11]:
content['text-lower'].iloc[0]

'rt @nowthisnews: rep. trey radel (r- #fl) slams #obamacare. #politics https://t.co/zvywmg8yih'

> * We can seperate the entire contents into tokens (words, hashtags, mentions, etc.).
> * Seperating the contents into tokens is called tokenization.
> * We can use the `word_tokenize` function from the `nltk` library to tokenize the contents.
> * There is also a `TweetTokenizer` function in the `nltk` library that is specifically for tweets.

> * `.apply` is used to apply a function to a column. You don't have to use a for loop to apply a function to each row.

> * There are two ways to tokenize the contents. One is to use `apply()` function to tokenize the lowercased text. 
> * `apply()` function allows you to apply a function along the axis of a DataFrame.
> * Another way is to iterate through the lowercased text and tokenize each content.

In [12]:
content['tokenized_unigrams']=content['text-lower'].apply(word_tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  content['tokenized_unigrams']=content['text-lower'].apply(word_tokenize)


In [13]:
iterated_unigrams=[]
for idx, row in content.iterrows():
    iterated_unigrams.append(word_tokenize(row['text-lower']))

In [14]:
content['iterated_unigrams']=iterated_unigrams

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  content['iterated_unigrams']=iterated_unigrams


> * The results of iterating through each row and applying the `word_tokenize` function is a list of lists are identical.

In [15]:
content.loc[0,'iterated_unigrams'] == content.loc[0,'tokenized_unigrams']

True

> * In the `nltk` library, there is a list of stopwords (function words) that we can use to remove from the contents.

In [16]:
stop=stopwords.words('english')

In [17]:
stop[1:10] #use slice to show only the first 10 stopwords

['me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [18]:
stop[-10:] #use negative index to slice the last 10 stopwords

['shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't"]

>* Because stopwords list can be customized, we can add or remove words from the list.
>* Considering some of the data is collected from Twitter, we can add some Twitter-specific stopwords like `rt'.

In [19]:
stop.append('rt') #add 'rt' to the stopwords list

>* We want to remove the stopwords from the text-lower column

In [20]:
content['stopword']=content['text-lower'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop]))
#The lambda function takes each row of the 'text-lower' column, splits it into a list of words, 
#and then joins the words back together into a string, excluding any words that are in the 'stop' list.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  content['stopword']=content['text-lower'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop]))


>* Let's get rid of irrelevant punctuations.

In [21]:
content['stop_tokenized_unigrams']=content['stopword'].apply(word_tokenize)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  content['stop_tokenized_unigrams']=content['stopword'].apply(word_tokenize)


In [22]:
content['punct_tokenized_unigrams']=content['stop_tokenized_unigrams'].apply(lambda x: [word for word in x if word.isalnum()])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  content['punct_tokenized_unigrams']=content['stop_tokenized_unigrams'].apply(lambda x: [word for word in x if word.isalnum()])


>* The last step is to convert the text data into lemma form.
>* When counting the most frequent words, we saw that the same word with different forms was counted separately. For instance, the past and present tense of the same word were counted as two different words.
>* To avoid this, we will use lemmatization to convert the text data into the base form of words.

In [23]:
from nltk.stem import WordNetLemmatizer
lemmatizer=WordNetLemmatizer()
from nltk.corpus import wordnet

> * Interestingly enough, NLTK's WordNetLemmatizer is not perfect.
> * By default, it only lemmatize nouns.
> * Therefore, we need to specify the part of speech (POS) for each token.

In [24]:
def nltk_tag_to_wordnet_tag(nltk_tag):
    if nltk_tag.startswith('J'): #ADJECTIVE
        return wordnet.ADJ
    elif nltk_tag.startswith('V'): #VERB
        return wordnet.VERB
    elif nltk_tag.startswith('N'): #NOUN        
        return wordnet.NOUN
    elif nltk_tag.startswith('R'): #ADVERB
        return wordnet.ADV
    else:          
        return None

In [25]:
def lemmatize_sentence(sentence):
    # Tokenize the sentence and find the POS tag for each token
    nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))  
    # Tuple of (token, wordnet_tag)
    wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged) 
    lemmatized_sentence = []
    for word, tag in wordnet_tagged:
        if tag is None:
            # If no tag was found, then use the word as is
            lemmatized_sentence.append(word)
        else:        
            # Else use the tag to lemmatize the word
            lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
    return " ".join(lemmatized_sentence)

In [26]:
content['lemma']=content['punct_tokenized_unigrams'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  content['lemma']=content['punct_tokenized_unigrams'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])


In [27]:
content['lemmatizer_str']=content['lemma'].apply(lambda x: lemmatize_sentence(' '.join(x)))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  content['lemmatizer_str']=content['lemma'].apply(lambda x: lemmatize_sentence(' '.join(x)))


In [28]:
content['lemmatizer_token']=content['lemmatizer_str'].apply(word_tokenize)
#tokenize the corrected lemmatized string

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  content['lemmatizer_token']=content['lemmatizer_str'].apply(word_tokenize)


> * Let's compare the results of lemmatization and without lemmatization.

In [29]:
content.loc[5, 'lemmatizer_str']

'call usdotfra release info inspection casseltonderailment review quality rail'

In [30]:
content.loc[5, 'text-lower']

'called on the @usdotfra to release info about inspections before the #casseltonderailment to review quality of rails. (1/2)'

>* We will import the `gensim` library for topic modeling.

In [31]:
import gensim
from gensim import corpora
from gensim.models.ldamodel import LdaModel

>* In order for preparing the text data for LDA, we need to create a dictionary and a corpus.
>* A dictionary is a mapping between words and their integer ids.
>* A corpus is a list of lists where each list represents the bag of words for a single document.

In [32]:
id2word=corpora.Dictionary(content['lemmatizer_token'])

In [33]:
corpus=[id2word.doc2bow(text) for text in content['lemmatizer_token']]

In [34]:
corpus[0] 
#the first element in the tuple is the word id, 
#and the second element is the frequency of the word in the document

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]

In [35]:
content['lemmatizer_token'].iloc[0]

['nowthisnews', 'trey', 'radel', 'fl', 'slam', 'obamacare', 'politics', 'http']

In [36]:
#Another example for the frequency of the word is two ('interest')
content['lemmatizer_token'][91]

['photo',
 'join',
 'world',
 'interest',
 'man',
 'from',
 'america',
 'interest',
 'state',
 'landmines',
 'event',
 'cap',
 'hill',
 'http']

In [37]:
corpus[91]

[(1, 1),
 (18, 1),
 (78, 1),
 (92, 1),
 (360, 1),
 (436, 1),
 (491, 1),
 (732, 1),
 (733, 1),
 (734, 1),
 (735, 2),
 (736, 1),
 (737, 1)]

>* Q. What is the index number of the word 'interest'?

>* The answer is 735. Note that the order of tuples in `corpus` does not follow the order of tokens.
>* Gensim creates a unique id for each word in the document. 

In [38]:
id2word[735]

'interest'

>* Q. What is the word of index number 1?

>* The answer is 'http' because it appeared in the first instance of the `content['lemmatizer_token']`.

In [39]:
id2word[1]

'http'

>* Now we build the model with the dictionary and the corpus.
>* There are several parameters to set for the LDA model.
>* The number of topics is one of the most important parameters to set. It will be specified under `num_topics`.
>* We can set the number of topics to 5.

In [40]:
lda=LdaModel(corpus=corpus, id2word=id2word, num_topics=5)

In [41]:
from pprint import pprint
#This is just to show the topics in a more readable way

In [42]:
pprint(lda.print_topics())

[(0,
  '0.016*"http" + 0.007*"obamacare" + 0.007*"today" + 0.006*"know" + '
  '0.005*"click" + 0.005*"health" + 0.005*"congressional" + 0.005*"house" + '
  '0.005*"take" + 0.005*"care"'),
 (1,
  '0.016*"http" + 0.009*"today" + 0.007*"new" + 0.007*"job" + 0.007*"house" + '
  '0.006*"american" + 0.006*"time" + 0.006*"family" + 0.006*"great" + '
  '0.006*"act"'),
 (2,
  '0.014*"http" + 0.009*"make" + 0.008*"president" + 0.007*"great" + '
  '0.007*"today" + 0.006*"work" + 0.005*"sander" + 0.005*"get" + 0.004*"like" '
  '+ 0.004*"year"'),
 (3,
  '0.014*"veteran" + 0.007*"http" + 0.007*"today" + 0.007*"state" + '
  '0.005*"day" + 0.005*"family" + 0.005*"u" + 0.004*"service" + 0.004*"get" + '
  '0.004*"american"'),
 (4,
  '0.008*"http" + 0.006*"today" + 0.006*"work" + 0.005*"help" + 0.005*"school" '
  '+ 0.005*"day" + 0.005*"service" + 0.005*"executive" + 0.004*"american" + '
  '0.004*"get"')]


>* But you'll find that the topic words will change every time you run the model.

In [43]:
lda=LdaModel(corpus=corpus, id2word=id2word, num_topics=5)

In [44]:
pprint(lda.print_topics())

[(0,
  '0.016*"http" + 0.007*"year" + 0.007*"today" + 0.006*"say" + 0.005*"get" + '
  '0.005*"congress" + 0.005*"office" + 0.005*"house" + 0.005*"american" + '
  '0.004*"president"'),
 (1,
  '0.015*"http" + 0.009*"house" + 0.008*"veteran" + 0.007*"bill" + '
  '0.005*"work" + 0.005*"american" + 0.005*"today" + 0.005*"new" + '
  '0.005*"congress" + 0.005*"tax"'),
 (2,
  '0.011*"work" + 0.010*"make" + 0.010*"http" + 0.008*"great" + 0.007*"family" '
  '+ 0.007*"woman" + 0.005*"business" + 0.005*"american" + 0.005*"year" + '
  '0.005*"day"'),
 (3,
  '0.013*"http" + 0.012*"today" + 0.008*"school" + 0.007*"great" + '
  '0.007*"president" + 0.006*"law" + 0.006*"service" + 0.006*"visit" + '
  '0.005*"new" + 0.005*"high"'),
 (4,
  '0.012*"today" + 0.010*"http" + 0.009*"day" + 0.006*"happy" + 0.006*"state" '
  '+ 0.006*"here" + 0.005*"military" + 0.005*"family" + 0.005*"great" + '
  '0.005*"de"')]


>* This is because the LDA model is a probabilistic model, so the results are not deterministic.
>* To get the same results, you need to set the seed number.
>* The seed number is set under `random_state`.

In [45]:
lda=LdaModel(corpus=corpus, id2word=id2word, num_topics=5, random_state=1)

In [46]:
pprint(lda.print_topics())

[(0,
  '0.011*"http" + 0.008*"day" + 0.008*"work" + 0.008*"time" + 0.007*"act" + '
  '0.007*"government" + 0.007*"health" + 0.007*"care" + 0.007*"vote" + '
  '0.006*"house"'),
 (1,
  '0.011*"http" + 0.009*"today" + 0.008*"great" + 0.007*"state" + 0.006*"job" '
  '+ 0.006*"school" + 0.005*"american" + 0.005*"new" + 0.005*"help" + '
  '0.005*"service"'),
 (2,
  '0.016*"http" + 0.011*"today" + 0.006*"house" + 0.005*"new" + '
  '0.005*"veteran" + 0.005*"year" + 0.005*"american" + 0.004*"help" + '
  '0.004*"great" + 0.004*"need"'),
 (3,
  '0.009*"http" + 0.009*"work" + 0.007*"would" + 0.007*"law" + 0.006*"make" + '
  '0.006*"today" + 0.006*"tax" + 0.005*"keep" + 0.005*"legislation" + '
  '0.004*"day"'),
 (4,
  '0.013*"http" + 0.006*"family" + 0.005*"please" + 0.005*"work" + '
  '0.005*"great" + 0.004*"de" + 0.004*"make" + 0.004*"today" + 0.004*"year" + '
  '0.004*"hall"')]


>* You can also increase the number of topics by changing the `num_topics` parameter.
>* Let's set the number of topics to 10.

In [47]:
lda=LdaModel(corpus=corpus, id2word=id2word, num_topics=10, random_state=1)

In [48]:
pprint(lda.print_topics())

[(0,
  '0.014*"day" + 0.013*"http" + 0.010*"first" + 0.010*"happy" + 0.009*"hall" + '
  '0.009*"town" + 0.007*"point" + 0.007*"ii" + 0.007*"year" + 0.006*"change"'),
 (1,
  '0.017*"http" + 0.010*"great" + 0.008*"affair" + 0.008*"today" + '
  '0.008*"school" + 0.008*"meet" + 0.008*"share" + 0.006*"continue" + '
  '0.006*"american" + 0.006*"park"'),
 (2,
  '0.019*"http" + 0.013*"today" + 0.008*"great" + 0.007*"year" + 0.007*"new" + '
  '0.006*"sander" + 0.006*"visit" + 0.006*"state" + 0.005*"i" + '
  '0.005*"family"'),
 (3,
  '0.010*"http" + 0.008*"must" + 0.008*"governor" + 0.007*"president" + '
  '0.007*"legislation" + 0.007*"work" + 0.006*"today" + 0.006*"mental" + '
  '0.005*"december" + 0.005*"del"'),
 (4,
  '0.017*"http" + 0.009*"family" + 0.007*"click" + 0.007*"immigration" + '
  '0.007*"de" + 0.006*"today" + 0.006*"reform" + 0.006*"great" + 0.006*"night" '
  '+ 0.005*"government"'),
 (5,
  '0.018*"law" + 0.015*"http" + 0.013*"health" + 0.012*"care" + '
  '0.011*"president" + 0.00

>* Q. Do you like the results of the LDA model? What do you find interesting?

>* Finding the optimal number of topics is a challenging task.
>* The topic coherence score and perplexity are two common metrics to evaluate the model.
>* But this is beyond the scope of this course. Come find me if you're interested in learning more about it!

>* The assumption of the LDA model is that each document is a mixture of topics.
>* In this case with social media data, the document is an individual post and the topics are the themes of the post.
>* However, given that the social media data is too short to have a mixture of topics, the LDA model may not work well.
>* The LDA model is more suitable for long documents like research papers, articles, and books.
>* Therefore, researchers developed another topic model called NMF (Non-negative Matrix Factorization) for short argumentative texts.

>* Let's learn NMF

In [49]:
from gensim.models.nmf import Nmf

In [50]:
nmf = Nmf(corpus=corpus, id2word=id2word, num_topics=5)

In [51]:
pprint(nmf.print_topics())

[(0,
  '0.069*"http" + 0.009*"student" + 0.008*"day" + 0.007*"happy" + 0.007*"gop" '
  '+ 0.006*"talk" + 0.006*"week" + 0.005*"icymi" + 0.005*"debt" + '
  '0.005*"watch"'),
 (1,
  '0.014*"job" + 0.012*"business" + 0.010*"http" + 0.009*"year" + 0.009*"amp" '
  '+ 0.008*"i" + 0.008*"day" + 0.008*"help" + 0.007*"week" + 0.007*"today"'),
 (2,
  '0.030*"law" + 0.015*"congress" + 0.013*"president" + 0.012*"get" + '
  '0.011*"say" + 0.009*"do" + 0.009*"change" + 0.008*"go" + 0.008*"today" + '
  '0.008*"work"'),
 (3,
  '0.010*"law" + 0.010*"work" + 0.010*"need" + 0.009*"american" + 0.009*"make" '
  '+ 0.007*"act" + 0.007*"bill" + 0.007*"vote" + 0.007*"tax" + 0.007*"family"'),
 (4,
  '0.119*"http" + 0.031*"amp" + 0.011*"bill" + 0.009*"great" + 0.008*"vote" + '
  '0.008*"join" + 0.007*"house" + 0.005*"via" + 0.005*"2" + 0.005*"meet"')]


>* Similar to LDA, NMF returns the topic words each time you run the model.

In [52]:
nmf = Nmf(corpus=corpus, id2word=id2word, num_topics=5)

In [53]:
pprint(nmf.print_topics())

[(0,
  '0.031*"amp" + 0.029*"http" + 0.018*"vote" + 0.012*"today" + 0.009*"bill" + '
  '0.008*"join" + 0.008*"family" + 0.007*"u" + 0.007*"year" + 0.006*"student"'),
 (1,
  '0.020*"house" + 0.018*"http" + 0.016*"new" + 0.012*"president" + '
  '0.011*"job" + 0.007*"here" + 0.007*"read" + 0.006*"support" + 0.006*"obama" '
  '+ 0.005*"work"'),
 (2,
  '0.021*"today" + 0.012*"http" + 0.011*"act" + 0.010*"health" + 0.009*"great" '
  '+ 0.008*"day" + 0.007*"care" + 0.007*"thanks" + 0.006*"state" + '
  '0.006*"service"'),
 (3,
  '0.030*"law" + 0.016*"congress" + 0.010*"say" + 0.010*"get" + 0.010*"work" + '
  '0.010*"make" + 0.009*"do" + 0.009*"people" + 0.008*"change" + 0.008*"go"'),
 (4,
  '0.128*"http" + 0.010*"amp" + 0.009*"business" + 0.008*"gop" + 0.008*"small" '
  '+ 0.006*"american" + 0.006*"tax" + 0.006*"budget" + 0.005*"obamacare" + '
  '0.005*"via"')]


>* To avoid this randomness, you can set the seed number under `random_state`.

In [54]:
nmf=Nmf(corpus=corpus, id2word=id2word, num_topics=5, random_state=5)

In [55]:
pprint(nmf.print_topics())

[(0,
  '0.037*"http" + 0.029*"today" + 0.014*"great" + 0.013*"day" + 0.010*"state" '
  '+ 0.006*"time" + 0.006*"here" + 0.006*"new" + 0.006*"read" + 0.005*"year"'),
 (1,
  '0.024*"http" + 0.021*"job" + 0.017*"american" + 0.016*"get" + 0.014*"work" '
  '+ 0.010*"people" + 0.010*"law" + 0.007*"would" + 0.006*"economy" + '
  '0.006*"rate"'),
 (2,
  '0.038*"law" + 0.021*"congress" + 0.014*"president" + 0.012*"say" + '
  '0.011*"do" + 0.010*"change" + 0.010*"get" + 0.009*"go" + 0.009*"thing" + '
  '0.009*"executive"'),
 (3,
  '0.010*"make" + 0.009*"vote" + 0.007*"house" + 0.007*"work" + 0.007*"tax" + '
  '0.006*"act" + 0.006*"u" + 0.006*"year" + 0.006*"legislation" + '
  '0.006*"need"'),
 (4,
  '0.109*"http" + 0.030*"amp" + 0.009*"hear" + 0.009*"gop" + 0.006*"support" + '
  '0.005*"talk" + 0.005*"floor" + 0.004*"tune" + 0.004*"family" + '
  '0.004*"house"')]


>* You can also increase the number of topics by changing the `n_topics` parameter.

In [56]:
nmf=Nmf(corpus=corpus, id2word=id2word, num_topics=10, random_state=5)

In [57]:
pprint(nmf.print_topics())

[(0,
  '0.020*"bill" + 0.019*"veteran" + 0.018*"today" + 0.008*"service" + '
  '0.008*"va" + 0.008*"honor" + 0.008*"congress" + 0.007*"new" + '
  '0.007*"family" + 0.007*"president"'),
 (1,
  '0.017*"job" + 0.010*"work" + 0.010*"make" + 0.010*"american" + 0.010*"tax" '
  '+ 0.009*"house" + 0.009*"would" + 0.009*"year" + 0.008*"legislation" + '
  '0.008*"business"'),
 (2,
  '0.013*"state" + 0.012*"http" + 0.011*"time" + 0.010*"school" + '
  '0.010*"obamacare" + 0.009*"day" + 0.008*"national" + 0.008*"here" + '
  '0.007*"year" + 0.007*"live"'),
 (3,
  '0.022*"today" + 0.020*"great" + 0.014*"get" + 0.014*"vote" + 0.009*"go" + '
  '0.009*"law" + 0.007*"hear" + 0.007*"talk" + 0.007*"make" + 0.007*"one"'),
 (4,
  '0.049*"law" + 0.021*"congress" + 0.016*"president" + 0.016*"say" + '
  '0.015*"do" + 0.013*"change" + 0.012*"go" + 0.012*"get" + 0.012*"thing" + '
  '0.011*"make"'),
 (5,
  '0.107*"http" + 0.020*"help" + 0.013*"here" + 0.010*"health" + 0.010*"watch" '
  '+ 0.009*"bill" + 0.008*"ame

>* Q. Which model do you like better? LDA or NMF?

>* Topic modeling helps you understand the overall abstract level of the text data.
>* When you want to know about what the text is about, you can use topic modeling to extract the topics.

>* We are moving on to the word level understanding of the text data.
>* The words that appear together frequently are likely to have a strong relationship.
>* With this we can understand how the word was used in the context.
>* Word embedding is a technique to represent words in a relation to other words in the text data.
>* To do so, each word is represented as a dense vector in a high-dimensional space.
>* The distance between the vectors represents the relationship between the words.
>* Similar words will be located close to each other in the vector space.

In [58]:
from gensim.models.word2vec import Word2Vec

>* Before we plug in week 3 data, let's learn key functions in the Word2Vec model.

>* Let's download `word2vec-google-news-300` from the `gensim` library.
>* This dataset is pre-trained on Google News data.

In [59]:
import gensim.downloader

In [60]:
vector=gensim.downloader.load('word2vec-google-news-300')
#the size is 1662.8 MB

>* Let's look at the most similar words to the word 'singapore'.

In [61]:
vector.most_similar('singapore', topn=10)

[('malaysia', 0.647721529006958),
 ('hong_kong', 0.6318016052246094),
 ('malaysian', 0.6025472283363342),
 ('australia', 0.597445011138916),
 ('uae', 0.5960760116577148),
 ('india', 0.5947676301002502),
 ('uk', 0.5883470773696899),
 ('chinese', 0.5872371792793274),
 ('usa', 0.583607017993927),
 ('simon', 0.5695350766181946)]

In [62]:
vector.most_similar('korea', topn=10)

[('russia', 0.6269471049308777),
 ('korean', 0.6154767870903015),
 ('koreans', 0.6001532673835754),
 ('seoul', 0.5999401211738586),
 ('africa', 0.5899536609649658),
 ('south_korea', 0.5762559771537781),
 ('japan', 0.5648359060287476),
 ('Koreaâ_€_™', 0.5625478625297546),
 ('chinese', 0.5616578459739685),
 ('germany', 0.5582996606826782)]

In [63]:
vector.most_similar('japan', topn=10)

[('japanese', 0.6607722043991089),
 ('tokyo', 0.6265655755996704),
 ('america', 0.6033485531806946),
 ('europe', 0.5962790250778198),
 ('germany', 0.5782293081283569),
 ('chinese', 0.5763071179389954),
 ('india', 0.5745143294334412),
 ('hawaii', 0.5731386542320251),
 ('usa', 0.5680993795394897),
 ('korea', 0.5648358464241028)]

In [64]:
vector.most_similar('china', topn=10)
#china can be an ambiguous word. It can be a country or a material.

[('dinnerware', 0.6587947607040405),
 ('crockery', 0.6426127552986145),
 ('porcelain', 0.6392654776573181),
 ('crystal_stemware', 0.6264337301254272),
 ('chinaware', 0.6146420240402222),
 ('china_plates', 0.6145730018615723),
 ('silver_flatware', 0.6102818846702576),
 ('flatware', 0.6089655756950378),
 ('bone_china', 0.6068581938743591),
 ('tableware', 0.5923404693603516)]

>* With Word2Vec, having a sense of the context is possible.
>* Another function that WordVec provides is to subtract specific words (vectors) from other words (vectors) for analogy tasks.

>* For instance, let's say we are interested in what is compatible with the meaning of `king` when subtracting `man` but instead adding `woman`.
>* Humans know that the answer is `queen`.
>* `king` - `man` = `queen` - `woman`
>* `king` - `man` + `woman` = `queen`

In [65]:
vector.most_similar(positive=['king', 'woman'], negative=['man'])

[('queen', 0.7118192911148071),
 ('monarch', 0.6189674735069275),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321243286133),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593235015869),
 ('monarchy', 0.5087411403656006)]

>* Alternatively, you can subtract the vectors directly instead of using `positive` and `negative` parameters.

>* `France` - `Paris` = `Italy` - `Rome`
>* `France` - `Paris` + `Rome` = `Italy`

In [66]:
w=vector['France']-vector['Paris']+vector['Rome']
vector.most_similar(np.array([w]))

[('Italy', 0.7115296125411987),
 ('Rome', 0.7092385292053223),
 ('France', 0.5904253721237183),
 ('Sicily', 0.5600441694259644),
 ('Italians', 0.5599856376647949),
 ('Flaminio_Stadium', 0.5327231287956238),
 ('Bambino_Gesu_Hospital', 0.505158007144928),
 ('Italian', 0.4975103735923767),
 ('Spain', 0.49529916048049927),
 ('Antonio_Martino', 0.4828406572341919)]

In [67]:
w=vector['Japan']-vector['Tokyo']+vector['Seoul']
vector.most_similar(np.array([w]))

[('South_Korea', 0.867038905620575),
 ('Korea', 0.8067482709884644),
 ('Seoul', 0.7641376852989197),
 ('South_Korean', 0.7190972566604614),
 ('Korean', 0.6862273216247559),
 ('Japan', 0.645173192024231),
 ('North_Korea', 0.6439769864082336),
 ('Koreans', 0.6212112903594971),
 ('Yonhap', 0.619912326335907),
 ('Pyongyang', 0.6188929677009583)]

>* `vector_size` indicates the dimension of the vector.
>* `window` is the maximum distance between the current and predicted word within a sentence.
>* `min_count` is the minimum number of occurrences of a word within the corpus.

In [74]:
model=Word2Vec([row for row in content['lemmatizer_token']], vector_size=100, min_count=1, window=3)

>* Let's quickly check what is the most frequent unigram so that we can check the word embedding that is actually in the corpus.

In [69]:
from collections import Counter

In [70]:
Counter([item for row in content['punct_tokenized_unigrams'] for item in row]).most_common(10)

[('http', 2162),
 ('today', 784),
 ('house', 435),
 ('amp', 431),
 ('great', 396),
 ('new', 361),
 ('bill', 324),
 ('president', 317),
 ('act', 294),
 ('congress', 289)]

In [71]:
from gensim.models import KeyedVectors
model.wv.most_similar(positive=['congress'], topn=10)

[('amp', 0.9997757077217102),
 ('give', 0.9997571110725403),
 ('get', 0.9997456073760986),
 ('also', 0.9997426867485046),
 ('time', 0.9997410178184509),
 ('say', 0.9997406601905823),
 ('take', 0.9997357726097107),
 ('first', 0.9997175931930542),
 ('must', 0.9997175335884094),
 ('stop', 0.9997149705886841)]

In [72]:
model.wv.most_similar(positive=['president'], topn=10)

[('administration', 0.9995152354240417),
 ('say', 0.9994564056396484),
 ('obamacare', 0.9994329810142517),
 ('plan', 0.999427080154419),
 ('county', 0.9994122385978699),
 ('want', 0.9994081258773804),
 ('amp', 0.9994051456451416),
 ('go', 0.9994004964828491),
 ('show', 0.9993847608566284),
 ('need', 0.9993702173233032)]

In [73]:
model.wv.most_similar(positive=['bill'], topn=10)

[('legislation', 0.9996969103813171),
 ('vote', 0.999688982963562),
 ('bipartisan', 0.9996677041053772),
 ('say', 0.9996665120124817),
 ('support', 0.9996650815010071),
 ('act', 0.999660074710846),
 ('take', 0.9996511340141296),
 ('would', 0.9996363520622253),
 ('amp', 0.999620258808136),
 ('senate', 0.9996134638786316)]