# Topic Modeling

- Discover topics in a text corpus (a collection of documents)

- scikit-learn package is very popular in Python: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
- Gensim module is very popular in Python: http://radimrehurek.com/gensim/



## Algorithms

- Non-negative matrix factorization (NMF)

- Latent Dirchilet Allocation (LDA)

# 1. Non-Negative Matrix Factorization (NMF or NNMF)

- is a group of algorithms in multivariate analysis and linear algebra 
- where a matrix V is factorized into (usually) two matrices W and H, 
- with the property that all three matrices have **no negative elements**. 

This non-negativity makes the resulting matrices easier to inspect. 

Also, in applications such as processing of audio spectrograms or muscular activity, 
non-negativity is inherent to the data being considered. 

Since the problem is not exactly solvable in general, it is commonly approximated numerically.

Source:
https://en.wikipedia.org/wiki/Non-negative_matrix_factorization

## (1) Open the JSON File and Create a List of 1K Tweets Text, "corpus_contents", For TF-IDF Vectorization

In [1]:
import json
import numpy as np
from pprint import pprint

In [2]:
infile = open('tweet_stream_halloween_1000.json')
data = json.load(infile)
infile.close()

In [3]:
data[0].keys()

dict_keys(['created_at', 'id', 'id_str', 'text', 'source', 'truncated', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'retweeted_status', 'is_quote_status', 'quote_count', 'reply_count', 'retweet_count', 'favorite_count', 'entities', 'favorited', 'retweeted', 'filter_level', 'lang', 'timestamp_ms'])

In [4]:
data[0]['text']

'RT @xbrinni: Haunting Ground! 🦇\n\nHalloween shoot this year was another one of our favorite PS2 games! Photos to come soon! \n\nDaniella is @k…'

In [5]:
corpus_contents = []

for t in data:
    corpus_contents.append(t['text'])

In [6]:
pprint(corpus_contents)

['RT @xbrinni: Haunting Ground! 🦇\n'
 '\n'
 'Halloween shoot this year was another one of our favorite PS2 games! Photos '
 'to come soon! \n'
 '\n'
 'Daniella is @k…',
 'RT @sarah_schlosser: I don’t think I’ve ever laughed so hard, this guy wins '
 'Halloween https://t.co/glV3sVzatR',
 'RT @partycasino: Whats the most out there costume you saw last night? Tag us '
 'in your most crazy Halloween costume pics for a chance to be f…',
 'RT @NCTsmtown: NCT 127 ‘Regular’ Halloween Costume Ver.\n'
 '\n'
 '#NCT127 #NCT127_Regular\n'
 '#NCT \n'
 '#DancePractice #Halloween https://t.co/t6u7nqFEku',
 'RT @NCTsmtown: NCT DREAM ‘We Go Up’ Halloween Costume Ver.\n'
 '\n'
 '#NCTDREAM #NCTDREAM_WeGoUp\n'
 '#NCT\n'
 '#DancePractice #Halloween https://t.co/TsyAoOIr…',
 'RT @SidemenClothing: 🚨 COMPETITION TIME 🚨\n'
 '\n'
 "It's Halloween which means that it's time for another treat! For your chance "
 'to win this unrelea…',
 'RT @_parkercurry: Happy #Halloween! Can you guess who I am? '
 'https://t.co/

In [7]:
len(corpus_contents)

1000

## (2)  Vectorize the Corpus With TfidfVectorizer and Create a List of Unique Words

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Set a TfidfVecorizer instance object that remove stopwords (stop_words = 'english') 
# and ignore terms that appears less than 2% of the documents (min_df = 2).
vectorizer = TfidfVectorizer(stop_words = 'english', min_df = 2)

# Using the TfidVectorizer instance object, tokenize all the strings in the corpus 
# and return document-term TF-IDF matrix comprised of vectors for strings
doc_term_matrix = vectorizer.fit_transform(corpus_contents)

print(doc_term_matrix.shape) # 1000 documents (tweets) and 887 unique words
print(doc_term_matrix)

(1000, 887)
  (0, 697)	0.38511866470801
  (0, 145)	0.29410374385748145
  (0, 572)	0.38511866470801
  (0, 280)	0.3688500277807765
  (0, 249)	0.31164323763533236
  (0, 870)	0.2778351069302481
  (0, 671)	0.38511866470801
  (0, 321)	0.0704408628923014
  (0, 339)	0.38511866470801
  (0, 634)	0.07008089509576916
  (1, 295)	0.3635621632653926
  (1, 370)	0.07912575631100026
  (1, 852)	0.3635621632653926
  (1, 312)	0.3409615244909181
  (1, 333)	0.3249261160886729
  (1, 434)	0.3635621632653926
  (1, 815)	0.2829108504511952
  (1, 762)	0.28988744825261886
  (1, 202)	0.2636894301374788
  (1, 642)	0.3635621632653926
  (1, 321)	0.06943101685389305
  (1, 634)	0.06907620958547414
  (2, 127)	0.37558505701129974
  (2, 573)	0.36471447314483835
  (2, 163)	0.37558505701129974
  :	:
  (996, 370)	0.12038542494081285
  (997, 167)	0.552078495846353
  (997, 416)	0.439695633558456
  (997, 379)	0.4958870647024045
  (997, 78)	0.4163740824955723
  (997, 330)	0.22152436399370987
  (997, 370)	0.11507878813600508
  (997

In [9]:
# Create a List of Unique Vocabulary
unique_words = vectorizer.get_feature_names() 
print(len(unique_words)) # Number of unique words
print(unique_words[:10])

887
['0tdi2juxbb', '10', '100', '11th', '127', '151029', '181031', '181101', '19', '1e50a214']


## (3) NMF Decomposition using document-term matrix with TfidfVectorizer

Scikit-learn NMF
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html

In [10]:
from sklearn import decomposition

# Set the Desired Number of Topics
num_topics = 5

# Set a Classifier (i.e. "clf") That Initializes the NMF Decomposition With the Assigned Number of Topics
clf = decomposition.NMF(n_components = num_topics)

# Using the Classifier Object, Transform the Document-Term TF-IDF Matrix to Fit the NMF Model
# and Return a Decomposed Matrix With the Number of Documents and the Number of Topics

doc_top_matrix = clf.fit_transform(doc_term_matrix)

print(doc_top_matrix.shape) # Check the shape of the matrix
print(doc_top_matrix)

(1000, 5)
[[0.         0.         0.         0.00027065 0.05465437]
 [0.         0.         0.         0.         0.08909413]
 [0.         0.         0.         0.05540277 0.0472334 ]
 ...
 [0.0917804  0.         0.         0.         0.03721288]
 [0.         0.         0.         0.         0.06263012]
 [0.00291224 0.         0.0011032  0.00196992 0.08234749]]




> **< name of the classifier >.components_** returns a decomposed matrix with the number of topics and the number of terms.

In [11]:
top_term_matrix = clf.components_

print(top_term_matrix.shape)
print(top_term_matrix)

(5, 887)
[[0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.00269896 0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.01503725 0.00983586 0.00430947 ... 0.00283797 0.00283797 0.00283797]]


## (4) Now Let's Try to See the Constructed Topics

In [12]:
import numpy as np

topic_1 = clf.components_[0]
topic_1[:10]

array([0.        , 0.        , 0.        , 0.00732722, 0.        ,
       0.        , 0.        , 0.        , 0.        , 0.        ])

In [13]:
# We Need Indices of the Top Key Words for Each Topic
# How to Find Them? Sorting? We May Lose Their Origial Indices
# You Can Use the np.argsort Function

num_top_words = 5 # The top 5 words of each topic

# Get the Indices of the 5 Largest Weights (From Smallest to Largest)
np.argsort(topic_1)[-num_top_words:]

array([747, 634, 321, 370, 330], dtype=int64)

In [14]:
# We Need the Top 5 Words From Top-Down
np.argsort(topic_1)[-num_top_words:][::-1] # [::-1] will change its direction

array([330, 370, 321, 634, 747], dtype=int64)

In [15]:
print(topic_1[330], topic_1[370], topic_1[321], topic_1[634], topic_1[747])

2.310898617930497 0.9031976919187378 0.8846703982655016 0.7601853683679839 0.19049809191401104


In [16]:
# We Can Use unique_words
print(unique_words[330], unique_words[370], unique_words[321], unique_words[634], unique_words[747])

happy https halloween rt tedcruz


In [17]:
import numpy as np

topic_words = []
num_top_words = 5 # The top 5 words of each topic

# Go Over Each Component/Topic
for topic in clf.components_:

    # Get the Indices of the 5 Largest Weights (From Smallest to Largest)
    word_idx = np.argsort(topic)[-num_top_words:]
    
    temp_lst = []
    # Let's See the Words Corresponding to the Indices
    for idx in word_idx[::-1]: # To access the largest weights first, plesae reverse the sequential object using [::-1]
        temp_lst.append(unique_words[idx]) # Let's append a keywords of the topic to a temp_lst
        
    topic_words.append(temp_lst) # Let's append a list of keyword of the topic to topic_words

In [18]:
from pprint import pprint
pprint(topic_words)

[['happy', 'https', 'halloween', 'rt', 'tedcruz'],
 ['love', '40carze8gt', 'nailed', 'parker', 'michelleobama'],
 ['_jalenrobinson_', '1i5o551jt6', 'won', 'https', 'rt'],
 ['nct', 'ver', 'costume', 'dancepractice', 'nctsmtown'],
 ['https', 'halloween', 'rt', 'like', 'dressed']]


## (5) Summary

In [19]:
import json
import numpy as np
from pprint import pprint
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import decomposition
import numpy as np

infile = open('tweet_stream_halloween_1000.json')
data = json.load(infile)
infile.close()

corpus_contents = []

for t in data:
    corpus_contents.append(t['text'])
    
vectorizer = TfidfVectorizer(stop_words = 'english', min_df = 2)
doc_term_matrix = vectorizer.fit_transform(corpus_contents)

unique_words = vectorizer.get_feature_names() 

num_topics = 5

clf = decomposition.NMF(n_components = num_topics)
doc_top_matrix = clf.fit_transform(doc_term_matrix)
top_term_matrix = clf.components_

topic_words = []
num_top_words = 5 

for topic in clf.components_:
    word_idx = np.argsort(topic)[-num_top_words:]
    temp_lst = []
    for idx in word_idx[::-1]: 
        temp_lst.append(unique_words[idx])
    topic_words.append(temp_lst) 
    
pprint(topic_words)

[['happy', 'https', 'halloween', 'rt', 'tedcruz'],
 ['love', '40carze8gt', 'nailed', 'parker', 'michelleobama'],
 ['_jalenrobinson_', '1i5o551jt6', 'won', 'https', 'rt'],
 ['nct', 'ver', 'costume', 'dancepractice', 'nctsmtown'],
 ['https', 'halloween', 'rt', 'like', 'dressed']]




## (6) Practice: Customizing Stopwords For Topic Modeling With 1K Tweets

In [20]:
from sklearn.feature_extraction import text 

In [21]:
my_additional_stop_word_list = ['rt', 'https']

In [22]:
stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_word_list)

In [23]:
stop_words

frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
           'also',
           'although',
           'always',
           'am',
           'among',
           'amongst',
           'amoungst',
           'amount',
           'an',
           'and',
           'another',
           'any',
           'anyhow',
           'anyone',
           'anything',
           'anyway',
           'anywhere',
           'are',
           'around',
           'as',
           'at',
           'back',
           'be',
           'became',
           'because',
           'become',
           'becomes',
           'becoming',
           'been',
           'before',
           'beforehand',
           'behind',
           'being',
           'below',
           'beside',
           'besides'

In [24]:
import json
import numpy as np
from pprint import pprint
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text 
from sklearn import decomposition
import numpy as np

infile = open('tweet_stream_halloween_1000.json')
data = json.load(infile)
infile.close()

corpus_contents = []

for t in data:
    corpus_contents.append(t['text'])

my_additional_stop_word_list = ['rt', 'https']
my_stop_words = text.ENGLISH_STOP_WORDS.union(my_additional_stop_word_list)

vectorizer = TfidfVectorizer(stop_words = my_stop_words, min_df = 2)
doc_term_matrix = vectorizer.fit_transform(corpus_contents)

unique_words = vectorizer.get_feature_names() 

num_topics = 5

clf = decomposition.NMF(n_components = num_topics)
doc_top_matrix = clf.fit_transform(doc_term_matrix)
top_term_matrix = clf.components_

topic_words = []
num_top_words = 5 

for topic in clf.components_:
    word_idx = np.argsort(topic)[-num_top_words:]
    temp_lst = []
    for idx in word_idx[::-1]: 
        temp_lst.append(unique_words[idx])
    topic_words.append(temp_lst) 
    
pprint(topic_words)

[['happy', 'halloween', 'tedcruz', 'jigtaimzep', 'amp'],
 ['love', '40carze8gt', 'nailed', 'parker', 'michelleobama'],
 ['1i5o551jt6', '_jalenrobinson_', 'won', 'halloween', 'dressed'],
 ['nct', 'ver', 'costume', 'dancepractice', 'nctsmtown'],
 ['halloween', 'like', 'dressed', 'dress', 'party']]


