# compare NMF and LDA models in topic modeling
https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df

## 1. open a testing document (tokenized)
I have given up building a topic model for all apps descriptions (across all panels) because
this is too large. I will generate topic model for each app (with all panels combined), and
use k-means to classify each app's broad/niche-ness.

In [1]:
%matplotlib inline
%run -i '0_paths.py'
%run -i '4_natural_language_processing.py'
initial_panel = '201908'
panels_have_text = ['201912', '202001', '202003', '202004', '202009', '202010', '202011']
DF = combine_tokenized_cols_into_single_col(initial_panel, panels_have_text)
tokenized_cols = ['description_' + item for item in panels_have_text]
doc_sample = DF.loc[DF.index[900], 'combined_panels_description']
print(len(doc_sample ))
for i in tokenized_cols:
    print(len(DF.loc[DF.index[900], i]))
doc_sample

HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…


1566
228
228
222
222
222
222
222


['want',
 'head',
 'chef',
 'sushi',
 'restaurant',
 'check',
 'toca',
 'kitchen',
 'sushi',
 'new',
 'app',
 'toca',
 'boca',
 'want',
 'play',
 'food',
 'toca',
 'boca',
 'bring',
 'toca',
 'kitchen',
 'cook',
 'play',
 'prepare',
 'food',
 'hungry',
 'character',
 'pick',
 'ingredient',
 'prepare',
 'way',
 'slice',
 'boil',
 'fry',
 'cook',
 'microwave',
 'mix',
 'wait',
 'hungry',
 'friend',
 'response.\\nwe',
 'create',
 'educational',
 'app',
 'kid',
 'mud',
 'cake',
 'new',
 'face',
 'complete',
 'virtual',
 'makeover',
 'kid',
 'love',
 'character',
 'fearful',
 'reaction',
 'different',
 'meal',
 'giggle',
 'cat',
 'like',
 'salty',
 'fish',
 'boy',
 'eat',
 'potato',
 'green',
 'mix',
 'match',
 'meal',
 'high',
 'score',
 'level',
 'time',
 'limit',
 'fun',
 'not',
 'end',
 'cooking',
 'app.\\n-',
 'cute',
 'character',
 'cook',
 'favorite',
 'food',
 '\\n-',
 '12',
 'different',
 'ingredient',
 'prepare',
 '180',
 'different',
 'way',
 '\\n-',
 'slice',
 'boil',
 'fry',
 '

## 2. NMF (Non-Negative Matrix Factorization) modeling
The goal of NMF is to find two non-negative matrices (W, H) whose product approximates
the non- negative matrix X. This factorization can be used for example for
dimensionality reduction, source separation or topic extraction.

The goal of using tf-idf instead of the raw frequencies of occurrence of a token
in a given document is to scale down the impact of tokens that occur very frequently
in a given corpus and that are hence empirically
less informative than features that occur in a small fraction of the training corpus.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html

https://shravan-kuchkula.github.io/topic-modeling/#vectorize-the-reviews

In [2]:
%run -i '4_natural_language_processing.py'
pipe = Pipeline(steps = [('tfidf', TfidfVectorizer()),
                         ('nmf', NMF(n_components=1, init='nndsvd',
                                     random_state=1, alpha=.1, l1_ratio=.5))])
# accessing pipe estimator by its kay (user defined name)
#
# alpha is the constant that multiplies the regularization terms. Set it to zero to have no regularization.

HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




In [4]:
pipe.fit_transform(doc_sample) # without running this step, you cannot run the following steps
nmf_feature_names = pipe['tfidf'].get_feature_names()
print(nmf_feature_names)

['12', '130', '180', '215', 'advertising', 'app', 'aspire', 'assure', 'award', 'believe', 'boca', 'boil', 'boy', 'bring', 'cake', 'cat', 'character', 'check', 'chef', 'complete', 'cook', 'cooking', 'country', 'create', 'creative', 'creativity', 'customer', 'cute', 'deliver', 'design', 'different', 'download', 'eat', 'educational', 'empower', 'end', 'ended', 'experience', 'explore', 'face', 'favorite', 'fearful', 'fish', 'food', 'free', 'friend', 'friendly', 'fry', 'fun', 'giggle', 'green', 'head', 'help', 'high', 'hungry', 'imagination', 'include', 'ingredient', 'interface', 'issue', 'kid', 'kitchen', 'learn', 'let', 'level', 'like', 'limit', 'love', 'makeover', 'match', 'matter', 'meal', 'microwave', 'million', 'mix', 'mud', 'music', 'nabout', 'nas', 'nat', 'new', 'not', 'nprivacy', 'ntoca', 'nwe', 'nwhat', 'offer', 'open', 'party', 'perfect', 'perspective', 'pick', 'play', 'playful', 'policy', 'potato', 'power', 'prepare', 'privacy', 'product', 'professional', 'purchase', 'purchases'

In [19]:
# print(nmf_output)
nmf_weights = pipe['nmf'].components_[1]
print(len(nmf_weights))
print(nmf_weights)
#
nmf_weights_indices = pipe['nmf'].components_[1].argsort() # ascending
print(nmf_weights_indices)
print(nmf_weights_indices[-1]) # the largest weight index
print(nmf_weights[nmf_weights_indices[-1]]) # 5.026 printed the largest weight

131
[0.         0.         0.         0.         0.         4.09394365
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.         0.         0.
 0.         0.         0.         0.        

In [20]:
# descending, top 20 indices of biggest 20 weights
nmf_weights_indices = pipe['nmf'].components_[1].argsort()[:-20 - 1:-1]
print(nmf_weights_indices)
words = [nmf_feature_names[key] for key in nmf_weights_indices]
print(words)

[  5 130  40  46  45  44  43  42  41  39  48  38  37  36  35  34  47  49
  32  50]
['app', 'world', 'favorite', 'friendly', 'friend', 'free', 'food', 'fish', 'fearful', 'face', 'fun', 'explore', 'experience', 'ended', 'end', 'empower', 'fry', 'giggle', 'eat', 'green']


In [21]:
# https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df
def get_nmf_topics(pipe, num_topics):
    #the word ids obtained need to be reverse-mapped to the words so we can print the topic names.
    feat_names = pipe['tfidf'].get_feature_names()
    word_dict = {}
    for i in range(num_topics):
        #for each topic, obtain the largest values, and add the words they map to into the dictionary.
        words_ids = pipe['nmf'].components_[i].argsort()[:-20 - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = words
    return pd.DataFrame(word_dict)

In [22]:
nmf_topic_df = get_nmf_topics(pipe, num_topics=10)
nmf_topic_df

Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07,Topic # 08,Topic # 09,Topic # 10
0,kid,app,toca,cook,boca,kitchen,play,want,food,fun
1,world,world,world,world,nat,world,world,world,world,eat
2,fry,favorite,favorite,fry,fish,fry,fry,favorite,eat,empower
3,empower,friendly,friendly,empower,fry,empower,empower,friendly,empower,end
4,end,friend,friend,end,friendly,end,end,friend,end,ended
5,ended,free,free,ended,friend,ended,ended,free,ended,experience
6,experience,food,food,experience,free,experience,experience,food,experience,explore
7,explore,fish,fish,explore,food,explore,explore,fish,explore,face
8,face,fearful,fearful,face,fearful,face,face,fearful,face,favorite
9,favorite,face,face,favorite,giggle,favorite,favorite,face,favorite,fearful


## 3. LDA (Latent Dirichlet Allocation) modeling
LDA, or Latent Derelicht Analysis is a probabilistic model, and to obtain cluster assignments,
it uses two probability values: P( word | topics) and P( topics | documents).
These values are calculated based on an initial random assignment, after which they are repeated
for each word in each document, to decide their topic assignment. In an iterative procedure,
these probabilities are calculated multiple times, until the convergence of the algorithm.

https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df

In [23]:
%run -i '4_natural_language_processing.py'
pipe = Pipeline(steps = [('count', CountVectorizer()),
                         ('lda', LatentDirichletAllocation(n_components=10, max_iter=100,
                                    learning_method='online', random_state=1,
                                    batch_size=128, evaluate_every=-1, n_jobs=-1))])
# accessing pipe estimator by its kay (user defined name)
lda_output = pipe.fit_transform(doc_sample)

HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




In [24]:
lda_feature_names = pipe['count'].get_feature_names()
print(lda_feature_names)

['12', '130', '180', '215', 'advertising', 'app', 'aspire', 'assure', 'award', 'believe', 'boca', 'boil', 'boy', 'bring', 'cake', 'cat', 'character', 'check', 'chef', 'complete', 'cook', 'cooking', 'country', 'create', 'creative', 'creativity', 'customer', 'cute', 'deliver', 'design', 'different', 'download', 'eat', 'educational', 'empower', 'end', 'ended', 'experience', 'explore', 'face', 'favorite', 'fearful', 'fish', 'food', 'free', 'friend', 'friendly', 'fry', 'fun', 'giggle', 'green', 'head', 'help', 'high', 'hungry', 'imagination', 'include', 'ingredient', 'interface', 'issue', 'kid', 'kitchen', 'learn', 'let', 'level', 'like', 'limit', 'love', 'makeover', 'match', 'matter', 'meal', 'microwave', 'million', 'mix', 'mud', 'music', 'nabout', 'nas', 'nat', 'new', 'not', 'nprivacy', 'ntoca', 'nwe', 'nwhat', 'offer', 'open', 'party', 'perfect', 'perspective', 'pick', 'play', 'playful', 'policy', 'potato', 'power', 'prepare', 'privacy', 'product', 'professional', 'purchase', 'purchases'

In [25]:
# print(nmf_output)
lda_weights = pipe['lda'].components_[1]
print(len(lda_weights))
print(lda_weights)
#
lda_weights_indices = pipe['lda'].components_[1].argsort() # ascending
print(lda_weights_indices)
print(lda_weights_indices[-1]) # the largest weight index
print(lda_weights[lda_weights_indices[-1]]) # 106.06 printed the largest weight

131
[ 0.1         0.1         0.1         0.1         0.1         0.1
  0.1         6.69366489  0.1         0.1        42.75411162 13.21930135
  0.1         0.1         0.1         0.1         0.1         0.1
  0.1         0.1         0.1         0.1         0.1         0.1
  0.1         0.1         0.1         0.1         0.1         0.1
  0.1         0.1         0.1         0.1         0.1         0.1
  0.1         0.1         0.1         0.1         0.1         6.65031313
  0.1        26.32000348  0.1         0.1         0.1         0.1
  0.1         0.1         6.66280169  0.1         0.1         0.1
  0.1         0.1         0.1         0.1         0.1         0.1
  0.1         0.1         0.1         0.1         0.1         0.1
  0.1         0.1         0.1         0.1         0.1         0.1
  0.1         0.1         0.1         0.1         0.1         0.1
  0.1         0.10001954  0.1         0.1         0.1         0.1
  0.1         0.1         0.1         0.1         6.668988

In [26]:
# descending, top 20 indices of biggest 20 weights
lda_weights_indices = pipe['lda'].components_[1].argsort()[:-20 - 1:-1] # descending order the largest to 20th largest
print(lda_weights_indices)
words = [lda_feature_names[key] for key in lda_weights_indices]
print(words)

[123  10  92  43  11   7 130  88  50 111  41  79  20 100   6 110  96  17
 108  38]
['toca', 'boca', 'play', 'food', 'boil', 'assure', 'world', 'party', 'green', 'salty', 'fearful', 'nat', 'cook', 'professional', 'aspire', 'safe', 'power', 'check', 'roleplay', 'explore']


In [27]:
# https://medium.com/ml2vec/topic-modeling-is-an-unsupervised-learning-approach-to-clustering-documents-to-discover-topics-fdfbf30e27df
def get_lda_topics(pipe, num_topics):
    #the word ids obtained need to be reverse-mapped to the words so we can print the topic names.
    feat_names = pipe['count'].get_feature_names()
    word_dict = {}
    for i in range(num_topics):
        #for each topic, obtain the largest values, and add the words they map to into the dictionary.
        words_ids = pipe['lda'].components_[i].argsort()[:-20 - 1:-1]
        words = [feat_names[key] for key in words_ids]
        word_dict['Topic # ' + '{:02d}'.format(i+1)] = words
    return pd.DataFrame(word_dict)

In [28]:
lda_topic_df = get_lda_topics(pipe, num_topics=10)
lda_topic_df


Unnamed: 0,Topic # 01,Topic # 02,Topic # 03,Topic # 04,Topic # 05,Topic # 06,Topic # 07,Topic # 08,Topic # 09,Topic # 10
0,kid,toca,design,want,fun,chef,app,cook,kitchen,prepare
1,privacy,boca,eat,different,learn,policy,high,score,product,imagination
2,time,play,fry,cooking,way,read,character,let,like,slice
3,ingredient,food,meal,wait,hungry,experience,new,limit,offer,microwave
4,open,boil,country,issue,matter,win,empower,mix,180,130
5,safe,assure,work,include,ended,download,customer,sushi,reaction,stressful
6,power,world,million,rule,seriously,creative,complete,perspective,makeover,deliver
7,explore,party,award,boy,215,spark,friend,creativity,pick,explore
8,roleplay,green,playful,giggle,match,music,kid,friendly,nat,power
9,aspire,salty,believe,create,educational,rest,power,end,purchases,open


Please refer to excel saved with name 'lda_output_compare_with_nmf_output', and I eyeballed and concluded that
NMF's topic words makes more sense. Surprisingly, unlike what I have always thought of, each topic does not
represent a single collective theme, rather, topic 1 through topic 10 all look relatively the same.
It is the words appearing the first in topic 1 may be much similar to the word appearing in topic 2 through
topic 10. It's reasonable because we used argsort within component's (each topic's) weight matrix.

So for each app's description, I will use 1 topic, and 10 words, just to speed up the process.
Topic 2 through 9 seems just like replicates of topic 1.

