# Topic Modelling with NMF
#### A really trivial exercise to warm up with NMF


#### Set up
- For simplicity , we prepare the corpus to be a list of documents.
- In the ideal case, you should process the input text to remove stop words, perform lemmization etc
- Prepare vectorization using TFID of the input text


In [1]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF
import numpy as np

documents = ["the cat eat rice", 
             "secret message emailed by agent cat", 
             "today must go shopping for cat food"]
tf_vectorizer = TfidfVectorizer( stop_words='english')
tf_vectorized_documents = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()



#### See the extracted bag-of-word


In [2]:
tf_feature_names

['agent',
 'cat',
 'eat',
 'emailed',
 'food',
 'message',
 'rice',
 'secret',
 'shopping',
 'today']

### Questions:
1. What are the stop words in the corpus?
2. Are the stop words in the feature list?
2. If so, how where the stop words removed?

#### See the term document matrix

In [3]:
print(tf_vectorized_documents)

  (0, 6)	0.652490884512534
  (0, 2)	0.652490884512534
  (0, 1)	0.3853716274664007
  (1, 0)	0.479527938028855
  (1, 3)	0.479527938028855
  (1, 5)	0.479527938028855
  (1, 7)	0.479527938028855
  (1, 1)	0.2832169249871526
  (2, 4)	0.546454011634009
  (2, 8)	0.546454011634009
  (2, 9)	0.546454011634009
  (2, 1)	0.3227445421804912


#### Fit and transform the vectorized document into NMF model
- To align with convention used in the note, we will use A (or V), W, H
-  W * H = A

In [4]:
nmf_model = NMF(n_components=5, init='random',   random_state=0 )
A = tf_vectorized_documents
W = nmf_model.fit_transform(A)
H = nmf_model.components_


#### view the lda model parameters
- There are many parameters to tune. We are only interested into the number of topics

In [5]:
nmf_model

NMF(alpha=0.0, beta_loss='frobenius', init='random', l1_ratio=0.0, max_iter=200,
    n_components=5, random_state=0, shuffle=False, solver='cd', tol=0.0001,
    verbose=0)

* Explore contents of W

In [6]:
np.set_printoptions(precision=4)     # controls the precision of the np output inherent in W and H

In [7]:
W.shape

(3, 5)

In [8]:
print(W)

[[0.0000e+00 1.3293e+00 0.0000e+00 0.0000e+00 9.5947e-02]
 [7.9892e-01 0.0000e+00 1.4538e-08 2.8014e-01 0.0000e+00]
 [0.0000e+00 1.4770e-04 5.2578e-01 5.7715e-06 0.0000e+00]]


* Explore contents of H

In [9]:
H.shape

(5, 10)

In [10]:
print(H)

[[0.4642 0.3434 0.     0.4074 0.     0.4814 0.     0.4702 0.     0.    ]
 [0.     0.2899 0.3642 0.     0.     0.     0.3837 0.     0.     0.    ]
 [0.     0.6138 0.     0.     1.0393 0.     0.     0.     1.0393 1.0393]
 [0.388  0.0317 0.     0.55   0.     0.3389 0.     0.3708 0.     0.    ]
 [0.     0.     1.7546 0.     0.     0.     1.4841 0.     0.     0.    ]]


### Questions:
1. How many documents are there in the original corpus?
2. What is the number of terms that are extracted?
3. In W, how many items(list) are there in the array? 
4. In H, how many items(list) are there in the array? What does the items refer to
5. Fill in the blank: Row 3 (index 2) of W refers to the document _______________________
6. Fill in the blank: The predominant topic related to Row 3 is topic ___________________

### Exercises:
1. Change the number of topics (e.g. set n_components = 6) and run the notebook.
2. Describe (in terms of dimensions) how has W and  H change.
3. Increase documents with 2 more sentences. Describe the changes in W, H and A

#### Reference:
- https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
