### Topic Modelling with NFM

Acknowledgement: This tutorial is adapted from Derek Greene (http://derekgreene.com/)

Topic modelling aims to automatically discover the hidden thematic structure in a large corpus of text documents. One approach for topic modelling is to apply *matrix factorisation* methods, such as *Non-negative Matrix Factorisation (NMF)*. In this notebook we look at how to apply NMF using the *scikit-learn* library in Python.

### Applying NMF

First, let's load the TF-IDF normalised document-term matrix and list of terms that we stored earlier using *Joblib*:

You may need to:
1. Identify the location and name of the TF-IDF file that you save in the previous exercise
2. Inside the file name into the parameter of jobliub.load()

In [1]:
#from sklearn.externals import joblib
# A is the vectorized document
# terms is the feature names
import joblib
#(A,terms,snippets, raw_documents) = joblib.load( "REPLACE WITH YOUR FILE" )


### TO DO:
- provide joblib.load with the file name that contains the TF-IDF filww

In [2]:
(A,terms,snippets, raw_documents) = joblib.load( "articles-tfidf_pk.pkl" )
print( "Loaded %d X %d document-term matrix" % (A.shape[0], A.shape[1]) )
print ("Loaded %d unique terms" % len(terms))

Loaded 4551 X 10285 document-term matrix
Loaded 10285 unique terms


The key input parameter to NMF is the number of topics to generate *k*. For the moment, we will pre-specify a guessed value, for demonstration purposes.

In [3]:
k = 10

Another choice for NMF revolves around initialisation. Most commonly, NMF involves using random initialisation to populate the values in the factors W and H. Depending on the random seed that you use, you may get different results on the same dataset. Instead, using SVD-based initialisation provides more reliable results.

In [4]:
# create the model
from sklearn import decomposition
model = decomposition.NMF( init="nndsvd", n_components=k ) 
# apply the model and extract the two factor matrices
W = model.fit_transform( A )
H = model.components_

### Examining the Output

NMF produces to factor matrices as its output: *W* and *H*.

The *W* factor contains the document membership weights relative to each of the *k* topics. Each row corresponds to a single document, and each column correspond to a topic.

In [5]:
W.shape

(4551, 10)

###  Knowledge check
1. What does the H factor contains?
2. Find its shape.

In [6]:
# your codes
H.shape

(10, 10285)

###  Question

1.Based on the loaded documents and NMF that is applied that to it,
(a) the original file has __4551____ documents
(b) In total, there are _10284_____ terms.
(c) W contains __4551______ documents in the rows, and weightable of __10__ topics in the columns

### Exporting the Results

If we want to keep this topic model for later user, we can save it using *joblib*:

In [7]:
joblib.dump((W,H,terms,snippets), "articles-model-nmf-k%02d.pkl" % k) 

['articles-model-nmf-k10.pkl']


#### Reference:

https://github.com/derekgreene/topic-model-tutorial