### Topic Modelling with NFM

Acknowledgement: This tutorial is adapted from Derek Greene (http://derekgreene.com/)

Topic modelling aims to automatically discover the hidden thematic structure in a large corpus of text documents. One approach for topic modelling is to apply *matrix factorisation* methods, such as *Non-negative Matrix Factorisation (NMF)*. In this notebook we look at how to apply NMF using the *scikit-learn* library in Python.

### Applying NMF

First, let's load the TF-IDF normalised document-term matrix and list of terms that we stored earlier using *Joblib*:

You may need to:
1. Identify the location and name of the TF-IDF file that you save in the previous exercise
2. Inside the file name into the parameter of jobliub.load()

In [1]:
#from sklearn.externals import joblib
# A is the vectorized document
# terms is the feature names
import joblib
import numpy as np

#(A,terms,snippets, raw_documents) = joblib.load( "REPLACE WITH YOUR FILE" )
(A,terms,snippets, raw_documents) = joblib.load( "articles-tfidf_pk.pkl" )
print( "Loaded %d X %d document-term matrix" % (A.shape[0], A.shape[1]) )
print ("Loaded %d unique terms" % len(terms))

Loaded 4551 X 10285 document-term matrix
Loaded 10285 unique terms


The key input parameter to NMF is the number of topics to generate *k*. For the moment, we will pre-specify a guessed value, for demonstration purposes.

In [8]:
k = 30

Another choice for NMF revolves around initialisation. Most commonly, NMF involves using random initialisation to populate the values in the factors W and H. Depending on the random seed that you use, you may get different results on the same dataset. Instead, using SVD-based initialisation provides more reliable results.

In [9]:
# create the model
from sklearn import decomposition
model = decomposition.NMF( init="nndsvd", n_components=k ) 
# apply the model and extract the two factor matrices
W = model.fit_transform( A )
H = model.components_

### Most Relevant Documents

We can also look at the snippets for the top-ranked documents for each topic. We'll define a function to produce this ranking also.

In [15]:
import numpy as np
def get_descriptor( terms, H, topic_index, top ):
    # reverse sort the values to sort the indices
    top_indices = np.argsort( H[topic_index,:] )[::-1]
    # now get the terms corresponding to the top-ranked indices
    top_terms = []
    for term_index in top_indices[0:top]:
        top_terms.append( terms[term_index] )
    return top_terms

In [10]:
def get_top_snippets( all_snippets, W, topic_index, top ):
    # reverse sort the values to sort the indices
    top_indices = np.argsort( W[:,topic_index] )[::-1]
    # now get the snippets corresponding to the top-ranked indices
    top_snippets = []
    for doc_index in top_indices[0:top]:
        top_snippets.append( all_snippets[doc_index] )
    return top_snippets

For instance, for the first topic listed above, the top 10 documents are:

In [16]:
descriptors = []
for topic_index in range(k):
    descriptors.append( get_descriptor( terms, H, topic_index, 10 ) )
    str_descriptor = ", ".join( descriptors[topic_index] )
    print("Topic %02d: %s" % ( topic_index+1, str_descriptor ) )

Topic 01: brexit, government, parliament, uk, minister, theresa, hammond, article, trade, 50
Topic 02: trump, donald, president, republican, campaign, america, presidential, white, election, american
Topic 03: film, films, movie, star, director, hollywood, actor, drama, festival, cinema
Topic 04: united, mourinho, manchester, rooney, gaal, ibrahimovic, van, mata, pogba, rashford
Topic 05: bank, banks, rbs, banking, deutsche, shares, lloyds, customers, hsbc, financial
Topic 06: nhs, care, patients, hospital, health, social, services, healthcare, hospitals, patient
Topic 07: album, music, band, pop, song, songs, rock, bowie, sound, guitar
Topic 08: facebook, internet, twitter, online, users, google, media, social, company, content
Topic 09: labour, party, corbyn, leader, jeremy, mps, ukip, voters, election, leadership
Topic 10: mental, health, people, children, services, depression, illness, problems, young, support
Topic 11: growth, economy, rates, markets, prices, pound, inflation, rat

In [19]:
# snippets contains only the first 100 characters of the text corpus.
# raw_documents contains the entire corpus 
# the next line pulls up the top 10 snippets for a given topic 0
# You can replace snippets with raw_documents

topic_snippets = get_top_snippets( snippets, W, 26, 10 )

for i, snippet in enumerate(topic_snippets):
    print("%02d. %s" % ( (i+1), snippet ) )

01. Win (home) tickets to Newcastle United v West Brom The has teamed up with Barclays, proud sponsors o
02. Win (home) tickets to Leicester City v West Brom in the Premier League This competition is now close
03. Win (home) tickets to Manchester United v Watford in the Premier League This competition is now clos
04. Win (home) tickets to Watford v Newcastle United The has teamed up with Barclays, proud sponsors of 
05. Win (home) tickets to Aston Villa v Leicester City The has teamed up with Barclays, proud sponsors o
06. Win (home) tickets to Norwich City v West Ham in the Premier League The has teamed up with Barclays,
07. Win (home) tickets to Newcastle United v Sunderland in the Premier League The has teamed up with Bar
08. Win (home) tickets to Liverpool v Stoke City in the Premier League The has teamed up with Barclays, 
09. Barclays again warns on investment bank results Profits in Barclays investment banking arm will not 
10. Barclays agrees to hand over internal documents to 

Similarly, for the second topic:

In [12]:
topic_snippets = get_top_snippets( snippets, W, 1, 30 )
for i, snippet in enumerate(topic_snippets):
    print("%02d. %s" % ( (i+1), snippet ) )

01. Donald Trump at the White House: Obama reports 'excellent conversation' – as it happened Are you adj
02. What will President Donald Trump do? Predicting his policy agenda Donald Trump has been short on pol
03. Trump to visit White House as Obama calls for unity Trump to visit White House amid calls for unity 
04. Trump uses speech to defend defunct brands – and brandish thick slabs of steak Presidential candidat
05. Decoding the mystery of athletes who support Donald Trump When Latrell Sprewell joined Twitter a mon
06. A timeline of Donald Trump's alleged sexual misconduct: who, when and what At least 24 women have ac
07. Beware a boring Donald Trump. He’s more dangerous than a maverick one Donald Trump’s arrival in the 
08. A lifetime of misogyny catches up with Trump Back in the spring, Jill Harth didn’t want to talk. Nei
09. 'I was looking at the next president of the United States': the verdict on Trump's speech Lucia Grav
10. Trump cancels Chicago rally amid violence and chaos

###  Knowledge check:
1. Which topic suggest technology and internet companies?
2. Then, retrieve the top 5 articles related to the topic.

In [None]:
# Your codes


In [13]:
# ANSWERS
topic_snippets = get_top_snippets( snippets, W, 7, 10 )
for i, snippet in enumerate(topic_snippets):
    print("%02d. %s" % ( (i+1), snippet ) )

01. Twitter: 140 characters in search of a buyer Why doesn’t anyone want to buy Twitter? After the compa
02. Facebook’s satellite went up in smoke, but its developing world land grab goes on A rocket crashing 
03. Facebook lures Africa with free internet - but what is the hidden cost? Facebook has signed up almos
04. What are four of the top social media networks doing to protect children? According to recent report
05. Zuckerberg has given Facebook investors all they need. He wants one thing in return: control Mark Zu
06. Connecting everyone to internet 'would add $6.7tn to global economy' Bringing internet access to the
07. From births to melons: perks and pitfalls of Facebook’s live video revolution Last week, a man in Ca
08. How much are you worth to Facebook? Facebook has set new records for both the number of users it has
09. You may hate Donald Trump. But do you want Facebook to rig the election against him? While the prosp
10. Google, Facebook and Microsoft race to get 1 billio

### Exporting the Results

If we want to keep this topic model for later user, we can save it using *joblib*:

In [14]:
joblib.dump((W,H,terms,snippets), "articles-model-nmf-k%02d.pkl" % k) 

['articles-model-nmf-k30.pkl']


## Exercises:
Make a copy of this notebook and perform the following exercise:
1. Change the model to derive 13 topics
2. Change the program to look at 10 terms per topic.
3. Change the program to display the 8 most relevant documents fot topic #3.
4. For all the topics, display the top 5 documents.





#### Reference:

https://github.com/derekgreene/topic-model-tutorial