# Topic Modeling on test_prepared_scored


Topic models are statistical models that aim to discover the 'hidden' thematic structure in a collection of documents, i.e. identify possible topics in our corpus. It is an interative process by nature, as it is crucial to determine the right number of topics. 

This notebook is organised as follows:

* [Setup and dataset loading](#setup)
* [Text Processing:](#text_process) Before feeding the data to a machine learning model, we need to convert it into numerical features.
* [Topics Extraction Models:](#mod) We present two differents models from the sklearn library: NMF and LDA.
* [Topics Visualisation with pyLDAvis](#viz)
* [Topics Clustering:](#clust)  We try to understand how topics relate to each other.
* [Further steps](#next)

## Setup and dataset loading <a id="setup" /> 

First of all, let's load the libraries that we'll use.

**This notebook requires the installation of the [pyLDAvis](https://pyldavis.readthedocs.io/en/latest/readme.html#installation) package.**
[See here for help with intalling python packages.](https://www.dataiku.com/learn/guide/code/python/install-python-packages.html)

In [0]:
%pylab inline
import warnings                         # Disable some warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
import dataiku
from dataiku import pandasutils as pdu
import pandas as pd,  seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text 

from sklearn.decomposition import LatentDirichletAllocation,NMF
import pyLDAvis.sklearn
pyLDAvis.enable_notebook()

In [0]:
dataset_limit = 10000


The first thing we do is now to load the dataset and identify possible text columns.

In [0]:
# Take a handle on the dataset
mydataset = dataiku.Dataset("test_prepared_scored")

# Load the first lines.
# You can also load random samples, limit yourself to some columns, or only load
# data matching some filters.
#
# Please refer to the Dataiku Python API documentation for more information
df = mydataset.get_dataframe(limit = dataset_limit)

df_orig = df.copy()

# Get the column names
numerical_columns = list(df.select_dtypes(include=[np.number]).columns)
categorical_columns = list(df.select_dtypes(include=[object]).columns)
date_columns = list(df.select_dtypes(include=['<M8[ns]']).columns)

# Print a quick summary of what we just loaded
print "Loaded dataset"
print "   Rows: %s" % df.shape[0]
print "   Columns: %s (%s num, %s cat, %s date)" % (df.shape[1], 
                                                    len(numerical_columns), len(categorical_columns),
                                                    len(date_columns))

By default, we suppose that the text of interest for which we want to extract topics is the first of the categorical columns.

In [0]:
raw_text_col = categorical_columns[0]

# Uncomment this if you want to take manual control over which variables is the text of interest
#print df.columns
#raw_text_col = "text_normalized"

raw_text = df[raw_text_col]
# Issue a warning if data contains NaNs
if(raw_text.isnull().any()):
    print('\x1b[33mWARNING: Your text contains NaNs\x1b[0m')
    print('Please take care of them, the countVextorizer will not be able to fit your data if it contains empty values.')

**To test this notebook  on example data uncomment the following cell.**

You can test this notebook on the 20newsgroups dataset:

In [0]:
#Example on the 20newsgroups
#from sklearn.datasets import fetch_20newsgroups
#dataset = fetch_20newsgroups(shuffle=True, random_state=1,remove=('headers', 'footers', 'quotes'))
#raw_text = dataset.data

## Text Processing <a id="text_process" /> 

We cannot directly feed the text to the Topics Extraction Algorithms. We first need to process the text in order to get numerical vectors. We achieve this by applying either a CountVectorizer() or a TfidfVectorizer(). For more information on those technics, please refer to thid [sklearn documentation](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html).   

As with any text mining task, we first need to remove stop words that provide no useful information about topics. *sklearn* provides a default stop words list for english, but we can alway add to it any custom stop words : <a id="stop_words" /a>

In [0]:
custom_stop_words = []
#custom_stop_words = [u'did', u'good', u'right', u'said', u'does', u'way',u'edu', u'com', u'mail', u'thanks', u'post', u'address', u'university', u'email', u'soon', u'article',u'people', u'god', u'don', u'think', u'just', u'like', u'know', u'time', u'believe', u'say',u'don', u'just', u'think', u'probably', u'use', u'like', u'look', u'stuff', u'really', u'make', u'isn']

stop_words = text.ENGLISH_STOP_WORDS.union(custom_stop_words)

### CountVectorizer() on the text data <a id="tfidf" /> 

We first initialise a CountVectorizer() object and then apply the fit_transform method to the text.


In [0]:
cnt_vectorizer = CountVectorizer(strip_accents = 'unicode',stop_words = stop_words,lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b', max_df = 0.85, min_df = 2)

text_cnt = cnt_vectorizer.fit_transform(raw_text)

print(text_cnt.shape)

### TfidfVectorizer() on the text data <a id="tfidf" /> 

We first initialise a TfidfVectorizer() object and then apply the fit_transform method to the text.

In [0]:
tfidf_vectorizer = TfidfVectorizer(strip_accents = 'unicode',stop_words = stop_words,lowercase = True,
                                token_pattern = r'\b[a-zA-Z]{3,}\b', max_df = 0.75, min_df = 0.02)

text_tfidf = tfidf_vectorizer.fit_transform(raw_text)

print(text_tfidf.shape)

In the following, we will apply the topics extraction to `text_tidf`.

## Topics Extraction Models <a id="mod" /> 

There are two very popular models for topic modelling, both available in the sklearn library: 

* [NMF (Non-negative Matrix Factorization)](https://en.wikipedia.org/wiki/Non-negative_matrix_factorization),
* [LDA (Latent Dirichlet Allocation)](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

Those two topic modeling algorithms infer topics from a collection of texts by viewing each document as a mixture of various topics. The only parameter we need to choose is the number of desired topics `n_topics`.  
It is recommended to try different values for `n_topics` in order to find the most insightful topics. For that, we will show below different analyses (most frequent words per topics and heatmaps).

In [0]:
n_topics= 10

Use this line for LDA

In [0]:
topics_model = LatentDirichletAllocation(n_topics, random_state=0)

Uncomment the following line to try NMF instead.

In [0]:
#topics_model = NMF(n_topics, random_state=0)

In [0]:
topics_model.fit(text_tfidf)

### Most Frequent Words per Topics
An important way to assess the validity of our topic modelling is to directly look at the most frequent words in each topics.

In [0]:
n_top_words = 10
feature_names = tfidf_vectorizer.get_feature_names()

def get_top_words_topic(topic_idx):
    topic = topics_model.components_[topic_idx]
   
    print( [feature_names[i] for i in topic.argsort()[:-n_top_words - 1:-1]] )
    
for topic_idx, topic in enumerate(topics_model.components_):
    print ("Topic #%d:" % topic_idx )
    get_top_words_topic(topic_idx)
    print ("")

Pay attention to the words present, if some are very common you may want to go back to the [definition of custom stop words](#stop_words).

#### Naming the topics

Thanks to the above analysis, we can try to name each topics:

In [0]:
dict_topic_name = {i: "topic_"+str(i) for i in xrange(n_topics)}
#dict_topic_name = my_dict_topic_name #Define here your own name mapping and uncomment this !

For the 20newsgroup dataset, if you used the [suggested custom stop words](#stop_words) we suggest these 10 topics

In [0]:
#dict_topic_name = {0: "Posting", 1: "Driving", 2: "OS (Windows)", 3: "Past", 4: "Games", 5: "Sales", 6: "Misc", 7: "Christianity", 8: "Personal information", 9: "Government/Justice"}

### Topics Heatmaps

Another visual helper to better understand the found topics is to look at the heatmap for the document-topic and topic-words matrices. This gives us the distribution of topics over the collection of documents and the distribution of words over the topics.  
We start with the topic-word heatmap where the darker the color is the more the word is representative of the topic:

In [0]:
word_model = pd.DataFrame(topics_model.components_.T)
word_model.index = feature_names
word_model.columns.name = 'topic'
word_model['norm'] = (word_model).apply(lambda x: x.abs().max(),axis=1)
word_model = word_model.sort_values(by='norm',ascending=0) # sort the matrix by the norm of row vector
word_model.rename(columns = dict_topic_name, inplace = True) #naming topic
 
del word_model['norm']

plt.figure(figsize=(9,8))
sns.heatmap(word_model[:10]) 

We now display the document-topic heatmap:

In [0]:
# retrieve the document-topic matrix
document_model = pd.DataFrame(topics_model.transform(text_tfidf))
document_model.columns.name = 'topic'
document_model.rename(columns = dict_topic_name, inplace = True) #naming topics

plt.figure(figsize=(9,8))
sns.heatmap(document_model.sort_index()[:10]) #we limit here to the first 10 texts

### Topic distribution over the corpus  
We can look at how the topics are represented in the collection of documents.

In [0]:
topics_proportion = document_model.sum()/document_model.sum().sum()
topics_proportion.plot(kind = "bar")

For each topic, we can investigate the documents the most representative for the given topic:

In [0]:
def top_documents_topics(topic_name, n_doc = 3, excerpt = True):
    '''This returns the n_doc documents most representative of topic_name'''
    
    document_index = list(document_model[topic_name].sort_values(ascending = False).index)[:n_doc]
    for order, i in enumerate(document_index):
        print "Text for the {}-th most representative document for topic {}:\n".format(order + 1,topic_name)
        if excerpt:
            print raw_text[i][:1000]
        else:
            print raw_text[i]
        print "\n******\n"

For the 20newsgroup dataset, you can try this to get excerpts from the 3 most representative texts related to the Driving topic

In [0]:
top_documents_topics("topic_0")

## Topics Visualization with pyLDAvis <a id="viz">

Thanks to the pyLDAvis package, we can easily visualise and interpret the topics that has been fit to our corpus of text data.

In [0]:
pyLDAvis.sklearn.prepare(topics_model, text_tfidf, tfidf_vectorizer)

## Topics Clustering  <a id="clust">  

Once we have fitted topics on the text data, we can try to understand how they relate to one another: we achieve this by doing a hierachical clustering on the topics. We propose two methods, the first is based on a correlation table between topics, the second on a contigency table.

In [0]:
# correlation matrix between topics
cor_matrix = np.corrcoef(document_model.iloc[:,:n_topics].values,rowvar=0)

#Renaming of the index and columns
cor_matrix = pd.DataFrame(cor_matrix)
cor_matrix.rename(index = dict_topic_name, inplace = True)
cor_matrix.rename(columns= dict_topic_name, inplace = True)

sns.clustermap(cor_matrix, cmap="bone")

In [0]:
# contingency table on the binarized document-topic matrix
document_bin_topic = (document_model.iloc[:,:n_topics] > 0.25).astype(int)
contingency_matrix = np.dot(document_bin_topic.T.values, document_bin_topic.values )

#Renaming of the index and columns
contingency_matrix = pd.DataFrame(contingency_matrix)
contingency_matrix.rename(index = dict_topic_name, inplace = True)
contingency_matrix.rename(columns= dict_topic_name, inplace = True)

sns.clustermap(contingency_matrix)

## Further steps  <a id="next">  

Topics extraction is a vast subject and a notebook can only show so much. There still much thing we could do, here are some ideas:  


#### 1. Discard documents from noise topics
The following helper function takes as argument the topics for which we wish to discard documents.

In [0]:
def remove_doc(*topic_name):
    
    doc_max_topic = document_model.idxmax(axis = 1)
    print "Removing documents whose main topic are in ", topic_name
    doc_max_topic_filtered = doc_max_topic[~doc_max_topic.isin(topic_name)]
    return [raw_text[i] for i in doc_max_topic_filtered.index.tolist()]

#E.g.: to remove documents whose main topic are topic_1 or topic_3, we would simply call remove_doc("topic_0","topic_2")

For the 20newsgroup dataset, try this to remove text of topic "Misc"

In [0]:
#raw_text_filtered = remove_doc("Misc")

#### 2. Scoring the topic model on new text
Finally, we can score new text with our topic model as follows.

In [0]:
new_text = raw_text[:3] #Change this to the new text you'd like to score !

tfidf_new_text = tfidf_vectorizer.transform(new_text)
result = pd.DataFrame(topics_model.transform(tfidf_new_text), columns = [dict_topic_name[i] for i in xrange(n_topics)])
sns.heatmap(result)