This notebook will be collected automatically at **6pm on Monday** from `/home/data_scientist/assignments/Week9` directory on the course JupyterHub server. If you work on this assignment on the course Jupyterhub server, just make sure that you save your work and instructors will pull your notebooks automatically after the deadline. If you work on this assignment locally, the only way to submit assignments is via Jupyterhub, and you have to place the notebook file in the correct directory with the correct file name before the deadline.

1. Make sure everything runs as expected. First, restart the kernel (in the menubar, select `Kernel` → `Restart`) and then run all cells (in the menubar, select `Cell` → `Run All`).
2. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed by the autograder.
3. Do not change the file path or the file name of this notebook.
4. Make sure that you save your work (in the menubar, select `File` → `Save and CheckPoint`)

# Problem 9.2. NLP: Topic Modeling.

In this problem, we explore the concept of topic modeling.

In [None]:
import numpy as np
import pandas as pd

from scipy.sparse.csr import csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cross_validation import check_random_state
from sklearn.decomposition import NMF
from sklearn.preprocessing import normalize
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

from gensim.matutils import Sparse2Corpus
from gensim.corpora.dictionary import Dictionary
from gensim.models import LdaModel

from nose.tools import assert_equal, assert_is_instance, assert_true
from numpy.testing import assert_array_equal, assert_array_almost_equal

We use the twenty newsgroup data.

In [None]:
from sklearn.datasets import fetch_20newsgroups

train = fetch_20newsgroups(
    data_home='/home/data_scientist/data/textdm', 
    subset='train',
    shuffle=True,
    random_state=check_random_state(0),
    remove=('headers', 'footers', 'quotes')
    )

test = fetch_20newsgroups(
    data_home='/home/data_scientist/data/textdm', 
    subset='test',
    shuffle=True,
    random_state=check_random_state(0),
    remove=('headers', 'footers', 'quotes')
    )

## Document term matrix

- Use TfidfVectorizer to create a document term matrix for both `train['data']` and `test['data']`.
- Use English stop words.
- Use unigrams and bigrams.
- Ignore terms that have a document frequency strictly lower than 2.
- Build a vocabulary that only consider the top 20,000 features ordered by term frequency across the corpus.

In [None]:
def get_document_term_matrix(train_data, test_data):
    '''
    Uses TfidfVectorizer to create a document term matrix for "train_data" and "test_data".
    
    Paramters
    ---------
    train_data: A list of strings
    test_data:A list of strings
    
    Returns
    -------
    A 3-tuple of (model, train_matrix, test_matrix).
    model: A TfidfVectorizer instance
    train_matrix: A scipy.csr_matrix
    test_matrix: A scipy.csr_matrix
    '''
    
    # YOUR CODE HERE
    
    return model, train_matrix, test_matrix

In [None]:
cv, train_data, test_data = get_document_term_matrix(train['data'], test['data'])

In [None]:
assert_is_instance(cv, TfidfVectorizer)
assert_is_instance(train_data, csr_matrix)
assert_is_instance(test_data, csr_matrix)
assert_equal(cv.stop_words, 'english')
assert_equal(cv.ngram_range, (1, 2))
assert_equal(cv.min_df, 2)
assert_equal(cv.max_features, 20000)
assert_equal(train_data.data.size, 680499)
assert_array_almost_equal(
    train_data.data[:5],
    [0.04590546,  0.05614672,  0.05849851,  0.05614672,  0.06487626]
    )
assert_equal(test_data.data.size, 415292)
assert_array_almost_equal(
    test_data.data[:5],
    [0.16046961,  0.3429567 ,  0.2124038 ,  0.28698678,  0.22300288]
    )

## Non-negative matrix factorization

- Apply non-negative matrix factorization (NMF) to compute topics in `train_data`.
- Use 60 topics.
- Normalize the transformed data to have unit probability.

In [None]:
def apply_nmf(data, random_state):
    '''
    Applies non-negative matrix factorization (NMF) to compute topics.
    
    Parameters
    ----------
    data: A csr_matrix
    random_state: A RandomState instance for NMF
    
    Returns
    -------
    A tuple of (nmf, transformed_data)
    nmf: An sklearn.NMF instance
    transformed_data: A numpy.ndarray
    '''
    
    # YOUR CODE HERE
    
    return nmf, transformed_data

In [None]:
nmf, td_norm = apply_nmf(train_data, random_state=check_random_state(0))

# We use a DataFrame to simplify the collecting of the data for display.
df = pd.DataFrame(td_norm)
df.fillna(value=0, inplace=True)
df['label'] = pd.Series(train['target'])

df_label = df.groupby('label').mean()
df_label['names'] = pd.Series(train['target_names'], dtype="category")

# we display only final 5 columns for brevity (we have 61 columns)
print(df_label.ix[:, -5:])

In [None]:
assert_is_instance(nmf, NMF)
assert_is_instance(td_norm, np.ndarray)
assert_equal(nmf.n_components, 60)
assert_equal(nmf.max_iter, 200)
assert_equal(td_norm.shape, (11314, 60))
assert_array_almost_equal(
    td_norm[0, :5],
    [0.        ,  0.00421649,  0.        ,  0.120597  ,  0.00141566]
    )
assert_array_almost_equal(
    td_norm[-1, -5:],
    [ 0.05955216,  0.        ,  0.00094186,  0.        ,  0.06290102]
    )

## Topic-based Classification

- Train a Random Forest classifier on the topics in the training data sample of the twenty newsgroup data set.
- Use default parameters for the random forest classifier. Don't forget to set the `random_state` parameter.
- Compute the topics, by using the previously created NMF model, for the test data and compute classifications from these topic models. 

In [None]:
def classify_topics(nmf, X_train, y_train, X_test, random_state):
    '''
    
    Paramters
    ---------
    nmf: An sklearn.NMF model.
    X_train: A numpy array.
    y_train: A numpy array.
    X_test: A scipy csr_matrix.
    random_state: A RandomState instance for Random Forest Classifier.
    
    Returns
    -------
    A tuple of (clf, y_pred)
    clf: A RandomForestClassifier instance.
    y_pred: A numpy array.
    '''
    
    # YOUR CODE HERE
    
    return clf, y_pred

The resulting classification report and confusion matrix are shown to demonstrate the quality of this classification method.

In [None]:
clf, ts_preds = classify_topics(
    nmf, nmf.transform(train_data), train['target'], test_data, check_random_state(0)
    )
print(classification_report(test['target'], ts_preds, target_names=test['target_names']))

In [None]:
assert_is_instance(clf, RandomForestClassifier)
assert_is_instance(ts_preds, np.ndarray)
assert_equal(len(ts_preds), len(test['target']))
assert_array_equal(ts_preds[:5], [8, 1, 16, 15, 6])
assert_array_equal(ts_preds[-5:], [7, 9, 3, 1, 9])

## Topic Modeling with Gensim

- Use the gensim library to perform topic modeling of the twenty newsgroup data. First transform a sparse matrix into a gensim corpus, and then construct a vocabulary dictionary. Finally, create a  Latent Dirichlet allocation (LDA) model with 20 topics for the newsgroup text, and return 5 most significant words for each topic.
- You should specify three parameters in `LdaModel()`: `corpus`, `id2word`, and `num_topics`. Use default values for all other paramters. Ignore any warnings about `passes` or `iterations`.

In [None]:
def get_topics(cv, train_data):
    '''
    Uses gensim to perform topic modeling.
    
    Paramters
    ---------
    cv: A TfidfVectorizer instance.
    train_data: A scipy csr_matrix.
    
    Returns
    -------
    A list of strings (functions of the most important terms in each topic).
    '''
    
    # YOUR CODE HERE
    
    return topics

In [None]:
topics = get_topics(cv, train_data)

for idx, (lst, val) in enumerate(topics):
    print('Topic {0}'.format(idx))
    print(35*('-'))
    for i, z in lst:
        print('    {0:20s}: {1:5.4f}'.format(z, i))
    print(35*('-'))

In [None]:
assert_is_instance(topics, list)
assert_equal(len(topics), 20)

for topic, score in topics:
    assert_is_instance(topic, list)
    assert_is_instance(score, float)
    assert_equal(len(topic), 5)
    for v, k in topic:
        assert_is_instance(k, str)
        assert_is_instance(v, float)