This notebook will be collected automatically at **6pm on Monday** from `/home/data_scientist/assignments/Week7` directory on the course JupyterHub server. If you work on this assignment on the course Jupyterhub server, just make sure that you save your work and instructors will pull your notebooks automatically after the deadline. If you work on this assignment locally, the only way to submit assignments is via Jupyterhub, and you have to place the notebook file in the correct directory with the correct file name before the deadline.

1. Make sure everything runs as expected. First, restart the kernel (in the menubar, select `Kernel` → `Restart`) and then run all cells (in the menubar, select `Cell` → `Run All`).
2. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed by the autograder.
3. Do not change the file path or the file name of this notebook.
4. Make sure that you save your work (in the menubar, select `File` → `Save and CheckPoint`)

# Problem 7.3. Mining.

In this problem, we use the Reuters corpus to perform text mining tasks, such as n-grams, stemming, and clustering.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy as sp
import re
import requests
import json
import string

from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils import check_random_state
from sklearn.cluster import KMeans

import nltk
from nltk.corpus import reuters
from nltk.stem.porter import PorterStemmer

from nose.tools import (
    assert_equal,
    assert_is_instance,
    assert_almost_equal,
    assert_true
)
from numpy.testing import assert_array_equal

We will analyze the NLTK Reuters corpus. See the [NLTK docs](http://www.nltk.org/book/ch02.html#reuters-corpus) for more information.

In [Problem 7.2](https://github.com/UI-DataScience/info490-sp16/blob/master/Week7/assignments/w7p2.ipynb), we created the training and test sets from the Reuters corpus. I saved `X_train`, `X_test`, `y_train`, and `y_test` from Problem 7.2 as JSON files.

```python
import json
with open('reuters_X_train.json', 'w') as f:
    json.dump(X_train, f)
```

The JSON files are available in `/home/data_scientist/data/misc`.

In [None]:
def load_reuters(name):
    fpath = '/home/data_scientist/data/misc/reuters_{}.json'.format(name)
    with open(fpath) as f:
        reuters = json.load(f)
    return reuters

X_train, X_test, y_train, y_test = map(load_reuters, ['X_train', 'X_test', 'y_train', 'y_test'])

## n-grams

- Use unigrams, bigrams, and trigrams,
- Build a pipeline by using `TfidfVectorizer` and `LinearSVC`,
- Name the first step `tf` and the second step `svc`,
- Use default parameters for both `TfidfVectorizer` and `LinearSVC`,
- Use English stop words,
- Impose a minimum feature term that requires a term to be present in at least three documents, and
- Set a maximum frequency of 75%, such that any term occurring in more than 75% of all documents will be ignored

In [None]:
def ngram(X_train, y_train, X_test, random_state):
    '''
    Creates a document term matrix and uses SVM classifier to make document classifications.
    Uses unigrams, bigrams, and trigrams.
    
    Parameters
    ----------
    X_train: A list of strings.
    y_train: A list of strings.
    X_test: A list of strings.
    random_state: A np.random.RandomState instance.
    
    Returns
    -------
    A tuple of (clf, y_pred)
    clf: A Pipeline instance.
    y_pred: A numpy array.
    '''
    
    # YOUR CODE HERE
    
    return clf, y_pred

In [None]:
clf1, y_pred1 = ngram(X_train, y_train, X_test, random_state=check_random_state(0))
score1 = accuracy_score(y_pred1, y_test)
print("SVC prediction accuracy = {0:5.1f}%".format(100.0 * score1))

In [None]:
assert_is_instance(clf1, Pipeline)
assert_is_instance(y_pred1, np.ndarray)
tf1 = clf1.named_steps['tf']
assert_is_instance(tf1, TfidfVectorizer)
assert_is_instance(clf1.named_steps['svc'], LinearSVC)
assert_equal(tf1.stop_words, 'english')
assert_equal(tf1.ngram_range, (1, 3))
assert_equal(tf1.max_df, 0.75)
assert_equal(tf1.min_df, 3)
assert_equal(len(y_pred1), len(y_test))
assert_array_equal(y_pred1[:5], ['trade', 'grain', 'crude', 'bop', 'palm-oil'])
assert_array_equal(y_pred1[-5:], ['acq', 'dlr', 'ship', 'ipi', 'gold'])
assert_almost_equal(score1, 0.90195428949983436)

## Stemming

- Use the `tokenize` method in the following code cell to incorporate the Porter Stemmer into the classification pipeline,
- Use unigrams, bigrams, and trigrams
- Build a pipeline by using `TfidfVectorizer` and `LinearSVC`,
- Name the first step `tf` and the second step `svc`,
- Use default parameters for both `TfidfVectorizer` and `LinearSVC`,
- Use English stop words,
- Impose a minimum feature term that requires a term to be present in at least two documents, and
- Set a maximum frequency of 50%, such that any term occurring in more than 50% of all documents will be ignored

In [None]:
def tokenize(text):
    tokens = nltk.word_tokenize(text)
    tokens = [token for token in tokens if token not in string.punctuation]

    stemmer = PorterStemmer()
    stems = map(stemmer.stem, tokens)
    
    return stems

In [None]:
def stem(X_train, y_train, X_test, random_state):
    '''
    Creates a document term matrix and uses SVM classifier to make document classifications.
    Uses the Porter stemmer.
    
    Parameters
    ----------
    X_train: A list of strings.
    y_train: A list of strings.
    X_test: A list of strings.
    random_state: A np.random.RandomState instance.
    
    Returns
    -------
    A tuple of (clf, y_pred)
    clf: A Pipeline instance.
    y_pred: A numpy array.
    '''
    
    # YOUR CODE HERE
    
    return clf, y_pred

In [None]:
clf2, y_pred2 = stem(X_train, y_train, X_test, random_state=check_random_state(0))
score2 = accuracy_score(y_pred2, y_test)
print("SVC prediction accuracy = {0:5.1f}%".format(100.0 * score2))

In [None]:
assert_is_instance(clf2, Pipeline)
assert_is_instance(y_pred2, np.ndarray)
tf2 = clf2.named_steps['tf']
assert_is_instance(tf2, TfidfVectorizer)
assert_is_instance(clf2.named_steps['svc'], LinearSVC)
assert_equal(tf2.stop_words, 'english')
assert_equal(tf2.ngram_range, (1, 3))
assert_equal(tf2.max_df, 0.5)
assert_equal(tf2.min_df, 2)
assert_equal(len(y_pred2), len(y_test))
assert_array_equal(y_pred2[:5], ['trade', 'grain', 'crude', 'bop', 'palm-oil'])
assert_array_equal(y_pred2[-5:], ['acq', 'dlr', 'ship', 'ipi', 'gold'])
assert_almost_equal(score2, 0.89234845975488575)

## Clustering Analysis

- Build a pipeline by using `TfidfVectorizer` and `KMeans`,
- Name the first step `tf` and the second step `km`,
- Use default parameters for both `TfidfVectorizer` and `KMeans`,
- Use unigrams only,
- Use English stop words, and
- Set the number of clusters equal to `true_k`.

In [None]:
def cluster(X_train, X_test, true_k, random_state):
    '''
    Applies clustering analysis to a feature matrix.
    
    Parameters
    ----------
    X_train: A list of strings.
    X_test: A list of strings.
    true_k: An int. The number of clusters.
    random_state: A np.random.RandomState instance.
    
    Returns
    -------
    A Pipeline instance.
    '''    
    # YOUR CODE HERE
    
    return clf

In [None]:
clf3 = cluster(X_train, X_test, true_k=len(reuters.categories()), random_state=check_random_state(0))

In [None]:
assert_is_instance(clf3, Pipeline)
tf3 = clf3.named_steps['tf']
assert_is_instance(tf3, TfidfVectorizer)
km3 = clf3.named_steps['km']
assert_is_instance(km3, KMeans)
assert_equal(tf3.stop_words, 'english')
assert_equal(tf3.ngram_range, (1, 1))
assert_equal(tf3.max_df, 1.0)
assert_equal(tf3.min_df, 1)
assert_equal(km3.n_clusters, len(reuters.categories()))

- Write a function that identifies the most frequently used words in a cluster.

The `cluster` parameter specifies the cluster label. Since we are not using training labels, the cluster labels returned by `KMeans` will be integers.

If you set `top_tokens=3`, the function should return a list of top 3 tokens in the specified cluster; similarly, if you set `top_tokens=5`, it should return a list of top 5 tokens.

In [None]:
def get_top_tokens(km, tf, cluster, top_tokens):
    '''
    Identifies the most frequently used words in "icluster".
    
    Parameters
    ----------
    km: A Kmeans instance.
    tf: A TfidfVectorizer instance.
    icluster: An int. Which cluster?
    top_tokens: An int. How many tokens do you want?
    
    Returns
    -------
    A list of strings.
    '''
    
    # YOUR CODE HERE
    
    return tokens

In [None]:
km = clf3.named_steps['km']
tf = clf3.named_steps['tf']

for i in range(len(reuters.categories())):
    print("Cluster {0}: {1}".format(i, ' '.join(get_top_tokens(km, tf, i, 5))))

In [None]:
token0 = get_top_tokens(km, tf, 0, 3)
assert_is_instance(token0, list)
assert_true(all(isinstance(t, str) for t in token0))
assert_equal(token0, ['units', 'housing', 'starts'])

token1 = get_top_tokens(km, tf, 89, 5)
assert_is_instance(token1, list)
assert_true(all(isinstance(t, str) for t in token1))
assert_equal(token1, ['qtly', 'cts', 'div', 'dividend', 'record'])