This notebook will be collected automatically at **6pm on Monday** from `/home/data_scientist/assignments/Week7` directory on the course JupyterHub server. If you work on this assignment on the course Jupyterhub server, just make sure that you save your work and instructors will pull your notebooks automatically after the deadline. If you work on this assignment locally, the only way to submit assignments is via Jupyterhub, and you have to place the notebook file in the correct directory with the correct file name before the deadline.

1. Make sure everything runs as expected. First, restart the kernel (in the menubar, select `Kernel` → `Restart`) and then run all cells (in the menubar, select `Cell` → `Run All`).
2. Make sure you fill in any place that says `YOUR CODE HERE`. Do not write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed by the autograder.
3. Do not change the file path or the file name of this notebook.
4. Make sure that you save your work (in the menubar, select `File` → `Save and CheckPoint`)

# Problem 7.2. Text Classification.

In this problem, we perform text classificatoin tasks by using the nltk and the scikit learn machine learning libraries.

In [None]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy as sp
import re
import requests

from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.utils import check_random_state

import nltk

from nose.tools import (
    assert_equal,
    assert_is_instance,
    assert_almost_equal,
    assert_true
)
from numpy.testing import assert_array_equal

We will analyze the NLTK Reuters corpus. See the [NLTK docs](http://www.nltk.org/book/ch02.html#reuters-corpus) for more information.

In [None]:
from nltk.corpus import reuters

## Categories

Before delving into text data mining, let's first explore the Reuters data set.

- Write a function that, given an NLTK corpus object, returns its categories.

In [None]:
def get_categories(corpus):
    '''
    Finds categories of an NLTK corpus.
    
    Parameters
    ----------
    corpus: An NLTK corpus object.
    
    Returns
    -------
    A list of strings.
    '''
    
    # YOUR CODE HERE
    
    return categories

In [None]:
all_categories = get_categories(reuters)
print(all_categories)

In [None]:
assert_is_instance(all_categories, list)
for c in all_categories:
    assert_is_instance(c, str)
assert_equal(len(all_categories), 90)
assert_equal(sorted(all_categories)[:5], ['acq', 'alum', 'barley', 'bop', 'carcass'])
assert_equal(sorted(all_categories)[-5:], ['veg-oil', 'wheat', 'wpi', 'yen', 'zinc'])

## fileids

- Find all `fileids` of the Reuters corpus.

In [None]:
def get_fileids(corpus):
    '''
    Finds all fileids of an NLTK corpus.
    
    Parameters
    ----------
    corpus: An NLTK corpus object.
    
    Returns
    -------
    A list of strings.
    '''
    
    # YOUR CODE HERE
    
    return fileids

In [None]:
fileids = get_fileids(reuters)
print(fileids[:5], '...', fileids[-5:])

In [None]:
assert_is_instance(fileids, list)
for f in fileids:
    assert_is_instance(f, str)
assert_equal(len(fileids), 10788)
assert_equal(
    sorted(fileids)[:5],
    ['test/14826', 'test/14828', 'test/14829', 'test/14832', 'test/14833']
    )
assert_equal(
    sorted(fileids)[-5:], 
    ['training/999', 'training/9992', 'training/9993', 'training/9994', 'training/9995']
    )

## List of categories

The corpus contains 10,788 documents (`fileids`) which have been classified into 90 topics (`categories`). We will use those 10,788 documents to train machine learning models and try to predict which topic each document belongs to.

- Find categories for each element of `fileids`.

Note, categories in the Reuters corpus overlap with each other, simply because a news story often covers multiple topics. If a document has more than one category, use **only the first category** (in alphabetical order).

Since we using only one category for each `fileid`, the result from `get_categories_from_fileids()` should have the same length as the `fileids` list.

In [None]:
def get_categories_from_fileids(corpus, fileids):
    '''
    Finds categories for each element of "fileids".
    
    Parameters
    ----------
    corpus: An NLTK corpus.
    fileids: A list of strings.
    
    Returns
    -------
    A list of strings.
    '''
    
    # YOUR CODE HERE
    
    return result

In [None]:
categories = get_categories_from_fileids(reuters, fileids)
print(categories[:5], '...', categories[-5:])

In [None]:
assert_is_instance(categories, list)
assert_true(all(isinstance(c, str) for c in categories))
assert_equal(len(categories), len(fileids))
assert_equal(
    categories[:5],
    ['trade', 'grain', 'crude', 'corn', 'palm-oil']
    )
assert_equal(
    categories[-5:],
    ['interest', 'earn', 'earn', 'earn', 'earn']
    )

## Training and test sets

The Reuters data set has already been grouped into a training set and a test set. See `fileids`.

- To create a training set, iterate through `fileids` and find all `fileids` that start with "train". `X_train` should be a list of **raw data** strings, which you can obtain by using the `raw()` method. `y_train`  is a list of categories. Repeat for all `fileids` that start with "test". In the end, `train_test_split()` should return a 4-tuple of `X_train`, `X_test`, `y_train`, and `y_test`, each of which is a list of strings.

In [None]:
def train_test_split(corpus, fileids, categories):
    '''
    Creates a training set and a test from the NLTK Reuters corpus.
    
    Parameters
    ----------
    corpus: An NLTK corpus.
    fileids: A list of strings.
    categories: A list of strings.
    
    Returns
    -------
    A 4-tuple (X_train, X_test, y_train, y_test)
    All four elements in the tuple are lists of strings.
    '''
    
    # YOUR CODE HERE
    
    return X_train, X_test, y_train, y_test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(reuters, fileids, categories)

In [None]:
assert_is_instance(X_train, list)
assert_is_instance(X_test, list)
assert_is_instance(y_train, list)
assert_is_instance(y_test, list)

assert_true(all(isinstance(elem, str) for elem in X_train))
assert_true(all(isinstance(elem, str) for elem in X_test))
assert_true(all(isinstance(elem, str) for elem in y_train))
assert_true(all(isinstance(elem, str) for elem in y_test))

assert_equal(len(X_train), 7769)
assert_equal(len(X_test), 3019)
assert_equal(len(X_train), len(y_train))
assert_equal(len(X_test), len(y_test))

assert_equal(X_train[0][:20], 'BAHIA COCOA REVIEW\n ')
assert_equal(y_train[0], 'cocoa')
assert_equal(X_test[0][:20], 'ASIAN EXPORTERS FEAR')
assert_equal(y_test[0], 'trade')

assert_equal(X_train[1][:20], 'COMPUTER TERMINAL SY')
assert_equal(y_train[1], 'acq')
assert_equal(X_test[1][:20], 'CHINA DAILY SAYS VER')
assert_equal(y_test[1], 'grain')

assert_equal(X_train[2][:20], 'N.Z. TRADING BANK DE')
assert_equal(y_train[2], 'money-supply')
assert_equal(X_test[2][:20], 'JAPAN TO REVISE LONG')
assert_equal(y_test[2], 'crude')

## SVC (no pipeline, no stop words)

- Use `CountVectorizer` to create a document term matrix, and apply `LinearSVC` algorithm to classify which topic each news document belongs to. Do not use pipeline (yet). Do not use stop words (yet). Use default parameters for both `CountVectorizer` and `LinearSVC`.

In [None]:
def cv_svc(X_train, y_train, X_test, random_state):
    '''
    Creates a document term matrix and uses SVM classifier to make document classifications.
    
    Parameters
    ----------
    X_train: A list of strings.
    y_train: A list of strings.
    X_test: A list of strings.
    random_state: A np.random.RandomState instance.
    
    Returns
    -------
    A tuple of (cv, sv, y_pred)
    cv: A CountVectorizer instance.
    svc: A LinearSVC instance.
    y_pred: A numpy array.
    '''
    
    # YOUR CODE HERE
    
    return cv, svc, y_pred

In [None]:
cv1, svc1, y_pred1 = cv_svc(X_train, y_train, X_test, random_state=check_random_state(0))
score1 = accuracy_score(y_pred1, y_test)
print("SVC prediction accuracy = {0:3.1f}%".format(100.0 * score1))

In [None]:
assert_is_instance(cv1, CountVectorizer)
assert_is_instance(svc1, LinearSVC)
assert_is_instance(y_pred1, np.ndarray)
assert_equal(cv1.stop_words, None)
assert_equal(len(y_pred1), len(y_test))
assert_array_equal(y_pred1[:5], ['trade', 'grain', 'crude', 'corn', 'palm-oil'])
assert_array_equal(y_pred1[-5:], ['acq', 'dlr', 'earn', 'ipi', 'gold'])
assert_almost_equal(score1, 0.88274263000993702)

## SVC (Pipeline, no stop words)

- Build a pipeline by using `CountVectorizer` and `LinearSVC`. Name the first step `cv` and the second step `svc`. Do not use stop words (yet). Use default parameters for both `CountVectorizer` and `LinearSVC`.

In [None]:
def cv_svc_pipe(X_trani, y_train, X_test, random_state):
    '''
    Creates a document term matrix and uses SVM classifier to make document classifications.
    
    Parameters
    ----------
    X_train: A list of strings.
    y_train: A list of strings.
    X_test: A list of strings.
    random_state: A np.random.RandomState instance.
    
    Returns
    -------
    A tuple of (clf, y_pred)
    clf: A Pipeline instance.
    y_pred: A numpy array.
    '''
    
    # YOUR CODE HERE
    
    return clf, predicted

In [None]:
clf2, y_pred2 = cv_svc_pipe(X_train, y_train, X_test, random_state=check_random_state(0))
score2 = accuracy_score(y_pred2, y_test)
print("SVC prediction accuracy = {0:3.1f}%".format(100.0 * score2))

In [None]:
assert_is_instance(clf2, Pipeline)
assert_is_instance(y_pred2, np.ndarray)
cv2 = clf2.named_steps['cv']
assert_is_instance(cv2, CountVectorizer)
assert_is_instance(clf2.named_steps['svc'], LinearSVC)
assert_equal(cv2.stop_words, None)
assert_equal(len(y_pred2), len(y_test))
assert_array_equal(y_pred1, y_pred2)
assert_array_equal(y_pred1, y_pred2)
assert_almost_equal(score1, score2)

## SVC (Pipeline and stop words)

- Build a pipeline by using `CountVectorizer` and `LinearSVC`. Name the first step `cv` and the second step `svc`. Use English stop words. Use default parameters for both `CountVectorizer` and `LinearSVC`.

In [None]:
def cv_svc_pipe_sw(X_train, y_train, X_test, random_state):
    '''
    Creates a document term matrix and uses SVM classifier to make document classifications.
    Uses English stop words.
    
    Parameters
    ----------
    X_train: A list of strings.
    y_train: A list of strings.
    X_test: A list of strings.
    random_state: A np.random.RandomState instance.
    
    Returns
    -------
    A tuple of (clf, y_pred)
    clf: A Pipeline instance.
    y_pred: A numpy array.
    '''

    # YOUR CODE HERE
    
    return clf, predicted

In [None]:
clf3, y_pred3 = cv_svc_pipe_sw(X_train, y_train, X_test, random_state=check_random_state(0))
score3 = accuracy_score(y_pred3, y_test)
print("SVC prediction accuracy = {0:3.1f}%".format(100.0 * score3))

In [None]:
assert_is_instance(clf3, Pipeline)
assert_is_instance(y_pred3, np.ndarray)
cv3 = clf3.named_steps['cv']
assert_is_instance(cv3, CountVectorizer)
assert_is_instance(clf3.named_steps['svc'], LinearSVC)
assert_equal(cv3.stop_words, 'english')
assert_equal(len(y_pred3), len(y_test))
assert_array_equal(y_pred3[:5], ['trade', 'grain', 'crude', 'corn', 'palm-oil'])
assert_array_equal(y_pred3[-5:], ['acq', 'dlr', 'earn', 'ipi', 'gold'])
assert_almost_equal(score3, 0.87777409738323953)

## Pipeline of TF-IDF and SVM with stop words

- Build a pipeline by using `TfidfVectorizer` and `LinearSVC`. Name the first step `tf` and the second step `svc`. Use English stop words. Use default parameters for both `TfidfVectorizer` and `LinearSVC`.

In [None]:
def tfidf_svc(X_train, y_train, X_test, random_state):
    '''
    Creates a document term matrix and uses SVM classifier to make document classifications.
    Uses English stop words.
    
    Parameters
    ----------
    X_train: A list of strings.
    y_train: A list of strings.
    X_test: A list of strings.
    random_state: A np.random.RandomState instance.
    
    Returns
    -------
    A tuple of (clf, y_pred)
    clf: A Pipeline instance.
    y_pred: A numpy array.
    '''
    
    # YOUR CODE HERE
    
    return clf, predicted

In [None]:
clf4, y_pred4 = tfidf_svc(X_train, y_train, X_test, random_state=check_random_state(0))
score4 = accuracy_score(y_pred4, y_test)
print("SVC prediction accuracy = {0:5.1f}%".format(100.0 * score4))

In [None]:
assert_is_instance(clf4, Pipeline)
assert_is_instance(y_pred4, np.ndarray)
tf4 = clf4.named_steps['tf']
assert_is_instance(tf4, TfidfVectorizer)
assert_is_instance(clf4.named_steps['svc'], LinearSVC)
assert_equal(tf4.stop_words, 'english')
assert_equal(len(y_pred4), len(y_test))
assert_array_equal(y_pred4[:5], ['trade', 'grain', 'crude', 'bop', 'palm-oil'])
assert_array_equal(y_pred4[-5:], ['acq', 'dlr', 'crude', 'ipi', 'gold'])
assert_almost_equal(score4, 0.89831069890692283)