# Module 5 Assignment

A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

In [10]:
import pandas as pd
import numpy as np
import random
import nltk
import helper

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

from nose.tools import assert_equal, assert_is_instance
from nltk.corpus import reuters
from operator import itemgetter


For the following problems we will use the NLTK Reuters corpus which contains 10,788 documents which have been classified into 90 topics.

In [2]:
#Let's take a look at the categories
print(reuters.categories())

['acq', 'alum', 'barley', 'bop', 'carcass', 'castor-oil', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'cotton', 'cotton-oil', 'cpi', 'cpu', 'crude', 'dfl', 'dlr', 'dmk', 'earn', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-oil', 'heat', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-oil', 'livestock', 'lumber', 'meal-feed', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-oil', 'palmkernel', 'pet-chem', 'platinum', 'potato', 'propane', 'rand', 'rape-oil', 'rapeseed', 'reserves', 'retail', 'rice', 'rubber', 'rye', 'ship', 'silver', 'sorghum', 'soy-meal', 'soy-oil', 'soybean', 'strategic-metal', 'sugar', 'sun-meal', 'sun-oil', 'sunseed', 'tea', 'tin', 'trade', 'veg-oil', 'wheat', 'wpi', 'yen', 'zinc']


# Problem 1: Find the number of times a particular token occurs

Write a function called $\texttt{token_counter}$ which takes in an nltk corpora and outputs the number of times a particular token occurs in the corpora.

In [3]:
def token_counter(token,corpora):
    '''
    Inputs
    -------
    
    token: the token to search for
    corpora: a list of documents
    
    Returns
    --------
    
    count: the number of times the token occurs in the corpora
    '''
    ###BEGIN SOLUTION###
    count = corpora.words().count(token)
    ###END SOLUTION###
    
    return count

In [4]:
num_to = token_counter('to',reuters)
assert_equal(num_to,34035)
num_ship = token_counter('ship',reuters)
assert_equal(num_ship,116)

In [5]:
cv = CountVectorizer(analyzer='word', lowercase=True)

# Problem 2: Use CountVectorizer to retrieve the corpus vocabulary

Write a function called $\texttt{most_common_words}$ which uses CountVectorizer to return the vocabulary of the corpora.

In [6]:
def most_common_words(corpora):
    '''
    Inputs
    -------

    corpora: a list of documents
    
    Returns
    --------
    
    vocabulary: a Python dictionary containing each word and its count
    '''
    ###BEGIN SOLUTION###
    cv = CountVectorizer(analyzer='word', lowercase=True)
    cv.fit_transform(corpora.words())
    vocabulary = cv.vocabulary_
    ###END SOLUTION###
    
    return vocabulary

In [7]:
vocabulary = most_common_words(reuters)
assert_is_instance(vocabulary,dict)
assert_equal(vocabulary['asian'],3384)

# Problem 3: Reading and Preprocessing Data
Read in the [badges dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/badges/badges.info) (it is stored in the same directory as 'badges.data') using which ever method you like (pandas, numpy, built-in python modules). *HINT: I recommend using pandas to read in this **fixed width** dataset.* 

The first column are the labels we want to predict (the + or - sign); the second column are the names. If you do this step correctly the data will be identical to the first 5 rows below:
```
[['+', 'Naoki Abe'],
 ['-', 'Myriam Abramson'],
 ['+', 'David W. Aha'],
 ['+', 'Kamal M. Ali'],
 ['-', 'Eric Allender']]
```

Depending on how you read in the data. Format your data so that it is in an acceptable format for [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) function in in sci-kit learn.

Next Split the data set the random_state parameter to be 0, and assign the approriate parameter to set aside 80% of your dataset for training data.

Now use the encode your labels using [LabelEncoder](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder). 

Next create a [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) model using sci-kit learn's implementation using the default values.

For your training features learn the vocabulary and idf and return the term-document matrix. For your testing set **only** transform documents to a document term matrix.

Assign your encoded labels to train_y and test_y for the training and testing labels respectively.
Assign your document-term matrices to the training and testing document-term matrix.


In [8]:
###BEGIN SOLUTION
df = pd.read_fwf('badges.data', names=('class', 'name'))
df.dropna(inplace=True)
df.head(5).values

train, test = train_test_split(df, random_state=0, train_size=.8)

tfidf = TfidfVectorizer()
train_y = LabelEncoder().fit_transform(train['class'])
train_X = tfidf.fit_transform(train['name'])
test_y = LabelEncoder().fit_transform(test['class'])
test_X = tfidf.transform(test['name'])
###END SOLUTION

In [12]:
assert_equal(train_y.tolist(), [0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0])

assert_equal(test_y.tolist(), [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0])


# Problem 4: Fit a Naive Bayes Multinomial Model to the Badges dataset
Use sci-kit learn's implementation of a [naive bayes classifer for multinomial model](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html) to fit a model to the Badges dataset. Lastly get predictions of the testing labels and name your predictions *_pred*

In [13]:
###BEGIN SOLUTION
model = MultinomialNB()
model.fit(train_X, train_y)
_pred = model.predict(test_X)
###END SOLUTION

In [14]:
print(metrics.classification_report(_pred, test_y))

             precision    recall  f1-score   support

          0       1.00      0.81      0.90        58
          1       0.08      1.00      0.15         1

avg / total       0.98      0.81      0.88        59



In [15]:
assert_equal(_pred.tolist(), [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


# Problem 5: Using N-Grams to classify badges 
Use sci-kit learn's  [Count Vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html). Learn the names and transform the training term-document matrix.
For the testing names transform into the testing term-document matrix.

Next create another Multinomial Naive Bayes model.
Fit the model on the  training term-document matrix and training labels.
Next make predictions on the testing labels with the testing term-document matrix. store this value as *p*.


In [16]:
###BEGIN SOLUTION
cv = CountVectorizer()
tk_func = cv.build_analyzer()

x1 = cv.fit_transform(train['name'])
x2 = cv.transform(test['name'])

model2 = MultinomialNB()
model.fit(x1,train_y)
p = model.predict(x2)
###END SOLUTION

In [17]:
print(metrics.classification_report(test_y, p))

             precision    recall  f1-score   support

          0       0.81      0.98      0.88        47
          1       0.50      0.08      0.14        12

avg / total       0.74      0.80      0.73        59



In [18]:
assert_equal(p.tolist(), [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0])

As before, we will use the NLTK Reuters corpus which contains 10,788 documents which have been classified into 90 topics. The documents have file ids (held in the $\texttt{fileids()}$ method) which denotes if they are training or testing documents. An example is below.

In [19]:
#Let's look at the file id for the first document
reuters.fileids()[0]

'test/14826'

# Problem 6: Write a function that splits the reuters data into testing and training sets

Write a function called $\texttt{splitter}$ which takes in a corpora, and returns a list of file ids for the training documents, and a list of file ids for the testing documents.

In [20]:
def splitter(corpora):
    """

    Inputs
    ----------

    corpora: an nltk corpora
    
    Returns
    -------

    trainig_ids: a list of training document file ids
    testing_ids: a list of testing document file ids
    """

    ###BEGIN SOLUTION###
    training_ids = [i for i in corpora.fileids() if "train" in i]
    testing_ids = [i for i in reuters.fileids() if "test" in i]
    ###END SOLUTION###
    
    return training_ids, testing_ids

In [21]:
training_ids, testing_ids = splitter(reuters)
assert_equal(len(training_ids),7769)
assert_equal(len(testing_ids),3019)


We will use the below function, along with the training and testing ids that we found to get the full testing and training datasets


In [22]:
def get_categories_from_fileids(corpus, fileids):
    """
    Finds categories for each element of 'fileids'.
    
    Parameters
    ----------
    corpus: An NLTK corpus.
    fileids: A list of strings.
    
    Returns
    -------
    A list of strings.
    """
    
    result = [sorted(corpus.categories(fileids=f))[0] for f in fileids]
    
    return result

In [23]:
X_train = [reuters.raw(fileids=fileid) for fileid in training_ids][:1000]
y_train = get_categories_from_fileids(reuters, training_ids)[:1000]


X_test = [reuters.raw(fileids=fileid) for fileid in testing_ids]
y_test = get_categories_from_fileids(reuters, testing_ids)

# Problem 7: Fit a naive bayes classifier using different n-grams

Write a function called $\texttt{naivebayes_categories}$ which takes in a training set of data, a lower n-gram range, and an upper n-gram range. Furthermore, use english stop words in the CountVectorizer.

In [24]:
def naivebayes_categories(X_train, y_train, lower, upper):
    """

    Inputs
    ----------
    X_train: The training data
    y_train: The training labels
    lower: the lower n_gram range
    upper: the upper n_gram range
    
    Returns
    -------
    clf: the fitted model
    
    """
    ###BEGIN SOLUTION###
    tools = [('cv', CountVectorizer()), ('nb', MultinomialNB())]
    clf = Pipeline(tools)
    clf.set_params(cv__stop_words='english',cv__ngram_range=(lower,upper))

    clf = clf.fit(X_train, y_train)
    ###END SOLUTION###
    
    return clf

In [25]:
model = naivebayes_categories(X_train,y_train,1,2)
predictions = model.predict(X_test)
assert_is_instance(model,Pipeline)
assert_equal(model.classes_[0],'acq')
assert_equal(predictions[0],'trade')

Let's see how accurate the model is.

In [26]:
accuracy_score(predictions, y_test)

0.71745611129513087