## Part 2: Training your own ML Model

<a href="https://colab.research.google.com/github/peckjon/hosting-ml-as-microservice/blob/master/part2/train_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Download corpuses

We'll continue using the `movie_reviews` corpus to train our model. The `stopwords` corpus contains a [set of standard stopwords](https://gist.github.com/sebleier/554280) we'll want to remove from the input, and `punkt` is used for toneization in the [.words()](https://www.nltk.org/api/nltk.corpus.html#corpus-reader-functions) method of the corpus reader.

In [54]:
import nltk

nltk.download('movie_reviews')
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\pbrad\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\pbrad\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\pbrad\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Define feature extractor and bag-of-words converter

Given a list of (already tokenized) words, we need a function to extract just the ones we care about: those not found in the list of English stopwords or standard punctuation.

We also need a way to easily turn a list of words into a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model), pairing each word with the count of its occurrences.

In [55]:
from nltk.corpus import stopwords
from string import punctuation

stopwords_eng = stopwords.words('english')

def extract_features(words):
    return [w for w in words if w not in stopwords_eng and w not in punctuation]

def bag_of_words(words):
    bag = {}
    for w in words:
        bag[w] = bag.get(w,0)+1
    return bag

### Ingest, clean, and convert the positive and negative reviews

For both the positive ("pos") and negative ("neg") sets of reviews, extract the features and convert to bag of words. From these, we construct a list of tuples known as a "featureset": the first part of each tuple is the bag of words for that review, and the second is its label ("pos"/"neg").

Note that `movie_reviews.words(fileid)` provides a tokenized list of words. If we wanted the un-tokenized text, we would use `movie_reviews.raw(fileid)` instead, then tokenize it using our preferred tokenizeer (e.g. [nltk.tokenize.word_tokenize](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktLanguageVars.word_tokenize)).

In [56]:
from nltk.corpus import movie_reviews

reviews_pos = []
reviews_neg = []
for fileid in movie_reviews.fileids('pos'):
    words = extract_features(movie_reviews.words(fileid))
    reviews_pos.append((bag_of_words(words), 'pos'))
for fileid in movie_reviews.fileids('neg'):
    words = extract_features(movie_reviews.words(fileid))
    reviews_neg.append((bag_of_words(words), 'neg'))

In [57]:
reviews_pos[123][1]

'pos'

### Split reviews into training and test sets
We need to break up each group of reviews into a training set (about 80%) and a test set (the remaining 20%). In case there's some meaningful order to the reviews (e.g. the first 800 are from one group of reviewers, the next 200 are from another), we shuffle the sets first to ensure we aren't introducing additional bias. Note that this means our accuracy will not be exactly the same on every run; if you wish to see consistent results on each run, you can stabilize the shuffle by calling [random.seed(n)](https://www.geeksforgeeks.org/random-seed-in-python/) first.

In [58]:
from random import shuffle

split_pct = .80

def split_set(review_set):
    split = int(len(review_set)*split_pct)
    return (review_set[:split], review_set[split:])

shuffle(reviews_pos)
shuffle(reviews_neg)

pos_train, pos_test = split_set(reviews_pos)
neg_train, neg_test = split_set(reviews_neg)

train_set = pos_train+neg_train
test_set = pos_test+neg_test

In [59]:
test_set[1]

({'andy': 6,
  'leaves': 1,
  'cowboy': 2,
  'camp': 2,
  'mother': 1,
  'holds': 1,
  'yard': 3,
  'sale': 2,
  'scrounges': 1,
  'room': 1,
  'old': 3,
  'toys': 6,
  'one': 4,
  'wheezy': 3,
  'penguin': 1,
  'broken': 2,
  'squeaker': 1,
  'woody': 19,
  'tom': 2,
  'hanks': 2,
  'saddles': 1,
  'dog': 2,
  'rides': 1,
  'rescue': 1,
  'succeeds': 1,
  'mission': 1,
  'make': 1,
  'back': 1,
  'house': 1,
  'al': 7,
  'unscrupulous': 1,
  'owner': 2,
  'toy': 16,
  'barn': 3,
  'recognizes': 1,
  'rare': 1,
  'collector': 1,
  'item': 1,
  'steals': 1,
  'buzz': 10,
  'lightyear': 1,
  'tim': 1,
  'allen': 1,
  'leads': 2,
  'hamm': 1,
  'john': 1,
  'ratzenberger': 1,
  'mr': 1,
  'potato': 1,
  'head': 1,
  'rickles': 1,
  'slinky': 1,
  'jim': 1,
  'varney': 1,
  'rex': 1,
  'wallace': 1,
  'shawn': 1,
  'city': 1,
  'find': 1,
  'friend': 1,
  'meanwhile': 1,
  'discovers': 1,
  'reason': 1,
  'kidnapped': 1,
  'collected': 1,
  'every': 2,
  'piece': 1,
  'merchandising': 2,
 

### Train the model

Now that our data is ready, the training step itself is quite simple if we use the [NaiveBayesClassifier](https://www.nltk.org/api/nltk.classify.html#module-nltk.classify.naivebayes) provided by NLTK.

If you are used to methods such as `model.fit(x,y)` which take two parameters -- the data and the labels -- it may be confusing that `NaiveBayesClassifier.train` takes just one argument. This is because the labels are already embedded in `train_set`: each element in the set is a Bag of Words paired with a 'pos' or 'neg'; value.

In [60]:
from nltk.classify import NaiveBayesClassifier

model = NaiveBayesClassifier.train(train_set)

### Check model accuracy

NLTK's built-in [accuracy](https://www.nltk.org/api/nltk.classify.html#module-nltk.classify.util) utility can run our test_set through the model and compare the labels returned by the model to the labels in the test set, producing an overall % accuracy. Not too impressive, right? We need to improve.

In [61]:
from nltk.classify.util import accuracy

print(100 * accuracy(model, test_set))

67.5


### Save the model
Our trained model will be cleared from memory when this notebook is closed. So that we can use it again later, save the model as a file using the [pickle](https://docs.python.org/3/library/pickle.html) serializer.

In [13]:
import pickle

model_file = open('sa_classifier.pickle','wb')
pickle.dump(model, model_file)
model_file.close()

## Trying a different model

In [62]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.stem import LancasterStemmer,WordNetLemmatizer
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import svm
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
import time

### Turn data set into dataframe

In [63]:
from nltk.corpus import movie_reviews as mr

reviews = []
for fileid in mr.fileids():
    tag, filename = fileid.split('/')
    reviews.append((mr.raw(fileid), tag))

df = pd.DataFrame(reviews, columns=['review', 'sentiment'])
df = df.sample(frac=1).reset_index(drop=True)

In [64]:
df.describe()

Unnamed: 0,review,sentiment
count,2000,2000
unique,2000,2
top,for those of us who weren't yet born when the ...,pos
freq,1,1000


In [65]:
df['sentiment'].value_counts()

pos    1000
neg    1000
Name: sentiment, dtype: int64

The above create a randomized dataframe of all the movie reviews with two columns, one of the review, the other of the text. Next step is to clean the reviews so that they can be turned into a better model.

### Text Cleaning


In [66]:
tokenizer=ToktokTokenizer()
stopword_list=nltk.corpus.stopwords.words('english')
lemmatizer = WordNetLemmatizer()
import re

def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    return text

def simple_lemmer(text):
    lemmatizer = WordNetLemmatizer()
    word_list = nltk.word_tokenize(text)
    lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
    return lemmatized_output

def remove_stopwords(text):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    filtered_tokens = [token for token in tokens if token not in stopword_list]
    filtered_text = ' '.join(filtered_tokens) 
    return filtered_text

df['review'] = df['review'].apply(remove_special_characters)
df['review'] = df['review'].apply(simple_lemmer)
df['review'] = df['review'].apply(remove_stopwords)

In [67]:
df.iloc[69]['review']

'following review contains harsh language expect clicked title cast kristen holly smith danica sheridan alex boling michael dotson sonya hensley janet krajeski sabrina lu dionysius burbano calvin grant jeff b harmon written directed jeff b harmon running time 97 minute thought losing make vomity inside blatz balinski danica sheridan lament fact lesbian lover april kristen holly smith ha received telegram exfiance isle lesbos incredibly offensive musical comedy april pfferpot smith resident small town bumfuck arkansas get married high school sweetheart football hero dick dickson michael dotson april get extreme cold foot run home stick gun mouth pull trigger instead killing magically transported mirror isle lesbos alternate dimension lesbian rule men allowed except lance homosexual toilet cleanerslave april love new home friend dick parent ready give mr pfferpot director jeff b harmon janet krajeski decide need medical help enlist aid dr sigmoid colon also jeff b harmon claim cure homos

### Create the test/train split

In [68]:
# Create Training Set
train_reviews=df.review[:1600]
train_sentiments=df.sentiment[:1600]

# Create Test Set
test_reviews=df.review[1600:]
test_sentiments=df.sentiment[1600:]

# Make sure things are the same
print(train_reviews.shape,train_sentiments.shape)
print(test_reviews.shape,test_sentiments.shape)

(1600,) (1600,)
(400,) (400,)


Time to turn those reviews into vectors!

In [69]:
vectorizer = TfidfVectorizer(min_df = 5,
                             max_df = 0.8,
                             sublinear_tf = True,
                             use_idf = True)

train_vectors = vectorizer.fit_transform(train_reviews)
test_vectors = vectorizer.transform(test_reviews)

In [74]:
classifier_linear = svm.SVC(kernel='linear')
t0 = time.time()
classifier_linear.fit(train_vectors, train_sentiments)
t1 = time.time()
prediction_linear = classifier_linear.predict(test_vectors)
t2 = time.time()
time_linear_train = t1-t0
time_linear_predict = t2-t1

print("Training time: %fs; Prediction time: %fs" % (time_linear_train, time_linear_predict))

report = classification_report(test_sentiments, prediction_linear, output_dict=True)
print('positive: ', report['pos'])

print('negative: ', report['neg'])

Training time: 4.902939s; Prediction time: 1.011080s
positive:  {'precision': 0.9024390243902439, 'recall': 0.9203980099502488, 'f1-score': 0.9113300492610837, 'support': 201}
negative:  {'precision': 0.9179487179487179, 'recall': 0.8994974874371859, 'f1-score': 0.9086294416243655, 'support': 199}


### Save the model (Colab version)

Google Colab doesn't provide direct access to files saved during a notebook session, so we need to save it in [Google Drive](https://drive.google.com) instead. The first time you run this, it will ask for permission to access your Google Drive. Follow the instructions, then wait a few minutes and look for a new folder called "Colab Output" in [Drive](https://drive.google.com). Note that Colab does not alway sync to Drive immediately, so check the file update times and re-run this cell if it doesn't look like you have the most revent version of your file.

In [None]:
import sys
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/gdrive')
    !mkdir -p '/content/gdrive/My Drive/Colab Output'
    model_file = open('/content/gdrive/My Drive/Colab Output/sa_classifier.pickle','wb')
    pickle.dump(model, model_file)
    model_file.flush()
    print('Model saved in /content/gdrive/My Drive/Colab Output')
    !ls '/content/gdrive/My Drive/Colab Output'
    drive.flush_and_unmount()
    print('Re-run this cell if you cannot find it in https://drive.google.com')