## Part 2: Training your own ML Model

<a href="https://colab.research.google.com/github/peckjon/hosting-ml-as-microservice/blob/master/part2/train_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Download corpuses

We'll continue using the `movie_reviews` corpus to train our model. The `stopwords` corpus contains a [set of standard stopwords](https://gist.github.com/sebleier/554280) we'll want to remove from the input, and `punkt` is used for toneization in the [.words()](https://www.nltk.org/api/nltk.corpus.html#corpus-reader-functions) method of the corpus reader.

In [39]:
from nltk import download
import nltk

download('movie_reviews')
download('punkt')
download('stopwords')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/arnaudcharlier/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/arnaudcharlier/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/arnaudcharlier/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

### Define feature extractor and bag-of-words converter

Given a list of (already tokenized) words, we need a function to extract just the ones we care about: those not found in the list of English stopwords or standard punctuation.

We also need a way to easily turn a list of words into a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model), pairing each word with the count of its occurrences.

In [40]:
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from string import punctuation

stopwords_eng = stopwords.words('english')
manual_stopwords = [',', '.', '?', '/', '(', ')', "'", '*', '://', '-', '[', ']', '’', ':', '**', "\\", '"',
                    's', ';', "like", '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '&#', '!', '...', '=', 'e',
                    'r', '&', ').', 'g', 'films','3000','secondly','bible']

def wordStemmer(wordrow):
    stemmer = SnowballStemmer("english")
    stemmed = stemmer.stem(wordrow)
    return stemmed

def extract_features(words):
    return [w for w in words if w not in stopwords_eng and w not in punctuation and w not in manual_stopwords]

def bag_of_words(words):
    bag = {}
    for w in words:
        ws = wordStemmer(w)
        bag[w] = bag.get(w,0)+1
    return bag

### Ingest, clean, and convert the positive and negative reviews

For both the positive ("pos") and negative ("neg") sets of reviews, extract the features and convert to bag of words. From these, we construct a list of tuples known as a "featureset": the first part of each tuple is the bag of words for that review, and the second is its label ("pos"/"neg").

Note that `movie_reviews.words(fileid)` provides a tokenized list of words. If we wanted the un-tokenized text, we would use `movie_reviews.raw(fileid)` instead, then tokenize it using our preferred tokenizeer (e.g. [nltk.tokenize.word_tokenize](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktLanguageVars.word_tokenize)).

In [41]:
from nltk.corpus import movie_reviews

reviews_pos = []
reviews_neg = []
for fileid in movie_reviews.fileids('pos'):
    words = extract_features(movie_reviews.words(fileid))
    #freqDist = nltk.probability.FreqDist(words)
    #print(freqDist.hapaxes())
    reviews_pos.append((bag_of_words(words), 'pos'))
for fileid in movie_reviews.fileids('neg'):
    words = extract_features(movie_reviews.words(fileid))
    #freqDist = nltk.probability.FreqDist(words)
    #print(freqDist.hapaxes())
    reviews_neg.append((bag_of_words(words), 'neg'))
    
    
    


### Split reviews into training and test sets
We need to break up each group of reviews into a training set (about 80%) and a test set (the remaining 20%). In case there's some meaningful order to the reviews (e.g. the first 800 are from one group of reviewers, the next 200 are from another), we shuffle the sets first to ensure we aren't introducing additional bias. Note that this means our accuracy will not be exactly the same on every run; if you wish to see consistent results on each run, you can stabilize the shuffle by calling [random.seed(n)](https://www.geeksforgeeks.org/random-seed-in-python/) first.

In [42]:
from random import shuffle

split_pct = .80

def split_set(review_set):
    split = int(len(review_set)*split_pct)
    return (review_set[:split], review_set[split:])

shuffle(reviews_pos)
shuffle(reviews_neg)

pos_train, pos_test = split_set(reviews_pos)
neg_train, neg_test = split_set(reviews_neg)

train_set = pos_train+neg_train
test_set = pos_test+neg_test


### Train the model

Now that our data is ready, the training step itself is quite simple if we use the [NaiveBayesClassifier](https://www.nltk.org/api/nltk.classify.html#module-nltk.classify.naivebayes) provided by NLTK.

If you are used to methods such as `model.fit(x,y)` which take two parameters -- the data and the labels -- it may be confusing that `NaiveBayesClassifier.train` takes just one argument. This is because the labels are already embedded in `train_set`: each element in the set is a Bag of Words paired with a 'pos' or 'neg'; value.

In [43]:
from nltk.classify import NaiveBayesClassifier

model = NaiveBayesClassifier.train(train_set)
print(model.show_most_informative_features(100))

Most Informative Features
                     bad = 4                 neg : pos    =     17.0 : 1.0
                  stupid = 2                 neg : pos    =     15.7 : 1.0
             outstanding = 1                 pos : neg    =     13.6 : 1.0
                   blend = 1                 pos : neg    =     11.7 : 1.0
               marvelous = 1                 pos : neg    =     11.7 : 1.0
             uninvolving = 1                 neg : pos    =     11.0 : 1.0
              apparently = 2                 neg : pos    =     10.3 : 1.0
              astounding = 1                 pos : neg    =     10.3 : 1.0
            construction = 1                 pos : neg    =      9.7 : 1.0
               wonderful = 2                 pos : neg    =      9.7 : 1.0
                  avoids = 1                 pos : neg    =      9.7 : 1.0
             beautifully = 1                 pos : neg    =      9.6 : 1.0
                  boring = 2                 neg : pos    =      9.4 : 1.0

### Check model accuracy

NLTK's built-in [accuracy](https://www.nltk.org/api/nltk.classify.html#module-nltk.classify.util) utility can run our test_set through the model and compare the labels returned by the model to the labels in the test set, producing an overall % accuracy. Not too impressive, right? We need to improve.

In [44]:
from nltk.classify.util import accuracy

print(100 * accuracy(model, test_set))

71.75


### Save the model
Our trained model will be cleared from memory when this notebook is closed. So that we can use it again later, save the model as a file using the [pickle](https://docs.python.org/3/library/pickle.html) serializer.

In [45]:
import pickle

model_file = open('sa_classifier.pickle','wb')
pickle.dump(model, model_file)
model_file.close()

### Save the model (Colab version)

Google Colab doesn't provide direct access to files saved during a notebook session, so we need to save it in [Google Drive](https://drive.google.com) instead. The first time you run this, it will ask for permission to access your Google Drive. Follow the instructions, then wait a few minutes and look for a new folder called "Colab Output" in [Drive](https://drive.google.com). Note that Colab does not alway sync to Drive immediately, so check the file update times and re-run this cell if it doesn't look like you have the most revent version of your file.

In [46]:
import sys
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/gdrive')
    !mkdir -p '/content/gdrive/My Drive/Colab Output'
    model_file = open('/content/gdrive/My Drive/Colab Output/sa_classifier.pickle','wb')
    pickle.dump(model, model_file)
    model_file.flush()
    print('Model saved in /content/gdrive/My Drive/Colab Output')
    !ls '/content/gdrive/My Drive/Colab Output'
    drive.flush_and_unmount()
    print('Re-run this cell if you cannot find it in https://drive.google.com')