## Part 2: Training your own ML Model

### Prepare the environment

We'll continue using the `movie_reviews` corpus to train our model. The `stopwords` corpus contains a [set of standard stopwords](https://gist.github.com/sebleier/554280) we'll want to remove from the input and `punkt` is used for toneization in the [.words()](https://www.nltk.org/api/nltk.corpus.html#corpus-reader-functions) method of the corpus reader.

Finally, `string.punctuation` is a string containing the most common punctuation symbols you can find in a text. It can be useful if you want to do some processing that only affects them (like removing them)

In [80]:
from nltk import download

download('movie_reviews')
download('stopwords')
download('punkt')

from nltk.corpus import stopwords
stopwords_eng = stopwords.words("english")

from string import punctuation

[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/ozge/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/ozge/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/ozge/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Define feature extractor and bag-of-words converter

Given a list of (already tokenized) words, we need a function to extract just the ones we care about: those not found in the list of English stopwords or standard punctuation.

We also need a way to easily turn a list of words into a [bag-of-words](https://en.wikipedia.org/wiki/Bag-of-words_model), pairing each word with the count of its occurrences.

In [81]:
def extract_features(words):
    return [w for w in words if w not in stopwords_eng and w not in punctuation]

def bag_of_words(words):
    bag = {}
    for w in words:
        bag[w] = bag.get(w,0)+1
    return bag

### Text processing utils

The following functions encapsulate some operations that we will repeat on multiple implementations. Extracting them to independent functions allows us to reduce unnecessary duplication and increase code readability

In [82]:
import re

def is_useful_word(word):
    return (word not in stopwords_eng) and (word not in punctuation)

def remove_punctuation(text):
    return re.sub(r'[^a-zA-Z0-9\s]', ' ', text)

## Ideas

The following two cells implement two text processing techniques that you can use to get a more accurate sentiment classification model. Use them as a reference on how to implement improvements like this, and explore more possibilities to get the best result you can!

And as a final recommendation: these techniques are not exclusive, meaning that you can combine multiple ones in a single model! The accuracy will not always improve, but it's worth trying it and seeing the results!

In [7]:
# IMPLEMENTATION: use Spacy lemmatizer
# !pip install spacy
# !python -m spacy download en_core_web_sm

import spacy 

nlp = spacy.load("en_core_web_sm") 

def extract_features(document):
    return [str(w.lemma_) for w in nlp(document) if is_useful_word(w.text)]

# Example:
print(extract_features("Hello world, corpuses calling!"))

['hello', 'world', 'corpuse', 'call']


In [9]:
# IMPLEMENTATION: use n-grams

import re
# from nltk.util import ngrams
from nltk.util import everygrams

def extract_features(document):
    document = document.lower()
    document = re.sub(r'[^a-zA-Z0-9\s]', ' ', document)
    words = [w for w in document.split(" ") if w != "" and is_useful_word(w)]
    return ['_'.join(ngram) for ngram in list(everygrams(words, max_len=3))]

# Example
print(extract_features("Hello world, corpuses calling!"))

['hello', 'hello_world', 'hello_world_corpuses', 'world', 'world_corpuses', 'world_corpuses_calling', 'corpuses', 'corpuses_calling', 'calling']


In [83]:
# IMPLEMENTATION: use Spacy lemmatizer and n-grams combines

# !pip install spacy
# !python -m spacy download en_core_web_sm

import spacy 
import re
from nltk.util import everygrams

nlp = spacy.load("en_core_web_sm") 

def extract_features(document):
    # Your implementation goes here!
    document = document.lower()
    document = re.sub(r'[^a-zA-Z0-9\s]', ' ', document)
    
    words=[str(w.lemma_) for w in nlp(document) if is_useful_word(w.text)]
    words = [w for w in words if " " not in w]    
    return ['_'.join(ngram) for ngram in list(everygrams(words, max_len=3))]
    

# Example
print(extract_features("Hello world, corpuses calling!"))

['hello', 'hello_world', 'hello_world_corpus', 'world', 'world_corpus', 'world_corpus_call', 'corpus', 'corpus_call', 'call']


### Ingest, clean, and convert the positive and negative reviews

For both the positive ("pos") and negative ("neg") sets of reviews, extract the features and convert to bag of words. From these, we construct a list of tuples known as a "featureset": the first part of each tuple is the bag of words for that review, and the second is its label ("pos"/"neg").

Note that `movie_reviews.words(fileid)` provides a tokenized list of words. If we wanted the un-tokenized text, we would use `movie_reviews.raw(fileid)` instead, then tokenize it using our preferred tokenizeer (e.g. [nltk.tokenize.word_tokenize](https://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.punkt.PunktLanguageVars.word_tokenize)).

In [84]:
from nltk.corpus import movie_reviews

reviews_pos = []
reviews_neg = []
for fileid in movie_reviews.fileids('pos'):
    words = extract_features(movie_reviews.raw(fileid))
    reviews_pos.append((bag_of_words(words), 'pos'))
for fileid in movie_reviews.fileids('neg'):
    words = extract_features(movie_reviews.raw(fileid))
    reviews_neg.append((bag_of_words(words), 'neg'))

### Split reviews into training and test sets
We need to break up each group of reviews into a training set (about 80%) and a test set (the remaining 20%). In case there's some meaningful order to the reviews (e.g. the first 800 are from one group of reviewers, the next 200 are from another), we shuffle the sets first to ensure we aren't introducing additional bias. Note that this means are accuracy will not be exactly the same on every run.

In [85]:
from random import shuffle

split_pct = .80

def split_set(review_set):
    split = int(len(review_set)*split_pct)
    return (review_set[:split], review_set[split:])

shuffle(reviews_pos)
shuffle(reviews_neg)

pos_train, pos_test = split_set(reviews_pos)
neg_train, neg_test = split_set(reviews_neg)

train_set = pos_train+neg_train
test_set = pos_test+neg_test

### Train the model

Now that our data is ready, the training step itself is quite simple if we use the [NaiveBayesClassifier](https://www.nltk.org/api/nltk.classify.html#module-nltk.classify.naivebayes) provided by NLTK.

In [86]:
from nltk.classify import NaiveBayesClassifier

model = NaiveBayesClassifier.train(train_set)

### Check model accuracy

NLTK's built-in [accuracy](https://www.nltk.org/api/nltk.classify.html#module-nltk.classify.util) utility can run our test_set through the model and compare the labels returned by the model to the labels in the test set, producing an overall % accuracy. Not too impressive, right? We need to improve.

In [87]:
from nltk.classify.util import accuracy

print(100 * accuracy(model, test_set))
model.show_most_informative_features(20)

80.0
Most Informative Features
               marvelous = 1                 pos : neg    =     15.0 : 1.0
                   waste = 2                 neg : pos    =     14.3 : 1.0
               wonderful = 2                 pos : neg    =     13.0 : 1.0
                     bad = 5                 neg : pos    =     12.3 : 1.0
          extremely_well = 1                 pos : neg    =     12.3 : 1.0
                  symbol = 1                 pos : neg    =     12.3 : 1.0
               ludicrous = 1                 neg : pos    =     12.2 : 1.0
                  murphy = 1                 pos : neg    =     11.0 : 1.0
                    plod = 1                 neg : pos    =     11.0 : 1.0
               bad_movie = 2                 neg : pos    =     10.3 : 1.0
           steven_seagal = 1                 neg : pos    =     10.3 : 1.0
                   thumb = 1                 neg : pos    =     10.3 : 1.0
                 immerse = 1                 pos : neg    =     10.3 

### Save the model
Our trained model will be cleared from memory when this notebook is closed. So that we can use it again later, save the model as a file using the [pickle](https://docs.python.org/3/library/pickle.html) serializer.

In [88]:
import pickle

model_file = open("sa_classifier.pickle","wb")
pickle.dump(model, model_file)
model_file.close()
print("saved")

saved


### Save the model (Colab version)

Google Colab doesn't provide direct access to files saved during a notebook session, so we need to save it in [Google Drive](https://drive.google.com) instead. The first time you run this, it will ask for permission to access your Google Drive. Follow the instructions, then wait a few minutes and look for a new folder called "Colab Output" in [Drive](https://drive.google.com). Note that Colab does not alway sync to Drive immediately, so check the file update times and re-run this cell if it doesn't look like you have the most revent version of your file.

In [None]:
import sys
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/gdrive')
    !mkdir -p '/content/gdrive/My Drive/Colab Output'
    model_file = open('/content/gdrive/My Drive/Colab Output/sa_classifier.pickle',"wb")
    pickle.dump(model, model_file)
    model_file.flush()
    print('Model saved in /content/gdrive/My Drive/Colab Output')
    !ls '/content/gdrive/My Drive/Colab Output'
    drive.flush_and_unmount()
    print('Re-run this cell if you cannot find it in https://drive.google.com')