## Homework 5: NLTK and Machine Learning Pipelines

In this final homework assignment, you'll be bringing together ideas from Natural Language Processing as well as
Machine Learning Pipelines.  This assignment uses materials adapted from [Benjamin Bengfort](https://bbengfort.github.io/tutorials/2016/05/19/text-classification-nltk-sckit-learn.html).

We are going back to work that you did in previous courses with object-oriented Python to create a proprocessor
to do NLP on some text in the context of machine learning pipelines.

First, we are going to review code that you should find reusable and helpful moving forward.  There's a lot of setup 
for this homework assignment.

In [48]:
import string

from nltk.corpus import stopwords as sw
from nltk.corpus import wordnet as wn
from nltk import wordpunct_tokenize
from nltk import WordNetLemmatizer
from nltk import sent_tokenize
from nltk import pos_tag

import re

from sklearn.base import BaseEstimator, TransformerMixin


class NLTKPreprocessor(BaseEstimator, TransformerMixin):

    def __init__(self, stopwords=None, punct=None,
                 lower=True, strip=True):
        self.lower      = lower
        self.strip      = strip
        self.stopwords  = stopwords or set(sw.words('english'))
        self.punct      = punct or set(string.punctuation)
        self.lemmatizer = WordNetLemmatizer()

    def fit(self, X, y=None):
        return self

    def inverse_transform(self, X):
        return [" ".join(doc) for doc in X]

    def transform(self, X):
        return [
            list(self.tokenize(doc)) for doc in X
        ]

    def tokenize(self, document):
        # Break the document into sentences
        
        # use regex to take out html tags ##########################################
        no_html = re.compile('<.*?>')
        
        for sent in sent_tokenize(document):
            
            # take out of actual sentence ##########################################
            sent = re.sub(no_html, '', sent)
            
            # Break the sentence into part of speech tagged tokens
            for token, tag in pos_tag(wordpunct_tokenize(sent)):
                # Apply preprocessing to the token
                token = token.lower() if self.lower else token
                token = token.strip() if self.strip else token
                token = token.strip('_') if self.strip else token
                token = token.strip('*') if self.strip else token

                # If stopword, ignore token and continue
                if token in self.stopwords:
                    continue

                # If punctuation, ignore token and continue
                if all(char in self.punct for char in token):
                    continue

                # Lemmatize the token and yield
                lemma = self.lemmatize(token, tag)
                yield lemma

    def lemmatize(self, token, tag):
        tag = {
            'N': wn.NOUN,
            'V': wn.VERB,
            'R': wn.ADV,
            'J': wn.ADJ
        }.get(tag[0], wn.NOUN)

        return self.lemmatizer.lemmatize(token, tag)

This just takes text and returns it unmodified.  We need it in the next section

In [49]:
def identity_tokenizer(text):
    return text
# just takes text and then spits it back out again

In [50]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report as clsr
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split as tts
import time
import pickle


def build_and_evaluate(X, y,
    classifier=SGDClassifier, outpath=None, verbose=True):

    def build(classifier, X, y=None):
        """
        Inner build function that builds a single model.
        """
        if isinstance(classifier, type):
            classifier = classifier()

        model = Pipeline([
            ('preprocessor', NLTKPreprocessor()),
            ('vectorizer', TfidfVectorizer(
                tokenizer=identity_tokenizer, # note that this will fail unless you use the identity_tokenizer
                preprocessor=None, lowercase=False
            )),
            ('classifier', classifier),
        ])

        model.fit(X, y)
        return model

    # Label encode the targets
    labels = LabelEncoder()
    y = labels.fit_transform(y)

    # Begin evaluation
    if verbose: print("Building for evaluation")
    X_train, X_test, y_train, y_test = tts(X, y, test_size=0.2)
    import time
    start_time = time.time()

    model = build(classifier, X_train, y_train)

    if verbose:
        print("Evaluation model fit in {:0.3f} seconds".format(time.time() - start_time))
        print("Classification Report:\n")

    y_pred = model.predict(X_test)
    print(clsr(y_test, y_pred, target_names=labels.classes_))

    if verbose:
        print("Building complete model and saving ...")
    start_time = time.time()
    model = build(classifier, X, y)
    model.labels_ = labels

    if verbose:
        print("Complete model fit in {:0.3f} seconds".format(time.time() - start_time))

    if outpath:
        with open(outpath, 'wb') as f:
            pickle.dump(model, f)

        print("Model written out to {}".format(outpath))
        
    # line below checks the accuracy score and prints it out with the model ###########################
    score = accuracy_score(y_test, y_pred)

    return model, score

Now that we've got everything set up for our pipelines, we can load some data.  Here we're going to use the Movie Reviews
corpus from the NLTK package.

In [51]:
from nltk.corpus import movie_reviews as reviews

X = [reviews.raw(fileid) for fileid in reviews.fileids()]
y = [reviews.categories(fileid)[0] for fileid in reviews.fileids()]
print("There are {} reviews".format(len(y)))


There are 2000 reviews


In [52]:
reviews.raw('neg/cv000_29416.txt')

'plot : two teen couples go to a church party , drink and then drive . \nthey get into an accident . \none of the guys dies , but his girlfriend continues to see him in her life , and has nightmares . \nwhat\'s the deal ? \nwatch the movie and " sorta " find out . . . \ncritique : a mind-fuck movie for the teen generation that touches on a very cool idea , but presents it in a very bad package . \nwhich is what makes this review an even harder one to write , since i generally applaud films which attempt to break the mold , mess with your head and such ( lost highway & memento ) , but there are good and bad ways of making all types of films , and these folks just didn\'t snag this one correctly . \nthey seem to have taken this pretty neat concept , but executed it terribly . \nso what are the problems with the movie ? \nwell , its main problem is that it\'s simply too jumbled . \nit starts off " normal " but then downshifts into this " fantasy " world in which you , as an audience membe

In [6]:
# we can take a closer look at the structure of 'reviews'

In [7]:
PATH = "movie_reviews_model.pickle"
model = build_and_evaluate(X,y, classifier=SGDClassifier, outpath=PATH)

Building for evaluation
Evaluation model fit in 122.839 seconds
Classification Report:

              precision    recall  f1-score   support

         neg       0.87      0.85      0.86       195
         pos       0.86      0.88      0.87       205

    accuracy                           0.86       400
   macro avg       0.87      0.86      0.86       400
weighted avg       0.87      0.86      0.86       400

Building complete model and saving ...
Complete model fit in 144.305 seconds
Model written out to movie_reviews_model.pickle


As you can see, building a model takes a considerable amount of time (and resources), so we're going to use the
"pickled" version of the model so we don't have to recreate it.

In [8]:
with open(PATH, 'rb') as f:
    model = pickle.load(f)

yhat = model.predict([
    "This is the worst movie I have ever seen!",
    "The movie was great action packed and full of adventure!",
    "Wow!",
    "This was the best and the worst at the same time!"
])


print(yhat)
print(model.labels_.inverse_transform(yhat))

[0 1 0 0]
['neg' 'pos' 'neg' 'neg']


Finally, we can take a look to see which words are most highly associated with each sentiment:

In [53]:
from operator import itemgetter
def show_most_informative_features(model, text=None, n=20):
    # Extract the vectorizer and the classifier from the pipeline
    vectorizer = model.named_steps['vectorizer']
    classifier = model.named_steps['classifier']

    # Check to make sure that we can perform this computation
    if not hasattr(classifier, 'coef_'):
        raise TypeError(
            "Cannot compute most informative features on {}.".format(
                classifier.__class__.__name__
            )
        )

    if text is not None:
        # Compute the coefficients for the text
        tvec = model.transform([text]).toarray()
    else:
        # Otherwise simply use the coefficients
        tvec = classifier.coef_

    # Zip the feature names with the coefs and sort
    coefs = sorted(
        zip(tvec[0], vectorizer.get_feature_names()),
        key=itemgetter(0), reverse=True
    )

    # Get the top n and bottom n coef, name pairs
    topn  = zip(coefs[:n], coefs[:-(n+1):-1])

    # Create the output string to return
    output = []

    # If text, add the predicted value to the output.
    if text is not None:
        output.append("\"{}\"".format(text))
        output.append(
            "Classified as: {}".format(model.predict([text]))
        )
        output.append("")

    # Create two columns with most negative and most positive features.
    for (cp, fnp), (cn, fnn) in topn:
        output.append(
            "{:0.4f}{: >15}    {:0.4f}{: >15}".format(
                cp, fnp, cn, fnn
            )
        )

    return "\n".join(output)

In [10]:
print(show_most_informative_features(model))

2.5886            fun    -4.6676            bad
2.4877          great    -2.5842  unfortunately
2.1383    performance    -2.5095          waste
2.1004            see    -2.3932        suppose
1.8559          quite    -2.3638        nothing
1.8054         matrix    -2.3588           plot
1.7293           trek    -2.3563        attempt
1.6320      memorable    -1.9190           poor
1.6062       terrific    -1.8976          awful
1.6001      different    -1.8972         stupid
1.5508       bulworth    -1.8945           look
1.5395     especially    -1.8386         boring
1.5044            job    -1.8356     ridiculous
1.4976        portray    -1.7776          guess
1.4914           also    -1.7350           even
1.4857      hilarious    -1.7069          could
1.4487      enjoyable    -1.6332         script
1.4457              7    -1.6270      carpenter
1.4427        overall    -1.6086         anyway
1.4414           true    -1.5399           dull


## Your challenge:
Build a sentiment classifier for the IMDB Dataset, which is available in the data/ directory.  Please note that
the IMDB Dataset consists of 50000 rows, so it's probably best to do most of your work on a sample of the
original dataset.  In the code below we use a sample size of 1000.  That's probably fine to start with but your final submission should be based on a sample of at least 5000.

You should attempt to improve the default classifier shown above by trying to get a higher accuracy score.  For example, you might want to try one of the other classifiers from the list shown in class 22.  Another way to improve your pipeline is to spend more time
building a better text preprocessor (e.g. you can see some reviews contain HTML, which you might decide to strip out).  Another thing you might want to do is to look more closely at the stopword list.

Please note that if you resample the dataset you will get slightly different accuracy values.  The values should not fluctuate wildly, so don't get too concerned about their absolute value.  What we're looking for is an improvement from the baseline and evidence that you tried a variety of approaches to improving the classifier.  We're also looking for evidence that you can manipulate text data into a machine learning pipeline and correctly interpret the results.

You should include code and interpretation of your results in this notebook.   If you tried many different approaches and ultimately chose one over the others, please include that in your write-up.  You do not need to include code for analyses that you discarded.

You should be able to plug the new data into the old pipeline code to get started (another handy thing about pipelines) and then start experimenting with improving the code!

In [54]:
import pandas as pd

In [55]:
m = pd.read_csv("data/imdb-dataset-of-50k-movie-reviews.zip")
# Let's do most of our work on a smaller sample of the 50000 rows
m = m.sample(5000)

In [56]:
m.head()

Unnamed: 0,review,sentiment
11617,"Gee, what a heck of a movie!... I said I wante...",negative
7482,"Ok, first of all, I am a huge zombie movie fan...",negative
34701,I'll be short and to the point. This movie was...,negative
27802,I think that COMPLETE SAVAGES is a very good T...,positive
9828,At the beginning we get to see the start of a ...,negative


In [57]:
# INSERT YOUR CODE AND INTERPRETATION IN MULTIPLE CELLS BELOW THIS ONE

In [58]:
X = m.review
y = m.sentiment
X.head()

11617    Gee, what a heck of a movie!... I said I wante...
7482     Ok, first of all, I am a huge zombie movie fan...
34701    I'll be short and to the point. This movie was...
27802    I think that COMPLETE SAVAGES is a very good T...
9828     At the beginning we get to see the start of a ...
Name: review, dtype: object

In [59]:
from sklearn.metrics import accuracy_score, log_loss
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC, NuSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

PATH = "kelsey_pickle.pickle"

classifiers = [
    KNeighborsClassifier(3),
    SVC(kernel="rbf", C=0.025, probability=True),
    NuSVC(nu=0.1,probability=True),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    AdaBoostClassifier(),
    GradientBoostingClassifier(),
    SGDClassifier() # adding in the default that was set above, so we can compare to that and see if we've improved
    ]

n = 0

for classifier in classifiers:

    n += 1
    print('classification type: ' + str(classifier))
    model, score = build_and_evaluate(X, y, classifier=classifier) 
    # no path rn at the end of the line above because we want to just grab the most accurate one and write it to the file
    if n == 1:
        prev_score = score
        with open(PATH, 'wb') as f:
            pickle.dump(model, f)
    elif score > prev_score:
        prev_score = score
        with open(PATH, 'wb') as f:
            pickle.dump(model, f)
    print('accuracy score: {}%'.format(score*100))
    print('----------------------------------------------------------------')
    
        
    
    
#     pipe = Pipeline(steps=[('preprocessor', preprocessor),
#                       ('classifier', classifier)])
#     pipe.fit(X_train, y_train)   
#     print(classifier)
#     print("model score: %.3f" % pipe.score(X_test, y_test))

classification type: KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=3, p=2,
                     weights='uniform')
Building for evaluation
Evaluation model fit in 94.521 seconds
Classification Report:

              precision    recall  f1-score   support

    negative       0.75      0.68      0.72       531
    positive       0.68      0.75      0.71       469

    accuracy                           0.71      1000
   macro avg       0.71      0.72      0.71      1000
weighted avg       0.72      0.71      0.71      1000

Building complete model and saving ...
Complete model fit in 117.909 seconds
accuracy score: 71.3%
----------------------------------------------------------------
classification type: SVC(C=0.025, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='rbf', max_iter=-1, probability=True, random_state=No



Evaluation model fit in 266.905 seconds
Classification Report:



  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

    negative       0.00      0.00      0.00       530
    positive       0.47      1.00      0.64       470

    accuracy                           0.47      1000
   macro avg       0.23      0.50      0.32      1000
weighted avg       0.22      0.47      0.30      1000

Building complete model and saving ...




Complete model fit in 388.591 seconds
accuracy score: 47.0%
----------------------------------------------------------------
classification type: NuSVC(cache_size=200, class_weight=None, coef0=0.0,
      decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
      kernel='rbf', max_iter=-1, nu=0.1, probability=True, random_state=None,
      shrinking=True, tol=0.001, verbose=False)
Building for evaluation




Evaluation model fit in 110.010 seconds
Classification Report:

              precision    recall  f1-score   support

    negative       0.19      0.17      0.18       497
    positive       0.24      0.26      0.25       503

    accuracy                           0.22      1000
   macro avg       0.21      0.22      0.21      1000
weighted avg       0.21      0.22      0.21      1000

Building complete model and saving ...




Complete model fit in 143.488 seconds
accuracy score: 21.6%
----------------------------------------------------------------
classification type: DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')
Building for evaluation
Evaluation model fit in 97.285 seconds
Classification Report:

              precision    recall  f1-score   support

    negative       0.69      0.70      0.69       498
    positive       0.70      0.68      0.69       502

    accuracy                           0.69      1000
   macro avg       0.69      0.69      0.69      1000
weighted avg       0.69      0.69      0.69      1000

Building complete model and saving 



Evaluation model fit in 100.976 seconds
Classification Report:

              precision    recall  f1-score   support

    negative       0.71      0.82      0.76       497
    positive       0.79      0.67      0.72       503

    accuracy                           0.74      1000
   macro avg       0.75      0.75      0.74      1000
weighted avg       0.75      0.74      0.74      1000

Building complete model and saving ...
Complete model fit in 127.800 seconds
accuracy score: 74.5%
----------------------------------------------------------------
classification type: AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=50, random_state=None)
Building for evaluation
Evaluation model fit in 98.990 seconds
Classification Report:

              precision    recall  f1-score   support

    negative       0.81      0.82      0.81       518
    positive       0.80      0.79      0.80       482

    accuracy                          

KeyboardInterrupt: 

improve the classifier as much as possible

assessed on degree on which you've changed the code to try and improve it

can change in 3 places
- nltk preprocessor
- pipelines
- choosing the classifier

remove html with regex, get rid of all "'s and punctuation

- coherent description of final results
    - 
- improvement to preprocessor
    - taking out HTML tags using regex
- evidence of multiple classifiers attempted
    - tried 7 different classifiers than the one above
    - make improvements on the highest performing classifier
- appropriate use of documentation within code
    - i explained my code and shit
- appropriate use of visualizations
    - 

In [15]:
# You should include the output from the following code in your notebook:
with open(PATH, 'rb') as f:
    model = pickle.load(f)

yhat = model.predict([
    "This is the worst movie I have ever seen!",
    "The movie was great action packed and full of adventure!",
    "Wow!",
    "This was the best and the worst at the same time!"
])


print(yhat)
print(model.labels_.inverse_transform(yhat))

FileNotFoundError: [Errno 2] No such file or directory: 'PATH_TO_YOUR_IMDB_MODEL'