Adapted from *Python Machine Learning 2nd Edition* by [Sebastian Raschka](https://sebastianraschka.com), Packt Publishing Ltd. 2017

Code Repository: https://github.com/rasbt/python-machine-learning-book-2nd-edition

Code License: [MIT License](https://github.com/rasbt/python-machine-learning-book-2nd-edition/blob/master/LICENSE.txt)

## Applying Machine Learning To Sentiment Analysis

The IMDB movie review set can be downloaded from http://ai.stanford.edu/~amaas/data/sentiment/

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv('movie_data.csv', encoding='utf-8')
df.shape

(50000, 2)

### Cleaning text data

We need to clean up the data by removing unwanted characters.

In [3]:
df.loc[0, 'review']

"at a Saturday matinee in my home town. I went with an older friend (he was about 12) and my mom let me go because she thought the film would be OK (it's rated G). I was assaulted by loud music, STRANGE images, no plot and a stubborn refusal to make ANY sense. We left halfway through because we were bored, frustrated and our ears hurt. <br /><br />I saw it 22 years later in a revival theatre. My opinion had changed--it's even WORSE! Basically everything I hated about it was still there and the film was VERY 60s...and has dated badly. I got all the little in-jokes...too bad they weren't funny. The constant shifts in tone got quickly annoying and there's absolutely nothing to get a firm grip on. Some people will love this. I found it frustrating...by the end of the film I felt like throwing something heavy at the screen.<br /><br />Also, all the Monkees songs in this movie SUCK (and I DO like them).<br /><br />For ex-hippies only...or if you're stoned. I give this a 1."

We will remove all the HTML markups. For simplicity, we will now remove all punctuation marks except for emoticon characters since they are useful for sentiment analysis. Next, all non-word characters are removed and the text is converted to lowercase. 

In [4]:
import re
def preprocessor(text):
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)',
                           text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text

In [5]:
preprocessor("</a>This :) is :( a test :-)!")

'this is a test :) :( :)'

Apply the preprocessor to movie review data. 

In [6]:
df['review'] = df['review'].apply(preprocessor)
df.head(5)

Unnamed: 0,review,sentiment
0,at a saturday matinee in my home town i went w...,0
1,i love this movie it is the first film master ...,1
2,in the voice over which begins the film hughie...,1
3,spoiler alert the point is though that i didn...,0
4,this is an excellent film no it s not mel gibs...,1


### Processing documents into tokens

One simple way to tokenize documents is to split them into individual words by splitting documents by the whitespace characters. Another useful approach is word stemming, which is the process of transforming a word into its root form. Porter Stemmmer algorithm is one original stemming algorithm developed by Martin Porter in 1979.

In [7]:
#import Natural Lanuage Toolkit 
from nltk.stem.porter import PorterStemmer

porter = PorterStemmer()
def tokenizer(text):
    return text.split()

def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]

In [8]:
tokenizer('runners like running and thus they run')

['runners', 'like', 'running', 'and', 'thus', 'they', 'run']

In [9]:
tokenizer_porter('runners like running and thus they run')

['runner', 'like', 'run', 'and', 'thu', 'they', 'run']

We will also apply another useful technique called stop-word removal. 

In [10]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/karen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
from nltk.corpus import stopwords

stop = stopwords.words('english')
[w for w in tokenizer_porter('a runner likes running and runs a lot')
if w not in stop]

['runner', 'like', 'run', 'run', 'lot']

### Training a logistic regression model for document classification

Prepare training set and the test set. 

In [12]:
X_train = df.iloc[:40000, 0].values
y_train = df.iloc[:40000, 1].values
X_test = df.iloc[40000:50000, 0].values
y_test = df.iloc[40000:50000, 1].values

In [13]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
clf = LogisticRegression(random_state=42)

lr_tfidf_pipeline = Pipeline([('vect', tfidf),
                     ('clf', clf)])

In [14]:
lr_tfidf_pipeline.fit(X_train, y_train)

Pipeline(steps=[('vect', TfidfVectorizer()),
                ('clf', LogisticRegression(random_state=42))])

In [15]:
lr_tfidf_pipeline.predict(X_test)

array([0, 0, 0, ..., 0, 0, 0])

In [16]:
print('Test Accuracy: %.3f' % lr_tfidf_pipeline.score(X_test, y_test))

Test Accuracy: 0.899


Try the model prediction with a single example

In [19]:
label = {0: 'negative', 1: 'positive'}
example = ['The movie is so so']

print('Prediction: %s\nProbability: %.2f%%' %
      (label[lr_tfidf_pipeline.predict(example)[0]],
       np.max(lr_tfidf_pipeline.predict_proba(example)) * 100))

Prediction: negative
Probability: 52.47%


### Pickle the model

In [18]:
import pickle
file_pickle = open('model.pkl', 'wb')
pickle.dump(lr_tfidf_pipeline, file_pickle)
file_pickle.close()

In [None]:
new_model_file = open('model.pkl', 'rb')
new_model=pickle.load(new_model_file)

In [None]:
print('Prediction: %s\nProbability: %.2f%%' %
      (label[new_model.predict(example)[0]],
       np.max(new_model.predict_proba(example)) * 100))

Prediction: negative
Probability: 52.47%
