### Load Data
The source of the data is from Kaggle.
https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [1]:
import pandas as pd
import os

MOVIE_PATH = "reviews_dataset"

def load_review_data(movie_review_path=MOVIE_PATH):
    csv_path = os.path.join(movie_review_path,"IMDB Dataset.csv")
    return pd.read_csv(csv_path)


In [2]:
movie_reviews = load_review_data()
movie_reviews.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
movie_reviews['review']

0        One of the other reviewers has mentioned that ...
1        A wonderful little production. <br /><br />The...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot, bad dialogue, bad acting, idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I'm going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

In [4]:
movie_reviews['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [5]:
movie_reviews['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

### Data Cleaning 

Based on the first 2 reviews, we can do some basic data cleaning on our data set 
* HTML tag 
* punctuations 
* numerical values
* make all text lower case 
* tokenize text 
* remove stopwords


In [6]:
# round 1 : HTML tag cleaning (one input of text)
import re 
import string

def remove_html_tag(text): 
    text = re.sub('<.*?>','',text)
    return text

clean_html = lambda x: remove_html_tag(x)

In [7]:
no_tag_reviews = pd.DataFrame(movie_reviews['review'].apply(clean_html))
no_tag_reviews['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

In [8]:
# round 2 : clean out the punctuation 

def remove_punct_num(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

clean_punct_num = lambda x: remove_punct_num(x)

In [9]:
movie_reviews['clean_reviews'] = pd.DataFrame(no_tag_reviews['review'].apply(clean_punct_num))
print(movie_reviews['clean_reviews'][0])

one of the other reviewers has mentioned that after watching just  oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictures

In [10]:
# round 3: remove stop words 
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

nltk.download("stopwords")
nltk.download('punkt')
stop_words = set(stopwords.words('english')) 


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kaixuanchin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kaixuanchin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
print(stop_words)

{'some', 'below', "shan't", "don't", 'its', 'wouldn', 'to', 'of', 'all', 're', "she's", 'mightn', 'before', 'above', 'that', "weren't", 'why', 'in', 'not', 'my', 'you', "mustn't", 'they', 'off', 'very', "you're", 'been', 'i', 'his', 'had', 'couldn', 'further', 'or', 'up', 'most', 'yourselves', 'into', 'have', 'as', 'yours', 'doesn', 'hadn', 'doing', 'over', 'about', 'weren', 'just', 'won', "isn't", 'now', 'any', 'ma', 'after', 'when', "didn't", 'these', 'each', "should've", 'between', 'd', 'who', 'them', "couldn't", 'their', "wasn't", 't', 'll', 'own', 'aren', 'hers', 'o', "you'll", 'did', 'whom', 've', 'such', 'mustn', 'those', 'being', 'do', "won't", 'more', 'at', 'by', 'until', 'out', 'ours', 'for', 'your', 'are', "it's", 'too', 'so', 'myself', 'it', 'haven', 'herself', 'through', 'while', 'theirs', 'needn', 'him', 'against', "hasn't", 'was', "needn't", 'shan', 'no', 'than', 'hasn', 's', 'because', 'this', 'down', "aren't", 'were', 'will', 'a', 'if', 'shouldn', 'himself', "haven't",

In [12]:
def reduce_stop_word(text):
    word_tokens = word_tokenize(text)
    text =[word for word in word_tokens if word not in stop_words]
    return text

clean_stop_words = lambda x: reduce_stop_word(x)

In [13]:
movie_reviews['review_clean_tokenize'] = movie_reviews['clean_reviews'].apply(clean_stop_words)

In [14]:
movie_reviews.head()

Unnamed: 0,review,sentiment,clean_reviews,review_clean_tokenize
0,One of the other reviewers has mentioned that ...,positive,one of the other reviewers has mentioned that ...,"[one, reviewers, mentioned, watching, oz, epis..."
1,A wonderful little production. <br /><br />The...,positive,a wonderful little production the filming tech...,"[wonderful, little, production, filming, techn..."
2,I thought this was a wonderful way to spend ti...,positive,i thought this was a wonderful way to spend ti...,"[thought, wonderful, way, spend, time, hot, su..."
3,Basically there's a family where a little boy ...,negative,basically theres a family where a little boy j...,"[basically, theres, family, little, boy, jake,..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter matteis love in the time of money is a ...,"[petter, matteis, love, time, money, visually,..."


For now, we will settle with only the above data cleaning. But we can always come back to this step and check does stemming help to improve the prediction of the data 

In [15]:
# change the label of sentiment to 0 and 1 
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
y_labels = encoder.fit_transform(movie_reviews['sentiment'])

y_labels = pd.DataFrame(y_labels,columns=['sentiment'])
y_labels = pd.DataFrame(y_labels)

movie_reviews['sentiment'] = y_labels['sentiment'].values
movie_reviews

Unnamed: 0,review,sentiment,clean_reviews,review_clean_tokenize
0,One of the other reviewers has mentioned that ...,1,one of the other reviewers has mentioned that ...,"[one, reviewers, mentioned, watching, oz, epis..."
1,A wonderful little production. <br /><br />The...,1,a wonderful little production the filming tech...,"[wonderful, little, production, filming, techn..."
2,I thought this was a wonderful way to spend ti...,1,i thought this was a wonderful way to spend ti...,"[thought, wonderful, way, spend, time, hot, su..."
3,Basically there's a family where a little boy ...,0,basically theres a family where a little boy j...,"[basically, theres, family, little, boy, jake,..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",1,petter matteis love in the time of money is a ...,"[petter, matteis, love, time, money, visually,..."
...,...,...,...,...
49995,I thought this movie did a down right good job...,1,i thought this movie did a down right good job...,"[thought, movie, right, good, job, wasnt, crea..."
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",0,bad plot bad dialogue bad acting idiotic direc...,"[bad, plot, bad, dialogue, bad, acting, idioti..."
49997,I am a Catholic taught in parochial elementary...,0,i am a catholic taught in parochial elementary...,"[catholic, taught, parochial, elementary, scho..."
49998,I'm going to have to disagree with the previou...,0,im going to have to disagree with the previous...,"[im, going, disagree, previous, comment, side,..."


### Spliting the Data

In [24]:
from sklearn.model_selection import train_test_split

lg_train_set, lg_test_set = train_test_split(movie_reviews, test_size=0.2, random_state=42)
# since the dataset it too large for the classifier, so we reduce the dataset. 
sm_train_set , sm_test_set = train_test_split(lg_test_set,test_size = 0.2 , random_state=42)

In [65]:
train_y = sm_train_set['sentiment']
test_y = sm_test_set['sentiment']

train_y = pd.DataFrame(train_y)
test_y = pd.DataFrame(test_y)

test_y.head(9)

Unnamed: 0,sentiment
25056,0
30334,0
17962,0
39588,0
34107,0
7096,1
15594,0
38331,0
48168,1


### Organizing Data 

After cleaning the data, we can start organizing our data to two types 
* corpus = a collection(set) of words 
* document-term matrix(DTM) =  frequency of terms that occur in a collection of documents


##### Corpus

When we are doing data cleaning, we had change the reviews in to a list of words.

In [26]:
movie_reviews.head()

Unnamed: 0,review,sentiment,clean_reviews,review_clean_tokenize
0,One of the other reviewers has mentioned that ...,1,one of the other reviewers has mentioned that ...,"[one, reviewers, mentioned, watching, oz, epis..."
1,A wonderful little production. <br /><br />The...,1,a wonderful little production the filming tech...,"[wonderful, little, production, filming, techn..."
2,I thought this was a wonderful way to spend ti...,1,i thought this was a wonderful way to spend ti...,"[thought, wonderful, way, spend, time, hot, su..."
3,Basically there's a family where a little boy ...,0,basically theres a family where a little boy j...,"[basically, theres, family, little, boy, jake,..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",1,petter matteis love in the time of money is a ...,"[petter, matteis, love, time, money, visually,..."


In [27]:
import pickle

sm_train_set['review_clean_tokenize'].to_pickle("train_corpus.pkl")
sm_test_set['review_clean_tokenize'].to_pickle("test_corpus.pkl")

##### Document-Term Matrix (DTM)

We are using a `CountVectorizer` from sckitlearn, there are other option which is `TfidVectorizer`. 

In [28]:
from sklearn.feature_extraction.text import CountVectorizer

sm_train_set.drop(['review', 'sentiment','review_clean_tokenize'], axis=1)
sm_test_set.drop(['review', 'sentiment','review_clean_tokenize'], axis=1)

vectorizer = CountVectorizer(stop_words='english')
train_x_vectors = vectorizer.fit_transform(sm_train_set['clean_reviews'])
# need to use transform ( if not the feature size will be different)
test_x_vectors = vectorizer.transform(sm_test_set['clean_reviews'])

In [61]:
test_x_vectors

<2000x70459 sparse matrix of type '<class 'numpy.int64'>'
	with 162349 stored elements in Compressed Sparse Row format>

In [30]:
train_x_vectors

<8000x70459 sparse matrix of type '<class 'numpy.int64'>'
	with 712909 stored elements in Compressed Sparse Row format>

### Classification

Here we are only using 2 typed of classifier which is SVM and Decission Tree Classifier.

##### Linear SVM

In [31]:
from sklearn.svm import SVC

clf_svm = SVC(kernel='linear')

clf_svm.fit(train_x_vectors, train_y)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [67]:
clf_svm.predict(test_x_vectors[5]) #predicted correctly 

array([1])

In [68]:
from sklearn.tree import DecisionTreeClassifier

clf_dec = DecisionTreeClassifier()
clf_dec.fit(train_x_vectors, train_y)



DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [69]:
clf_dec.predict(test_x_vectors[5])

array([1])

### Evaluation 

One of the common way to get the performance of the model is to use F1 score. 

In [70]:
from sklearn.metrics import f1_score

f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[1,0])


array([0.84810127, 0.83967112])

In [71]:
f1_score(test_y, clf_dec.predict(test_x_vectors), average=None, labels=[1,0])


array([0.7133758 , 0.70137825])

Based on the F1 score, we can say that with this small portion of dataset, the linear SVM did a better job since the prediction of positive and negative has the score of up to 80% which is 10% better than the decision tree. 

There are many ways that we can continue to increase our F1 score. 
1. Fine tune the model with `GridSearchCV`
2. using Stemming and Lemmatization to do a text normalization 
3. based on the data, some of the words did not split correctly, some of them are 2 words in one, we can also try to clean up our data 
4. we can add high frequency word to the collection of stop words. 

In this case, we are just fine tuning our model

### Fine Tune Our Model


In [None]:
from sklearn.model_selection import GridSearchCV

print(clf.score(test_x_vectors, test_y))

parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}

svc = SVC()
clf = GridSearchCV(svc, parameters, cv=5)
clf.fit(train_x_vectors, train_y)

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


In [None]:
print(clf.score(test_x_vectors, test_y))