### Load Data
The source of the data is from Kaggle.
https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

In [1]:
import pandas as pd
import os

MOVIE_PATH = "reviews_dataset"

def load_review_data(movie_review_path=MOVIE_PATH):
    csv_path = os.path.join(movie_review_path,"IMDB Dataset.csv")
    return pd.read_csv(csv_path)


In [2]:
movie_reviews = load_review_data()
movie_reviews.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
movie_reviews['review']

0        One of the other reviewers has mentioned that ...
1        A wonderful little production. <br /><br />The...
2        I thought this was a wonderful way to spend ti...
3        Basically there's a family where a little boy ...
4        Petter Mattei's "Love in the Time of Money" is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot, bad dialogue, bad acting, idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I'm going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

In [4]:
movie_reviews['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

In [5]:
movie_reviews['review'][1]

'A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only "has got all the polari" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams\' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master\'s of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional \'dream\' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell\'s murals decorating every surface) are terribly well d

### Data Cleaning 

Based on the first 2 reviews, we can do some basic data cleaning on our data set 
* HTML tag 
* punctuations 
* numerical values
* make all text lower case 
* tokenize text 
* remove stopwords


In [6]:
# round 1 : HTML tag cleaning (one input of text)
import re 
import string

def remove_html_tag(text): 
    text = re.sub('<.*?>','',text)
    return text

clean_html = lambda x: remove_html_tag(x)

In [7]:
no_tag_reviews = pd.DataFrame(movie_reviews['review'].apply(clean_html))
no_tag_reviews['review'][0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

In [8]:
# round 2 : clean out the punctuation 

def remove_punct_num(text):
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

clean_punct_num = lambda x: remove_punct_num(x)

In [9]:
movie_reviews['clean_reviews'] = pd.DataFrame(no_tag_reviews['review'].apply(clean_punct_num))
print(movie_reviews['clean_reviews'][0])

one of the other reviewers has mentioned that after watching just  oz episode youll be hooked they are right as this is exactly what happened with methe first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the wordit is called oz as that is the nickname given to the oswald maximum security state penitentary it focuses mainly on emerald city an experimental section of the prison where all the cells have glass fronts and face inwards so privacy is not high on the agenda em city is home to manyaryans muslims gangstas latinos christians italians irish and moreso scuffles death stares dodgy dealings and shady agreements are never far awayi would say the main appeal of the show is due to the fact that it goes where other shows wouldnt dare forget pretty pictures

In [10]:
# round 3: remove stop words 
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

nltk.download("stopwords")
nltk.download('punkt')
stop_words = set(stopwords.words('english')) 


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kaixuanchin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kaixuanchin/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
print(stop_words)

{'you', 'or', "don't", 'during', 'mustn', 'will', 've', 'needn', 'll', 'nor', 'my', 'does', 'there', 'can', 'weren', 'is', 'no', 'yours', 'shan', "mightn't", 'theirs', 'had', 'only', 'their', 'some', 'of', 'wouldn', 'by', 'am', 'when', 'all', 'not', "mustn't", 'between', 'they', 'but', "that'll", 'being', 'mightn', 'just', 'have', 'an', 'o', "wasn't", 'then', "couldn't", 'from', "didn't", "she's", 'isn', 'm', 'very', 'other', 'couldn', 'ours', 'those', 'than', 'what', 'she', 'has', 'in', 'wasn', 'its', 'over', 'your', 'we', 'under', 'about', 'how', 'at', 'a', 'out', 'itself', 'did', 'above', 'more', "aren't", 'y', "shouldn't", 'shouldn', 'such', 'hers', "you'll", 'who', "shan't", "you've", 'don', "isn't", 'these', "needn't", 'should', 'so', "weren't", 'was', 'into', 're', "should've", 'aren', 'him', 'why', "doesn't", 'ma', 'yourselves', "you're", 'which', 'own', 'this', 'because', 'won', 'do', 'been', "hadn't", 'having', 'the', 'while', 'with', 'any', 'few', 'haven', 'doing', 'ourselve

In [12]:
def reduce_stop_word(text):
    word_tokens = word_tokenize(text)
    text =[word for word in word_tokens if word not in stop_words]
    return text

clean_stop_words = lambda x: reduce_stop_word(x)

In [13]:
movie_reviews['review_clean_tokenize'] = movie_reviews['clean_reviews'].apply(clean_stop_words)

In [14]:
movie_reviews.head()

Unnamed: 0,review,sentiment,clean_reviews,review_clean_tokenize
0,One of the other reviewers has mentioned that ...,positive,one of the other reviewers has mentioned that ...,"[one, reviewers, mentioned, watching, oz, epis..."
1,A wonderful little production. <br /><br />The...,positive,a wonderful little production the filming tech...,"[wonderful, little, production, filming, techn..."
2,I thought this was a wonderful way to spend ti...,positive,i thought this was a wonderful way to spend ti...,"[thought, wonderful, way, spend, time, hot, su..."
3,Basically there's a family where a little boy ...,negative,basically theres a family where a little boy j...,"[basically, theres, family, little, boy, jake,..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter matteis love in the time of money is a ...,"[petter, matteis, love, time, money, visually,..."


For now, we will settle with only the above data cleaning. But we can always come back to this step and check does stemming help to improve the prediction of the data 

### Organizing Data 

After cleaning the data, we can start organizing our data to two types 
* corpus = a collection(set) of words 
* document-term matrix(DTM) =  frequency of terms that occur in a collection of documents


##### Corpus

When we are doing data cleaning, we had change the reviews in to a list of words.

In [15]:
movie_reviews.head()

Unnamed: 0,review,sentiment,clean_reviews,review_clean_tokenize
0,One of the other reviewers has mentioned that ...,positive,one of the other reviewers has mentioned that ...,"[one, reviewers, mentioned, watching, oz, epis..."
1,A wonderful little production. <br /><br />The...,positive,a wonderful little production the filming tech...,"[wonderful, little, production, filming, techn..."
2,I thought this was a wonderful way to spend ti...,positive,i thought this was a wonderful way to spend ti...,"[thought, wonderful, way, spend, time, hot, su..."
3,Basically there's a family where a little boy ...,negative,basically theres a family where a little boy j...,"[basically, theres, family, little, boy, jake,..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter matteis love in the time of money is a ...,"[petter, matteis, love, time, money, visually,..."


In [16]:
import pickle

movie_reviews['review_clean_tokenize'].to_pickle("corpus.pkl")

##### Document-Term Matrix (DTM)

We are using a `CountVectorizer` from sckitlearn, there are other option which is `TfidVectorizer`. 

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')
data_cv = cv.fit_transform(movie_reviews['clean_reviews'])
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = movie_reviews['clean_reviews'].index
data_dtm 


Unnamed: 0,aa,aaa,aaaaaaaaaaaahhhhhhhhhhhhhh,aaaaaaaargh,aaaaaaah,aaaaaaahhhhhhggg,aaaaagh,aaaaah,aaaaargh,aaaaarrrrrrgggggghhhhhh,...,überwoman,ünel,ünfaithful,üvegtigris,üzümcü,ýs,þorleifsson,þór,יגאל,כרמון
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
49995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
49996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
49997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
49998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
#data_dtm.to_pickle("dtm.pkl")

### Split data ( train and test set ) 


In [19]:
# in our data frame, there is different column where the actual review without cleaning and clean reviews with cleaning, 
# we will be using the clean reviews instead of review. 


# do a copy of that data frame
movie_reviews_copy = movie_reviews.copy()

movie_reviews_copy = movie_reviews_copy.drop(['review','review_clean_tokenize'], axis=1)

movie_reviews_copy.head()

Unnamed: 0,sentiment,clean_reviews
0,positive,one of the other reviewers has mentioned that ...
1,positive,a wonderful little production the filming tech...
2,positive,i thought this was a wonderful way to spend ti...
3,negative,basically theres a family where a little boy j...
4,positive,petter matteis love in the time of money is a ...


In [20]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(movie_reviews_copy,test_size=0.2,random_state=42)


train_set_x = train_set.drop(["sentiment"],axis=1)
train_set_y = train_set.drop(["clean_reviews"],axis=1)

test_set_x = test_set.drop(["sentiment"],axis=1)
test_set_y = test_set.drop(["clean_reviews"],axis=1)

### Tokenize the reviews



In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


# data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
# data_dtm.index = movie_reviews['clean_reviews'].index
# data_dtm 

vectorizer = TfidfVectorizer(stop_words='english')
tokenize_train_x = vectorizer.fit_transform(train_set_x['clean_reviews'])
tokenize_test_x = vectorizer.fit_transform(test_set_x['clean_reviews'])

# print out the review
print(train_set_x['clean_reviews'][0])
# print out the count vectorizer of the review
print(tokenize_train_x[0].toarray())

### Clean up labels 

From the dataset that we have, the sentiment label are object type, in order to do classification, we need to use change it to interger or float value by using `labelEncoder`

In [None]:
type(train_set_y['sentiment'][0])

In [None]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
train_set_y_label= encoder.fit_transform(train_set_y['sentiment'])

train_set_y_label[2]

In [None]:
# after using onehotencoder, we need to concatenate with the original data, so the labels can be in 1 and 0 instead of str. 
# we can also use pipline to transform the data
train_set_y_label


### Classisfication 

Linear SVM

In [None]:
from sklearn.svm import SVC

clf_svm = SVC(kernel="linear")

clf_svm.fit(tokenize_train_x , train_set_y_label)

clf_svm.predict(test_x_vectors[0])

Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier 

tree_clf = DecisionTreeClassifier()
tree_clf.fit(tokenize_train_x , train_set_y_label)

tokenize_test_x[0]