Importing  and describing data

In [13]:
#importing dataset as dataframe using pandas

import pandas as pd
data = pd.read_csv('IMDB_Dataset.csv')

In [14]:
#what the dataset looks like
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [15]:
#number of unique values
data['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [4]:
#summary
data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


Text Processing

In [5]:
import nltk

#separate by words
from nltk.tokenize.toktok import ToktokTokenizer
tokenizer = ToktokTokenizer()

In [8]:
#remove stop words
from nltk.corpus import stopwords
stopwords_list = set(stopwords.words('english'))

def remove_stopwords(text):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    filtered_tokens = [token for token in tokens if token.lower() not in stopwords_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text


data['review']=data['review'].apply(remove_stopwords)

data.head()

Unnamed: 0,review,sentiment
0,One reviewers mentioned watching 1 Oz episode ...,positive
1,wonderful little production. <br / ><br / >The...,positive
2,thought wonderful way spend time hot summer we...,positive
3,Basically ' family little boy ( Jake ) thinks ...,negative
4,"Petter Mattei ' "" Love Time Money "" visually s...",positive


In [10]:
#removing noise
import re
def remove_special_chars(text):
    text=text.lower()
    text=re.sub(r'[^a-zA-z0-9\s]','',text)
    return text

data['review'] = data['review'].apply(remove_special_chars)

data.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production br br the filmin...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically family little boy jake thinks zo...,negative
4,petter mattei love time money visually stun...,positive


In [12]:
#text stemming
from nltk.stem.porter import PorterStemmer
def nltk_stemmer(text):
    stemmer=nltk.porter.PorterStemmer()
    text= ' '.join([stemmer.stem(word) for word in text.split()])
    return text

data['review'] = data['review'].apply(nltk_stemmer)

data.head()

Unnamed: 0,review,sentiment
0,one review mention watch 1 oz episod hook righ...,positive
1,wonder littl product br br the film techniqu u...,positive
2,thought wonder way spend time hot summer weeke...,positive
3,basic famili littl boy jake think zombi closet...,negative
4,petter mattei love time money visual stun film...,positive


BAG OF WORDS MODEL 

The Bag of Words Model converts text into numerical representation, so that text data can be used to train models.
This model uses word tokens from the entire set of data.

In [35]:
#split into 80% train and 20% test set

train_data = data.sample(frac=0.8, random_state=25)
test_data = data.drop(training_data.index)

print(train_data.shape)
print(test_data.shape)

(40000, 2)
(10000, 2)


In [45]:
#split data into train and test sets
from sklearn.model_selection import train_test_split as tts

x = data.iloc[0:,0].values
y = data.iloc[0:,1].values

x_train, x_test, y_train, y_test = tts(x, y, test_size=0.2, random_state=5555)


In [59]:
#build model
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()

#importing Naive Bayes
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

classifier = MultinomialNB()

model=Pipeline([('vectorizer',tf),('classifier',classifier)])

model.fit(X_train, y_train)

Pipeline(steps=[('vectorizer', TfidfVectorizer()),
                ('classifier', MultinomialNB())])

Model Testing and Results

In [60]:
#model prediction
y_pred = model.predict(x_test)

In [66]:
#model results

from sklearn.metrics import accuracy_score, confusion_matrix
accuracy_score(y_pred,y_test)


0.8639

In [62]:
#confusion matrix

cmatrix =confusion_matrix(y_test,y_pred)
print(cmatrix)

[[4429  578]
 [ 783 4210]]


In [65]:
#recall and precision

from sklearn.metrics import classification_report
nbmodel_results= classification_report(y_test,y_pred,target_names=['Positive','Negative'])
print(nbmodel_results)

              precision    recall  f1-score   support

    Positive       0.85      0.88      0.87      5007
    Negative       0.88      0.84      0.86      4993

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



As a final note, the accuracy is not particularly high. Some changes could be made to data preprocessing, namely, some improvement can be made to the stemming portion of code.