Importing all the neccessary libraries and the the cleaned dataset

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

df = pd.read_csv("cleaned_imdb_dataset.csv")

Getting the first 35000 records to train the model because training the model with 50000 records is resource intensive

In [2]:
df_new = df.iloc[:35000]

In [3]:

df_new

Unnamed: 0.1,Unnamed: 0,review,sentiment
0,0,one reviewer mentioned watching 1 oz episode y...,1
1,1,wonderful little production br br filming tech...,1
2,2,thought wonderful way spend time hot summer we...,1
3,3,basically there family little boy jake think t...,0
4,4,petter matteis love time money visually stunni...,1
...,...,...,...
34995,34995,awful awful awful show real world issue dealt ...,0
34996,34996,like action movie softspot b flick bad dialogu...,0
34997,34997,begin nice note falter quickly let expectation...,0
34998,34998,aardman next pixar aardman animation prof anim...,1


Splitting the 25000 records into train set and test set, the train set is used to train the model and the test set is used to see the models accuracy and whether the model is underfitting or overfitting

In [4]:
X_train,X_test,y_train,y_test = train_test_split(df_new['review'],df_new['sentiment'],test_size=0.20)

Here we are using two methods to turn words into vectors , those are TFIDF and bag of words, each method have its strengths and weaknesses and the most most suitable method will be used based on model accuracy

In [5]:
bow = CountVectorizer()
X_train_bow = bow.fit_transform(X_train)
X_test_bow = bow.transform(X_test)

In [6]:
tfidf = TfidfVectorizer()
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

In this a naive bayes model will be trained and the accuracy of the model will be tested by giving the model to predict the output for the test data. Keep in mind that two naive bayes models are trained to see whether TFIDF or bow gives the best accuracy

In [7]:
nb_model_tfidf = MultinomialNB().fit(X_train_tfidf, y_train)
nb_model_bow   = MultinomialNB().fit(X_train_bow, y_train)


confusion_matrix,accuracy_score,classification_report are imported to see the performance of the models

In [8]:
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report

Predicting the output for the test data 

In [9]:
y_pred_bow = nb_model_bow.predict(X_test_bow)
y_pred_tfidf = nb_model_tfidf.predict(X_test_tfidf)

The prediction is then used to calculate the accuracy score of the bow vectorised model and tfidf vectorised model

In [10]:
print("BOW accuracy score:",accuracy_score(y_test,y_pred_bow))

BOW accuracy score: 0.8601428571428571


In [11]:
print("tfidf accuracy score:",accuracy_score(y_test,y_pred_tfidf))

tfidf accuracy score: 0.8684285714285714


Based on the above the TFIDF vectorised model has higher accuracy than the bow vectorised model therefore we can use the tfidf model

Experimenting with our ml model to see whether it's performing well with unknown data

In [12]:
text = ["Best movie ever"]
text_number = tfidf.transform(text)
output = nb_model_bow.predict(text_number)

In [13]:
print(output)

[1]


Training a logistic regression model to see whether it's better than the naive bayes model with both the tfidf and bow vectorised test data

In [14]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(solver='liblinear', random_state=42, max_iter=1000)
lg_tfidf = model.fit(X_train_tfidf,y_train)
lg_bow = model.fit(X_train_bow,y_train)

In [15]:
lg_pred_bow = lg_bow.predict(X_test_bow)
lg_pred_tfidf = lg_tfidf.predict(X_test_tfidf)

In [16]:
print("BOW accuracy score for logistic regression:",accuracy_score(y_test,lg_pred_bow))
print("tfidf accuracy score for logistic regression:",accuracy_score(y_test,lg_pred_tfidf))

BOW accuracy score for logistic regression: 0.8814285714285715
tfidf accuracy score for logistic regression: 0.87


Based on the above the bow version of the logistic regression model has more accuracy than the tfidf version of the naive bayes model therefore we'll use the logistic regression model for out reasearch

In [17]:
text = ["This movie is a piece of shit"]
text_number = bow.transform(text)
output = lg_bow.predict(text_number)
output

array([0])

In [18]:
import pickle

with open("sentiment_analysis_ml_model.pkl", "wb") as f:
    pickle.dump(lg_bow, f)

In [19]:
with open("BOW_vectoriser.pkl", "wb") as f:
    pickle.dump(bow, f)