**Sentiment Analysis of IMDB Movie Reviews**

**Problem Statement:**

In this, we have to predict the number of positive and negative reviews based on sentiments by using different classification models.

In [2]:
!pip install nltk

Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting click (from nltk)
  Using cached click-8.1.3-py3-none-any.whl (96 kB)
Installing collected packages: click, nltk
Successfully installed click-8.1.3 nltk-3.8.1


In [3]:
import nltk
nltk.download('wordnet2022') #wordnet2022
nltk.download('stopwords')

[nltk_data] Downloading package wordnet2022 to
[nltk_data]     C:\Users\97252\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet2022 is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\97252\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

**Import necessary libraries**

In [9]:
#Load the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from wordcloud import WordCloud,STOPWORDS
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize,sent_tokenize
from bs4 import BeautifulSoup
import spacy
import re,string,unicodedata
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.stem import LancasterStemmer,WordNetLemmatizer
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from textblob import TextBlob
from textblob import Word
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.model_selection import train_test_split

import os
#print(os.listdir("../input"))
import warnings
warnings.filterwarnings('ignore')


**Import the training dataset**

In [11]:
#importing the training data
imdb_data=pd.read_csv('C:/Users/97252/Desktop/pytorch-project/train.csv')
print(imdb_data.shape)
imdb_data.head(10)

(25000, 2)


Unnamed: 0,text,sentiment
0,For a movie that gets no respect there sure ar...,0
1,Bizarre horror movie filled with famous faces ...,0
2,"A solid, if unremarkable film. Matthau, as Ein...",0
3,It's a strange feeling to sit alone in a theat...,0
4,"You probably all already know this by now, but...",0
5,I saw the movie with two grown children. Altho...,0
6,You're using the IMDb. You've given some heft...,0
7,This was a good film with a powerful message o...,0
8,"Made after QUARTET was, TRIO continued the qua...",0
9,"For a mature man, to admit that he shed a tear...",0


**Exploratery data analysis**

In [12]:
#Summary of the dataset
imdb_data.describe()

Unnamed: 0,sentiment
count,25000.0
mean,0.5
std,0.50001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


**Sentiment count**

In [13]:
#sentiment count
imdb_data['sentiment'].value_counts()

sentiment
0    12500
1    12500
Name: count, dtype: int64

In [14]:
X = imdb_data['text']
y = imdb_data['sentiment']
train_reviews, test_reviews, train_sentiments, test_sentiments =  train_test_split( X,y ,
                                                                                    random_state=104, 
                                                                                    test_size=0.2, 
                                                                                    shuffle=True)

print(train_reviews.shape,train_sentiments.shape)
print(test_reviews.shape,test_sentiments.shape)

(20000,) (20000,)
(5000,) (5000,)


We can see that the dataset is balanced.

**Spliting the training dataset**

**Text normalization**

In [15]:
#Tokenization of text
tokenizer=ToktokTokenizer()
#Setting English stopwords
stopword_list=nltk.corpus.stopwords.words('english')

**Removing html strips and noise text**

In [16]:
#Removing the html strips
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

#Removing the noisy text
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text
#Apply function on review column
imdb_data['text']=imdb_data['text'].apply(denoise_text)

**Removing special characters**

In [17]:
#Define function for removing special characters
def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    return text
#Apply function on review column
imdb_data['text']=imdb_data['text'].apply(remove_special_characters)

**Text stemming
**

In [18]:
#Stemming the text
def simple_stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text
#Apply function on review column
imdb_data['text']=imdb_data['text'].apply(simple_stemmer)

**Removing stopwords**

In [19]:
#set stopwords to english
stop=set(stopwords.words('english'))
print(stop)

#removing the stopwords
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text
#Apply function on review column
imdb_data['text']=imdb_data['text'].apply(remove_stopwords)

{'other', 'hers', 'isn', 'their', 'which', 'doesn', "you'd", 'the', 'who', 'them', 'being', "it's", 'only', 'am', 'y', 'while', 'than', 'needn', 'ours', "doesn't", 'from', 'can', 'both', "didn't", 'now', "needn't", 'shouldn', "weren't", 'over', "hadn't", 'such', 'so', 'yourselves', 'then', 'should', 'd', 'there', 'its', 'where', 'further', 'it', 'he', 'you', 'will', "isn't", 'off', 'or', 'more', 'above', "mightn't", 'your', 'after', 'hasn', 'between', 'no', "don't", "should've", 'why', "wasn't", 'ourselves', 'an', "haven't", 'to', 'with', 'couldn', 'i', 'll', 'they', 'a', 'didn', 'once', 'does', 'any', 'had', 'below', 'has', 'and', 'o', "wouldn't", 'myself', 'themselves', 'up', 'under', 'out', 'me', 'yourself', 'what', 'be', 'been', 'each', "couldn't", 'my', "you're", "mustn't", 'hadn', 'that', "she's", 'are', 'doing', 'shan', "aren't", 'against', 'yours', 're', 'herself', 'did', 'on', 'mightn', 'these', "hasn't", 'by', 'him', 'do', 'because', 'very', "shouldn't", 'as', 'again', 'their

**Normalized train reviews**

In [20]:
norm_train_reviews= train_reviews
norm_train_reviews[0]

'For a movie that gets no respect there sure are a lot of memorable quotes listed for this gem. Imagine a movie where Joe Piscopo is actually funny! Maureen Stapleton is a scene stealer. The Moroni character is an absolute scream. Watch for Alan "The Skipper" Hale jr. as a police Sgt.'

**Normalized test reviews**

In [21]:
norm_test_reviews=test_reviews

**Bags of words model **

It is used to convert text documents to numerical vectors or bag of words.

In [22]:
#Count vectorizer for bag of words
cv=CountVectorizer(min_df=0,max_df=1,binary=False,ngram_range=(1,3))
#transformed train reviews
cv_train_reviews=cv.fit_transform(norm_train_reviews)
#transformed test reviews
cv_test_reviews=cv.transform(norm_test_reviews)

print('BOW_cv_train:',cv_train_reviews.shape)
print('BOW_cv_test:',cv_test_reviews.shape)
#vocab=cv.get_feature_names()-toget feature names

BOW_cv_train: (20000, 3578773)
BOW_cv_test: (5000, 3578773)


**Term Frequency-Inverse Document Frequency model (TFIDF)**

It is used to convert text documents to  matrix of  tfidf features.

In [23]:
#Tfidf vectorizer
tv=TfidfVectorizer(min_df=0,max_df=1,use_idf=True,ngram_range=(1,3))
#transformed train reviews
tv_train_reviews=tv.fit_transform(norm_train_reviews)
#transformed test reviews
tv_test_reviews=tv.transform(norm_test_reviews)
print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)

Tfidf_train: (20000, 3578773)
Tfidf_test: (5000, 3578773)


**Labeling the sentiment text**

In [24]:
#labeling the sentient data
lb=LabelBinarizer()
#transformed sentiment data
sentiment_data=lb.fit_transform(imdb_data['sentiment'])
print(sentiment_data.shape)

(25000, 1)


**Split the sentiment tdata**

In [25]:
#Spliting the sentiment data
#train_sentiments=sentiment_data[:40000]
#test_sentiments=sentiment_data[40000:]
print(train_sentiments)
print(test_sentiments)

22081    1
12045    0
8173     0
18290    1
21564    1
        ..
21631    1
6310     0
17113    1
22209    1
8261     0
Name: sentiment, Length: 20000, dtype: int64
598      0
10039    0
21590    1
13701    1
1823     0
        ..
13401    1
13771    1
10164    0
22505    1
4305     0
Name: sentiment, Length: 5000, dtype: int64


**Modelling the dataset**

Let us build logistic regression model for both bag of words and tfidf features

In [26]:
#training the model
lr=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
#Fitting the model for Bag of words
lr_bow=lr.fit(cv_train_reviews,train_sentiments)
print(lr_bow)
#Fitting the model for tfidf features
lr_tfidf=lr.fit(tv_train_reviews,train_sentiments)
print(lr_tfidf)

LogisticRegression(C=1, max_iter=500, random_state=42)
LogisticRegression(C=1, max_iter=500, random_state=42)


**Logistic regression model performane on test dataset**

In [27]:
#Predicting the model for bag of words
lr_bow_predict=lr.predict(cv_test_reviews)
print(lr_bow_predict)
##Predicting the model for tfidf features
lr_tfidf_predict=lr.predict(tv_test_reviews)
print(lr_tfidf_predict)

[0 0 1 ... 1 0 0]
[0 0 1 ... 1 0 0]


**Accuracy of the model**

In [28]:
#Accuracy score for bag of words
lr_bow_score=accuracy_score(test_sentiments,lr_bow_predict)
print("lr_bow_score :",lr_bow_score)
#Accuracy score for tfidf features
lr_tfidf_score=accuracy_score(test_sentiments,lr_tfidf_predict)
print("lr_tfidf_score :",lr_tfidf_score)

lr_bow_score : 0.775
lr_tfidf_score : 0.7676


**Print the classification report**

In [29]:
#Classification report for bag of words 
lr_bow_report=classification_report(test_sentiments,lr_bow_predict,target_names=['Positive','Negative'])
print(lr_bow_report)

#Classification report for tfidf features
lr_tfidf_report=classification_report(test_sentiments,lr_tfidf_predict,target_names=['Positive','Negative'])
print(lr_tfidf_report)

              precision    recall  f1-score   support

    Positive       0.78      0.77      0.78      2533
    Negative       0.77      0.78      0.77      2467

    accuracy                           0.78      5000
   macro avg       0.78      0.78      0.77      5000
weighted avg       0.78      0.78      0.78      5000

              precision    recall  f1-score   support

    Positive       0.80      0.72      0.76      2533
    Negative       0.74      0.82      0.78      2467

    accuracy                           0.77      5000
   macro avg       0.77      0.77      0.77      5000
weighted avg       0.77      0.77      0.77      5000



**Confusion matrix**

In [30]:
#confusion matrix for bag of words
cm_bow=confusion_matrix(test_sentiments,lr_bow_predict,labels=[1,0])
print(cm_bow)
#confusion matrix for tfidf features
cm_tfidf=confusion_matrix(test_sentiments,lr_tfidf_predict,labels=[1,0])
print(cm_tfidf)

[[1935  532]
 [ 593 1940]]
[[2026  441]
 [ 721 1812]]


In [31]:
print('Enter q to stop')
while(True):
  user_input = input("Enter sentance to be eveluated:")
  if user_input == 'q':
    break
  else:
    prediction  =lr.predict(user_input)
    print(prediction)




Enter q to stop


ValueError: Expected 2D array, got scalar array instead:
array=this is a great movie.
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [35]:
cv_test_reviews[0][0]

<1x3578773 sparse matrix of type '<class 'numpy.int64'>'
	with 14 stored elements in Compressed Sparse Row format>

**Conclusion:**
* We can observed that both logistic regression and multinomial naive bayes model performing well compared to linear support vector  machines.
* Still we can improve the accuracy of the models by preprocessing data and by using lexicon models like Textblob.