Project Title: Sentiment Analysis of IMDB Movie Review Dataset

The main aim of the project is to analyze the moview review data set to understand the sentiemnt or emotion expressed in the review and build a classification model that can accurately classify the sentiment of a given text into predefined category such as positive or negative. 
Sentiment analysis can be used in solving various business problems such as Sentiment CLassification, Understanding the customer feedback, Monitoring social media, Research market trends, Financial, Political and Social analysis and so on.

The main tasks to be performed in this project are:
1. Analyze the data to understand what the data represents and how it can be transformed according to the end goals of the project.
2. Preprocessing the data by removing irrelevant and special characters using a combination of Regex and NLTK.
3. Tokenization of the text, Removal of stop words and lemmatization to reduce the words to their base forms.
4. Using TF-IDF to convert the tokenized text into numerical features that the machine learning model can understand.
5. Model selection, Training and Evaluation.
6. Model deployment.
    

In [1]:
#Importing necessary libraries
import pandas as pd
import string
import seaborn as sns
import plotly as plt
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from collections import Counter
import warnings
warnings.filterwarnings("ignore")



In [2]:
df = pd.read_csv("IMDB Dataset.csv")

In [3]:
print(f"Total number of elements in the dataframe: {df.size}")

Total number of elements in the dataframe: 100000


In [4]:
print(f"The shape of the dataframe: {df.shape}")

The shape of the dataframe: (50000, 2)


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [6]:
df.isna().sum()

review       0
sentiment    0
dtype: int64

In [7]:
df["sentiment"].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [8]:
df.sample(50)

Unnamed: 0,review,sentiment
18809,<br /><br />There is STAR TREK canon -- lots o...,positive
16762,Scott's collection of 80's icons cannot save t...,negative
15413,Greetings again from the darkness. Director Al...,positive
15146,The story of Ned Kelly has been enshrouded in ...,positive
3612,I normally do not take the time to make commen...,negative
48278,"Elvis Presley plays a ""half-breed"" Native Amer...",negative
37752,The sounds in the movie were so mundane and ri...,negative
47311,The Patriot (nothing to do with the Mel Gibson...,negative
18070,"To a certain extent, I actually liked this fil...",positive
32044,Uggh! I really wasn't that impressed by this f...,positive


The intitial analysis of the data reveals the following details:

1. The data consists of 50k movie reviews and their respective sentiment.
2. The data is quite balanced with 25k sentiments for each of the sentiments.
3. There is no missing data in the data set.
4. The data contains some noise which needs to be cleaned. 

Overall, the data looks good enough for further analysis.

Cleaning and Structuring of the data.

In [9]:
#converting the sentiments and reviews to lower case
df["sentiment"] = df["sentiment"].str.lower()

In [10]:
df["review"] = df["review"].apply(lambda x: x.lower())

In [11]:
#checking a sample of the data to make sure the text is converted into lower case
df["review"].sample(50)

49915    what an utter disappointment. forget this abys...
32925    i think that saying this film has too many is ...
21658    the movie is not as funny as the director's pr...
15397    even if you're a huge sandler fan, please don'...
33537    the brain (or head) that wouldn't die is one o...
43346    cb4 was awful, but it may have given cundieff ...
10746    this film features two of my favorite guilty p...
13409    another movie from swedish hillbilly country, ...
13422    ms aparna sen, the maker of mr & mrs iyer, dir...
9074     stanley stupid (tom arnold) and his wife, joan...
25206    steve carell once again stars in a light roman...
34683    i happened upon this flick on a rainy sunday, ...
373      owen loves his mamma...only he'd love her bett...
6858     i somehow missed this movie when it came out a...
25614    you know you're in for something different whe...
44317    i cant describe how terrible this movie is. am...
31831    i originally reviewed this film on amazon abou.

In [12]:
#using regex to remove all charachters that are not alphabets
import re
pattern = r'[^a-zA-Z0-9\s]'
df["review"] = df["review"].apply(lambda x: re.sub(pattern, "", x))

In [13]:
#removing all  punctuation using regex
punctuation_pattern = r'[^\w\s]'
df["review"] = df["review"].apply(lambda x: re.sub(punctuation_pattern, "",x))

In [14]:
#removing numbers and html tags
number_pattern =  r'\d+'
df["review"] = df["review"].apply(lambda x: re.sub(number_pattern, "", x))

In [15]:
#remove html tags
html_pattern = r'<[^>]+>'
df["review"] = df["review"].apply(lambda x: re.sub(html_pattern, "",x))

In [16]:
df["review"].sample(20)

40339    words fail me this film was extremely difficul...
4720     a message movie but a rather good one outstand...
5936     this may be the worst show ive ever seen aside...
29315    the acting in this movie was superb as an amat...
21695    not as well known as the english american germ...
42262    ill start with what i likedbr br i really like...
24236    welcome to collinwood is one of the most delig...
18596    this is one for the golden turkey book its ano...
21749    antitrust could have been a great vehicle for ...
14783    one of the best movies for all ages you will n...
21272    awful film terrible acting cheesy totally unre...
3988     dude really where have you guys been the past ...
29524    also titled the magical castle this one is a s...
44235    when alfred hitchcock made strangers on a trai...
48708    this is the essence of the early eighties the ...
13846    near the closing stages of baby mama one of th...
18236    i wont say the show is all bad because there a.

In [17]:
#removing stopwords using nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\isunn\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [18]:
def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

In [19]:
df["review"] = df["review"].apply(remove_stopwords)

In [20]:
df["review"].sample(50)

21099    never heard film prior coming across perusing ...
48178    first positives excellent job depicting urban ...
12671    generally loved carry movies one actually pret...
30265    another bad spanish picture baaaad save photog...
38657    br br excellent think back long time find film...
11577    third collaboration karloff lugosi sees move a...
1658     yes know movie meant comedy humor top theme pe...
14373    awful lot wrong picture beginning script obvio...
24370    consuming human pork chop properly digesting f...
4814     decided watch movie noted scariest movie ever ...
18220    dialogue pretty dreadful plot really inspired ...
10843    normally try avoid scifi movies much isnt genr...
25018    really dont understand movie aimed absurdity m...
24320    soiler fake whole thing fake ghosts zombies al...
13874    friend mine bought cheaply decided give birthd...
1569     okay really tried tap called silly surreal hum...
25475    curiosity patience finally see controversial f.

In [21]:
#df2 = df.copy()

In [22]:
#tokenizing the text 
df["review"] = df["review"].apply(word_tokenize)

In [23]:
df["review"].head()

0    [one, reviewers, mentioned, watching, oz, epis...
1    [wonderful, little, production, br, br, filmin...
2    [thought, wonderful, way, spend, time, hot, su...
3    [basically, theres, family, little, boy, jake,...
4    [petter, matteis, love, time, money, visually,...
Name: review, dtype: object

In [24]:
#function to count most freq and least freq words
def word_counter(df, column_name,most_common_count=20,least_common_count=20):
    all_words = [word for sublist in df[column_name] for word in sublist]
    word_frequency = Counter(all_words)
    most_common_words = word_frequency.most_common(most_common_count)
    least_common_words = word_frequency.most_common()[-least_common_count:][::-1]
    return most_common_words,least_common_words

In [25]:
most_common,least_common = word_counter(df, "review")


In [26]:
print(f"The 20 most common words are: {most_common}")


The 20 most common words are: [('br', 114890), ('movie', 83523), ('film', 74459), ('one', 51028), ('like', 38992), ('good', 28570), ('even', 24576), ('would', 24024), ('time', 23269), ('really', 22951), ('see', 22535), ('story', 22097), ('much', 18947), ('well', 18798), ('get', 18205), ('great', 17821), ('also', 17818), ('bad', 17719), ('people', 17538), ('first', 17155)]


In [27]:
print(f"The 20 least common words are: {least_common}")

The 20 least common words are: [('yosemitebr', 1), ('studentsthe', 1), ('horriblecatwoman', 1), ('clatter', 1), ('frenchonly', 1), ('philandererbr', 1), ('effortful', 1), ('ohsohard', 1), ('ashknenazi', 1), ('jossi', 1), ('wasamwill', 1), ('emiles', 1), ('burtolucci', 1), ('censorial', 1), ('angelyne', 1), ('nolin', 1), ('subjectivebut', 1), ('satireis', 1), ('dkman', 1), ('mottos', 1)]


In [28]:
from nltk.stem import WordNetLemmatizer

In [29]:
lemmatizer = WordNetLemmatizer()

In [30]:
def lemmatize_tokens(tokens):
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmatized_tokens

In [31]:
df["review"] = df["review"].apply(lemmatize_tokens)

In [32]:
df["review"].head()

0    [one, reviewer, mentioned, watching, oz, episo...
1    [wonderful, little, production, br, br, filmin...
2    [thought, wonderful, way, spend, time, hot, su...
3    [basically, there, family, little, boy, jake, ...
4    [petter, matteis, love, time, money, visually,...
Name: review, dtype: object

In [33]:
#removing all occurences of word "br"
def remove_br(tokens):
    return [token for token in tokens if token!= "br"]

In [34]:
df["review"] = df["review"].apply(remove_br)

In [35]:
df["review"].head()

0    [one, reviewer, mentioned, watching, oz, episo...
1    [wonderful, little, production, filming, techn...
2    [thought, wonderful, way, spend, time, hot, su...
3    [basically, there, family, little, boy, jake, ...
4    [petter, matteis, love, time, money, visually,...
Name: review, dtype: object

In [36]:
df.head()

Unnamed: 0,review,sentiment
0,"[one, reviewer, mentioned, watching, oz, episo...",positive
1,"[wonderful, little, production, filming, techn...",positive
2,"[thought, wonderful, way, spend, time, hot, su...",positive
3,"[basically, there, family, little, boy, jake, ...",negative
4,"[petter, matteis, love, time, money, visually,...",positive


In [37]:
df["review"].head()

0    [one, reviewer, mentioned, watching, oz, episo...
1    [wonderful, little, production, filming, techn...
2    [thought, wonderful, way, spend, time, hot, su...
3    [basically, there, family, little, boy, jake, ...
4    [petter, matteis, love, time, money, visually,...
Name: review, dtype: object

In [38]:
df['review'] = df['review'].apply(lambda word_list: ' '.join(word_list))

The data is now preprocessed and ready to be fed into the machine learning model. The initial model is a naive bayes model.

In [39]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

In [40]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)

# Creating a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Creating the TF-IDF representation of the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Creating the TF-IDF representation of the testing data 
X_test_tfidf = tfidf_vectorizer.transform(X_test)


In [41]:
# Training the  Naive Bayes classifier model on the vectorized training data
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_tfidf, y_train)

# Making predictions on the testing set
y_pred = naive_bayes_classifier.predict(X_test_tfidf)

# Evaluating the model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Print classification report
print(classification_report(y_test, y_pred))




Accuracy: 0.87
              precision    recall  f1-score   support

    negative       0.85      0.88      0.87      4961
    positive       0.88      0.85      0.87      5039

    accuracy                           0.87     10000
   macro avg       0.87      0.87      0.87     10000
weighted avg       0.87      0.87      0.87     10000



The initial naive bayes model performed quite well with an accuracy of 87% on the test data set. The precision and recall values are also quite good.

In [42]:
# Testing the data on new data
new_review = "This movie is bad! I hated it."

# Preprocess the new review and convert it to TF-IDF representation
new_review_tfidf = tfidf_vectorizer.transform([new_review])

# Predict the sentiment of the new review
predicted_sentiment = naive_bayes_classifier.predict(new_review_tfidf)

print(f"Predicted Sentiment: {predicted_sentiment[0]}")

Predicted Sentiment: negative


In [43]:
#Building a SVM model
from sklearn.svm import SVC
svm_model = SVC(kernel="linear")
svm_model.fit(X_train_tfidf, y_train)

SVC(kernel='linear')

In [44]:
y_pred = svm_model.predict(X_test_tfidf)

In [45]:
#Evaluating the model performance
accuracy = accuracy_score(y_test, y_pred)
classification_report = classification_report(y_test, y_pred)

print(f"Accuracy of the SVM: {accuracy}")
print(f"Classification report of the SVM:\n {classification_report}")

Accuracy of the SVM: 0.8967
Classification report of the SVM:
               precision    recall  f1-score   support

    negative       0.91      0.88      0.89      4961
    positive       0.89      0.91      0.90      5039

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

