Project Title: Sentiment Analysis of IMDB Movie Review Dataset

The main aim of the project is to analyze the moview review data set to understand the sentiemnt or emotion expressed in the review and build a classification model that can accurately classify the sentiment of a given text into predefined category such as positive or negative. 
Sentiment analysis can be used in solving various business problems such as Sentiment CLassification, Understanding the customer feedback, Monitoring social media, Research market trends, Financial, Political and Social analysis and so on.

The main tasks to be performed in this project are:
1. Analyze the data to understand what the data represents and how it can be transformed according to the end goals of the project.
2. Preprocessing the data by removing irrelevant and special characters using a combination of Regex and NLTK.
3. Tokenization of the text, Removal of stop words and lemmatization to reduce the words to their base forms.
4. Using TF-IDF to convert the tokenized text into numerical features that the machine learning model can understand.
5. Model selection, Training and Evaluation.
6. Model deployment.
    

In [1]:
#Importing necessary libraries
import pandas as pd
import string
import seaborn as sns
import plotly as plt
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
from collections import Counter
import warnings
warnings.filterwarnings("ignore")



In [2]:
df = pd.read_csv("IMDB Dataset.csv")

In [3]:
print(f"Total number of elements in the dataframe: {df.size}")

Total number of elements in the dataframe: 100000


In [4]:
print(f"The shape of the dataframe: {df.shape}")

The shape of the dataframe: (50000, 2)


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [6]:
df.isna().sum()

review       0
sentiment    0
dtype: int64

In [7]:
df["sentiment"].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [8]:
df.sample(50)

Unnamed: 0,review,sentiment
17606,Anyone who has said that it's better than Host...,negative
26427,I saw this film at the 3rd Adelaide Internatio...,positive
47633,Although Kurt Russell was and is probably the ...,positive
34234,"When it comes to the erotic genre, I'm lucky t...",positive
44919,This movie was so poorly acted. What was with ...,negative
25838,Tarzan and Jane are living happily in the jung...,positive
8025,Who won the best actress Oscar for 1933? It sh...,positive
31891,"After watching Awake,I led to a conclusion:dir...",negative
24572,Are we really making 'video nasties' again? In...,positive
46661,Blazing saddles! It's a fight between two estr...,negative


The intitial analysis of the data reveals the following details:

1. The data consists of 50k movie reviews and their respective sentiment.
2. The data is quite balanced with 25k sentiments for each of the sentiments.
3. There is no missing data in the data set.
4. The data contains some noise which needs to be cleaned. 

Overall, the data looks good enough for further analysis.

Cleaning and Structuring of the data.

In [9]:
#converting the sentiments and reviews to lower case
df["sentiment"] = df["sentiment"].str.lower()

In [10]:
df["review"] = df["review"].apply(lambda x: x.lower())

In [11]:
#checking a sample of the data to make sure the text is converted into lower case
df["review"].sample(50)

3777     i went to see this movie with my 17 y.o. daugh...
40187    don't bother. a little prosciutto could go a l...
19032    the story itself is routine: a boy runs away f...
23249    i am sligthly biased because i appear in this ...
27323    i cannot comprehend how this picture was allow...
39697    fully deserving its prestigious hollywood awar...
27827    i saw this film at the ny gay & lesbian film f...
46147    donald sutherland, an american paleontologist ...
24571    being s club seven, the film already boosts an...
38116    carson daly has to be the only late night talk...
23020    i originally caught this back in 1996 in its o...
23722    this was a fantastic episode. i saw a clip fro...
7488     i put this second version of "the man who knew...
6403     **spoilers** with the title of the film having...
7287     nice way to relax. i am packing my suitcase no...
35134    your average garden variety psychotic nutcase ...
1163     well, where do i start...<br /><br />as one of.

In [12]:
#using regex to remove all charachters that are not alphabets
import re
pattern = r'[^a-zA-Z0-9\s]'
df["review"] = df["review"].apply(lambda x: re.sub(pattern, "", x))

In [13]:
#removing all  punctuation using regex
punctuation_pattern = r'[^\w\s]'
df["review"] = df["review"].apply(lambda x: re.sub(punctuation_pattern, "",x))

In [14]:
#removing numbers and html tags
number_pattern =  r'\d+'
df["review"] = df["review"].apply(lambda x: re.sub(number_pattern, "", x))

In [15]:
#remove html tags
html_pattern = r'<[^>]+>'
df["review"] = df["review"].apply(lambda x: re.sub(html_pattern, "",x))

In [16]:
df["review"].sample(20)

14113    is it a coincidence that orca was made two yea...
45284    sunday would not be sunday without an action m...
34590    an elite american military team which of cours...
47408    to say this film is simply a demonisation of c...
22276    devil hunter gained notoriety for the fact tha...
43972    film is designed to affect the audience and th...
7024     star rating  the works  just misses the mark  ...
9091     this film is really only bill mahers interpret...
17603    the kissing bandit was the third and final fil...
18812    if archie bunker was armed he may well have be...
14525    this movie was not only disappointing to the h...
834      i do regret that i have bought this series i e...
26866    i like the show but comeon writers get some ac...
49850    this show is painful to watch br br it is obvi...
16872    oh my god what the hell happened here im not g...
40888    unwatchable you cant even make it past the fir...
2727     honestly i was expecting to hate this one and .

In [17]:
#removing stopwords using nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\isunn\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [18]:
def remove_stopwords(text):
    stop_words = set(stopwords.words("english"))
    words = text.split()
    filtered_words = [word for word in words if word.lower() not in stop_words]
    return " ".join(filtered_words)

In [19]:
df["review"] = df["review"].apply(remove_stopwords)

In [20]:
df["review"].sample(50)

6915     huge amy adams fan many years also big fan mus...
2861     china syndrome could released better time twel...
28001    movie made years end civil warmost likely anti...
40052    much love ellen barkin really underrated boo m...
43494    cuba gooding jr secret service agent blames as...
23752    anthony wong plays loka husband whose wife ser...
26650    shame see interesting story diluted standard v...
7437     summary line mens wet dream ideal woman seriou...
19349    feel movie amazing adam sandlers performance i...
18105    childless couple brooke adams jeff hayenga go ...
9504     well odds exact right moment redneck amateursc...
28817    lucky enough attend screening stockholm elegan...
36118    pier paolo pasolini peepeepee prefer call due ...
99       mario fan long remember fond memories playing ...
27866    paperhouse thrillerhorror sick bed fever year ...
30713    true geekgirls dream high tech high drama smar...
42202    director probably still early learning stages .

In [21]:
#df2 = df.copy()

In [22]:
#tokenizing the text 
df["review"] = df["review"].apply(word_tokenize)

In [23]:
df["review"].head()

0    [one, reviewers, mentioned, watching, oz, epis...
1    [wonderful, little, production, br, br, filmin...
2    [thought, wonderful, way, spend, time, hot, su...
3    [basically, theres, family, little, boy, jake,...
4    [petter, matteis, love, time, money, visually,...
Name: review, dtype: object

In [24]:
#function to count most freq and least freq words
def word_counter(df, column_name,most_common_count=20,least_common_count=20):
    all_words = [word for sublist in df[column_name] for word in sublist]
    word_frequency = Counter(all_words)
    most_common_words = word_frequency.most_common(most_common_count)
    least_common_words = word_frequency.most_common()[-least_common_count:][::-1]
    return most_common_words,least_common_words

In [25]:
most_common,least_common = word_counter(df, "review")


In [26]:
print(f"The 20 most common words are: {most_common}")


The 20 most common words are: [('br', 114890), ('movie', 83523), ('film', 74459), ('one', 51028), ('like', 38992), ('good', 28570), ('even', 24576), ('would', 24024), ('time', 23269), ('really', 22951), ('see', 22535), ('story', 22097), ('much', 18947), ('well', 18798), ('get', 18205), ('great', 17821), ('also', 17818), ('bad', 17719), ('people', 17538), ('first', 17155)]


In [27]:
print(f"The 20 least common words are: {least_common}")

The 20 least common words are: [('yosemitebr', 1), ('studentsthe', 1), ('horriblecatwoman', 1), ('clatter', 1), ('frenchonly', 1), ('philandererbr', 1), ('effortful', 1), ('ohsohard', 1), ('ashknenazi', 1), ('jossi', 1), ('wasamwill', 1), ('emiles', 1), ('burtolucci', 1), ('censorial', 1), ('angelyne', 1), ('nolin', 1), ('subjectivebut', 1), ('satireis', 1), ('dkman', 1), ('mottos', 1)]


In [28]:
from nltk.stem import WordNetLemmatizer

In [29]:
lemmatizer = WordNetLemmatizer()

In [30]:
def lemmatize_tokens(tokens):
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return lemmatized_tokens

In [31]:
df["review"] = df["review"].apply(lemmatize_tokens)

In [32]:
df["review"].head()

0    [one, reviewer, mentioned, watching, oz, episo...
1    [wonderful, little, production, br, br, filmin...
2    [thought, wonderful, way, spend, time, hot, su...
3    [basically, there, family, little, boy, jake, ...
4    [petter, matteis, love, time, money, visually,...
Name: review, dtype: object

In [33]:
#removing all occurences of word "br"
def remove_br(tokens):
    return [token for token in tokens if token!= "br"]

In [34]:
df["review"] = df["review"].apply(remove_br)

In [35]:
df["review"].head()

0    [one, reviewer, mentioned, watching, oz, episo...
1    [wonderful, little, production, filming, techn...
2    [thought, wonderful, way, spend, time, hot, su...
3    [basically, there, family, little, boy, jake, ...
4    [petter, matteis, love, time, money, visually,...
Name: review, dtype: object

In [36]:
df.head()

Unnamed: 0,review,sentiment
0,"[one, reviewer, mentioned, watching, oz, episo...",positive
1,"[wonderful, little, production, filming, techn...",positive
2,"[thought, wonderful, way, spend, time, hot, su...",positive
3,"[basically, there, family, little, boy, jake, ...",negative
4,"[petter, matteis, love, time, money, visually,...",positive


In [37]:
df["review"].head()

0    [one, reviewer, mentioned, watching, oz, episo...
1    [wonderful, little, production, filming, techn...
2    [thought, wonderful, way, spend, time, hot, su...
3    [basically, there, family, little, boy, jake, ...
4    [petter, matteis, love, time, money, visually,...
Name: review, dtype: object

In [38]:
df['review'] = df['review'].apply(lambda word_list: ' '.join(word_list))

In [39]:
df.head()

Unnamed: 0,review,sentiment
0,one reviewer mentioned watching oz episode you...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically there family little boy jake think t...,negative
4,petter matteis love time money visually stunni...,positive


The data is now preprocessed and ready to be fed into the machine learning model. The initial model is a naive bayes model.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

In [None]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)

# Creating a TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Creating the TF-IDF representation of the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Creating the TF-IDF representation of the testing data 
X_test_tfidf = tfidf_vectorizer.transform(X_test)


In [None]:
# Training the  Naive Bayes classifier model on the vectorized training data
naive_bayes_classifier = MultinomialNB()
naive_bayes_classifier.fit(X_train_tfidf, y_train)

# Making predictions on the testing set
y_pred = naive_bayes_classifier.predict(X_test_tfidf)

# Evaluating the model performance
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

# Print classification report
print(classification_report(y_test, y_pred))




The initial naive bayes model performed quite well with an accuracy of 87% on the test data set. The precision and recall values are also quite good.

In [None]:
# Testing the data on new data
new_review = "This movie is bad! I hated it."

# Preprocess the new review and convert it to TF-IDF representation
new_review_tfidf = tfidf_vectorizer.transform([new_review])

# Predict the sentiment of the new review
predicted_sentiment = naive_bayes_classifier.predict(new_review_tfidf)

print(f"Predicted Sentiment: {predicted_sentiment[0]}")