<a href="https://colab.research.google.com/github/pavannayak9398/Natural-Language-Processing-Projects/blob/main/IMDB_Movie_Reviews_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis of IMBD Movie Reviews**

**Introduction**

This project focuses on classifying movie reviews as either Positive or Negative using Natural Language Processing (NLP) and Machine Learning techniques. The goal is to build an efficient sentiment analysis model that can automatically analyze and predict the sentiment of movie reviews from the IMDb Movie Reviews dataset.

**Techniques Used:**

1. Preprocessing: Tokenization, Stopwords Removal, Lemmatization
2. Feature Extraction: TF-IDF Vectorizer
3. Model: Logistic Regression

By leveraging TF-IDF for feature extraction and Logistic Regression for classification, this project provides an effective approach to understanding audience sentiment, which can be beneficial for movie recommendation systems, trend analysis, and audience feedback interpretation.

In [None]:
import numpy as np
import pandas as pd

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

**--> Download NLTK data**

In [None]:
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

**--> Load Data**

In [None]:
df=pd.read_csv("/content/drive/MyDrive/FSDS @Kodi Senapati/Colab files/NLP/Datasets/IMDB Dataset.csv")

df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [None]:
df.shape

(50000, 2)

**--> Text Preprocessing**

In [None]:
stop_words=set(stopwords.words('english'))
lemmatizer=WordNetLemmatizer()

def preprocess(text):
  tokens=word_tokenize(text.lower()) #lowercase & word tokenize
  tokens=[word for word in tokens if word.isalpha()] #Remove non-alphabetic character
  tokens=[word for word in tokens if word not in stop_words] #Stopword Removal
  tokens=[lemmatizer.lemmatize(word) for word in tokens] #Lemmatization
  return ' '.join(tokens)

df['Cleaned_Review']=df['review'].apply(preprocess)

In [None]:
df.head()

Unnamed: 0,review,sentiment,Cleaned_Review
0,One of the other reviewers has mentioned that ...,positive,one reviewer mentioned watching oz episode hoo...
1,A wonderful little production. <br /><br />The...,positive,wonderful little production br br filming tech...
2,I thought this was a wonderful way to spend ti...,positive,thought wonderful way spend time hot summer we...
3,Basically there's a family where a little boy ...,negative,basically family little boy jake think zombie ...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter mattei love time money visually stunnin...


In [None]:
df.shape

(50000, 3)

**--> Check for data imbalance**

In [None]:
sentiment_count=df['sentiment'].value_counts(normalize=True)*100

print(sentiment_count)

sentiment
positive    50.0
negative    50.0
Name: proportion, dtype: float64


**--> Feature Extraction**

*TF-IDF Vectorization:* It converts the cleaned text into numerical feature representations based on TF-IDF. This helps in assigning importance to words based on their occurence in the dataset.



In [None]:
vectorize=TfidfVectorizer()

X=vectorize.fit_transform(df['Cleaned_Review'])
y=df['sentiment']

**--> Model Building & Prediction**

In [None]:
# Train-Test-Split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)


# Model Building
model=LogisticRegression()
model.fit(X_train, y_train)

y_pred=model.predict(X_test)

In [None]:
# Print Actual vs. Predicted Sentiments
comparison_df = pd.DataFrame({'Actual': y_test.values, 'Predicted': y_pred})

# Display sample results
print(comparison_df.head(10))

     Actual Predicted
0  positive  negative
1  positive  positive
2  negative  negative
3  positive  positive
4  negative  negative
5  positive  positive
6  positive  positive
7  positive  negative
8  negative  negative
9  negative  negative


**--> Model Evaluations**

In [None]:
# Evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\n Classification Report: \n", classification_report(y_test, y_pred))


Accuracy: 0.8945

 Classification Report: 
               precision    recall  f1-score   support

    negative       0.91      0.88      0.89      4961
    positive       0.88      0.91      0.90      5039

    accuracy                           0.89     10000
   macro avg       0.89      0.89      0.89     10000
weighted avg       0.89      0.89      0.89     10000



## **Predict for future reviews**

In [None]:
review_1="The movie was so good and thoroughly enjoyed full movie"

clean_review_1=preprocess(review_1)

# Transform the input using the already fitted TF-IDF Vectorizer
review_vectorized=vectorize.transform([clean_review_1])

#Predict using the trained model
prediction=model.predict(review_vectorized)

print("Predicted Sentiment:", prediction[0])

Predicted Sentiment: positive


In [None]:
review_2="This movie was so boring I almost fell asleep. A complete waste of time."

clean_review=preprocess(review_2)

# Transform the input using the already fitted TF-IDF Vectorizer
review_vectorized=vectorize.transform([clean_review])

#Predict using the trained model
prediction=model.predict(review_vectorized)

print("Predicted Sentiment:", prediction[0])

Predicted Sentiment: negative
