# Sentiment Analysis on Movie Reviews
### CodeClause Internship – Entry Level Project

**Objective:**  
To analyze movie reviews and classify them as positive or negative using Natural Language Processing and Machine Learning techniques.


In [1]:
import nltk
import pandas as pd
import string

from nltk.corpus import movie_reviews, stopwords
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [2]:
nltk.download('movie_reviews')
nltk.download('stopwords')
nltk.download('wordnet')



[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Pritam\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\movie_reviews.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Pritam\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Pritam\AppData\Roaming\nltk_data...


True

In [3]:
documents=[]
labels=[]

# Load Datasets

In [4]:

for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        documents.append(movie_reviews.raw(fileid))
        labels.append(category)

print(len(documents), len(labels))

2000 2000


In [5]:
df = pd.DataFrame({
    'review': documents,
    'sentiment': labels
})

df.head()

Unnamed: 0,review,sentiment
0,"plot : two teen couples go to a church party ,...",neg
1,the happy bastard's quick movie review \ndamn ...,neg
2,it is movies like these that make a jaded movi...,neg
3,""" quest for camelot "" is warner bros . ' firs...",neg
4,synopsis : a mentally unstable man undergoing ...,neg


In [6]:
df['sentiment'].value_counts()

sentiment
neg    1000
pos    1000
Name: count, dtype: int64

In [8]:
# Load English stop words (common words like 'is', 'the', 'and')
stop_words = set(stopwords.words('english'))

# Initialize lemmatizer to convert words to their base form
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Convert text to lowercase for consistency
    text = text.lower()
    
    # Remove punctuation symbols from text
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Split text into individual words
    words = text.split()
    
    # Remove stopwords and apply lemmatization
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    
    # Join words back into a cleaned sentence
    return " ".join(words)


In [10]:
# Apply processing
df['clean_review'] = df['review'].apply(preprocess_text)
df.head()


Unnamed: 0,review,sentiment,clean_review
0,"plot : two teen couples go to a church party ,...",neg,plot two teen couple go church party drink dri...
1,the happy bastard's quick movie review \ndamn ...,neg,happy bastard quick movie review damn y2k bug ...
2,it is movies like these that make a jaded movi...,neg,movie like make jaded movie viewer thankful in...
3,""" quest for camelot "" is warner bros . ' firs...",neg,quest camelot warner bros first featurelength ...
4,synopsis : a mentally unstable man undergoing ...,neg,synopsis mentally unstable man undergoing psyc...


In [11]:
# applying train-test split
X = df['clean_review']
y = df['sentiment']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


In [12]:
# Vectorization
tfidf = TfidfVectorizer(max_features=5000)

X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)


In [13]:
# training model
model = LogisticRegression()
model.fit(X_train_tfidf, y_train)


In [14]:
# prediction and accuracy
y_pred = model.predict(X_test_tfidf)

accuracy = accuracy_score(y_test, y_pred)
accuracy


0.8275

In [15]:
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

         neg       0.83      0.82      0.83       199
         pos       0.83      0.83      0.83       201

    accuracy                           0.83       400
   macro avg       0.83      0.83      0.83       400
weighted avg       0.83      0.83      0.83       400



In [16]:
confusion_matrix(y_test, y_pred)


array([[164,  35],
       [ 34, 167]])

In [17]:
# testing custom review
def predict_sentiment(text):
    text = preprocess_text(text)
    vec = tfidf.transform([text])
    return model.predict(vec)[0]

predict_sentiment("This movie was fantastic and full of emotions")


'pos'

## Conclusion
This project successfully implemented a sentiment analysis model using NLP techniques and Logistic Regression. The model achieved high accuracy and can effectively classify movie reviews as positive or negative.
## Model Performance
- Accuracy: 83%
- Balanced precision and recall for both classes
- Logistic Regression performed effectively on TF-IDF features

## Tools Used
- Python
- Pandas
- NLTK
- Scikit-learn

## Learning Outcomes
- Text preprocessing using NLP
- Feature extraction using TF-IDF
- Sentiment classification using machine learning
