# Sentiment Analysis of IMDB Movie Reviews

This project performs **binary sentiment classification** (positive vs negative)
on IMDB movie reviews using classical Natural Language Processing (NLP) techniques
and Machine Learning.

**Key steps:**
- Importing required libraries 
- Loading dataset
- Creating a pandas dataframe
- Text preprocessing
- Feature extraction using TF-IDF
- Split data into train and test sets
- Model training
- Performance evaluation

## 1. Importing Required Libraries

In [3]:
# ====================
#    Data handling
# ====================
import pandas as pd
import numpy as np

# ====================
#    Visualization
# ====================
# Plotting library for charts and graphs
import matplotlib.pyplot as plt 
# Statistical data visualization built on matplotlib
import seaborn as sns                    

# ====================
# NLP & preprocessing
# ====================
# Core NLP library
import nltk   
# Common words to remove (e.g., "the", "is")
from nltk.corpus import stopwords  
# Splits text into individual words (tokens)
from nltk.tokenize import word_tokenize  

# ====================
#   Machine Learning
# ====================
from sklearn.model_selection import train_test_split
# Converts text into numerical feature vectors using TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report


In [4]:
# To ensure NLP tools work correctly
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('movie_reviews')

[nltk_data] Downloading package punkt to /Users/mahsa/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /Users/mahsa/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/mahsa/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/mahsa/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

## 2. Loading IMDB Dataset from NLTK

In [5]:
# ===============================
# Load IMDB movie reviews dataset
# ===============================
# Note: The dataset is tokenized by nltk
from nltk.corpus import movie_reviews

## 3. Create a Pandas Dataframe

In [6]:
import random
# Create a list of (review_text, sentiment_label)
# 'pos' = positive review, 'neg' = negative review
documents = []
# Loop through each sentiment category ('pos', 'neg'), then through each review file
# belonging to that category. Each review is provided as a list of tokenized words,
# which are joined into a single string to reconstruct the full review text.
# The resulting text is paired with its corresponding sentiment label.
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        review_text = ' '.join(movie_reviews.words(fileid))
        documents.append((review_text, category))

# Shuffle the dataset to avoid any ordering bias
random.shuffle(documents)

# Convert the dataset into a pandas DataFrame
df = pd.DataFrame(documents, columns=['review', 'label'])

# Preview the dataset
df.head()

Unnamed: 0,review,label
0,> from writer and director darren stein comes ...,neg
1,"it ' s hard not to recommend "" the others . "" ...",pos
2,the comet - disaster flick is a disaster alrig...,neg
3,"after enduring mariah carey ' s film debut , g...",neg
4,the swooping shots across darkened rooftops su...,neg
