## Sentiment Classification Tutorial: Classical vs. Deep Learning Models (IMDB Reviews)

This tutorial adapts the text classification framework to a binary sentiment analysis task using the IMDB Movie Reviews dataset. We will classify customer reviews into two categories: 'positive' or 'negative'.

We will compare the performance of:

* **Bag-of-Words (BoW) Classifier using TF-IDF features.**

* **Gated Recurrent Unit (GRU) from PyTorch.**

* **Bi-directional GRU (Bi-GRU) from PyTorch.**


#### Dataset: IMDB-Dataset.csv (The actual uploaded file is now used)

#### Target: Binary Classification (2 classes: Positive, Negative)

### In this Notebook, we will focus on Bag-of-Words Classifier.

### 1. **Setup and Data Loading**

We import necessary libraries and load the IMDB dataset, converting the categorical sentiment labels into numerical format.

In [None]:
# Core data science and NLP libraries
import numpy as np
import pandas as pd
import re
import os

# Scikit-learn for classical ML
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline

from google.colab import drive
drive.mount('/content/drive')


# Set a random seed for reproducibility
SEED = 42
np.random.seed(SEED)

# --- Load the IMDB Dataset ---
FILE_PATH = '/content/drive/MyDrive/Colab Notebooks/IMDB-Dataset.csv'
df = pd.read_csv(FILE_PATH)


# 1. Label Encoding: Convert 'positive' to 1 and 'negative' to 0
# The 'sentiment' column is the raw target string
df['target'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

X = df['review'].values
y = df['target'].values

# 2. Split Data into Training and Testing Sets
# We use a standard 80/20 split for training and testing
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=SEED,
    stratify=y # Ensure equal class distribution in both splits
)

target_names = ['negative', 'positive'] # Map numerical targets back to names (0, 1)

print(f"\nTraining Samples: {len(X_train)}")
print(f"Testing Samples: {len(X_test)}")
print(f"Total Classes: {len(target_names)}")
print(f"Classes: {target_names}")
print("-" * 50)
print(f"Example Data Point (Class: {target_names[y_train[0]]}):\n{X_train[0][:300]}...\n")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Training Samples: 40000
Testing Samples: 10000
Total Classes: 2
Classes: ['negative', 'positive']
--------------------------------------------------
Example Data Point (Class: positive):



### 2. **Classifier 1: Bag-of-Words (BoW) Baseline**

We will reuse the TF-IDF feature extraction pipeline combined with Logistic Regression. This demonstrates how well a classical, linear model performs without needing to understand the sequence or context of the words.

In [None]:
# %%
# --- Bag-of-Words Model (TF-IDF + Logistic Regression) ---

print("Starting BoW (TF-IDF) Classification...")

# 1. Feature Engineering: TfidfVectorizer
# min_df=5: Ignore terms that appear in less than 5 documents.
# stop_words='english': Remove common English stopwords.
tfidf_vectorizer = TfidfVectorizer(
    min_df=5,
    stop_words='english',
    ngram_range=(1, 2) # Also include 2-word combinations (bigrams)
)

# 2. Classifier: Logistic Regression
log_reg_classifier = LogisticRegression(
    solver='lbfgs',
    random_state=SEED,
    max_iter=1000
)

# 3. Build a pipeline: chain vectorization and classification
bow_pipeline = Pipeline([
    ('tfidf', tfidf_vectorizer),
    ('clf', log_reg_classifier)
])

# Training
bow_pipeline.fit(X_train, y_train)

# Prediction
y_pred_bow = bow_pipeline.predict(X_test)

# Evaluation
accuracy_bow = accuracy_score(y_test, y_pred_bow)
f1_bow = f1_score(y_test, y_pred_bow, average='weighted')

print("\n--- BoW Classifier Performance ---")
print(f"Test Accuracy: {accuracy_bow:.4f}")
print(f"Weighted F1-Score: {f1_bow:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred_bow, target_names=target_names))

# Store results for final comparison
results = {'BoW (TF-IDF)': {'Accuracy': accuracy_bow, 'F1-Score': f1_bow}}
# %%


Starting BoW (TF-IDF) Classification...

--- BoW Classifier Performance ---
Test Accuracy: 0.9013
Weighted F1-Score: 0.9013

Classification Report:
              precision    recall  f1-score   support

    negative       0.91      0.89      0.90      5000
    positive       0.89      0.91      0.90      5000

    accuracy                           0.90     10000
   macro avg       0.90      0.90      0.90     10000
weighted avg       0.90      0.90      0.90     10000

