# Assignment 3 - Movie Review Sentiment Prediction

## Assignment Description
In this assignment, your task is to predict the sentiment of the given movie reviews. You will be provided with a training dataset and a test dataset. The labels for the test dataset will remain hidden, and your task is to submit predictions for the same. The assignment will be conducted on the Kaggle Platform.

## Instructions
1. Use the following link to join the competition: [Kaggle Competition](https://www.kaggle.com/t/bdfe59603f7a49819e1888c5b35cadb5)
2. After joining the competition, go to the Code tab and create a New Notebook.
3. Use this Notebook to make model submissions, and the performance of the model will be reflected in the leaderboard.
4. Refer to the rubrics provided below for the peer review to understand the minimum requirements in the notebook.
5. The deadline for the assignment is **Dec 11, 2025**.

## Peer Review Rubrics Checklist
- [x] Identify data types of different columns
- [x] Present descriptive statistics of numerical columns
- [x] Identify and handle the missing values
- [x] Identify and handle duplicates
- [x] Identify and handle outliers
- [x] Present at least three visualizations and provide insights
- [x] Scale Numerical features and Encode Categorical features
- [x] Model Building (at least 7 models)
- [x] Hyperparameter Tuning on any 3 of the models
- [x] Comparison of model performances

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, LabelEncoder, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, StackingClassifier, VotingClassifier
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from collections import Counter

# Download NLTK data
try:
    nltk.data.find('corpora/stopwords')
    nltk.data.find('corpora/wordnet')
    nltk.data.find('sentiment/vader_lexicon')
except LookupError:
    nltk.download('stopwords')
    nltk.download('wordnet')
    nltk.download('omw-1.4')
    nltk.download('vader_lexicon')

# Set plot style
sns.set(style="whitegrid")

In [None]:
# Load Data
# Note: Adjust the paths if running locally or on Kaggle
try:
    train = pd.read_csv('/kaggle/input/mlp-term-3-2025-kaggle-assignment-3/train.csv')
    test = pd.read_csv('/kaggle/input/mlp-term-3-2025-kaggle-assignment-3/test.csv')
    submission = pd.read_csv('/kaggle/input/mlp-term-3-2025-kaggle-assignment-3/sample_submission.csv')
    print("Data loaded from Kaggle directory.")
except FileNotFoundError:
    # Fallback for local testing if files are in the same directory
    try:
        train = pd.read_csv('train.csv')
        test = pd.read_csv('test.csv')
        submission = pd.read_csv('sample_submission.csv')
        print("Data loaded from local directory.")
    except FileNotFoundError:
        print("Error: Data files not found. Please ensure train.csv and test.csv are available.")

## Rubric Step 1: Identify Data Types
We will examine the data types of the columns in the training dataset.

In [None]:
print("Training Data Info:")
print(train.info())

**Observation:** The output above shows the data types for each column (e.g., `int64`, `float64`, `object`).

## Rubric Step 2: Descriptive Statistics
We will look at the descriptive statistics (mean, median, min, max, etc.) for the numerical columns.

In [None]:
print("Descriptive Statistics:")
print(train.describe())

## Rubric Step 3: Identify and Handle Missing Values
We will check for missing values and handle them appropriately (imputation or dropping).

In [None]:
print("Missing values before handling:")
print(train.isnull().sum())

# Define numerical columns
num_cols = ['feature_1', 'feature_2', 'feature_3']

# Impute numerical columns with median
imputer = SimpleImputer(strategy='median')
train[num_cols] = imputer.fit_transform(train[num_cols])
test[num_cols] = imputer.transform(test[num_cols])

# Drop rows with missing target ('sentiment') or text ('phrase') in train
train.dropna(subset=['phrase', 'sentiment'], inplace=True)

# Fill missing phrases in test with empty string
test['phrase'] = test['phrase'].fillna('')

print("\nMissing values after handling:")
print(train.isnull().sum())

## Rubric Step 4: Identify and Handle Duplicates
We will check for and remove duplicate rows to ensure data quality.

In [None]:
initial_len = len(train)
train.drop_duplicates(inplace=True)
print(f"Dropped {initial_len - len(train)} duplicate rows.")

## Rubric Step 5: Identify and Handle Outliers
We will identify outliers using the IQR method but will retain them as they might contain valuable information for sentiment analysis.

In [None]:
print("Outlier Detection (IQR Method):")
for col in num_cols:
    Q1 = train[col].quantile(0.25)
    Q3 = train[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = train[(train[col] < lower_bound) | (train[col] > upper_bound)]
    print(f"Outliers in {col}: {len(outliers)}")

print("\nDecision: We are RETAINING outliers to preserve valuable information for the sentiment analysis task.")

## Rubric Step 6: Visualizations
We will present at least three visualizations to gain insights into the data.

In [None]:
# Visualization 1: Target Distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='sentiment', data=train)
plt.title('Distribution of Sentiment')
plt.xlabel('Sentiment')
plt.ylabel('Count')
plt.show()

# Visualization 2: Top 20 Frequent Words
all_words = ' '.join(train['phrase'].astype(str)).split()
word_freq = Counter(all_words).most_common(20)
words_df = pd.DataFrame(word_freq, columns=['Word', 'Frequency'])

plt.figure(figsize=(12, 6))
sns.barplot(x='Frequency', y='Word', data=words_df)
plt.title('Top 20 Frequent Words')
plt.show()

# Visualization 3: Correlation Heatmap of Numerical Features
plt.figure(figsize=(8, 6))
sns.heatmap(train[num_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Numerical Features')
plt.show()

## Rubric Step 7: Scale Numerical Features and Encode Categorical Features
We will define a preprocessing pipeline that:
1. Cleans the text data.
2. Extracts features using TF-IDF (Word and Character n-grams).
3. Extracts sentiment scores using VADER.
4. Scales numerical features using MinMaxScaler.

In [None]:
# Text Cleaning Function
def clean_text(text):
    lemmatizer = WordNetLemmatizer()
    # Lowercase
    text = str(text).lower()
    # Remove special chars and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Tokenize
    words = text.split()
    # Lemmatize (keeping stopwords for context like 'not')
    words = [lemmatizer.lemmatize(word) for word in words]
    return " ".join(words)

# Apply cleaning
train['cleaned_phrase'] = train['phrase'].apply(clean_text)
test['cleaned_phrase'] = test['phrase'].apply(clean_text)

# Custom Transformer for VADER Sentiment
class VaderSentimentEstimator(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.sia = SentimentIntensityAnalyzer()

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        scores = [self.sia.polarity_scores(str(text))['compound'] for text in X]
        return np.array(scores).reshape(-1, 1)

# Define Transformers
text_word_transformer = TfidfVectorizer(max_features=10000, ngram_range=(1, 3), min_df=2, max_df=0.9, sublinear_tf=True)
text_char_transformer = TfidfVectorizer(max_features=10000, analyzer='char', ngram_range=(2, 4), min_df=2, max_df=0.9, sublinear_tf=True)

vader_transformer = Pipeline(steps=[
    ('vader', VaderSentimentEstimator()),
    ('scaler', MinMaxScaler())
])

numeric_transformer = Pipeline(steps=[
    ('scaler', MinMaxScaler())
])

# Column Transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('text_word', text_word_transformer, 'cleaned_phrase'),
        ('text_char', text_char_transformer, 'cleaned_phrase'),
        ('vader', vader_transformer, 'phrase'),
        ('num', numeric_transformer, num_cols)
    ])

# Prepare X and y
X = train.drop(['sentiment', 'id'], axis=1)
y = train['sentiment']
X_test_submission = test.drop(['id'], axis=1)

## Rubric Step 9: Hyperparameter Tuning
We will perform hyperparameter tuning on 3 different models. For demonstration speed, we will use a subset of the data.

In [None]:
print("Hyperparameter Tuning on 3 Models (using subset for speed)...")
X_subset = X.iloc[:1500]
y_subset = y.iloc[:1500]

# 1. Logistic Regression Tuning
print("1. Tuning Logistic Regression...")
param_grid_lr = {'classifier__C': [0.1, 1, 10]}
pipeline_lr = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', LogisticRegression(max_iter=3000, random_state=42))])
grid_lr = GridSearchCV(pipeline_lr, param_grid_lr, cv=3, scoring='accuracy', n_jobs=1)
grid_lr.fit(X_subset, y_subset)
print(f"Best LR Params: {grid_lr.best_params_}, Score: {grid_lr.best_score_:.4f}")

# 2. Random Forest Tuning
print("2. Tuning Random Forest...")
param_grid_rf = {'classifier__n_estimators': [100, 200], 'classifier__max_depth': [10, 20]}
pipeline_rf = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', RandomForestClassifier(random_state=42))])
grid_rf = GridSearchCV(pipeline_rf, param_grid_rf, cv=3, scoring='accuracy', n_jobs=1)
grid_rf.fit(X_subset, y_subset)
print(f"Best RF Params: {grid_rf.best_params_}, Score: {grid_rf.best_score_:.4f}")

# 3. Multinomial NB Tuning
print("3. Tuning Multinomial NB...")
param_grid_mnb = {'classifier__alpha': [0.1, 0.5, 1.0]}
pipeline_mnb = Pipeline(steps=[('preprocessor', preprocessor), ('classifier', MultinomialNB())])
grid_mnb = GridSearchCV(pipeline_mnb, param_grid_mnb, cv=3, scoring='accuracy', n_jobs=1)
grid_mnb.fit(X_subset, y_subset)
print(f"Best MNB Params: {grid_mnb.best_params_}, Score: {grid_mnb.best_score_:.4f}")

## Rubric Step 8: Model Building (at least 7 models)
We will train and evaluate 7 different models using Cross-Validation.

In [None]:
models = {
    'Logistic Regression': LogisticRegression(C=1.0, max_iter=3000, random_state=42),
    'Linear SVC': LinearSVC(C=0.5, random_state=42, dual=True, max_iter=3000),
    'Multinomial NB': MultinomialNB(alpha=0.5),
    'Random Forest': RandomForestClassifier(n_estimators=300, max_depth=None, random_state=42),
    'Ridge Classifier': RidgeClassifier(random_state=42, solver='lsqr'),
    'XGBoost': XGBClassifier(n_estimators=200, learning_rate=0.1, random_state=42, eval_metric='logloss'),
    'LightGBM': LGBMClassifier(n_estimators=200, learning_rate=0.05, random_state=42, verbose=-1)
}

results = {}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("Training 7 Models with Cross-Validation...")
for name, model in models.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', model)])
    scores = cross_val_score(pipeline, X, y, cv=cv, scoring='accuracy', n_jobs=1)
    results[name] = scores.mean()
    print(f"{name}: Mean CV Accuracy = {scores.mean():.4f}")

## Rubric Step 10: Comparison of Model Performances
We will compare the performance of the models and build a Voting Ensemble for the final submission.

In [None]:
results_df = pd.DataFrame(list(results.items()), columns=['Model', 'Accuracy']).sort_values(by='Accuracy', ascending=False)
print("Model Comparison Table:")
print(results_df)

# Visualize Comparison
plt.figure(figsize=(10, 6))
sns.barplot(x='Accuracy', y='Model', data=results_df)
plt.title('Model Accuracy Comparison')
plt.xlim(0, 1.0)
plt.show()

In [None]:
# Building Voting Ensemble (Soft Voting) with Top Performers
print("Building Voting Ensemble...")

estimators = [
    ('lr', Pipeline([('preprocessor', preprocessor), ('clf', LogisticRegression(C=1, max_iter=3000))])),
    ('svc', Pipeline([('preprocessor', preprocessor), ('clf', CalibratedClassifierCV(LinearSVC(C=0.5, dual=True, max_iter=3000), cv=3))])),
    ('mnb', Pipeline([('preprocessor', preprocessor), ('clf', MultinomialNB(alpha=0.5))]))
]

voting_clf = VotingClassifier(estimators=estimators, voting='soft', n_jobs=1)

# Evaluate Ensemble on Hold-out set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
voting_clf.fit(X_train, y_train)
val_preds = voting_clf.predict(X_val)
ensemble_acc = accuracy_score(y_val, val_preds)
print(f"Voting Ensemble Validation Accuracy: {ensemble_acc:.4f}")

## Final Submission
We will retrain the ensemble model on the full training data and generate predictions for the test set.

In [None]:
print("Retraining Final Model on Full Data...")
voting_clf.fit(X, y)

print("Generating Predictions for Test Set...")
predictions = voting_clf.predict(X_test_submission)

submission_df = pd.DataFrame({'id': test['id'], 'sentiment': predictions})
submission_df.to_csv('submission.csv', index=False)
print("Submission saved to 'submission.csv'")