# IMDb Movie Review Sentiment Analysis - Complete Pipeline

## Overview
This notebook contains the complete end-to-end pipeline for IMDb movie review sentiment analysis:
1. Data Loading and Exploration
2. Data Preprocessing and Cleaning
3. Feature Engineering (TF-IDF, Word2Vec, Textual Features)
4. Model Development (Multiple Algorithms)
5. Model Evaluation and Visualization
6. Predictions on New Reviews

## Dataset
The dataset contains movie reviews with sentiment labels (positive/negative).


## Part 1: Setup and Import Libraries


In [None]:
# Data manipulation
import pandas as pd
import numpy as np
import os
import pickle
import time
import re
import warnings
warnings.filterwarnings('ignore')

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# Text processing
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer

# Machine Learning
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, roc_curve, precision_recall_curve, auc,
    confusion_matrix, classification_report
)

# Advanced ML
from xgboost import XGBClassifier
from gensim.models import Word2Vec
from scipy.sparse import hstack

# Download NLTK data (run once)
try:
    nltk.data.find('tokenizers/punkt')
except LookupError:
    nltk.download('punkt')

try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords')

try:
    nltk.data.find('corpora/wordnet')
except LookupError:
    nltk.download('wordnet')

try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
except LookupError:
    nltk.download('averaged_perceptron_tagger')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 200)

# Create necessary directories
os.makedirs('data', exist_ok=True)
os.makedirs('models', exist_ok=True)
os.makedirs('models/visualizations', exist_ok=True)

print("✓ All libraries imported successfully!")
print("✓ NLTK data downloaded/verified!")
print("✓ Directories created!")


## Part 2: Data Loading and Exploration


In [None]:
# Load the dataset
df = pd.read_csv('data/imdb_data.csv')

print(f"✓ Dataset loaded successfully!")
print(f"Dataset shape: {df.shape}")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")
print(f"Column names: {list(df.columns)}")


In [None]:
# Display first few rows
df.head(10)


In [None]:
# Display column information
df.info()


In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_percent = (missing_values / len(df)) * 100

missing_df = pd.DataFrame({
    'Column': missing_values.index,
    'Missing Count': missing_values.values,
    'Missing Percentage': missing_percent.values
})

missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

if len(missing_df) > 0:
    print("Missing Values Found:")
    print(missing_df)
else:
    print("✓ No missing values found in the dataset!")


In [None]:
# Check for duplicate rows
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

if duplicate_count > 0:
    print("\nRemoving duplicates...")
    df = df.drop_duplicates()
    print(f"✓ Dataset shape after removing duplicates: {df.shape}")
else:
    print("✓ No duplicates found!")


In [None]:
# Sentiment distribution
sentiment_counts = df['sentiment'].value_counts()
sentiment_percentages = df['sentiment'].value_counts(normalize=True) * 100

print("Sentiment Distribution:")
print(sentiment_counts)
print("\nSentiment Percentages:")
print(sentiment_percentages)

# Visualize sentiment distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Bar chart
sentiment_counts.plot(kind='bar', ax=axes[0], color=['#2ecc71', '#e74c3c'])
axes[0].set_title('Sentiment Distribution (Count)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Sentiment', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=0)

# Pie chart
sentiment_percentages.plot(kind='pie', ax=axes[1], autopct='%1.1f%%', colors=['#2ecc71', '#e74c3c'])
axes[1].set_title('Sentiment Distribution (Percentage)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('')

plt.tight_layout()
plt.savefig('models/visualizations/sentiment_distribution.png', dpi=300, bbox_inches='tight')
plt.show()


In [None]:
# Calculate review length statistics
df['review_length'] = df['review'].str.len()
df['word_count'] = df['review'].str.split().str.len()

print("Review Length Statistics:")
print(df[['review_length', 'word_count']].describe())

# Visualize review length distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Character count distribution
df.boxplot(column='review_length', by='sentiment', ax=axes[0])
axes[0].set_title('Review Length (Characters) by Sentiment', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Sentiment', fontsize=10)
axes[0].set_ylabel('Character Count', fontsize=10)

# Word count distribution
df.boxplot(column='word_count', by='sentiment', ax=axes[1])
axes[1].set_title('Word Count by Sentiment', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Sentiment', fontsize=10)
axes[1].set_ylabel('Word Count', fontsize=10)

plt.suptitle('Review Length Analysis', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('models/visualizations/review_length_analysis.png', dpi=300, bbox_inches='tight')
plt.show()


In [None]:
# Create word clouds for positive and negative reviews
def create_wordcloud(text, title, ax):
    wordcloud = WordCloud(width=800, height=400, background_color='white', max_words=100).generate(text)
    ax.imshow(wordcloud, interpolation='bilinear')
    ax.set_title(title, fontsize=12, fontweight='bold')
    ax.axis('off')

# Combine all positive reviews
positive_text = ' '.join(df[df['sentiment'] == 'positive']['review'].astype(str))

# Combine all negative reviews
negative_text = ' '.join(df[df['sentiment'] == 'negative']['review'].astype(str))

# Create word clouds
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
create_wordcloud(positive_text, 'Positive Reviews Word Cloud', axes[0])
create_wordcloud(negative_text, 'Negative Reviews Word Cloud', axes[1])

plt.tight_layout()
plt.savefig('models/visualizations/wordclouds.png', dpi=300, bbox_inches='tight')
plt.show()


## Part 3: Text Preprocessing


In [None]:
# Initialize lemmatizer and stemmer
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

def clean_text(text):
    """Clean text by removing HTML tags, special characters, and extra whitespace"""
    if pd.isna(text):
        return ""
    
    # Convert to string
    text = str(text)
    
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    
    # Remove URLs
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    
    # Remove special characters and digits (keep only letters and spaces)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

def tokenize_text(text):
    """Tokenize text into words"""
    return word_tokenize(text)

def remove_stopwords(tokens):
    """Remove stop words from tokens"""
    return [token for token in tokens if token not in stop_words]

def lemmatize_text(tokens):
    """Lemmatize tokens (convert to base form)"""
    return [lemmatizer.lemmatize(token) for token in tokens]

def preprocess_text(text, use_lemmatization=True, remove_stop=True):
    """Complete text preprocessing pipeline"""
    # Clean text
    text = clean_text(text)
    
    # Tokenize
    tokens = tokenize_text(text)
    
    # Remove stop words
    if remove_stop:
        tokens = remove_stopwords(tokens)
    
    # Lemmatize
    if use_lemmatization:
        tokens = lemmatize_text(tokens)
    
    # Join tokens back to string
    return ' '.join(tokens)

print("✓ Text preprocessing functions created!")


In [None]:
# Apply preprocessing to reviews
print("Starting text preprocessing...")
print("This may take a few minutes for large datasets...")

# Create a copy of the dataframe
df_processed = df.copy()

# Apply preprocessing (using lemmatization)
df_processed['cleaned_review'] = df_processed['review'].apply(
    lambda x: preprocess_text(x, use_lemmatization=True, remove_stop=True)
)

print("\n✓ Text preprocessing completed!")
print(f"\nSample original review:")
print(df_processed['review'].iloc[0][:200])
print(f"\nSample cleaned review:")
print(df_processed['cleaned_review'].iloc[0][:200])


## Part 4: Feature Engineering


In [None]:
# Extract various textual features
def extract_text_features(df):
    """Extract textual features from reviews"""
    features = df.copy()
    
    # Basic length features
    features['char_count'] = features['cleaned_review'].str.len()
    features['word_count'] = features['cleaned_review'].str.split().str.len()
    features['sentence_count'] = features['cleaned_review'].str.split('.').str.len()
    
    # Average word length
    features['avg_word_length'] = features['char_count'] / (features['word_count'] + 1)
    
    # Count of uppercase letters (if any remain after preprocessing)
    features['uppercase_count'] = features['cleaned_review'].str.findall(r'[A-Z]').str.len()
    
    # Count of digits (if any remain)
    features['digit_count'] = features['cleaned_review'].str.findall(r'\d').str.len()
    
    # Count of special characters
    features['special_char_count'] = features['cleaned_review'].str.findall(r'[^a-zA-Z0-9\s]').str.len()
    
    # Count of exclamation marks and question marks (sentiment indicators)
    features['exclamation_count'] = features['cleaned_review'].str.count('!')
    features['question_count'] = features['cleaned_review'].str.count('?')
    
    return features

# Extract features
df_features = extract_text_features(df_processed)

print("✓ Textual features extracted!")
print("\nFeature Statistics:")
print(df_features[['char_count', 'word_count', 'avg_word_length']].describe())
print("\nSample features:")
df_features[['cleaned_review', 'char_count', 'word_count', 'avg_word_length']].head()


In [None]:
# Initialize TF-IDF Vectorizer
# Using common parameters for text classification
tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,  # Top 5000 features
    ngram_range=(1, 2),  # Unigrams and bigrams
    min_df=2,  # Minimum document frequency
    max_df=0.95,  # Maximum document frequency (ignore very common words)
    sublinear_tf=True  # Apply sublinear tf scaling
)

# Fit and transform the cleaned reviews
print("Fitting TF-IDF vectorizer...")
X_tfidf = tfidf_vectorizer.fit_transform(df_features['cleaned_review'])

print(f"✓ TF-IDF matrix shape: {X_tfidf.shape}")
print(f"Number of features: {X_tfidf.shape[1]}")

# Get feature names
feature_names = tfidf_vectorizer.get_feature_names_out()
print(f"\nSample feature names (first 20):")
print(feature_names[:20])


In [None]:
# Prepare tokenized sentences for Word2Vec
tokenized_reviews = [review.split() for review in df_features['cleaned_review']]

# Train Word2Vec model
print("Training Word2Vec model...")
word2vec_model = Word2Vec(
    sentences=tokenized_reviews,
    vector_size=100,  # Dimension of word vectors
    window=5,  # Context window size
    min_count=2,  # Minimum word frequency
    workers=4,  # Number of threads
    sg=0  # 0 for CBOW, 1 for Skip-gram
)

print(f"✓ Word2Vec model trained!")
print(f"Vocabulary size: {len(word2vec_model.wv.key_to_index)}")

# Create document vectors by averaging word vectors
def get_document_vector(words, model):
    """Get document vector by averaging word vectors"""
    vectors = [model.wv[word] for word in words if word in model.wv]
    if len(vectors) > 0:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(model.vector_size)

# Create document vectors
print("Creating document vectors...")
X_word2vec = np.array([get_document_vector(review, word2vec_model) for review in tokenized_reviews])

print(f"✓ Word2Vec matrix shape: {X_word2vec.shape}")


In [None]:
# Extract textual features as numpy array
textual_features = df_features[['char_count', 'word_count', 'avg_word_length', 
                                 'exclamation_count', 'question_count']].values

# Normalize textual features
scaler = StandardScaler()
textual_features_scaled = scaler.fit_transform(textual_features)

print(f"Textual features shape: {textual_features_scaled.shape}")

# Combine TF-IDF with textual features
X_combined_tfidf = hstack([X_tfidf, textual_features_scaled])

print(f"✓ Combined TF-IDF + Textual features shape: {X_combined_tfidf.shape}")

# Combine Word2Vec with textual features
X_combined_word2vec = np.hstack([X_word2vec, textual_features_scaled])

print(f"✓ Combined Word2Vec + Textual features shape: {X_combined_word2vec.shape}")


In [None]:
# Encode target variable
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df_features['sentiment'])

print(f"✓ Target variable shape: {y.shape}")
print(f"Class distribution:")
print(pd.Series(y).value_counts())
print(f"\nLabel mapping:")
for i, label in enumerate(label_encoder.classes_):
    print(f"  {i}: {label}")


In [None]:
# Split data for TF-IDF features
X_train_tfidf, X_test_tfidf, y_train, y_test = train_test_split(
    X_combined_tfidf, y, test_size=0.2, random_state=42, stratify=y
)

# Get indices for train and test sets
train_indices, test_indices = train_test_split(
    np.arange(len(y)), test_size=0.2, random_state=42, stratify=y
)

# Split data for Word2Vec features
X_train_word2vec, X_test_word2vec, _, _ = train_test_split(
    X_combined_word2vec, y, test_size=0.2, random_state=42, stratify=y
)

print("✓ Data split completed!")
print(f"Training set size (TF-IDF): {X_train_tfidf.shape}")
print(f"Test set size (TF-IDF): {X_test_tfidf.shape}")
print(f"Training labels: {y_train.shape}")
print(f"Test labels: {y_test.shape}")


## Part 5: Model Development


In [None]:
def train_and_evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    """Train a model and evaluate its performance"""
    print(f"\n{'='*60}")
    print(f"Training {model_name}...")
    print(f"{'='*60}")
    
    start_time = time.time()
    
    # Train the model
    model.fit(X_train, y_train)
    training_time = time.time() - start_time
    
    # Make predictions
    y_pred = model.predict(X_test)
    y_pred_proba = None
    
    # Get prediction probabilities if available
    if hasattr(model, 'predict_proba'):
        y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    roc_auc = None
    if y_pred_proba is not None:
        roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Print results
    print(f"\n{model_name} Results:")
    print(f"  Training Time: {training_time:.2f} seconds")
    print(f"  Accuracy: {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall: {recall:.4f}")
    print(f"  F1-Score: {f1:.4f}")
    if roc_auc is not None:
        print(f"  ROC-AUC: {roc_auc:.4f}")
    
    return {
        'model': model,
        'model_name': model_name,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'roc_auc': roc_auc,
        'training_time': training_time,
        'y_pred': y_pred,
        'y_pred_proba': y_pred_proba
    }

print("✓ Model training function created!")


In [None]:
# Store results
results_tfidf = {}

# 1. Logistic Regression
print("\n" + "="*60)
print("MODEL 1: Logistic Regression")
print("="*60)
lr_model = LogisticRegression(max_iter=1000, random_state=42, n_jobs=-1)
results_tfidf['Logistic Regression'] = train_and_evaluate_model(
    lr_model, X_train_tfidf, X_test_tfidf, y_train, y_test, "Logistic Regression"
)


In [None]:
# 2. Naive Bayes
print("\n" + "="*60)
print("MODEL 2: Naive Bayes")
print("="*60)
nb_model = MultinomialNB(alpha=1.0)
results_tfidf['Naive Bayes'] = train_and_evaluate_model(
    nb_model, X_train_tfidf, X_test_tfidf, y_train, y_test, "Naive Bayes"
)


In [None]:
# 3. Support Vector Machine (SVM)
print("\n" + "="*60)
print("MODEL 3: Support Vector Machine")
print("="*60)
# Note: SVM can be slow on large datasets, using linear kernel for speed
svm_model = SVC(kernel='linear', probability=True, random_state=42)
results_tfidf['SVM'] = train_and_evaluate_model(
    svm_model, X_train_tfidf, X_test_tfidf, y_train, y_test, "SVM"
)


In [None]:
# 4. Random Forest
print("\n" + "="*60)
print("MODEL 4: Random Forest")
print("="*60)
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1, max_depth=20)
results_tfidf['Random Forest'] = train_and_evaluate_model(
    rf_model, X_train_tfidf, X_test_tfidf, y_train, y_test, "Random Forest"
)


In [None]:
# 5. XGBoost
print("\n" + "="*60)
print("MODEL 5: XGBoost")
print("="*60)
# Convert sparse matrix to dense for XGBoost
X_train_tfidf_dense = X_train_tfidf.toarray() if hasattr(X_train_tfidf, 'toarray') else X_train_tfidf
X_test_tfidf_dense = X_test_tfidf.toarray() if hasattr(X_test_tfidf, 'toarray') else X_test_tfidf

xgb_model = XGBClassifier(random_state=42, n_jobs=-1, eval_metric='logloss')
results_tfidf['XGBoost'] = train_and_evaluate_model(
    xgb_model, X_train_tfidf_dense, X_test_tfidf_dense, y_train, y_test, "XGBoost"
)


In [None]:
# Create comparison dataframe
comparison_data = {
    'Model': [],
    'Accuracy': [],
    'Precision': [],
    'Recall': [],
    'F1-Score': [],
    'ROC-AUC': [],
    'Training Time (s)': []
}

for model_name, result in results_tfidf.items():
    comparison_data['Model'].append(model_name)
    comparison_data['Accuracy'].append(result['accuracy'])
    comparison_data['Precision'].append(result['precision'])
    comparison_data['Recall'].append(result['recall'])
    comparison_data['F1-Score'].append(result['f1_score'])
    comparison_data['ROC-AUC'].append(result['roc_auc'] if result['roc_auc'] is not None else np.nan)
    comparison_data['Training Time (s)'].append(result['training_time'])

comparison_df = pd.DataFrame(comparison_data)
comparison_df = comparison_df.sort_values('F1-Score', ascending=False)

print("\n" + "="*80)
print("MODEL COMPARISON (TF-IDF Features)")
print("="*80)
print(comparison_df.to_string(index=False))

# Visualize comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 12))

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
for idx, metric in enumerate(metrics):
    ax = axes[idx // 2, idx % 2]
    comparison_df_sorted = comparison_df.sort_values(metric, ascending=True)
    ax.barh(comparison_df_sorted['Model'], comparison_df_sorted[metric], color='steelblue')
    ax.set_xlabel(metric, fontsize=12)
    ax.set_title(f'{metric} Comparison', fontsize=14, fontweight='bold')
    ax.set_xlim([0, 1])
    for i, v in enumerate(comparison_df_sorted[metric]):
        ax.text(v + 0.01, i, f'{v:.3f}', va='center', fontsize=10)

plt.tight_layout()
plt.savefig('models/visualizations/model_comparison_tfidf.png', dpi=300, bbox_inches='tight')
plt.show()


In [None]:
# Find best model
best_model_name = comparison_df.iloc[0]['Model']
print(f"\n✓ Best performing model: {best_model_name}")

# Hyperparameter tuning for best model
print(f"\nPerforming hyperparameter tuning for {best_model_name}...")

if best_model_name == 'Logistic Regression':
    param_grid = {
        'C': [0.1, 1, 10, 100],
        'penalty': ['l1', 'l2'],
        'solver': ['liblinear']
    }
    base_model = LogisticRegression(max_iter=1000, random_state=42, n_jobs=-1)
    X_train_tuned = X_train_tfidf
    X_test_tuned = X_test_tfidf
    
elif best_model_name == 'Naive Bayes':
    param_grid = {
        'alpha': [0.1, 0.5, 1.0, 2.0]
    }
    base_model = MultinomialNB()
    X_train_tuned = X_train_tfidf
    X_test_tuned = X_test_tfidf
    
elif best_model_name == 'SVM':
    param_grid = {
        'C': [0.1, 1, 10],
        'kernel': ['linear', 'rbf']
    }
    base_model = SVC(probability=True, random_state=42)
    X_train_tuned = X_train_tfidf
    X_test_tuned = X_test_tfidf
    
elif best_model_name == 'Random Forest':
    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_depth': [10, 20, None],
        'min_samples_split': [2, 5]
    }
    base_model = RandomForestClassifier(random_state=42, n_jobs=-1)
    X_train_tuned = X_train_tfidf
    X_test_tuned = X_test_tfidf
    
elif best_model_name == 'XGBoost':
    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_depth': [3, 5, 7],
        'learning_rate': [0.01, 0.1, 0.2]
    }
    base_model = XGBClassifier(random_state=42, n_jobs=-1, eval_metric='logloss')
    X_train_tuned = X_train_tfidf_dense
    X_test_tuned = X_test_tfidf_dense

# Perform grid search
print("Running GridSearchCV (this may take a while)...")
grid_search = GridSearchCV(
    base_model, 
    param_grid, 
    cv=3, 
    scoring='f1_weighted',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train_tuned, y_train)

print(f"\n✓ Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

# Evaluate tuned model
best_tuned_model = grid_search.best_estimator_
y_pred_tuned = best_tuned_model.predict(X_test_tuned)

print(f"\nTuned {best_model_name} Performance:")
print(f"  Accuracy: {accuracy_score(y_test, y_pred_tuned):.4f}")
print(f"  F1-Score: {f1_score(y_test, y_pred_tuned, average='weighted'):.4f}")

# Update results
results_tfidf[f'{best_model_name} (Tuned)'] = {
    'model': best_tuned_model,
    'model_name': f'{best_model_name} (Tuned)',
    'accuracy': accuracy_score(y_test, y_pred_tuned),
    'precision': precision_score(y_test, y_pred_tuned, average='weighted'),
    'recall': recall_score(y_test, y_pred_tuned, average='weighted'),
    'f1_score': f1_score(y_test, y_pred_tuned, average='weighted'),
    'roc_auc': roc_auc_score(y_test, best_tuned_model.predict_proba(X_test_tuned)[:, 1]) if hasattr(best_tuned_model, 'predict_proba') else None,
    'training_time': 0,
    'y_pred': y_pred_tuned,
    'y_pred_proba': best_tuned_model.predict_proba(X_test_tuned)[:, 1] if hasattr(best_tuned_model, 'predict_proba') else None
}

# Update best model
best_model = best_tuned_model


In [None]:
# Save all models
for model_name, result in results_tfidf.items():
    # Clean model name for filename
    filename = model_name.lower().replace(' ', '_').replace('(', '').replace(')', '').replace('_tuned', '_tuned')
    filepath = f'models/{filename}.pkl'
    
    with open(filepath, 'wb') as f:
        pickle.dump(result['model'], f)
    
    print(f"Saved: {filepath}")

# Save best model separately
with open('models/best_model.pkl', 'wb') as f:
    pickle.dump(best_model, f)

# Save vectorizers and encoders
with open('models/tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)

word2vec_model.save('models/word2vec_model.model')

with open('models/feature_scaler.pkl', 'wb') as f:
    pickle.dump(scaler, f)

with open('models/label_encoder.pkl', 'wb') as f:
    pickle.dump(label_encoder, f)

print(f"\n✓ Best model saved: models/best_model.pkl ({best_model_name})")
print("✓ All models and vectorizers saved successfully!")


## Part 6: Model Evaluation


In [None]:
# Make predictions with best model
y_pred = best_model.predict(X_test_tuned)
y_pred_proba = best_model.predict_proba(X_test_tuned)[:, 1] if hasattr(best_model, 'predict_proba') else None

print("✓ Predictions made successfully!")
print(f"Prediction shape: {y_pred.shape}")
if y_pred_proba is not None:
    print(f"Prediction probabilities shape: {y_pred_proba.shape}")


In [None]:
# Calculate all metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
roc_auc = roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else None

# Print metrics
print("="*60)
print("MODEL PERFORMANCE METRICS")
print("="*60)
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1-Score:  {f1:.4f}")
if roc_auc is not None:
    print(f"ROC-AUC:   {roc_auc:.4f}")
print("="*60)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=label_encoder.classes_))


In [None]:
# Create confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Visualize confusion matrix
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
            xticklabels=label_encoder.classes_,
            yticklabels=label_encoder.classes_)
ax.set_xlabel('Predicted', fontsize=12, fontweight='bold')
ax.set_ylabel('Actual', fontsize=12, fontweight='bold')
ax.set_title('Confusion Matrix', fontsize=14, fontweight='bold')

# Add percentage annotations
cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100
for i in range(len(label_encoder.classes_)):
    for j in range(len(label_encoder.classes_)):
        ax.text(j+0.5, i+0.7, f'({cm_percent[i,j]:.1f}%)',
                ha='center', va='center', fontsize=9, color='red', fontweight='bold')

plt.tight_layout()
plt.savefig('models/visualizations/confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nConfusion Matrix:")
print(cm)


In [None]:
if y_pred_proba is not None:
    # Calculate ROC curve
    fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    
    # Plot ROC curve
    plt.figure(figsize=(8, 6))
    plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.4f})')
    plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='Random Classifier')
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate', fontsize=12, fontweight='bold')
    plt.ylabel('True Positive Rate', fontsize=12, fontweight='bold')
    plt.title('ROC Curve', fontsize=14, fontweight='bold')
    plt.legend(loc="lower right", fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('models/visualizations/roc_curve.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"ROC-AUC Score: {roc_auc:.4f}")
else:
    print("ROC curve not available (model doesn't support probability predictions)")


In [None]:
if y_pred_proba is not None:
    # Calculate Precision-Recall curve
    precision_vals, recall_vals, thresholds = precision_recall_curve(y_test, y_pred_proba)
    pr_auc = auc(recall_vals, precision_vals)
    
    # Plot Precision-Recall curve
    plt.figure(figsize=(8, 6))
    plt.plot(recall_vals, precision_vals, color='blue', lw=2, label=f'PR curve (AUC = {pr_auc:.4f})')
    plt.xlabel('Recall', fontsize=12, fontweight='bold')
    plt.ylabel('Precision', fontsize=12, fontweight='bold')
    plt.title('Precision-Recall Curve', fontsize=14, fontweight='bold')
    plt.legend(loc="lower left", fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('models/visualizations/precision_recall_curve.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print(f"Precision-Recall AUC Score: {pr_auc:.4f}")
else:
    print("Precision-Recall curve not available (model doesn't support probability predictions)")


In [None]:
# Get feature importance if available (for tree-based models)
if hasattr(best_model, 'feature_importances_'):
    feature_importances = best_model.feature_importances_
    
    # Get top 20 most important features (excluding textual features)
    n_tfidf_features = X_tfidf.shape[1]
    tfidf_importances = feature_importances[:n_tfidf_features]
    top_indices = np.argsort(tfidf_importances)[-20:][::-1]
    top_features = [(feature_names[i], tfidf_importances[i]) for i in top_indices]
    
    # Visualize top features
    features_df = pd.DataFrame(top_features, columns=['Feature', 'Importance'])
    
    plt.figure(figsize=(10, 8))
    sns.barplot(data=features_df, y='Feature', x='Importance', palette='viridis')
    plt.xlabel('Importance', fontsize=12, fontweight='bold')
    plt.ylabel('Feature', fontsize=12, fontweight='bold')
    plt.title('Top 20 Most Important Features', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.savefig('models/visualizations/feature_importance.png', dpi=300, bbox_inches='tight')
    plt.show()
    
    print("\nTop 20 Most Important Features:")
    print(features_df.to_string(index=False))
    
elif hasattr(best_model, 'coef_'):
    # For linear models, use coefficients
    coef = best_model.coef_[0]
    
    # Get top 20 features (positive and negative)
    n_tfidf_features = X_tfidf.shape[1]
    tfidf_coef = coef[:n_tfidf_features]
    
    top_positive_indices = np.argsort(tfidf_coef)[-10:][::-1]
    top_negative_indices = np.argsort(tfidf_coef)[:10]
    
    print("\nTop 10 Features for Positive Sentiment:")
    for idx in top_positive_indices:
        print(f"  {feature_names[idx]}: {tfidf_coef[idx]:.4f}")
    
    print("\nTop 10 Features for Negative Sentiment:")
    for idx in top_negative_indices:
        print(f"  {feature_names[idx]}: {tfidf_coef[idx]:.4f}")
else:
    print("Feature importance not available for this model type.")


In [None]:
# Analyze misclassified examples
df_results = pd.DataFrame({
    'review': df_processed['review'].iloc[test_indices].values,
    'sentiment': df_processed['sentiment'].iloc[test_indices].values,
    'predicted': label_encoder.inverse_transform(y_pred),
    'correct': (y_test == y_pred)
})

misclassified = df_results[df_results['correct'] == False]

print(f"Total misclassified reviews: {len(misclassified)}")
print(f"Misclassification rate: {len(misclassified) / len(df_results) * 100:.2f}%")

print("\nMisclassification breakdown:")
print(misclassified.groupby(['sentiment', 'predicted']).size().unstack(fill_value=0))

# Show some examples
print("\n" + "="*80)
print("SAMPLE MISCLASSIFIED REVIEWS")
print("="*80)

for idx, row in misclassified.head(10).iterrows():
    print(f"\nActual: {row['sentiment']} | Predicted: {row['predicted']}")
    print(f"Review: {row['review'][:200]}...")
    print("-" * 80)


## Part 7: Predictions on New Reviews


In [None]:
def predict_sentiment(review_text):
    """
    Predict sentiment for a given review text
    
    Parameters:
    -----------
    review_text : str
        The movie review text to analyze
    
    Returns:
    --------
    dict : Dictionary containing prediction, probability, and confidence
    """
    # Preprocess the text
    cleaned_text = preprocess_text(review_text)
    
    # Extract textual features
    char_count = len(cleaned_text)
    word_count = len(cleaned_text.split())
    avg_word_length = char_count / (word_count + 1) if word_count > 0 else 0
    exclamation_count = cleaned_text.count('!')
    question_count = cleaned_text.count('?')
    
    # Transform text using TF-IDF
    text_tfidf = tfidf_vectorizer.transform([cleaned_text])
    
    # Scale textual features
    textual_features = np.array([[char_count, word_count, avg_word_length, 
                                  exclamation_count, question_count]])
    textual_features_scaled = scaler.transform(textual_features)
    
    # Combine features
    features = hstack([text_tfidf, textual_features_scaled])
    
    # Make prediction
    prediction = best_model.predict(features)[0]
    sentiment = label_encoder.inverse_transform([prediction])[0]
    
    # Get prediction probability if available
    if hasattr(best_model, 'predict_proba'):
        probabilities = best_model.predict_proba(features)[0]
        prob_dict = {label_encoder.classes_[i]: probabilities[i] 
                    for i in range(len(label_encoder.classes_))}
        confidence = max(probabilities)
    else:
        prob_dict = None
        confidence = None
    
    return {
        'sentiment': sentiment,
        'prediction': prediction,
        'probabilities': prob_dict,
        'confidence': confidence,
        'original_text': review_text,
        'cleaned_text': cleaned_text
    }

print("✓ Prediction function created!")


In [None]:
# Sample reviews for testing
sample_reviews = [
    "This movie is absolutely fantastic! The acting was superb and the storyline kept me engaged throughout. Highly recommended!",
    "I was really disappointed with this film. The plot was confusing and the characters were poorly developed. Not worth watching.",
    "The movie was okay. Nothing special, but not terrible either. It's a decent watch if you have nothing else to do.",
    "Amazing cinematography and brilliant performances by all actors. This is one of the best movies I've seen this year!",
    "Terrible movie. Boring plot, bad acting, and a complete waste of time. I would not recommend this to anyone."
]

print("="*80)
print("PREDICTING SENTIMENT FOR SAMPLE REVIEWS")
print("="*80)

for i, review in enumerate(sample_reviews, 1):
    result = predict_sentiment(review)
    
    print(f"\nReview {i}:")
    print(f"Text: {review[:100]}...")
    print(f"Predicted Sentiment: {result['sentiment'].upper()}")
    if result['probabilities']:
        print(f"Probabilities: {result['probabilities']}")
        print(f"Confidence: {result['confidence']:.2%}")
    print("-" * 80)


## Summary

### Key Accomplishments:
1. ✅ **Data Loading & Exploration**: Loaded dataset, analyzed sentiment distribution, and explored review characteristics
2. ✅ **Text Preprocessing**: Cleaned text, removed HTML tags, tokenized, removed stopwords, and lemmatized
3. ✅ **Feature Engineering**: Created TF-IDF vectors, Word2Vec embeddings, and extracted textual features
4. ✅ **Model Development**: Trained 5 different models (Logistic Regression, Naive Bayes, SVM, Random Forest, XGBoost)
5. ✅ **Model Evaluation**: Comprehensive evaluation with confusion matrix, ROC curves, and feature importance
6. ✅ **Predictions**: Created prediction function for new reviews

### Model Performance:
- Best model selected based on F1-Score
- Hyperparameter tuning performed
- All models saved for future use

### Next Steps:
- Deploy model for production use
- Create API for real-time predictions
- Monitor model performance over time
- Retrain periodically with new data
