# IMDB Sentiment Analysis - Data Preprocessing

This notebook handles text preprocessing and feature extraction.

In [None]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import pickle
from src.utils import load_imdb_data, split_data, ensure_dir
from src.preprocessing import (
    clean_text, tokenize_text, preprocess_reviews,
    create_tfidf_features, encode_labels
)

import warnings
warnings.filterwarnings('ignore')

## 1. Load Data

In [None]:
# Load dataset
df = load_imdb_data('../data/IMDB Dataset.csv')
print(f"\nDataset shape: {df.shape}")
df.head()

Loading data from ../data/IMDB Dataset.csv...


Loaded 50000 reviews
Columns: ['review', 'sentiment']

Dataset shape: (50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## 2. Text Cleaning Examples

In [None]:
# Show cleaning examples
sample_review = df['review'].iloc[0]

print("Original review:")
print(sample_review[:500])
print("\n" + "="*80 + "\n")

cleaned = clean_text(sample_review)
print("Cleaned review:")
print(cleaned[:500])
print("\n" + "="*80 + "\n")

tokens = tokenize_text(cleaned)
print(f"Tokens (first 50): {tokens[:50]}")
print(f"\nTotal tokens: {len(tokens)}")

Original review:
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ


Cleaned review:
one of the other reviewers has mentioned that after watching just oz episode you will be hooked they are right as this is exactly what happened with me the first thing that struck me about oz was its brutality and unflinching scenes of violence which set in right from the word go trust me this is not a show for the faint hearted or timid this show pulls no punches with regards to drugs sex or violence its is hardcore in the classic use of the word it is called

## 3. Preprocess All Reviews

In [None]:
# Apply preprocessing pipeline
df_processed = preprocess_reviews(df)
df_processed.head()

Cleaning text...


Tokenizing...


Unnamed: 0,review,sentiment,cleaned_text,tokens,processed_text
0,One of the other reviewers has mentioned that ...,positive,one of the other reviewers has mentioned that ...,"[one, reviewers, mentioned, watching, oz, epis...",one reviewers mentioned watching oz episode ho...
1,A wonderful little production. <br /><br />The...,positive,a wonderful little production the filming tech...,"[wonderful, little, production, filming, techn...",wonderful little production filming technique ...
2,I thought this was a wonderful way to spend ti...,positive,i thought this was a wonderful way to spend ti...,"[thought, wonderful, way, spend, time, hot, su...",thought wonderful way spend time hot summer we...
3,Basically there's a family where a little boy ...,negative,basically there is a family where a little boy...,"[basically, family, little, boy, jake, thinks,...",basically family little boy jake thinks zombie...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,petter mattei is love in the time of money is ...,"[petter, mattei, love, time, money, visually, ...",petter mattei love time money visually stunnin...


In [None]:
# Check preprocessing results
print("Sample processed texts:")
for i in range(3):
    print(f"\n{i+1}. Original length: {len(df['review'].iloc[i])} chars")
    print(f"   Processed length: {len(df_processed['processed_text'].iloc[i])} chars")
    print(f"   Token count: {len(df_processed['tokens'].iloc[i])}")
    print(f"   Sentiment: {df['sentiment'].iloc[i]}")

Sample processed texts:

1. Original length: 1761 chars
   Processed length: 1115 chars
   Token count: 164
   Sentiment: positive

2. Original length: 998 chars
   Processed length: 660 chars
   Token count: 86
   Sentiment: positive

3. Original length: 926 chars
   Processed length: 578 chars
   Token count: 85
   Sentiment: positive


## 4. Create TF-IDF Features

In [None]:
# Create TF-IDF feature matrix
X_tfidf, vectorizer = create_tfidf_features(
    df_processed['processed_text'].tolist(),
    max_features=5000,
    ngram_range=(1, 2),
    min_df=5,
    max_df=0.8
)

print(f"\nFeature matrix shape: {X_tfidf.shape}")
print(f"Matrix sparsity: {(1 - X_tfidf.nnz / (X_tfidf.shape[0] * X_tfidf.shape[1])):.2%}")

Creating TF-IDF features with max_features=5000, ngram_range=(1, 2)


Feature matrix shape: (50000, 5000)
Vocabulary size: 5000

Feature matrix shape: (50000, 5000)
Matrix sparsity: 98.36%


In [None]:
# Show top features by TF-IDF score
feature_names = vectorizer.get_feature_names_out()
tfidf_scores = X_tfidf.sum(axis=0).A1
top_features_idx = tfidf_scores.argsort()[-20:][::-1]

print("Top 20 features by total TF-IDF score:")
for i, idx in enumerate(top_features_idx, 1):
    print(f"{i:2d}. {feature_names[idx]:20s} - Score: {tfidf_scores[idx]:.2f}")

Top 20 features by total TF-IDF score:
 1. movie                - Score: 2004.35
 2. film                 - Score: 1763.15
 3. one                  - Score: 1430.95
 4. like                 - Score: 1250.55
 5. good                 - Score: 1124.27
 6. would                - Score: 1096.33
 7. time                 - Score: 979.74
 8. see                  - Score: 968.50
 9. even                 - Score: 957.73
10. really               - Score: 956.71
11. story                - Score: 955.38
12. well                 - Score: 886.26
13. great                - Score: 872.34
14. bad                  - Score: 863.48
15. much                 - Score: 833.52
16. could                - Score: 832.50
17. people               - Score: 809.90
18. get                  - Score: 808.86
19. movies               - Score: 788.61
20. first                - Score: 771.80


## 5. Encode Labels

In [None]:
# Encode sentiment labels
y = encode_labels(df['sentiment'])

print(f"Label shape: {y.shape}")
print(f"Positive samples: {y.sum()} ({y.mean():.1%})")
print(f"Negative samples: {len(y) - y.sum()} ({1 - y.mean():.1%})")

Label shape: (50000,)
Positive samples: 25000 (50.0%)
Negative samples: 25000 (50.0%)


## 6. Train/Test Split

In [None]:
# Split data
X_train, X_test, y_train, y_test = split_data(
    X_tfidf, y,
    test_size=0.2,
    random_state=42
)

Splitting data: 80% train, 20% test
Train set: 40000 samples
Test set: 10000 samples
Train positive ratio: 50.00%
Test positive ratio: 50.00%


## 7. Save Preprocessed Data

In [None]:
# Ensure models directory exists
ensure_dir('../models')

# Save vectorizer
with open('../models/tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(vectorizer, f)
print("Vectorizer saved to ../models/tfidf_vectorizer.pkl")

# Save train/test splits
np.savez_compressed(
    '../models/train_test_data.npz',
    X_train=X_train,
    X_test=X_test,
    y_train=y_train,
    y_test=y_test
)
print("Train/test data saved to ../models/train_test_data.npz")

Vectorizer saved to ../models/tfidf_vectorizer.pkl


Train/test data saved to ../models/train_test_data.npz


## 8. Summary

### Preprocessing Steps Completed:
1. ✅ Text cleaning (lowercase, remove HTML, special chars)
2. ✅ Tokenization with stopword removal
3. ✅ TF-IDF vectorization (5000 features, unigrams + bigrams)
4. ✅ Label encoding (positive=1, negative=0)
5. ✅ Train/test split (80/20, stratified)

### Dataset Statistics:
- **Training samples**: 40,000
- **Test samples**: 10,000
- **Features**: 5,000 TF-IDF features
- **Matrix sparsity**: ~99% (typical for text data)

### Next Steps:
- Train Naive Bayes classifier
- Train Logistic Regression classifier
- Evaluate and compare models