# Data Preprocessing - IMDb Movie Reviews Sentiment Analysis
## Introduction
This notebook focuses on **text preprocessing** for the IMDb Movie Reviews dataset, building upon the insights from the exploratory data analysis. The goal is to clean and transform the raw text data into formats suitable for machine learning models.

**Dataset:** IMDB Dataset of 50K Movie Reviews (Kaggle)

**Objective:** Clean, transform, and prepare text data for both traditional ML and deep learning approaches

**Author:** NGUYEN Ngoc Dang Nguyen - Final-year Student in Computer Science, Aix-Marseille University

**Preprocessing Pipeline:**
1. Load cleaned dataset and setup
2. Text cleaning functions implementation
3. Basic text preprocessing (HTML removal, normalization)
4. Advanced text preprocessing (tokenization, stopword removal)
5. Feature extraction for traditional ML (TF-IDF, Count Vectorizer)
6. Text preparation for deep learning (tokenization, padding)
7. Train/Validation/Test split with stratification
8. Save processed data and preprocessing artifacts

## 1. Load Libraries and Setup

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import string
import pickle
import os
import sys

# Add src to path
sys.path.append('../src')

# Text processing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, PorterStemmer
from bs4 import BeautifulSoup

# ML libraries
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import accuracy_score

# TensorFlow/Keras - compatible import
try:
    from keras.preprocessing.text import Tokenizer
    from keras.preprocessing.sequence import pad_sequences
    from keras.utils import to_categorical
    print("Using standalone Keras")
except ImportError:
    try:
        from tensorflow.keras.preprocessing.text import Tokenizer
        from tensorflow.keras.preprocessing.sequence import pad_sequences
        from tensorflow.keras.utils import to_categorical
        print("Using tensorflow.keras")
    except ImportError:
        print("Warning: Could not import Keras preprocessing modules")

# Import from our modules
from config import *

print("Libraries imported successfully!")

## 2. Load Dataset

In [None]:
# Load the raw dataset
df = pd.read_csv('../data/raw/IMDB Dataset.csv')

print(f"Dataset loaded: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Quick overview
print(f"\nSentiment distribution:")
print(df['sentiment'].value_counts())

# Create binary labels for machine learning
df['label'] = df['sentiment'].map({'positive': 1, 'negative': 0})
print(f"\nLabel mapping: positive=1, negative=0")
print(df['label'].value_counts())

## 3. Text Cleaning Functions Implementation

In [None]:
def remove_html_tags(text):
    """Remove HTML tags from text using BeautifulSoup"""
    soup = BeautifulSoup(text, 'html.parser')
    cleaned = soup.get_text()
    return cleaned

def remove_urls(text):
    """Remove URLs from text"""
    url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    return re.sub(url_pattern, '', text)

def remove_special_chars(text, keep_punctuation=False):
    """Remove special characters, optionally keep basic punctuation"""
    if keep_punctuation:
        # Keep basic punctuation: . , ! ? ; :
        pattern = r'[^a-zA-Z0-9\s.,!?;:\'-]'
    else:
        # Remove all special characters except spaces
        pattern = r'[^a-zA-Z0-9\s]'
    return re.sub(pattern, '', text)

def normalize_text(text):
    """Basic text normalization"""
    # Convert to lowercase
    text = text.lower()
    # Remove extra whitespaces
    text = ' '.join(text.split())
    return text

def remove_numbers(text):
    """Remove standalone numbers"""
    return re.sub(r'\b\d+\b', '', text)

def expand_contractions(text):
    """Expand common English contractions"""
    contractions_dict = {
        "ain't": "are not", "aren't": "are not", "can't": "cannot",
        "couldn't": "could not", "didn't": "did not", "doesn't": "does not",
        "don't": "do not", "hadn't": "had not", "hasn't": "has not",
        "haven't": "have not", "he'd": "he would", "he'll": "he will",
        "he's": "he is", "i'd": "i would", "i'll": "i will", "i'm": "i am",
        "i've": "i have", "isn't": "is not", "it'd": "it would",
        "it'll": "it will", "it's": "it is", "let's": "let us",
        "shouldn't": "should not", "that's": "that is", "there's": "there is",
        "they'd": "they would", "they'll": "they will", "they're": "they are",
        "they've": "they have", "we'd": "we would", "we're": "we are",
        "we've": "we have", "weren't": "were not", "what's": "what is",
        "where's": "where is", "who's": "who is", "won't": "will not",
        "wouldn't": "would not", "you'd": "you would", "you'll": "you will",
        "you're": "you are", "you've": "you have"
    }
    
    for contraction, expansion in contractions_dict.items():
        text = re.sub(r'\b' + re.escape(contraction) + r'\b', expansion, text, flags=re.IGNORECASE)
    return text

def basic_text_cleaning(text):
    """Apply basic cleaning pipeline"""
    # Remove HTML tags
    text = remove_html_tags(text)
    # Remove URLs
    text = remove_urls(text)
    # Expand contractions
    text = expand_contractions(text)
    # Normalize text
    text = normalize_text(text)
    # Remove special characters (keep basic punctuation)
    text = remove_special_chars(text, keep_punctuation=True)
    # Remove extra whitespaces
    text = ' '.join(text.split())
    return text

print("Text cleaning functions defined successfully!")

## 4. Apply Basic Text Preprocessing

In [None]:
print("APPLYING BASIC TEXT PREPROCESSING")
print("="*50)

# Apply basic cleaning
print("Step 1: Basic text cleaning...")
df['cleaned_text'] = df['review'].apply(basic_text_cleaning)

# Show examples of cleaning results
print("Cleaning examples:")
for i in range(2):
    print(f"\nExample {i+1}:")
    print("Original:")
    print(df.iloc[i]['review'][:200] + "...")
    print("Cleaned:")
    print(df.iloc[i]['cleaned_text'][:200] + "...")
    print("-" * 80)

# Compare lengths before and after cleaning
df['original_length'] = df['review'].str.len()
df['cleaned_length'] = df['cleaned_text'].str.len()

reduction_ratio = ((df['original_length'] - df['cleaned_length']) / df['original_length'] * 100).mean()
print(f"\nAverage text length reduction: {reduction_ratio:.1f}%")
print(f"Original average length: {df['original_length'].mean():.0f} characters")
print(f"Cleaned average length: {df['cleaned_length'].mean():.0f} characters")

## 5. Advanced Text Preprocessing for Traditional ML

In [None]:
print("\nADVANCED PREPROCESSING FOR TRADITIONAL ML")
print("="*50)

# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

# Get English stopwords
stop_words = set(stopwords.words('english'))
print(f"Number of English stopwords: {len(stop_words)}")

def advanced_preprocessing_ml(text, remove_stopwords=True, apply_stemming=False, apply_lemmatization=True):
    """Advanced preprocessing for traditional ML models"""
    
    # Tokenize
    tokens = word_tokenize(text)
    
    # Remove non-alphabetic tokens and short words
    tokens = [token for token in tokens if token.isalpha() and len(token) > 2]
    
    # Remove stopwords
    if remove_stopwords:
        tokens = [token for token in tokens if token not in stop_words]
    
    # Apply stemming or lemmatization
    if apply_stemming:
        tokens = [stemmer.stem(token) for token in tokens]
    elif apply_lemmatization:
        tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    return ' '.join(tokens)

# Create different preprocessing versions for comparison
print("Creating different preprocessing versions...")

# Version 1: Basic cleaning + stopword removal + lemmatization
print("Processing ML Version 1: Basic + Stopwords removal + Lemmatization...")
df['text_ml_v1'] = df['cleaned_text'].apply(
    lambda x: advanced_preprocessing_ml(x, remove_stopwords=True, apply_lemmatization=True)
)

# Version 2: Basic cleaning + stemming (no stopword removal for comparison)
print("Processing ML Version 2: Basic + Stemming (keeping stopwords)...")
df['text_ml_v2'] = df['cleaned_text'].apply(
    lambda x: advanced_preprocessing_ml(x, remove_stopwords=False, apply_stemming=True)
)

# Version 3: Minimal preprocessing (just basic cleaning)
print("Processing ML Version 3: Minimal (just basic cleaning)...")
df['text_ml_minimal'] = df['cleaned_text']

# Compare word counts across versions
df['word_count_v1'] = df['text_ml_v1'].str.split().str.len()
df['word_count_v2'] = df['text_ml_v2'].str.split().str.len()
df['word_count_minimal'] = df['text_ml_minimal'].str.split().str.len()

# Statistics comparison
print("\nWord count statistics across preprocessing versions:")
versions_stats = pd.DataFrame({
    'Original': df['review'].str.split().str.len().describe(),
    'V1 (Lem+Stop)': df['word_count_v1'].describe(),
    'V2 (Stem)': df['word_count_v2'].describe(),
    'Minimal': df['word_count_minimal'].describe()
}).round(2)
print(versions_stats)

## 6. Feature Extraction for Traditional ML

In [None]:
print("\nFEATURE EXTRACTION FOR TRADITIONAL ML")
print("="*50)

# We'll use the best version (V1) for feature extraction
text_for_ml = df['text_ml_v1']

# TF-IDF Vectorization
print("Creating TF-IDF features...")
tfidf_vectorizer = TfidfVectorizer(
    max_features=10000,      # Top 10K most important features
    ngram_range=(1, 2),      # Unigrams and bigrams
    min_df=2,                # Ignore terms appearing in less than 2 documents
    max_df=0.95,             # Ignore terms appearing in more than 95% of documents
    strip_accents='unicode',
    lowercase=True
)

# Fit and transform
tfidf_matrix = tfidf_vectorizer.fit_transform(text_for_ml)
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"Matrix density: {tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1]):.4f}")

# Count Vectorization
print("\nCreating Count Vectorizer features...")
count_vectorizer = CountVectorizer(
    max_features=10000,
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.95,
    strip_accents='unicode',
    lowercase=True
)

count_matrix = count_vectorizer.fit_transform(text_for_ml)
print(f"Count matrix shape: {count_matrix.shape}")

# Show top features
print("\nTop 20 TF-IDF features:")
feature_names = tfidf_vectorizer.get_feature_names_out()
tfidf_scores = tfidf_matrix.mean(axis=0).A1
top_features = sorted(zip(feature_names, tfidf_scores), key=lambda x: x[1], reverse=True)[:20]
for feature, score in top_features:
    print(f"{feature}: {score:.4f}")


## 7. Text Preparation for Deep Learning

In [None]:
print("\nTEXT PREPARATION FOR DEEP LEARNING")
print("="*50)

# For deep learning, we use less aggressive preprocessing
def preprocessing_for_dl(text):
    """Preprocessing for deep learning models (less aggressive)"""
    # Basic cleaning (already done)
    # Keep original structure more intact
    text = text.lower()
    # Remove numbers but keep text structure
    text = re.sub(r'\d+', '', text)
    # Remove extra whitespaces
    text = ' '.join(text.split())
    return text

# Prepare text for deep learning
print("Processing text for deep learning...")
df['text_dl'] = df['cleaned_text'].apply(preprocessing_for_dl)

# Tokenization for deep learning
print("Creating tokenizer for deep learning...")
tokenizer = Tokenizer(
    num_words=20000,          # Vocabulary size
    oov_token='<OOV>',        # Out of vocabulary token
    lower=True,
    split=' '
)

# Fit tokenizer
tokenizer.fit_on_texts(df['text_dl'])

# Convert texts to sequences
sequences = tokenizer.texts_to_sequences(df['text_dl'])

# Analyze sequence lengths
sequence_lengths = [len(seq) for seq in sequences]
seq_length_stats = pd.Series(sequence_lengths).describe()
print(f"\nSequence length statistics:")
print(seq_length_stats)

# Visualize sequence length distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(sequence_lengths, bins=50, edgecolor='black', alpha=0.7)
plt.axvline(np.mean(sequence_lengths), linestyle='--', label=f'Mean: {np.mean(sequence_lengths):.0f}')
plt.axvline(np.percentile(sequence_lengths, 90), linestyle='--', label=f'90th percentile: {np.percentile(sequence_lengths, 90):.0f}')
plt.title('Distribution of Sequence Lengths')
plt.xlabel('Sequence Length')
plt.ylabel('Frequency')
plt.legend()
plt.grid(True)

plt.subplot(1, 2, 2)
plt.boxplot(sequence_lengths)
plt.title('Box Plot of Sequence Lengths')
plt.ylabel('Sequence Length')
plt.grid(True)

plt.tight_layout()
plt.show()

# Choose max sequence length (capture 90% of data)
MAX_SEQUENCE_LENGTH = int(np.percentile(sequence_lengths, 90))
print(f"\nChosen MAX_SEQUENCE_LENGTH: {MAX_SEQUENCE_LENGTH}")

# Pad sequences
print("Padding sequences...")
padded_sequences = pad_sequences(sequences, 
                                maxlen=MAX_SEQUENCE_LENGTH, 
                                padding='post', 
                                truncating='post')

print(f"Padded sequences shape: {padded_sequences.shape}")

# Vocabulary info
vocab_size = len(tokenizer.word_index) + 1  # +1 for padding token
print(f"Vocabulary size: {vocab_size}")
print(f"Using top {min(vocab_size, 20000)} words")


## 8. Train/Validation/Test Split

In [None]:
print("\nTRAIN/VALIDATION/TEST SPLIT")
print("="*50)

# First split: 80% train, 20% temp
X_train_text, X_temp_text, y_train, y_temp = train_test_split(
    df['text_ml_v1'], df['label'], 
    test_size=0.2, 
    random_state=RANDOM_STATE, 
    stratify=df['label']
)

# Second split: 10% validation, 10% test from the 20% temp
X_val_text, X_test_text, y_val, y_test = train_test_split(
    X_temp_text, y_temp, 
    test_size=0.5, 
    random_state=RANDOM_STATE, 
    stratify=y_temp
)

print(f"Training set size: {len(X_train_text)} ({len(X_train_text)/len(df)*100:.1f}%)")
print(f"Validation set size: {len(X_val_text)} ({len(X_val_text)/len(df)*100:.1f}%)")
print(f"Test set size: {len(X_test_text)} ({len(X_test_text)/len(df)*100:.1f}%)")

# Check stratification
print("\nClass distribution check:")
print("Training set:")
print(pd.Series(y_train).value_counts(normalize=True))
print("Validation set:")
print(pd.Series(y_val).value_counts(normalize=True))
print("Test set:")
print(pd.Series(y_test).value_counts(normalize=True))

# Apply TF-IDF transformation to splits
print("\nApplying TF-IDF transformation to splits...")
X_train_tfidf = tfidf_vectorizer.transform(X_train_text)
X_val_tfidf = tfidf_vectorizer.transform(X_val_text)
X_test_tfidf = tfidf_vectorizer.transform(X_test_text)

print(f"Train TF-IDF shape: {X_train_tfidf.shape}")
print(f"Validation TF-IDF shape: {X_val_tfidf.shape}")
print(f"Test TF-IDF shape: {X_test_tfidf.shape}")

# Prepare deep learning data splits
# Get indices for the splits
train_indices = X_train_text.index
val_indices = X_val_text.index
test_indices = X_test_text.index

# Extract corresponding padded sequences
X_train_seq = padded_sequences[train_indices]
X_val_seq = padded_sequences[val_indices]
X_test_seq = padded_sequences[test_indices]

print(f"\nDeep Learning data shapes:")
print(f"Train sequences: {X_train_seq.shape}")
print(f"Validation sequences: {X_val_seq.shape}")
print(f"Test sequences: {X_test_seq.shape}")


## 9. Save Processed Data and Preprocessing Artifacts

In [None]:
print("\nSAVING PROCESSED DATA AND ARTIFACTS")
print("="*50)

# Create directories if they don't exist
os.makedirs('../data/processed', exist_ok=True)
os.makedirs('../models/preprocessors', exist_ok=True)

# Save processed datasets
print("Saving processed datasets...")

# Traditional ML data
np.savez_compressed('../data/processed/traditional_ml_data.npz',
                    X_train=X_train_tfidf.toarray(),
                    X_val=X_val_tfidf.toarray(),
                    X_test=X_test_tfidf.toarray(),
                    y_train=y_train.values,
                    y_val=y_val.values,
                    y_test=y_test.values)

# Deep learning data
np.savez_compressed('../data/processed/deep_learning_data.npz',
                    X_train=X_train_seq,
                    X_val=X_val_seq,
                    X_test=X_test_seq,
                    y_train=y_train.values,
                    y_val=y_val.values,
                    y_test=y_test.values)

# Save text data for reference
pd.DataFrame({
    'text': X_train_text,
    'label': y_train
}).to_csv('../data/processed/train_text.csv', index=False)

pd.DataFrame({
    'text': X_val_text,
    'label': y_val
}).to_csv('../data/processed/val_text.csv', index=False)

pd.DataFrame({
    'text': X_test_text,
    'label': y_test
}).to_csv('../data/processed/test_text.csv', index=False)

# Save preprocessing artifacts
print("Saving preprocessing artifacts...")

# Save vectorizers
with open('../models/preprocessors/tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)

with open('../models/preprocessors/count_vectorizer.pkl', 'wb') as f:
    pickle.dump(count_vectorizer, f)

# Save tokenizer
with open('../models/preprocessors/tokenizer.pkl', 'wb') as f:
    pickle.dump(tokenizer, f)

# Save preprocessing parameters
preprocessing_config = {
    'max_sequence_length': MAX_SEQUENCE_LENGTH,
    'vocab_size': vocab_size,
    'tfidf_max_features': 10000,
    'random_state': RANDOM_STATE,
    'test_size': 0.2,
    'val_split': 0.5
}

with open('../models/preprocessors/preprocessing_config.pkl', 'wb') as f:
    pickle.dump(preprocessing_config, f)

# Save the full preprocessed dataframe for reference
print("Saving full preprocessed dataframe...")
df_processed = df[[
    'review', 'sentiment', 'label', 
    'cleaned_text', 'text_ml_v1', 'text_dl'
]].copy()
df_processed.to_csv('../data/processed/full_preprocessed_data.csv', index=False)

print("All data and artifacts saved successfully!")

## Data Preprocessing Conclusion
Successfully implemented comprehensive text preprocessing pipeline for IMDb reviews, creating clean datasets for both traditional ML and deep learning approaches. Key achievements include HTML tag removal, contraction expansion, advanced tokenization, and strategic feature extraction using TF-IDF (10K features) and sequence padding (90th percentile length).

The preprocessing pipeline generated stratified train/validation/test splits (80%/10%/10%) with balanced class distribution. Traditional ML data uses lemmatization and stopword removal, while deep learning data preserves text structure with vocabulary size of 20K tokens. All processed datasets and preprocessing artifacts saved successfully for model training.