
# Part A – NLP Preprocessing and Feature Engineering

This notebook covers:
- Loading the 50,000 IMDb reviews.
- Inspecting raw text samples.
- Cleaning and normalizing text.
- Creating TF–IDF vectors for classical ML.
- Creating tokenized, padded sequences for deep learning.
- Reporting vocabulary size, average review length, and feature shapes.


In [2]:

# Imports and Global Config

import os
import re
import string
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# For tokenization / padding (you can choose another library if you prefer)
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

DATA_DIR = Path("../data")
RAW_IMDB_DIR = DATA_DIR / "aclImdb"  # update this if you use a different path

print("Data directory:", RAW_IMDB_DIR.resolve())


Data directory: C:\Users\Dell\Desktop\Group_2\imdb_sentiment_project\data\aclImdb



## 1. Load the IMDb Dataset

The IMDb Large Movie Review Dataset has:
- 25,000 labeled training reviews
- 25,000 labeled test reviews
Each review is either **positive (pos)** or **negative (neg)**.

Implement a loader that walks the `train/pos`, `train/neg`, `test/pos`, `test/neg`
directories, reads the text files, and stores them in a DataFrame.


In [3]:
def load_imdb_split(split_dir):
    """Load a single split (train or test) from aclImdb.

    Returns a DataFrame with columns: ['review', 'label', 'split']
    where label is 1 for positive, 0 for negative.
    """
    rows = []
    for label, label_int in [("pos", 1), ("neg", 0)]:
        path = split_dir / label
        for fname in sorted(path.glob("*.txt")):
            text = fname.read_text(encoding="utf-8")
            rows.append({
                "review": text,
                "label": label_int,
                "split": split_dir.name
            })
    return pd.DataFrame(rows)


# ✅ NOW ACTUALLY LOAD THE DATA
train_df = load_imdb_split(RAW_IMDB_DIR / "train")
test_df = load_imdb_split(RAW_IMDB_DIR / "test")
full_df = pd.concat([train_df, test_df], ignore_index=True)

full_df.head()


Unnamed: 0,review,label,split
0,Bromwell High is a cartoon comedy. It ran at t...,1,train
1,Homelessness (or Houselessness as George Carli...,1,train
2,Brilliant over-acting by Lesley Ann Warren. Be...,1,train
3,This is easily the most underrated film inn th...,1,train
4,This is not the typical Mel Brooks film. It wa...,1,train



## 2. Display Raw Samples and Dataset Size

- Print 10 raw example reviews.
- Describe number of reviews, class balance, and basic statistics.


In [4]:

# TODO: After loading `full_df`, run the following:

if not full_df.empty:
    print("Total reviews:", len(full_df))
    print(full_df['label'].value_counts())
    print(full_df['split'].value_counts())

    # Show 10 raw samples
    sample_df = full_df.sample(10, random_state=42)
    for i, row in sample_df.iterrows():
        print("\n--- Sample", i, "---")
        print("Label:", row["label"])
        print(row["review"][:500], "...")  # print first 500 chars
else:
    print("full_df is empty. Load the dataset first.")


Total reviews: 50000
label
1    25000
0    25000
Name: count, dtype: int64
split
train    25000
test     25000
Name: count, dtype: int64

--- Sample 33553 ---
Label: 1
When I first saw the ad for this, I was like 'Oh here we go. He's done High School Musical, but he can't coast along on that so now he's making appearances on other Disney shows'. Personally, I love The Suite Life and I'm a big fan of Ashely Tisdale. But for some reason, I'm not too keen on Zac Efron, although all my friends think he's the best thing since Jesse McCartney. But he really annoys me. Anyway, I watched the show (taking a break from English coursework) and was pleasantly surprised. T ...

--- Sample 9427 ---
Label: 1
"A Girl's Folly" is a sort of half-comedy, half-mockumentary look at the motion picture business of the mid-1910's. We get a glimpse of life at an early movie studio, where we experience assembly of a set, running through a scene, handling of adoring movie fanatics, even lunch at the commissary. 


## 3. Text Cleaning

Apply standard preprocessing steps:
- Lowercasing
- Remove HTML tags
- Remove digits
- Remove punctuation
- Optional: remove stopwords, apply lemmatization

Define a clean_text function and apply it to the dataset.


In [10]:
import html

def clean_text(text):
    # Unescape HTML entities
    text = html.unescape(text)
    # Lowercase
    text = text.lower()
    # Remove HTML tags (simple regex)
    text = re.sub(r"<.*?>", " ", text)
    # Remove digits
    text = re.sub(r"\d+", " ", text)
    # Remove punctuation
    text = text.translate(str.maketrans("", "", string.punctuation))
    # Collapse multiple spaces
    text = re.sub(r"\s+", " ", text).strip()
    return text


print(full_df[["review", "clean_review"]].head())



                                              review  \
0  Bromwell High is a cartoon comedy. It ran at t...   
1  Homelessness (or Houselessness as George Carli...   
2  Brilliant over-acting by Lesley Ann Warren. Be...   
3  This is easily the most underrated film inn th...   
4  This is not the typical Mel Brooks film. It wa...   

                                        clean_review  
0  bromwell high is a cartoon comedy it ran at th...  
1  homelessness or houselessness as george carlin...  
2  brilliant overacting by lesley ann warren best...  
3  this is easily the most underrated film inn th...  
4  this is not the typical mel brooks film it was...  



## 4. TF–IDF Vectors and Token Sequences

We need two representations of the text:
- **TF–IDF** for classical ML (Logistic Regression / SVM / Naive Bayes)
- **Tokenized, padded sequences** for deep learning (LSTM / GRU / CNN / etc.)


In [8]:

# Parameters (you can tune)
MAX_TFIDF_FEATURES = 20000
MAX_NUM_WORDS = 20000
MAX_SEQUENCE_LENGTH = 200

if not full_df.empty:
    texts = full_df["clean_review"].tolist()
    labels = full_df["label"].values

    # TF–IDF
    tfidf_vectorizer = TfidfVectorizer(max_features=MAX_TFIDF_FEATURES)
    X_tfidf = tfidf_vectorizer.fit_transform(texts)

    # Tokenizer + sequences
    tokenizer = Tokenizer(num_words=MAX_NUM_WORDS, oov_token="<OOV>")
    tokenizer.fit_on_texts(texts)
    sequences = tokenizer.texts_to_sequences(texts)
    X_seq = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH, padding="post", truncating="post")

    print("TF-IDF shape:", X_tfidf.shape)
    print("Sequence shape:", X_seq.shape)
else:
    X_tfidf = None
    X_seq = None
    labels = None
    tokenizer = None
    tfidf_vectorizer = None
    print("Load the dataset first.")


TF-IDF shape: (50000, 20000)
Sequence shape: (50000, 200)



## 5. Statistics

Report:
- Vocabulary size
- Average review length (in tokens) before padding
- TF–IDF matrix shape
- Sequence matrix shape


In [9]:

if not full_df.empty and tokenizer is not None:
    # Vocabulary size
    vocab_size = len(tokenizer.word_index)
    print("Vocabulary size (tokenizer.word_index):", vocab_size)

    # Average review length (before padding)
    raw_lengths = [len(seq) for seq in tokenizer.texts_to_sequences(full_df["clean_review"].tolist())]
    print("Average review length (tokens):", np.mean(raw_lengths))

    print("TF-IDF matrix shape:", X_tfidf.shape if X_tfidf is not None else None)
    print("Sequence matrix shape:", X_seq.shape if X_seq is not None else None)


Vocabulary size (tokenizer.word_index): 163229
Average review length (tokens): 226.93236
TF-IDF matrix shape: (50000, 20000)
Sequence matrix shape: (50000, 200)



## 6. Save Processed Artifacts

Save TF–IDF features, sequences, labels, and vectorizers/tokenizers to disk,
so the other notebooks (ML, DL, RL) can load them directly.


In [10]:

PROCESSED_DIR = DATA_DIR / "processed"
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

if not full_df.empty and X_tfidf is not None and X_seq is not None:
    # Save numpy arrays
    np.save(PROCESSED_DIR / "X_seq.npy", X_seq)
    np.save(PROCESSED_DIR / "labels.npy", labels)

    # Save TF-IDF matrix as sparse
    from scipy import sparse
    sparse.save_npz(PROCESSED_DIR / "X_tfidf.npz", X_tfidf)

    # Save tokenizer and vectorizer using joblib
    import joblib
    joblib.dump(tokenizer, PROCESSED_DIR / "tokenizer.joblib")
    joblib.dump(tfidf_vectorizer, PROCESSED_DIR / "tfidf_vectorizer.joblib")

    # Save the cleaned dataframe (optional)
    full_df.to_csv(PROCESSED_DIR / "full_df_clean.csv", index=False)

    print("Saved processed features and objects to:", PROCESSED_DIR)
else:
    print("Nothing to save yet. Make sure data is loaded and processed.")


Saved processed features and objects to: ..\data\processed
