
# IMDb ACL (aclImdb) — Cleaning & Preprocessing Pipeline

This notebook works with the **IMDb Large Movie Review Dataset (aclImdb)** organized as:

```
aclImdb/
  train/
    pos/*.txt
    neg/*.txt
  test/
    pos/*.txt
    neg/*.txt
```

It loads reviews, applies a cleaning pipeline, and saves a CSV with `review`, `cleaned_review`, and `sentiment`.


## 1) Setup

In [1]:

import os, re, glob
import pandas as pd
from tqdm import tqdm

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

STOPWORDS = set(stopwords.words("english"))
LEMMATIZER = WordNetLemmatizer()


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DuaaHilal\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DuaaHilal\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DuaaHilal\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\DuaaHilal\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## 2) Config

In [2]:
IMDB_ROOT = r"aclImdb"
SPLIT = "both"
SAMPLE_PER_SPLIT = None  
OUTPUT_PATH = "imdb_acl_cleaned.csv"

## 3) Load Dataset

In [3]:

def load_acl_imdb(root_dir, split="train", sample_per_split=None):
    splits = ["train", "test"] if split == "both" else [split]
    rows = []

    for sp in splits:
        for label in ["pos", "neg"]:
            folder = os.path.join(root_dir, sp, label)
            if not os.path.exists(folder):
                raise FileNotFoundError(f"Folder not found: {folder}")
            for path in glob.glob(os.path.join(folder, "*.txt")):
                with open(path, "r", encoding="utf-8", errors="ignore") as f:
                    text = f.read()
                rows.append({
                    "review": text,
                    "sentiment": "positive" if label == "pos" else "negative",
                    "split": sp
                })

    df = pd.DataFrame(rows)

    if sample_per_split:
        dfs = []
        for sp in df["split"].unique():
            sub = df[df["split"] == sp]
            if len(sub) > sample_per_split:
                sub = sub.sample(sample_per_split, random_state=42)
            dfs.append(sub)
        df = pd.concat(dfs, ignore_index=True)
    return df


## 4) Cleaning Function

In [4]:

HTML_TAG_RE = re.compile(r"<[^>]+>")
URL_RE = re.compile(r"https?://\S+|www\.\S+")
EMAIL_RE = re.compile(r"\b[\w\.-]+@[\w\.-]+\.\w+\b")
NON_ALPHA_RE = re.compile(r"[^A-Za-z\s]")
MULTISPACE_RE = re.compile(r"\s+")

def clean_review(text: str) -> str:
    if not isinstance(text, str):
        return ""
    x = text.lower()
    x = HTML_TAG_RE.sub(" ", x)
    x = URL_RE.sub(" ", x)
    x = EMAIL_RE.sub(" ", x)
    x = NON_ALPHA_RE.sub(" ", x)
    tokens = word_tokenize(x)
    tokens = [t for t in tokens if t not in STOPWORDS and len(t) > 2]
    tokens = [LEMMATIZER.lemmatize(t) for t in tokens]
    cleaned = " ".join(tokens)
    return MULTISPACE_RE.sub(" ", cleaned).strip()

# Demo
print(clean_review("I LOVED this movie!! <br /> Email me at a@b.com 🤩"))


loved movie email


## 5) Run Pipeline

In [6]:

df = load_acl_imdb(IMDB_ROOT, split=SPLIT, sample_per_split=SAMPLE_PER_SPLIT)
print("Rows loaded:", len(df))

tqdm.pandas()
df["cleaned_review"] = df["review"].progress_apply(clean_review)

out = df[["review", "cleaned_review", "sentiment"]]
out.to_csv(OUTPUT_PATH, index=False)

print(f"✅ Saved cleaned dataset to: {OUTPUT_PATH}")
out.head()


Rows loaded: 50000


100%|██████████| 50000/50000 [03:23<00:00, 245.47it/s]


✅ Saved cleaned dataset to: imdb_acl_cleaned.csv


Unnamed: 0,review,cleaned_review,sentiment
0,Bromwell High is a cartoon comedy. It ran at t...,bromwell high cartoon comedy ran time program ...,positive
1,Homelessness (or Houselessness as George Carli...,homelessness houselessness george carlin state...,positive
2,Brilliant over-acting by Lesley Ann Warren. Be...,brilliant acting lesley ann warren best dramat...,positive
3,This is easily the most underrated film inn th...,easily underrated film inn brook cannon sure f...,positive
4,This is not the typical Mel Brooks film. It wa...,typical mel brook film much less slapstick mov...,positive


## 6) Before/After Examples

In [7]:

examples = out.sample(5, random_state=42)
examples


Unnamed: 0,review,cleaned_review,sentiment
33553,"When I first saw the ad for this, I was like '...",first saw like done high school musical coast ...,positive
9427,"""A Girl's Folly"" is a sort of half-comedy, hal...",girl folly sort half comedy half mockumentary ...,positive
199,I started watching the show from the first sea...,started watching show first season beginning p...,positive
12447,This is a more interesting than usual porn mov...,interesting usual porn movie fantasy adventure...,positive
39489,I suppose for 1961 this film was supposed to b...,suppose film supposed cool looking back year c...,negative
