#### **As a Data Scientist, I want to clean and preprocess IMDb movie reviews,So that the text is standardized, noise-free, and ready for sentiment analysis.**

    
### **Acceptance Criteria**

## **1. Dataset**

1.1. Use the IMDb Movie Reviews Dataset:  

1.2. Dataset contains 50,000 labeled movie reviews (positive/negative).

1.3. Focus on the review text.

## **2. Cleaning Steps**

2.1. Convert text to lowercase.

2.2. Remove HTML tags (many reviews contain < br > ).

2.3. Remove URLs and email addresses.

2.4. Remove punctuation, numbers, and emojis.

2.5. Remove stopwords (NLTK/Spacy).

2.6. Perform lemmatization (reduce words to their base form).

2.7. Keep only meaningful tokens (length > 2).

## **3. Deliverables**

3.1. A cleaned dataset with original review and cleaned review text.

3.2. A function clean_review(text) that applies the pipeline.

3.3. At least 5 before/after cleaning examples.

In [22]:
import pandas as pd
import os
import random
import sys
from bs4 import BeautifulSoup
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [23]:
data_dir = "Data/aclImdb"

In [24]:
def load_imdb_with_rating_v2(base_dir, subset="train", sample_size=5000):
    data_records = []

    categories = {"pos": 1, "neg": 0}  # map labels to 1/0
    
    for label, label_value in categories.items():
        path = os.path.join(base_dir, subset, label)
        filenames = random.sample(os.listdir(path), sample_size)

        for filename in filenames:
            # filename looks like "12345_7.txt"
            file_id, rating_str = filename.split("_")
            rating = int(rating_str.split(".")[0])  # extract rating number
            
            file_path = os.path.join(path, filename)
            with open(file_path, encoding="utf-8") as f:
                review_text = f.read()
            
            data_records.append({
                "id": int(file_id),
                "rating": rating,
                "txt": review_text,
                "label": label_value
            })

    return pd.DataFrame(data_records)


In [25]:
df_subset = load_imdb_with_rating_v2(data_dir, subset="train", sample_size=5000)
print(f"Shape: {df_subset.shape}")
df_subset.head()

Shape: (10000, 4)


Unnamed: 0,id,rating,txt,label
0,6784,8,"I like my Ronald Colman dashing and debonair, ...",1
1,11884,8,I found this film to be a fascinating study of...,1
2,1656,9,"""Thieves and Liars"" presents us with a very na...",1
3,4745,7,I can't understand why they decided to release...,1
4,305,8,Screwball comedy about romantic mismatches in ...,1


In [5]:
!pip install nltk




[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
!{sys.executable} -m pip install nltk




[notice] A new release of pip is available: 25.1.1 -> 25.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [26]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bbuser\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\bbuser\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\bbuser\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [27]:
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

In [28]:
def preprocess_review(review_text):
    """Clean and normalize raw review text."""
    
    # Step A: lowercase
    cleaned = review_text.lower()
    
    # Step B: strip HTML tags
    cleaned = BeautifulSoup(cleaned, "html.parser").get_text()
    
    # Step C: remove URLs and emails
    cleaned = re.sub(r'(http\S+|www\S+|https\S+|[\w\.-]+@[\w\.-]+)', '', cleaned)
    
    # Step D: keep only alphabetic characters and spaces
    cleaned = re.sub(r'[^a-z\s]', '', cleaned)
    
    # Step E: tokenize
    words = cleaned.split()
    
    # Step F: drop stopwords
    words = [word for word in words if word not in stop_words]
    
    # Step G: lemmatize
    words = [lemmatizer.lemmatize(word) for word in words]
    
    # Step H: filter short tokens
    words = [word for word in words if len(word) > 2]
    
    return " ".join(words)


In [29]:
df_subset["cleaned_review"] = df_subset["txt"].map(preprocess_review)

In [30]:
df_subset.loc[:, ["txt", "cleaned_review"]].head(10)

Unnamed: 0,txt,cleaned_review
0,"I like my Ronald Colman dashing and debonair, ...",like ronald colman dashing debonair fellow see...
1,I found this film to be a fascinating study of...,found film fascinating study family crisis leo...
2,"""Thieves and Liars"" presents us with a very na...",thief liar present naturalistic depiction leve...
3,I can't understand why they decided to release...,cant understand decided release film introduce...
4,Screwball comedy about romantic mismatches in ...,screwball comedy romantic mismatch new york ci...
5,I finally purchased and added to my collection...,finally purchased added collection copy show p...
6,"I was 16 when I first saw the movie, and it ha...",first saw movie always huge favorite mine cour...
7,"If you've ever seen the trailer for the film ""...",youve ever seen trailer film recruit colin far...
8,"While the soundtrack is a bit dated, this stor...",soundtrack bit dated story relevant ever blue ...
9,This movie completely ran laps around the orig...,movie completely ran lap around original dolem...


In [31]:
df_subset

Unnamed: 0,id,rating,txt,label,cleaned_review
0,6784,8,"I like my Ronald Colman dashing and debonair, ...",1,like ronald colman dashing debonair fellow see...
1,11884,8,I found this film to be a fascinating study of...,1,found film fascinating study family crisis leo...
2,1656,9,"""Thieves and Liars"" presents us with a very na...",1,thief liar present naturalistic depiction leve...
3,4745,7,I can't understand why they decided to release...,1,cant understand decided release film introduce...
4,305,8,Screwball comedy about romantic mismatches in ...,1,screwball comedy romantic mismatch new york ci...
...,...,...,...,...,...
9995,2510,4,This TV film tells the story of extrovert Fran...,0,film tell story extrovert frannie suddenly ret...
9996,5041,2,Ye Lou's film Purple Butterfly pits a secret o...,0,lous film purple butterfly pit secret organiza...
9997,8517,2,The biggest mystery of Veronica Mars is not on...,0,biggest mystery veronica mar one tackle screen...
9998,5903,1,"I live in Salt Lake City and I'm not a Mormon,...",0,live salt lake city mormon rent movie well liv...


In [32]:
df_subset.to_csv("imdb_cleaned_sample.csv", index=False)