# 02 — Data Cleaning (Netflix Reviews)

In this notebook, we clean the raw collected Netflix reviews by:
- Removing null or duplicate reviews
- Normalizing text (lowercasing, trimming spaces)
- Keeping only relevant columns (review, rating, source, etc.)

✅ Output of this notebook:  
`../data/netflix_reviews_clean.csv`


## Step 1: Load Raw Dataset

In [2]:
import re
import pandas as pd
from pathlib import Path

In [4]:
# Load raw dataset
RAW_DATA_PATH = Path("../data/netflix_reviews_raw.csv")
df = pd.read_csv(RAW_DATA_PATH)

print(f"✅ Loaded raw dataset with {len(df)} rows")
df.head()

✅ Loaded raw dataset with 41960 rows


Unnamed: 0,username,review,rating,date,country,source
0,Robert,"brainwashing children. don't deny it, the proo...",1,2025-10-08 00:57:12,NG,Play Store
1,Abraham Bernabe,disgusting woke agenda. just cancel and uninst...,1,2025-10-08 00:35:03,NG,Play Store
2,Kyle Martin,Please fix the pausing issue. I am trying to w...,2,2025-10-08 00:33:36,NG,Play Store
3,Jarrett,"Doesn't really have that much titles, heh lol",3,2025-10-08 00:30:51,NG,Play Store
4,Michael Raber,"gotta pay for the app, they raise the price wi...",1,2025-10-08 00:26:11,NG,Play Store


## Step 2: Basic Cleaning — Remove Null & Duplicate Reviews


In [5]:
# Remove duplicates
df.drop_duplicates(inplace=True)


In [6]:
# Handle missing values
df.dropna(subset=['review'], inplace=True)  # remove rows without text
df.reset_index(drop=True, inplace=True)

print(f"✅ After cleaning nulls & duplicates: {len(df)} rows remain")

✅ After cleaning nulls & duplicates: 41960 rows remain


## Step 3: Normalize Text (lowercase, strip whitespace)


In [7]:
# Text preprocessing
stop_words = {
    "i","me","my","myself","we","our","ours","ourselves","you","your","yours","yourself",
    "yourselves","he","him","his","himself","she","her","hers","herself","it","its","itself",
    "they","them","their","theirs","themselves","what","which","who","whom","this","that",
    "these","those","am","is","are","was","were","be","been","being","have","has","had",
    "having","do","does","did","doing","a","an","the","and","but","if","or","because","as",
    "until","while","of","at","by","for","with","about","against","between","into","through",
    "during","before","after","above","below","to","from","up","down","in","out","on","off",
    "over","under","again","further","then","once","here","there","when","where","why","how",
    "all","any","both","each","few","more","most","other","some","such","no","nor","not",
    "only","own","same","so","than","too","very","s","t","can","will","just","don","should",
    "now"
}


In [8]:
# Cleaning function
def clean_text(text):
    text = str(text).lower()                          # lowercase
    text = re.sub(r"http\S+|www\S+", "", text)        # remove links
    text = re.sub(r"[^a-z\s]", "", text)              # remove punctuation & numbers
    text = re.sub(r"\s+", " ", text).strip()          # remove extra spaces
    # remove stopwords
    text = " ".join([word for word in text.split() if word not in stop_words])
    return text

In [9]:
# Apply cleaning
df["clean_review"] = df["review"].apply(clean_text)
df.head()

Unnamed: 0,username,review,rating,date,country,source,clean_review
0,Robert,"brainwashing children. don't deny it, the proo...",1,2025-10-08 00:57:12,NG,Play Store,brainwashing children dont deny proof pudding
1,Abraham Bernabe,disgusting woke agenda. just cancel and uninst...,1,2025-10-08 00:35:03,NG,Play Store,disgusting woke agenda cancel uninstall garbage
2,Kyle Martin,Please fix the pausing issue. I am trying to w...,2,2025-10-08 00:33:36,NG,Play Store,please fix pausing issue trying watch monday n...
3,Jarrett,"Doesn't really have that much titles, heh lol",3,2025-10-08 00:30:51,NG,Play Store,doesnt really much titles heh lol
4,Michael Raber,"gotta pay for the app, they raise the price wi...",1,2025-10-08 00:26:11,NG,Play Store,gotta pay app raise price warning work half time


## Drop Rows Where Clean Review is Missing

In [10]:
# Drop rows where clean_review is missing or empty
df = df.dropna(subset=['clean_review'])  # Remove NaN values
df = df[df['clean_review'].str.strip() != ""]  # Remove blank strings if any

print(f"✅ Cleaned dataset after dropping empty reviews: {len(df)} rows remaining")


✅ Cleaned dataset after dropping empty reviews: 41238 rows remaining


## Step 4: Save Cleaned Dataset for Next Notebook


In [11]:
CLEANED_DATA_PATH = Path("../data/netflix_reviews_clean.csv")
df.to_csv(CLEANED_DATA_PATH, index=False)

print("\n✅ Cleaned dataset saved to 'data/netflix_reviews_clean.csv'")
print(f"📊 Total cleaned reviews: {len(df)}")



✅ Cleaned dataset saved to 'data/netflix_reviews_clean.csv'
📊 Total cleaned reviews: 41238


### Data Cleaning Completed!

➡️ Next Notebook: **03_sentiment_mapping.ipynb**
