**Data loading & inspection:** 
Load the dataset and check for missig values and duplicates to ensure data quality

In [10]:
import numpy as np
import pandas as pd

csv_path = "../data/raw/amazon_reviews/train.csv"
df = pd.read_csv(csv_path, header=None, names=["polarity", "title", "text"], nrows=200000)

# check missing values and duplicates
print(df.isnull().sum())
print("Duplicates:", df.duplicated().sum())

# drop rows that has missing values
df.dropna(subset=["title"], inplace=True)


polarity     0
title       18
text         0
dtype: int64
Duplicates: 0


**Text cleaning:**
The title and text columns were combined into a single full_review column. This provides a richer context for our sentiment analysis model(s).
All text was converted to lowercase to standardize the data and reduce vocabulary size.
We removed puncutation, numbers and extra whitespace using regular expressions


In [11]:
df["full_review"] = df["title"] + " " + df["text"]

import re

def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text) # removes special characters
    text = re.sub(r'\s+', ' ', text) # removes extra whitespace
    return text

df["cleaned_review"] = df["full_review"].apply(clean_text)


**Text normalization:**
Common english words were removed using NLTK's stopword list to focus on more meaningful words.
We used Keras' Tokenizer to convert the cleaned text into sequences of integers. An OOV token was specified to handle words not seen during training.
Sequences were padded to an appropriate length deemed by us, based on the histogram analysis from the first notebook.

In [12]:
import nltk
from nltk.corpus import stopwords

nltk.download("stopwords")

stop_words = set(stopwords.words("english"))

def remove_stopwords(text):
    words = text.split()
    words = [word for word in words if word not in stop_words]
    return " ".join(words)

df["cleaned_review"] = df["cleaned_review"].apply(remove_stopwords)


max_words = 5000
maxlen = 70

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=max_words, oov_token="<OOV>")
tokenizer.fit_on_texts(df["cleaned_review"])
sequences = tokenizer.texts_to_sequences(df["cleaned_review"])

# maxlen is a hyperparameter that can be tuned. we've set this to 60, based on the histogram created in notebook 1. 
# the histogram shows that the majority of reviews are below 60 words. we can adjust this in the future
padded_sequences = pad_sequences(sequences, maxlen=maxlen, padding="post")

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


Save the preprocessed data

In [13]:
processed_path = "../data/processed/amazon_reviews_processed.csv"

df.to_csv(processed_path, index=False)
print(f"Preprocessed data saved to {processed_path}")

Preprocessed data saved to ../data/processed/amazon_reviews_processed.csv
