### 01 - Exploratory Data Analysis

## Objective
Explore three Amazon review datasets, identify data quality issues, and prepare clean datasets for sentiment analysis, clustering, and text generation tasks.

## Approach

1. Data Loading & Initial Exploration
2. Data Quality Assessment
3. Sentiment Label Creation
4. Category Analysis & Consolidation
5. Dataset Preparation

In [4]:
# --- 1. Imports & Setup ---
import pandas as pd
import html

In [5]:
# --- 2. Load Data ---

ds1 = pd.read_csv("../data/1429_1.csv", low_memory=False) # low_memory to avoid DtypeWarning (mixed type errors)
ds2 = pd.read_csv("../data/amazon_customer_reviews_2017_2018.csv")
ds3 = pd.read_csv("../data/amazon_customer_reviews_Feb_April_2019.csv")

print(ds1.shape)
print(ds1.dtypes)
print("\n")
print(ds2.shape)
print(ds2.dtypes)
print("\n")
print(ds3.shape)
print(ds3.dtypes)

(34660, 21)
id                          str
name                        str
asins                       str
brand                       str
categories                  str
keys                        str
manufacturer                str
reviews.date                str
reviews.dateAdded           str
reviews.dateSeen            str
reviews.didPurchase      object
reviews.doRecommend      object
reviews.id              float64
reviews.numHelpful      float64
reviews.rating          float64
reviews.sourceURLs          str
reviews.text                str
reviews.title               str
reviews.userCity        float64
reviews.userProvince    float64
reviews.username            str
dtype: object


(5000, 24)
id                         str
dateAdded                  str
dateUpdated                str
name                       str
asins                      str
brand                      str
categories                 str
primaryCategories          str
imageURLs                  str
keys      

### Visualization of the datasets

In [6]:
ds1.head()

Unnamed: 0,id,name,asins,brand,categories,keys,manufacturer,reviews.date,reviews.dateAdded,reviews.dateSeen,...,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.userCity,reviews.userProvince,reviews.username
0,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,This product so far has not disappointed. My c...,Kindle,,,Adapter
1,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,great for beginner or experienced person. Boug...,very fast,,,truman
2,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,Inexpensive tablet for him to use and learn on...,Beginner tablet for our 9 year old son.,,,DaveZ
3,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-13T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,4.0,http://reviews.bestbuy.com/3545/5620406/review...,I've had my Fire HD 8 two weeks now and I love...,Good!!!,,,Shacks
4,AVqkIhwDv8e3D1O-lebb,"All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi,...",B01AHB9CN2,Amazon,"Electronics,iPad & Tablets,All Tablets,Fire Ta...","841667104676,amazon/53004484,amazon/b01ahb9cn2...",Amazon,2017-01-12T00:00:00.000Z,2017-07-03T23:33:15Z,"2017-06-07T09:04:00.000Z,2017-04-30T00:45:00.000Z",...,True,,0.0,5.0,http://reviews.bestbuy.com/3545/5620406/review...,I bought this for my grand daughter when she c...,Fantastic Tablet for kids,,,explore42


In [7]:
ds1.keys()

Index(['id', 'name', 'asins', 'brand', 'categories', 'keys', 'manufacturer',
       'reviews.date', 'reviews.dateAdded', 'reviews.dateSeen',
       'reviews.didPurchase', 'reviews.doRecommend', 'reviews.id',
       'reviews.numHelpful', 'reviews.rating', 'reviews.sourceURLs',
       'reviews.text', 'reviews.title', 'reviews.userCity',
       'reviews.userProvince', 'reviews.username'],
      dtype='str')

In [8]:
ds2.head()

Unnamed: 0,id,dateAdded,dateUpdated,name,asins,brand,categories,primaryCategories,imageURLs,keys,...,reviews.dateSeen,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.username,sourceURLs
0,AVqVGZNvQMlgsOJE6eUY,2017-03-03T16:56:05Z,2018-10-25T16:36:31Z,"Amazon Kindle E-Reader 6"" Wifi (8th Generation...",B00ZV9PXP2,Amazon,"Computers,Electronics Features,Tablets,Electro...",Electronics,https://pisces.bbystatic.com/image2/BestBuy_US...,allnewkindleereaderblack6glarefreetouchscreend...,...,"2018-05-27T00:00:00Z,2017-09-18T00:00:00Z,2017...",False,,0,3,http://reviews.bestbuy.com/3545/5442403/review...,I thought it would be as big as small paper bu...,Too small,llyyue,https://www.newegg.com/Product/Product.aspx%25...
1,AVqVGZNvQMlgsOJE6eUY,2017-03-03T16:56:05Z,2018-10-25T16:36:31Z,"Amazon Kindle E-Reader 6"" Wifi (8th Generation...",B00ZV9PXP2,Amazon,"Computers,Electronics Features,Tablets,Electro...",Electronics,https://pisces.bbystatic.com/image2/BestBuy_US...,allnewkindleereaderblack6glarefreetouchscreend...,...,"2018-05-27T00:00:00Z,2017-07-07T00:00:00Z,2017...",True,,0,5,http://reviews.bestbuy.com/3545/5442403/review...,This kindle is light and easy to use especiall...,Great light reader. Easy to use at the beach,Charmi,https://www.newegg.com/Product/Product.aspx%25...
2,AVqVGZNvQMlgsOJE6eUY,2017-03-03T16:56:05Z,2018-10-25T16:36:31Z,"Amazon Kindle E-Reader 6"" Wifi (8th Generation...",B00ZV9PXP2,Amazon,"Computers,Electronics Features,Tablets,Electro...",Electronics,https://pisces.bbystatic.com/image2/BestBuy_US...,allnewkindleereaderblack6glarefreetouchscreend...,...,2018-05-27T00:00:00Z,True,,0,4,https://reviews.bestbuy.com/3545/5442403/revie...,Didnt know how much i'd use a kindle so went f...,Great for the price,johnnyjojojo,https://www.newegg.com/Product/Product.aspx%25...
3,AVqVGZNvQMlgsOJE6eUY,2017-03-03T16:56:05Z,2018-10-25T16:36:31Z,"Amazon Kindle E-Reader 6"" Wifi (8th Generation...",B00ZV9PXP2,Amazon,"Computers,Electronics Features,Tablets,Electro...",Electronics,https://pisces.bbystatic.com/image2/BestBuy_US...,allnewkindleereaderblack6glarefreetouchscreend...,...,2018-10-09T00:00:00Z,True,177283626.0,3,5,https://redsky.target.com/groot-domain-api/v1/...,I am 100 happy with my purchase. I caught it o...,A Great Buy,Kdperry,https://www.newegg.com/Product/Product.aspx%25...
4,AVqVGZNvQMlgsOJE6eUY,2017-03-03T16:56:05Z,2018-10-25T16:36:31Z,"Amazon Kindle E-Reader 6"" Wifi (8th Generation...",B00ZV9PXP2,Amazon,"Computers,Electronics Features,Tablets,Electro...",Electronics,https://pisces.bbystatic.com/image2/BestBuy_US...,allnewkindleereaderblack6glarefreetouchscreend...,...,2018-05-27T00:00:00Z,True,,0,5,https://reviews.bestbuy.com/3545/5442403/revie...,Solid entry level Kindle. Great for kids. Gift...,Solid entry-level Kindle. Great for kids,Johnnyblack,https://www.newegg.com/Product/Product.aspx%25...


In [9]:
ds2.keys()

Index(['id', 'dateAdded', 'dateUpdated', 'name', 'asins', 'brand',
       'categories', 'primaryCategories', 'imageURLs', 'keys', 'manufacturer',
       'manufacturerNumber', 'reviews.date', 'reviews.dateAdded',
       'reviews.dateSeen', 'reviews.doRecommend', 'reviews.id',
       'reviews.numHelpful', 'reviews.rating', 'reviews.sourceURLs',
       'reviews.text', 'reviews.title', 'reviews.username', 'sourceURLs'],
      dtype='str')

In [10]:
ds3.head()

Unnamed: 0,id,dateAdded,dateUpdated,name,asins,brand,categories,primaryCategories,imageURLs,keys,...,reviews.didPurchase,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.username,sourceURLs
0,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,3,https://www.amazon.com/product-reviews/B00QWO9...,I order 3 of them and one of the item is bad q...,... 3 of them and one of the item is bad quali...,Byger yang,"https://www.barcodable.com/upc/841710106442,ht..."
1,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,4,https://www.amazon.com/product-reviews/B00QWO9...,Bulk is always the less expensive way to go fo...,... always the less expensive way to go for pr...,ByMG,"https://www.barcodable.com/upc/841710106442,ht..."
2,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Well they are not Duracell but for the price i...,... are not Duracell but for the price i am ha...,BySharon Lambert,"https://www.barcodable.com/upc/841710106442,ht..."
3,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Seem to work as well as name brand batteries a...,... as well as name brand batteries at a much ...,Bymark sexson,"https://www.barcodable.com/upc/841710106442,ht..."
4,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,These batteries are very long lasting the pric...,... batteries are very long lasting the price ...,Bylinda,"https://www.barcodable.com/upc/841710106442,ht..."


In [11]:
ds3.keys()

Index(['id', 'dateAdded', 'dateUpdated', 'name', 'asins', 'brand',
       'categories', 'primaryCategories', 'imageURLs', 'keys', 'manufacturer',
       'manufacturerNumber', 'reviews.date', 'reviews.dateSeen',
       'reviews.didPurchase', 'reviews.doRecommend', 'reviews.id',
       'reviews.numHelpful', 'reviews.rating', 'reviews.sourceURLs',
       'reviews.text', 'reviews.title', 'reviews.username', 'sourceURLs'],
      dtype='str')

In [None]:
# See the Electronics products in detail
electronics_ds3 = ds3[ds3["primaryCategories"] == "Electronics"]
print(electronics_ds3["name"].value_counts().head(30))

# Also check DS1 product names since it's all electronics
print(ds1["name"].value_counts().head(30))

# And review counts per product in Electronics
# (thin review counts = thin generation material)
print(electronics_ds3.groupby("name")["reviews.text"].count().sort_values(ascending=False).head(20))

name
Fire HD 8 Tablet with Alexa, 8 HD Display, 16 GB, Tangerine - with Special Offers                                                               2443
All-New Fire HD 8 Tablet, 8 HD Display, Wi-Fi, 16 GB - Includes Special Offers, Black                                                           2370
Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Blue Kid-Proof Case                                                                          1425
Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Green Kid-Proof Case                                                                         1212
Fire Tablet, 7 Display, Wi-Fi, 16 GB - Includes Special Offers, Black                                                                           1024
Fire Tablet with Alexa, 7 Display, 16 GB, Blue - with Special Offers                                                                             987
All-New Fire HD 8 Tablet with Alexa, 8 HD Display, 16 GB, Marine Blue - with Special Offers          

### Decision on meta-categories

Rather than forcing artificial diversity across low-signal categories, a deliberate choice was made to go deep within Electronics — a domain with sufficient product breadth and review volume to produce meaningful generation output. Our 5 sub-categories reflect natural product families that consumers actually compare when making purchase decisions, mirroring how sites like The Wirecutter structure their buying guides

Meta-Categories

1. Fire Tablets          ~15,000 reviews
2. Fire Kids Edition      ~5,000 reviews
3. Kindle E-Readers       ~4,500 reviews
4. Echo & Smart Speakers  ~4,270 reviews
5. Fire TV & Streaming    ~2,554 reviews

### The Cleaning Tasks Per Column

```python
KEEP = {
    "name"            → name_clean       # product identity for clustering/generation
    "reviews.rating"  → rating           # sentiment labels
    "reviews.text"    → review_text      # the core input for all 3 models
    "primaryCategories" → primary_category  # DS2/DS3 only, clustering reference
    "brand"           → brand            # useful context for generation
}
```

---

**`reviews.text` (most important)**

- Drop nulls
- Drop reviews under 20 characters (no signal)
- Strip leading/trailing whitespace
- Remove the \r\n duplication artifact (same as names)
- Decode any HTML entities (&amp; → &, etc.)


**`name`**

- Split on \r\n, take first part only
- Strip trailing commas and whitespace
- Drop nulls (can't assign category without a name)


**`reviews.rating`**

- Coerce to numeric, errors → NaN
- Drop NaN
- Cast to int
- Drop anything outside 1–5 range (data corruption)


**`primaryCategories` (DS2/DS3 only)**

- Strip whitespace
- Standardize multi-label separator (some use comma, some use comma+space)
- Keep as-is otherwise


**`brand`**

- Strip whitespace
- Fill nulls with "Unknown" — don't drop rows over a missing brand


In [13]:
# Load a dataset 

def load_raw(path: str, low_memory: bool = False) -> pd.DataFrame:
    """Load a raw CSV file into a DataFrame."""
    return pd.read_csv(path, low_memory=low_memory)

In [14]:
# Select and Rename

def select_columns(df: pd.DataFrame, has_primary_cat: bool = False) -> pd.DataFrame:
    """
    Keep only relevant columns and rename them to standard names.
    Gracefully skips columns that don't exist in this dataset.
    """
    keep = ["name", "brand", "reviews.rating", "reviews.text"]
    if has_primary_cat:
        keep.append("primaryCategories")

    # Only keep columns that actually exist
    existing = [c for c in keep if c in df.columns]
    df = df[existing].copy()

    return df.rename(columns={
        "reviews.text":      "review_text",
        "reviews.rating":    "rating",
        "primaryCategories": "primary_category",
    })

In [15]:
# Individual column cleaners

def clean_name(df: pd.DataFrame) -> pd.DataFrame:
    """Clean product name: remove duplication artifact, strip junk."""
    df["name"] = df["name"].apply(
        lambda x: "" if not isinstance(x, str) else x
    )
    df["name"] = (
        df["name"]
        .str.split("\r\n").str[0]
        .str.strip(" ,")
    )
    return df[df["name"].str.len() > 0].copy()

def clean_review_text(df: pd.DataFrame) -> pd.DataFrame:
    """Clean review text: remove artifacts, decode HTML, drop short reviews."""
    # Convert to string first, before any chained operation
    df["review_text"] = df["review_text"].apply(
        lambda x: "" if not isinstance(x, str) else x
    )
    df["review_text"] = (
        df["review_text"]
        .str.split("\r\n").str[0]
        .str.strip()
        .apply(html.unescape)
    )
    df = df[df["review_text"].str.len() >= 20].copy() # Remove any review under 20 characters due to low-signal 
    return df

def clean_brand(df: pd.DataFrame) -> pd.DataFrame:
    """Normalize brand: fill nulls, strip whitespace."""
    if "brand" not in df.columns:
        return df
    df["brand"] = df["brand"].fillna("Unknown").astype(str).str.strip()
    return df

def clean_primary_category(df: pd.DataFrame) -> pd.DataFrame:
    """
    Normalize primary_category: fill nulls, strip whitespace,
    standardize comma separator.
    Handles float NaN values safely.     ← this was your bug
    """
    if "primary_category" not in df.columns:
        return df
    df["primary_category"] = (
        df["primary_category"]
        .fillna("Unknown")      # kills float NaN before any string ops
        .astype(str)            # ensures everything is a string
        .str.strip()
        .str.replace(", ", ",", regex=False)
    )
    return df

def clean_rating(df: pd.DataFrame) -> pd.DataFrame:
    """Coerce rating to int, drop nulls and out-of-range values."""
    df["rating"] = pd.to_numeric(df["rating"], errors="coerce")
    df = df.dropna(subset=["rating"])
    df["rating"] = df["rating"].astype(int)
    return df[df["rating"].between(1, 5)].copy()

def clean_primary_category(df: pd.DataFrame) -> pd.DataFrame:
    """
    Normalize primary_category: fill nulls, strip whitespace,
    standardize comma separator.
    Handles float NaN values safely.     ← this was your bug
    """
    if "primary_category" not in df.columns:
        return df
    df["primary_category"] = (
        df["primary_category"]
        .fillna("Unknown")      # kills float NaN before any string ops
        .astype(str)            # ensures everything is a string
        .str.strip()
        .str.replace(", ", ",", regex=False)
    )
    return df

In [16]:
# Main function to call helper functions

def load_and_clean(
    path: str,
    has_primary_cat: bool = False,
    low_memory: bool = False
) -> pd.DataFrame:
    """
    Full pipeline: load → select → clean each column → reset index.
    """
    df = load_raw(path, low_memory=low_memory)
    df = select_columns(df, has_primary_cat=has_primary_cat)
    df = clean_name(df)
    df = clean_review_text(df)
    df = clean_rating(df)
    df = clean_brand(df)
    df = clean_primary_category(df)
    return df.reset_index(drop=True)

In [17]:
# ── DS1 ──────────────────────────────────────────────────────
raw_ds1 = load_raw("../data/1429_1.csv", low_memory=False)
df1 = select_columns(raw_ds1, has_primary_cat=False)
df1 = clean_name(df1)
df1 = clean_review_text(df1)
df1 = clean_rating(df1)
df1 = clean_brand(df1)
ds1_clean = df1.reset_index(drop=True)
print(f"DS1: {len(ds1_clean):,} rows")

# ── DS2 ──────────────────────────────────────────────────────
raw_ds2 = load_raw("../data/amazon_customer_reviews_2017_2018.csv")
df2 = select_columns(raw_ds2, has_primary_cat=True)
df2 = clean_name(df2)
df2 = clean_review_text(df2)
df2 = clean_rating(df2)
df2 = clean_brand(df2)
df2 = clean_primary_category(df2)
ds2_clean = df2.reset_index(drop=True)
print(f"DS2: {len(ds2_clean):,} rows")

# ── DS3 ──────────────────────────────────────────────────────
raw_ds3 = load_raw("../data/amazon_customer_reviews_Feb_April_2019.csv")
df3 = select_columns(raw_ds3, has_primary_cat=True)
df3 = clean_name(df3)
df3 = clean_review_text(df3)
df3 = clean_rating(df3)
df3 = clean_brand(df3)
df3 = clean_primary_category(df3)
ds3_clean = df3.reset_index(drop=True)
print(f"DS3: {len(ds3_clean):,} rows")

# ── Summary ──────────────────────────────────────────────────
print("\n── Final clean row counts ──")
for name, dataset in [("DS1", ds1_clean), ("DS2", ds2_clean), ("DS3", ds3_clean)]:
    print(f"{name}: {len(dataset):,} rows | columns: {dataset.columns.tolist()}")

DS1: 27,747 rows
DS2: 5,000 rows
DS3: 25,794 rows

── Final clean row counts ──
DS1: 27,747 rows | columns: ['name', 'brand', 'rating', 'review_text']
DS2: 5,000 rows | columns: ['name', 'brand', 'rating', 'review_text', 'primary_category']
DS3: 25,794 rows | columns: ['name', 'brand', 'rating', 'review_text', 'primary_category']


In [18]:
import os
os.makedirs("../data/processed", exist_ok=True)

# ── Check 1: Sentiment corpus (all three combined) ────────────────────────
sentiment_df = pd.concat(
    [ds1_clean, ds2_clean, ds3_clean], ignore_index=True
).drop_duplicates(subset=["review_text"])

# Map ratings to sentiment labels
def rating_to_sentiment(rating: int) -> str:
    if rating >= 4:
        return "positive"
    elif rating == 3:
        return "neutral"
    return "negative"

sentiment_df["sentiment"] = sentiment_df["rating"].apply(rating_to_sentiment)

print("── Sentiment Corpus ──")
print(f"Total reviews: {len(sentiment_df):,}")
print(f"Class distribution:\n{sentiment_df['sentiment'].value_counts()}")
print(f"Class balance (%):\n{sentiment_df['sentiment'].value_counts(normalize=True).mul(100).round(1)}")

# ── Check 2: Electronics corpus (clustering + generation) ─────────────────
electronics_df = pd.concat([
    ds1_clean[["name", "brand", "rating", "review_text"]],
    ds3_clean[ds3_clean["primary_category"].str.contains("Electronics", na=False)]
             [["name", "brand", "rating", "review_text"]]
], ignore_index=True).drop_duplicates(subset=["review_text"])

print("\n── Electronics Corpus ──")
print(f"Total reviews: {len(electronics_df):,}")
print(f"Unique products: {electronics_df['name'].nunique()}")

# ── Check 3: Echo vs Fire TV split decision ───────────────────────────────
for keywords, label in [
    (["Echo", "Tap"],       "Echo / Tap"),
    (["Fire TV", "FireTV"], "Fire TV")
]:
    count = electronics_df["name"].str.contains(
        "|".join(keywords), case=False, na=False
    ).sum()
    print(f"{label}: {count:,} reviews")

# ── Save processed files ──────────────────────────────────────────────────
sentiment_df.to_csv("../data/processed/sentiment_ready.csv", index=False)
electronics_df.to_csv("../data/processed/electronics_ready.csv", index=False)
print("\n── Saved ──")
print("sentiment_ready.csv")
print("electronics_ready.csv")

── Sentiment Corpus ──
Total reviews: 39,794
Class distribution:
sentiment
positive    36273
neutral      1845
negative     1676
Name: count, dtype: int64
Class balance (%):
sentiment
positive    91.2
neutral      4.6
negative     4.2
Name: proportion, dtype: float64

── Electronics Corpus ──
Total reviews: 30,487
Unique products: 80
Echo / Tap: 4,270 reviews
Fire TV: 2,554 reviews

── Saved ──
sentiment_ready.csv
electronics_ready.csv


## EDA Summary

Three raw datasets were cleaned and standardized using a modular pipeline 
(load → select → clean per column). After removing duplicated rows, the sentiment corpus 
contains 39,794 reviews with a severe class imbalance: 91.2% positive, 4.6% 
neutral, 4.2% negative. This will be addressed in training via class weighting 
and a balanced evaluation set.

All three datasets are combined for sentiment training because sentiment is a 
domain-agnostic task — emotional language patterns generalize across product 
categories, and maximum review volume reduces imbalance in the minority classes. 
For clustering and text generation, only DS1 and DS3 are used: DS2 is almost 
entirely Electronics and adds no category diversity. DS3 provides the labeled 
primary_category column that serves as ground truth for clustering evaluation, 
while DS1 contributes additional review volume within the Electronics domain.

The electronics corpus contains 30,487 reviews across 80 unique products, 
organized into 5 meta-categories: Fire Tablets, Fire Kids Edition, Kindle 
E-Readers, Echo & Smart Speakers, and Fire TV & Streaming. Rather than forcing 
artificial diversity across sparse non-electronics categories, I decided to go 
deep within a single domain where sufficient product breadth and review volume 
exist to produce meaningful clustering and generation output. 

All processed files are saved to /data/processed/.