# Image-Based Product Category Classification
## Dataset Preparation Notebook

This notebook documents the full data preprocessing pipeline used to prepare
an Amazon product dataset for image-based category classification.

### Objectives
- Migrate original Amazon categories into merged categories
- Filter irrelevant products
- Assign new category IDs
- Prepare a clean dataset for deep learning
- Analyze class balance

This dataset will be used for training a CNN-based image classifier.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter

## 1. Loading Datasets

We load:
- `merged_categories.csv`: mapping of merged categories to original IDs
- `amazon_products_filtered.csv`: products filtered to valid categories

These files were generated in previous preprocessing steps.

In [None]:
merged_categories_path = "merged_categories.csv"
products_path = "amazon_products_filtered.csv"

merged_df = pd.read_csv(merged_categories_path)
products_df = pd.read_csv(products_path)

print("Merged Categories:")
display(merged_df.head())

print("Products:")
display(products_df.head())

## 2. Assigning New Category IDs

Each merged category is assigned a new numeric ID.
These IDs will be used as classification labels.

In [None]:
merged_df["merged_category_id"] = range(len(merged_df))

# Build original -> merged mapping
original_to_merged = {}

for _, row in merged_df.iterrows():
    merged_id = row["merged_category_id"]
    for cid in str(row["category_ids"]).split(","):
        original_to_merged[int(cid.strip())] = merged_id

display(merged_df)

## 3. Migrating Products to New Categories

We replace the original Amazon `category_id` with the new merged category ID.

In [None]:
products_df["category_id"] = products_df["category_id"].astype(int)

products_df["merged_category_id"] = products_df["category_id"].map(
    original_to_merged
)

# Remove old category_id
products_df = products_df.drop(columns=["category_id"])

display(products_df.head())

## 4. Saving Migrated Datasets

We save:
- `merged_categories_new.csv`
- `amazon_products_merged.csv`

In [None]:
merged_df_out = merged_df[
    ["merged_category_id", "category_name", "category_ids"]
]

merged_df_out.to_csv("merged_categories_new.csv", index=False)
products_df.to_csv("amazon_products_merged.csv", index=False)

print("Files saved successfully.")

## 5. Dataset Statistics

We examine how many samples exist per merged category.
This is critical for class balance in deep learning.

In [None]:
category_counts = products_df["merged_category_id"].value_counts().sort_index()

category_counts

## 6. Class Distribution Visualization

In [None]:
plt.figure(figsize=(12, 5))
category_counts.plot(kind="bar")
plt.title("Number of Products per Merged Category")
plt.xlabel("Merged Category ID")
plt.ylabel("Number of Products")
plt.tight_layout()
plt.show()

## 7. (Optional) Removing Underrepresented Categories

Deep learning models perform poorly on classes with very few samples.
We optionally remove categories with fewer than N samples.

In [None]:
MIN_SAMPLES = 100  # adjust as needed

valid_categories = category_counts[category_counts >= MIN_SAMPLES].index

filtered_df = products_df[
    products_df["merged_category_id"].isin(valid_categories)
]

print("Before:", len(products_df))
print("After: ", len(filtered_df))

## 8. Final Dataset for Model Training

In [None]:
filtered_df.to_csv("amazon_products_final.csv", index=False)

display(filtered_df.head())

## Final Notes

The dataset is now:
- Clean
- Category-consistent
- Balanced (optionally)
- Ready for:
  - Image downloading
  - Train/validation/test splitting
  - CNN-based classification

Next steps:
- Download images using `imgUrl`
- Create PyTorch or TensorFlow dataloaders
- Train a baseline CNN (ResNet / EfficientNet)