# Dataset Analysis and Pipeline Validation
## Image-Based Product Category Classification

This notebook covers the initial analysis of the raw dataset and validates the effectiveness of our automated data preparation pipeline.

The goals of this notebook are to:
- Inspect the raw dataset and its original category distribution.
- **Validate the new category structure** created by our NLP-based merging script.
- **Analyze the balance of the final, merged dataset**.
- Verify data quality, such as the validity of image URLs.

In [None]:
# Import tools needed for data analysis and visualization
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
import requests
from concurrent.futures import ThreadPoolExecutor

plt.style.use("default")

## 1. Loading the Dataset

The dataset is provided as two CSV files, the first one contains product title, image URLs and category id.
The categories.csv, has the categories names.
Due to its size, the dataset is hosted externally and downloaded using a dedicated script.

In [None]:
products_df = pd.read_csv("../data/raw/products.csv")
categories_df = pd.read_csv("../data/raw/categories.csv")
products_df.head()

In [None]:
print(f"Number of samples: {len(products_df)}")
print(f"Number of columns: {len(products_df.columns)}")
products_df.columns

- The dataset contains product title, image URLs and their corresponding category id.
- The category labels are stored in a seperate csv file.
- The following analysis focuses on cleaning and refining these fields.

## 2. Raw Category Inspection

We first examine the distribution of all available categories to understand their frequency and suitability for image-based classification.

In [None]:
raw_category_counts = products_df["category_id"].value_counts()
print(categories_df[["id","category_name"]].head())

print("\nMin frequency of a category: " + str(raw_category_counts.min()))
print("Max frequency of a category: " + str(raw_category_counts.max()))
print("Mean frequency of a category: " + str(int(raw_category_counts.mean())))

## 3. Automated Category Merging

The original dataset contains 248 categories, many of which are either too specific, too broad, or semantically very similar. A manual approach to filtering or merging is not scalable or reproducible.

Also there are some categories that have a few products in them, and can unbalance the dataset.

To solve this, we implemented an automated pipeline that uses **NLP (Sentence-Transformers) and Clustering (K-Means)** to group the 248 original categories into a more manageable and balanced set of around 50 new categories. 

The following steps analyze the result of that pipeline.

First, we load the final processed dataset and the mapping file generated by our scripts.

In [None]:
# Load the results of the new data preparation pipeline
products_cleaned_df = pd.read_csv("../data/processed/products_cleaned.csv")
mapping_df = pd.read_csv("../data/processed/category_mapping.csv")

print(f"Loaded {len(products_cleaned_df)} cleaned products.")
print(f"Loaded mapping for {len(mapping_df)} original categories into {mapping_df['merged_category_id'].nunique()} new categories.")

For securing the dataset, categories with fewer than 100 samples were removed to ensure
a more balanced and learnable dataset.

After modifying, the final category set is obtained.
These categories are visually distinguishable and suitable for classification.


## 4. Dataset Balance Analysis

We analyze the number of samples per category to assess dataset balance.

In [None]:
# Getting ready for visualization
category_counts = products_cleaned_df["merged_category_id"].value_counts()
category_counts_df = (
    category_counts
    .reset_index()
)
category_counts_df.columns = ["merged_category_id", "count"]
category_counts_df = category_counts_df.merge(
    mapping_df,
    on="merged_category_id",
    how="left"
)

In [None]:
plt.figure(figsize=(15, 6))

plt.bar(
    category_counts_df["category_name"],
    category_counts_df["count"]
)

plt.xticks(rotation=90)
plt.xlabel("Category")
plt.ylabel("Number of Products")
plt.title("Category Distribution")

plt.tight_layout()
plt.show()

The graph shows some amount imbalance across categories.
This imbalance will be addressed in later phases using data augmentation and training strategies.

## 5. Image URL Quality Check

To assess data quality, we verify whether image URLs are reachable.
Due to dataset size, this check is performed on a random subset.


In [None]:
def is_url_valid(url):
    try:
        r = requests.get(
            url,
            timeout=5,
            stream=True,
            allow_redirects=True,
            headers={"User-Agent": "Mozilla/5.0"}
        )
        return r.status_code == 200
    except requests.RequestException:
        return False
    
def check_urls_parallel(urls, max_workers=40):
    results = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        for res in tqdm(executor.map(is_url_valid, urls), total=len(urls)):
            results.append(res)
    return results

sample_df = products_cleaned_df.sample(200, random_state=42).copy()

sample_df["valid"] = check_urls_parallel(
    sample_df["imgUrl"].tolist(),
    max_workers=20
)

sample_df["valid"].value_counts()

The sample-based validation indicates that most image URLs are reachable.
Invalid URLs are removed automatically during the image download and training stage, where missing images will be skipped.

## 6. Example Images from Each Category

To better understand the visual characteristics of each category and verify that
categories are visually distinguishable, we display a small number of example
images from each category.

For each category, 2â€“3 sample images are randomly selected and visualized.

In [None]:
from PIL import Image
from io import BytesIO
import time

def load_image_from_url(url, timeout=5, retries=3, backoff=1.0):
    """
    Fetch an image from a URL and return a PIL Image.

    Args:
        url (str): Image URL
        timeout (int): Request timeout in seconds
        retries (int): Number of retry attempts
        backoff (float): Seconds to wait between retries

    Returns:
        PIL.Image or None
    """
    for attempt in range(1, retries + 1):
        try:
            response = requests.get(url, timeout=timeout)
            response.raise_for_status()
            return Image.open(BytesIO(response.content)).convert("RGB")

        except Exception as e:
            if attempt < retries:
                time.sleep(backoff)
            else:
                return None
    
def fetch_image(row):
    """
    Fetch image for a single dataframe row.
    Returns (category, image or None).
    """
    img = load_image_from_url(row["imgUrl"])
    return row["merged_category_id"], img

In [None]:
images_per_category = 2
rows = []
categories = products_cleaned_df["merged_category_id"].unique()
n_rows = len(categories)
n_cols = images_per_category

for category in categories:
    category_df = products_cleaned_df[
        products_cleaned_df["merged_category_id"] == category
    ]
    sampled = category_df.sample(
        min(images_per_category, len(category_df)),
        random_state=42
    )
    rows.extend(sampled.to_dict("records"))

images = []

with ThreadPoolExecutor(max_workers=20) as executor:
    for result in tqdm(
        executor.map(fetch_image, rows),
        total=len(rows),
        desc="Downloading images"
    ):
        images.append(result)

In [None]:
fig, axes = plt.subplots(
    n_rows,
    n_cols,
    figsize=(n_cols * 2, n_rows * 2)
)

idx = 0
for row_idx, category in enumerate(categories):
    for col_idx in range(images_per_category):
        ax = axes[row_idx][col_idx]
        img_category, img = images[idx]

        if img is not None:
            ax.imshow(img)
        else:
            ax.text(0.5, 0.5, "Image\nNot Available",
                    ha="center", va="center")

        ax.axis("off")

        if col_idx == 0:
            ax.set_ylabel(str(category), rotation=0, labelpad=40)

        idx += 1
# plt.title("Example Images from Each Category")
plt.tight_layout()
plt.show()

The example images demonstrate that the selected categories are visually distinct
and suitable for image-based classification. This qualitative inspection supports
the feasibility of the proposed learning task.

## Summary

In this notebook, we performed a full cycle of data analysis and pipeline validation:
- We began by inspecting the **raw dataset** and its original, imbalanced category structure.
- We then loaded the results of our **automated NLP-based category merging pipeline**.
- We **visualized the final dataset's balance**, confirming that the new 50 categories are much more evenly distributed.
- We reviewed the mapping to understand how original categories were grouped.
- We verified the quality of image URLs in the dataset.

The resulting dataset, `products_cleaned.csv`, is now validated and ready to serve as the foundation for model training.