# 03 - Product Filtering for Forecasting Readiness

Following weekly aggregation, the dataset contains time series for a large
number of products with highly varying sales behavior.

This notebook focuses on identifying products that are suitable for
time series forecasting by applying filtering criteria based on:

- Sales regularity (number of weeks with non-zero demand)
- Business relevance (total quantity sold)

The goal is to retain products with sufficient signal quality while
excluding long-tail and irregular products that would negatively impact
forecasting performance.


In [None]:
import pandas as pd

In [11]:
## Load Weekly Aggregated Sales Data

df = pd.read_csv("../data/processed/weekly_sales.csv")
df["week"] = pd.to_datetime(df["week"])

## Compute Product-Level Sales Statistics

We compute summary statistics per product to assess sales consistency
and overall demand.

In [12]:
product_stats = (
    df
    .groupby("product_name")
    .agg(
        weeks_with_sales=("qty_sold", lambda x: (x > 0).sum()),
        total_quantity=("qty_sold", "sum"),
        total_weeks=("week", "nunique")
    )
    .reset_index()
)


In [13]:
## Inspect Product Distribution
product_stats.describe()

Unnamed: 0,weeks_with_sales,total_quantity,total_weeks
count,4675.0,4675.0,4675.0
mean,7.458396,21.369626,49.0
std,7.893068,105.379287,0.0
min,1.0,1.0,49.0
25%,1.0,2.0,49.0
50%,4.0,5.0,49.0
75%,11.0,15.0,49.0
max,31.0,5849.0,49.0


Based on the observed distribution:

- More than 50% of products are sold in fewer than 5 weeks.
- Over 75% of products sell fewer than 15 units across the entire period.
- A small subset of products accounts for the majority of consistent demand.

To ensure model stability and meaningful forecasts, we apply conservative
filtering thresholds that prioritize both sales regularity and volume.

The selected thresholds correspond approximately to the upper quartile
of sales frequency and volume distributions, ensuring retention of
products with sufficient historical signal.

> These thresholds are considered initial and may be adjusted
based on downstream EDA findings or business requirements.


In [14]:
MIN_WEEKS_WITH_SALES = 10
MIN_TOTAL_QUANTITY = 50

In [15]:
## Apply Product Filtering
valid_products = product_stats[
    (product_stats["weeks_with_sales"] >= MIN_WEEKS_WITH_SALES) &
    (product_stats["total_quantity"] >= MIN_TOTAL_QUANTITY)
]["product_name"]

filtered_df = df[df["product_name"].isin(valid_products)].copy()

In [16]:
## Filtering Results Summary
print(f"Total products before filtering: {product_stats.shape[0]}")
print(f"Total products after filtering: {filtered_df['product_name'].nunique()}")


Total products before filtering: 4675
Total products after filtering: 423


In [18]:
retention_rate = filtered_df["product_name"].nunique() / product_stats.shape[0]
print(f"Retention rate: {retention_rate:.2%}")


Retention rate: 9.05%


## Filtering Outcome

Applying the combined filtering criteria resulted in a significant reduction
in the number of products:

- Total products before filtering: 4,675
- Products retained for forecasting: 423

This confirms the presence of a strong long-tail distribution in the dataset,
where a small subset of products accounts for most consistent demand.

The filtered dataset represents products with sufficient sales frequency
and volume, making them suitable for reliable time series analysis and
forecasting.

This dataset will be used in subsequent exploratory analysis and model
development steps.

---


## Demand Coverage After Filtering

In addition to product retention, we assess how much of the total demand
is preserved after filtering to ensure business relevance.


In [19]:
total_demand_before = df["qty_sold"].sum()
total_demand_after = filtered_df["qty_sold"].sum()

demand_coverage = total_demand_after / total_demand_before

print(f"Demand coverage after filtering: {demand_coverage:.2%}")


Demand coverage after filtering: 64.55%


In [20]:
## Sales Density Improvement

density_before = (df["qty_sold"] > 0).mean()
density_after = (filtered_df["qty_sold"] > 0).mean()

print(f"Sales density before filtering: {density_before:.2%}")
print(f"Sales density after filtering: {density_after:.2%}")

Sales density before filtering: 15.22%
Sales density after filtering: 51.71%


In [17]:
## Save Filtered Dataset
filtered_df.to_csv("../data/processed/filtered_weekly_sales.csv", index=False)

## Final Notes

The applied product filtering strategy achieved a substantial improvement
in data quality while preserving a meaningful portion of overall demand.

- Although only **9.05% of products** were retained, they account for
  **64.55% of total demand**, confirming a strong long-tail structure
  in the dataset.

- Sales density increased significantly from **15.22% to 51.71%**,
  indicating a major reduction in zero-inflated time series and a
  higher concentration of informative signals.

This trade-off between product coverage and data quality ensures that
subsequent exploratory analysis and forecasting models are trained on
stable, interpretable time series, rather than sparse and irregular
long-tail products.

The filtered dataset therefore represents a reliable foundation for
time series modeling and demand forecasting.
