# Demand & Spend Data Validation

This notebook validates upstream demand and spend data used for forecasting and impact analysis.

## Objectives
- Validate data completeness and continuity
- Verify realistic distributions and ranges
- Identify seasonality and spend–demand relationships
- Establish baseline metrics before modeling

> Note: Data is assumed to be ingested from an upstream system (e.g., BigQuery).  
> CSVs are used here strictly for demonstration and interpretability.


In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_columns", 50)
pd.set_option("display.width", 120)

print("Libraries loaded successfully.")

Libraries loaded successfully.


In [None]:
DATA_PATH = "../data/raw/demand_spend_country_daily.csv"


df = pd.read_csv(DATA_PATH)

print(f"Data loaded successfully.")
print(f"Rows: {df.shape[0]:,}")
print(f"Columns: {df.shape[1]}")


FileNotFoundError: [Errno 2] No such file or directory: '../data/raw/demand_spend_data.csv'

In [None]:
print("Column Names:")
print(df.columns.tolist())

print("\nData Types:")
display(df.dtypes)

print("\nSample Records:")
display(df.head())


In [None]:
df["DATE"] = pd.to_datetime(df["DATE"], format="%Y-%b-%d")

date_min = df["DATE"].min()
date_max = df["DATE"].max()

print(f"Date range: {date_min.date()} → {date_max.date()}")
print(f"Total days covered: {(date_max - date_min).days + 1:,}")


In [None]:
missing = df.isna().sum()
missing = missing[missing > 0]

if missing.empty:
    print("✅ No missing values detected.")
else:
    print("⚠️ Missing values detected:")
    display(missing)


In [None]:
categorical_cols = ["CHANNEL", "REGION", "PRODUCT"]

for col in categorical_cols:
    print(f"\n{col} — unique values:")
    display(df[col].value_counts())


In [None]:
numeric_cols = [
    "SPEND",
    "DEMAND_UNITS",
    "PRICE",
    "DISCOUNT_RATE",
    "INFLATION_RATE",
    "UNEMPLOYMENT_RATE"
]

display(df[numeric_cols].describe().round(2))


In [None]:
daily_counts = df.groupby("DATE").size()

missing_days = pd.date_range(
    start=df["DATE"].min(),
    end=df["DATE"].max()
).difference(daily_counts.index)

print(f"Missing dates: {len(missing_days)}")

if len(missing_days) == 0:
    print("✅ No gaps in daily time series.")
else:
    print("⚠️ Missing dates detected.")
    display(missing_days[:10])


In [None]:
daily_demand = (
    df.groupby("DATE")["DEMAND_UNITS"]
    .sum()
    .reset_index()
)

plt.figure()
plt.plot(daily_demand["DATE"], daily_demand["DEMAND_UNITS"])
plt.title("Total Daily Demand Over Time")
plt.xlabel("Date")
plt.ylabel("Demand Units")
plt.tight_layout()
plt.show()


In [None]:
df["MONTH"] = df["DATE"].dt.month

monthly_demand = df.groupby("MONTH")["DEMAND_UNITS"].mean()

plt.figure()
monthly_demand.plot(kind="bar")
plt.title("Average Demand by Month (Seasonality)")
plt.xlabel("Month")
plt.ylabel("Avg Demand Units")
plt.tight_layout()
plt.show()


In [None]:
sample = df.sample(5000, random_state=42)

plt.figure()
sns.scatterplot(
    data=sample,
    x="SPEND",
    y="DEMAND_UNITS",
    alpha=0.3
)
plt.title("Spend vs Demand Relationship")
plt.xlabel("Spend")
plt.ylabel("Demand Units")
plt.tight_layout()
plt.show()


In [None]:
corr_cols = [
    "SPEND",
    "DEMAND_UNITS",
    "PRICE",
    "DISCOUNT_RATE",
    "INFLATION_RATE",
    "UNEMPLOYMENT_RATE"
]

corr = df[corr_cols].corr().round(2)

plt.figure()
sns.heatmap(corr, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.tight_layout()
plt.show()


In [None]:
baseline_metrics = {
    "AVG_DAILY_DEMAND": df.groupby("DATE")["DEMAND_UNITS"].sum().mean(),
    "AVG_DAILY_SPEND": df.groupby("DATE")["SPEND"].sum().mean(),
    "AVG_PRICE": df["PRICE"].mean(),
    "AVG_DISCOUNT": df["DISCOUNT_RATE"].mean()
}

baseline_df = pd.DataFrame(
    baseline_metrics, index=["BASELINE"]
).round(2)

display(baseline_df)


## Validation Summary

- Data spans multiple years with no temporal gaps
- Demand and spend show realistic variability and seasonality
- Strong relationship observed between spend and demand
- Macro-economic variables fall within expected ranges
- Dataset is suitable for:
  - Demand forecasting
  - Spend elasticity modeling
  - Scenario impact analysis

Next step: feature engineering and model training.
