# Module 1.6: First Contact with the Data

> **Goal:** Run first-contact checks to confirm the data can support the 5Q Framework. 

### This Is Triage, Not Treatment (yet!)

First contact is a **quick pass** to answer: *Can this data support our 5Q design?*

We fix only what's blocking. Everything else gets flagged for the right module.

Each section maps to a 5Q question:

| Q | Name | What It Defines | First Contact Check |
|---|------|-----------------|---------------------|
| **Q1** | Decision | The Target | Is `y` clear, numeric, clean? |
| **Q2** | Metric | What "Good" Means | Issues that bias evaluation? (NAs, duplicates) |
| **Q3** | Horizon & Level | The Structure | Enough history? Right granularity? |
| **Q4** | Data & Drivers | What Model Learns | Behavioral signals (zeros, volatility) |

---

## 1. Setup

In [1]:
# --- Imports ---
import sys
import warnings
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from dtype_diet import optimize_dtypes, report_on_dataframe
import forecast_foundations as ff
from forecast_foundations.reports import ModuleReport
import tsforge as tsf

# --- Settings ---
env = ff.setup_notebook()

DATA_DIR = env.DATA_DIR
OUTPUT_DIR = env.OUTPUT_DIR
cache = env.cache
output = env.output

✓ Setup complete | Root: real-world-forecasting-foundations | Notebook: 1.06_first_contact | Data: /Users/lindsaytruong/forecast-academy/real-world-forecasting-foundations/data | Cache: on


## 2. Load Data

`messify=True` simulates real-world data issues (string dtypes, NaN injection, duplicates, etc).

In [2]:
# M5 data downloads to ROOT_DIR/data, messified cache goes to DATA_DIR
daily_sales = ff.load_m5(
    DATA_DIR,
    cache=cache,
    cache_key='m5_messified',
    messify=True,
    messify_config={
       'random_state': 42,
        'zeros_to_na_frac': 0.30,         # 30% of zeros → NA
        'zeros_drop_frac': 0.02,           # Drop 2% of zero rows
        'zeros_drop_gaps_frac': 0.10,      # Drop 10% of zeros (gaps)
        'duplicates_add_n': 150,           # Add 150 duplicates
        'na_drop_frac': None,              # Don't drop NAs
        'dtypes_corrupt': True,            # Corrupt dtypes
    },
    include_hierarchy=True,
)

✓ Loaded 'm5_messified'
   Module: 1_06 | Shape: 45,311,289 × 8


---

<div style="text-align: center;">

## 3. `Q1: Decision` — Defines the Target

<div style="background: linear-gradient(135deg, #11998e 0%, #38ef7d 100%); color: white; padding: 12px 20px; border-radius: 8px; margin: 10px auto; max-width: 600px;">
<strong>Is the target clear, numeric, and clean?</strong><br>
</div>

</div>


### 3.1 Identify target columns

In [3]:
# What columns do we have? What are we forecasting?
daily_sales.columns

Index(['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id', 'unique_id',
       'ds', 'y'],
      dtype='object')

### 3.2 Check dtypes

In [4]:
# Are ds and y the right types?
daily_sales[['ds', 'y']].dtypes

ds    object
y     object
dtype: object

### 3.3 Fix dtypes

Messification corrupts dtypes. `errors='coerce'` converts unparseable values to proper NaN.

In [5]:
daily_sales['ds'] = pd.to_datetime(daily_sales['ds'])
daily_sales['y'] = pd.to_numeric(daily_sales['y'], errors='coerce')

In [6]:
# Verify fix
daily_sales[['ds', 'y']].dtypes

ds    datetime64[ns]
y            float64
dtype: object

### 3.4 Optimize memory

Rule of thumb: keep DataFrames under 1GB to avoid memory issues.

In [7]:
# Is the memory manageable?
daily_sales.memory_usage(deep=True).sum() / 1e6

1091.087819

In [8]:
# use the dtype-diet package to optimize dtypes
daily_sales = optimize_dtypes(daily_sales, report_on_dataframe(daily_sales))


In [9]:
# memory after optimization
daily_sales.memory_usage(deep=True).sum() / 1e6


819.220085

---

<div style="text-align: center;">

## 4. `Q2: Metric` — Defines What "Good" Means

<div style="background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 12px 20px; border-radius: 8px; margin: 10px auto; max-width: 600px;">
<strong>Are there issues that would bias our evaluation?</strong><br>
<em>Missingness, duplicates, and orphan data corrupt metrics.</em>
</div>

</div>


### 4.1 Check for NAs

In [10]:
# Where are the NAs?
daily_sales.isna().sum()

item_id            0
dept_id            0
cat_id             0
store_id           0
state_id           0
unique_id          0
ds                 0
y            8498653
dtype: int64

### 4.2 Drop invalid dates

Drop rows with null dates (can't aggregate without a date).

In [11]:
daily_sales = daily_sales.dropna(subset=['ds'])

### 4.3 Check for orphans

Rows with null ID columns are orphan data — they can't be aggregated properly.

In [12]:
id_cols = [c for c in daily_sales.columns if c not in ['ds', 'y']]
daily_sales[id_cols].isna().sum()

item_id      0
dept_id      0
cat_id       0
store_id     0
state_id     0
unique_id    0
dtype: int64

### 4.4 Remove duplicates

Duplicates inflate aggregates and bias metrics. Remove before aggregating.

In [13]:
non_target_cols = [c for c in daily_sales.columns if c != 'y']
n_dups = daily_sales.duplicated(subset=non_target_cols)

In [14]:
n_dups.value_counts()

False    45311160
True          129
Name: count, dtype: int64

In [15]:
# Remove them
daily_sales = daily_sales.drop_duplicates(subset=non_target_cols)

---

<div style="text-align: center;">

## 5. Q3: Horizon & Level — Defines the Structure

<div style="background: linear-gradient(135deg, #4facfe 0%, #00f2fe 100%); color: white; padding: 12px 20px; border-radius: 8px; margin: 10px auto; max-width: 600px;">
<strong>Do we have enough history at the right granularity?</strong><br>
</div>

</div>


### 5.1 Preview data

In [16]:
# What does the data look like?
daily_sales.head(10)

Unnamed: 0,item_id,dept_id,cat_id,store_id,state_id,unique_id,ds,y
0,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-01-29,3.0
1,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-01-30,0.0
2,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-01-31,0.0
3,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-01,1.0
4,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-02,4.0
5,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-03,2.0
6,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-04,
7,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-05,2.0
8,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-06,0.0
9,FOODS_1_001,FOODS_1,FOODS,CA_1,CA,FOODS_1_001_CA_1,2011-02-07,0.0


### 5.2 Check date range

Rule of thumb: need 2-3x forecast horizon. For 12-week forecast, want ~36+ weeks.

In [17]:
# min date
daily_sales['ds'].min()

Timestamp('2011-01-29 00:00:00')

In [18]:
# max date
daily_sales['ds'].max()

Timestamp('2016-06-19 00:00:00')

In [19]:
# number of weeks
((daily_sales['ds'].max() - daily_sales['ds'].min()).days // 7) + 1

282

### 5.3 Drop invalid dates

Drop rows with null dates (can't aggregate without a date).

In [20]:
daily_sales = daily_sales.dropna(subset=['ds'])

### 5.4  Check for outlier dates

Look for dates before 1900, future dates, or outlier dates far from the main range.

In [21]:
unique_dates = (
    daily_sales['ds']
    .dropna()
    .drop_duplicates()
    .sort_values()
)

unique_dates.head(5), unique_dates.tail(5)

(0   2011-01-29
 1   2011-01-30
 2   2011-01-31
 3   2011-02-01
 4   2011-02-02
 Name: ds, dtype: datetime64[ns],
 1881   2016-06-15
 1882   2016-06-16
 1883   2016-06-17
 1884   2016-06-18
 1885   2016-06-19
 Name: ds, dtype: datetime64[ns])

In [22]:
# Any date outliers or coverage issues?
tsf.plots.plot_date_coverage(daily_sales)

We see an increasing trend — new series were added over time. This is common in retail as stores expand product assortments. We'll handle ragged start dates in data prep.

### 5.5 Aggregate to weekly

Weekly aligns with business planning and reduces daily noise.

In [23]:
# Group columns (everything except ds and y)
group_cols = [c for c in daily_sales.columns if c not in ['ds', 'y']]

In [24]:
# Create week column (W-SAT = weeks ending Saturday, i.e., Sunday through Saturday)
daily_sales['week'] = daily_sales['ds'].dt.to_period('W-SAT').dt.start_time

In [25]:
# frequency
pd.infer_freq(daily_sales['week'].drop_duplicates().sort_values())

'W-SUN'

In [26]:
# Aggregate: group by all id columns + week, sum the target
weekly_sales = (
    daily_sales.groupby(group_cols + ['week'], as_index=False, observed=True)
    ['y']
    .sum()
    .rename(columns={'week': 'ds'})
)

### 5.6  Compare daily vs weekly

Compare the same series at daily vs weekly granularity. How does aggregation affect the signal?

In [27]:
# create a sample series by pulling the first unique series
first_series = daily_sales[group_cols].drop_duplicates().iloc[0]

# Filter data for this series
daily_mask = (daily_sales[group_cols] == first_series).all(axis=1)
weekly_mask = (weekly_sales[group_cols] == first_series).all(axis=1)


In [28]:
sample_daily = daily_sales[daily_mask].sort_values('ds')
sample_weekly = weekly_sales[weekly_mask].sort_values('ds')

In [29]:
sample_daily['agg'] = 'Daily'
sample_weekly['agg'] = 'Weekly'

combined = pd.concat([sample_daily, sample_weekly])

# Create label from series identifiers
label = f"unique_id={first_series['unique_id']}"

# Plot with facet mode
fig = tsf.plots.plot_timeseries(
    combined,
    id_col='agg',
    date_col='ds',
    value_col='y',
    mode='facet',
    theme='seaborn',
    style={
        'title': 'Daily vs Weekly Comparison',
        'subtitle': label
    }
)
fig


Weekly aggregation smooths daily noise and reveals seasonal patterns.

---

<div style="text-align: center;">

## 6. `Q4: Data` — Defines What the Model Learns

<div style="background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%); color: white; padding: 12px 20px; border-radius: 8px; margin: 10px auto; max-width: 600px;">
<strong>What behavioral signals shape model selection?</strong><br>
<em>Intermittency, volatility, and data quality affect what the model can learn.</em>
</div>

</div>


### 6.1 Check sparsity

High % zeros affects metric choice (MAPE undefined at 0) and model selection.

In [30]:
# How sparse is the data?
n_zeros = (weekly_sales['y'] == 0).sum()
n_na = weekly_sales['y'].isna().sum()

In [31]:
# percent of data with 0 sales
((n_zeros+n_na) / len(weekly_sales) * 100)

24.02633925168771

### 6.2 Check memory

In [32]:
# before aggregating
daily_sales.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 45311160 entries, 0 to 45311288
Data columns (total 9 columns):
 #   Column     Dtype         
---  ------     -----         
 0   item_id    category      
 1   dept_id    category      
 2   cat_id     category      
 3   store_id   category      
 4   state_id   category      
 5   unique_id  category      
 6   ds         datetime64[ns]
 7   y          float16       
 8   week       datetime64[ns]
dtypes: category(6), datetime64[ns](2), float16(1)
memory usage: 1.4 GB


In [33]:
# after aggregating
weekly_sales.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6848638 entries, 0 to 6848637
Data columns (total 8 columns):
 #   Column     Dtype         
---  ------     -----         
 0   item_id    category      
 1   dept_id    category      
 2   cat_id     category      
 3   store_id   category      
 4   state_id   category      
 5   unique_id  category      
 6   ds         datetime64[ns]
 7   y          float32       
dtypes: category(6), datetime64[ns](1), float32(1)
memory usage: 134.1 MB


### 6.3 Check calendar

The M5 calendar includes holidays, sporting events, and SNAP payment schedules. We'll focus on calendar features for now; prices and promotions come later.

In [34]:
calendar = ff.load_m5_calendar(DATA_DIR)

#unique holidays/events
calendar['event_name_1'].dropna().nunique()

Loading calendar from: /Users/lindsaytruong/forecast-academy/real-world-forecasting-foundations/data/m5/datasets/calendar.csv
  Shape: 1,969 rows × 13 columns


30

---

<div style="text-align: center;">

## 7. `Q5: Ownership` — Defines Transparency

<div style="background: linear-gradient(135deg, #1d1f56 0%, #3a2f7e 100%); color: white; padding: 12px 20px; border-radius: 8px; margin: 10px auto; max-width: 600px;">
<strong>What did we decide? Based off of which assumptions?</strong><br>
<em>Document decisions so downstream users can trace and adjust.</em>
</div>

</div>

### 7.1  Log Decisions

| Step | Decision | Why | Rev |
|------|----------|-----|-----|
| Dtype fixing | pd.to_datetime, pd.to_numeric | Raw strings → proper types | ✓ |
| Weekly aggregation | resample('W-SUN').sum() | Matches Walmart cadence | ✓ |
| Deduplication | drop_duplicates(keep='first') | First occurrence is truth | — |

### 7.2 Generate Report

In [35]:
from forecast_foundations import ModuleReport

report = ModuleReport(
    "1.06",
    input_df=daily_sales,
    output_df=weekly_sales,
    drivers={'calendar': calendar}
)

report.display()


━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
1.06 · First Contact
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

SNAPSHOT
─────────────────────────────────────────────────────────────────
         unique_id         ds   y
HOBBIES_1_001_CA_1 2013-07-14 1.0
HOBBIES_1_001_CA_1 2013-07-21 0.0
HOBBIES_1_001_CA_1 2013-07-28 2.0

DATA SUMMARY
─────────────────────────────────────────────────────────────────
  Rows             6,848,638
  Series           30,490
  Dates            2011-01-23 → 2016-06-19
  Frequency        Weekly
  History          282 weeks (5.4 yrs)
  Target zeros     24.0%

MEMORY
─────────────────────────────────────────────────────────────────
  ✓ 134 MB (Medium) — Fine for most operations

5Q CHECKS
─────────────────────────────────────────────────────────────────

  Q1 · Target
    ✓ ds exists              Yes
    ✓ ds is datetime         datetime64[ns]
    ✓ No NAs in ds           0
    ✓ y exists               Yes
    ✓ y is nu

### 7.3. Save

Save cleaned weekly data. NAs preserved for gap-filling in Module 1.08.

In [36]:
# Cache for downstream modules
output.save(
    df=weekly_sales,
    report=report,
)

✓ Report saved: /Users/lindsaytruong/forecast-academy/real-world-forecasting-foundations/data/output/reports/1.06_first_contact_report.txt
✓ Saved '1.06_first_contact'
   Data:   data/1.06_first_contact_output.parquet (8.09 MB, 6,848,638 rows)
   Report: reports/1.06_first_contact_report.txt


PosixPath('/Users/lindsaytruong/forecast-academy/real-world-forecasting-foundations/data/output/data/1.06_first_contact_output.parquet')

### 7.4 Next Steps

| Module | Focus |
|--------|-------|
| **1.7** | Understanding the M5 Dataset (hierarchy, calendar, prices, feature types & leakage risks) |
| **1.8** | Preparing Data for Forecasting: Timeline Engineering (fill gaps, fill policies, aggregate to weekly, merge calendar) |
| **1.9** | Diagnostics: The Big Picture (tsfeatures, forecastability camps, structure vs chaos) |
| **1.10** | The Lie Detector 6 (trend, seasonality, MI, entropy, intermittency, lumpiness) |
| **1.11** | GenAI-Assisted Diagnostics Using SPICE (prompt structure, safe constraints, 3 diagnostic questions) |
| **1.12** | First Look: Plotting Your Data (3-plot EDA workflow, demand archetypes, tsforge plotting helpers) |
| **1.13** | Understanding the Patterns (reading time series shapes, Pattern→Expectations matrix, detecting model failure) |
| **1.14** | Designing the Backtest Strategy (in-sample vs out-of-sample, rolling-origin, Walmart backtest plan) |