# Time-Series Forecasting | Corporación Favorita Grocery Sales
## Section 1: Data Preparation
### 1 - Notebook Overview

The raw `train.csv` file (~5GB) from the **Corporación Favorita** dataset is too large to load directly into memory for standard exploratory analysis. This notebook implements a memory-efficient preprocessing pipeline using **Pandas chunking** to create a lightweight, analysis‑ready subset by:

- **Regional Filtering:** Isolating the `Pichincha` region to focus the analysis.
- **Streamed Sampling:** Extracting `2 million rows` using a chunk-based ingestion strategy to optimize workflow.
- **Boolean Standardization:** Cleaning the `onpromotion` column for consistent logical processing.
- **Integrity Checks:** Identifying and handling missing values in the `train` dataset.
- **Outlier Validation:** Evaluating high-sales peaks identified via z-scores to decide if they represent true market signals or noise.
- **Temporal Alignment:** Reconstructing the full daily time index by filling missing dates with zero sales to ensure a continuous timeline.
- **Feature Engineering:** Extracting meaningful time-based signals (day of week, month, etc.) from the date index.
- **Optimized Persistence:** Saving the final prepared dataset as a `Parquet` file for near-instant loading in subsequent notebooks.

This preprocessing stage ensures a clean, reliable foundation for modeling. The prepared dataset is then utilized in [Section 2: Exploratory Data Analysis](/notebooks/02-exploratory-data-analysis.ipynb).

---

### 2 - Import Libraries

In [14]:
# Core libraries
import pandas as pd

# File handling
import os

# Dismiss deprecation warnings
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)

### 3 - Data Ingestion

In [15]:
# Define Local Paths
stores_path = "../data/stores.csv"
train_path = "../data/train.csv"

# Verify both files exist before starting any processing
if not os.path.exists(train_path) or not os.path.exists(stores_path):
    print("❌ ERROR: Required data files not found in the /data/ folder.")
    print("-" * 50)
    print("To run this notebook, please download the dataset from Kaggle:")
    print("https://www.kaggle.com/competitions/favorita-grocery-sales-forecasting")
    print("\nEnsure 'train.csv' and 'stores.csv' are placed in your './data/' directory.")
    print("-" * 50)
else:
    print("✅ Local files detected. Starting chunked ingestion...")

    # Filter stores from the Pichincha Region only
    df_stores = pd.read_csv(stores_path)
    store_ids = df_stores[df_stores['state'] == 'Pichincha']['store_nbr'].unique()

    # Read train.csv in chunks 1 million rows at a time to prevent memory overload
    chunk_size = 10**6
    filtered_chunks = []

    for chunk in pd.read_csv(train_path, chunksize=chunk_size, dtype={'onpromotion': object}):
        # map handles the strings, .astype("boolean") supports nulls if they exist
        chunk['onpromotion'] = chunk['onpromotion'].map({'True': True, 'False': False, None: None}).astype('boolean')

        # Filter rows belonging to Pichincha region
        chunk_filtered = chunk[chunk['store_nbr'].isin(store_ids)]
        filtered_chunks.append(chunk_filtered)

        # Free up memory before reading next chunk
        del chunk

    # Combine all filtered chunks
    df_train = pd.concat(filtered_chunks, ignore_index=True)

    # Randomly sample 2 million rows in a reproducible way
    sample_size = min(2_000_000, len(df_train))
    df_train = (
        df_train
        .sample(n=sample_size, random_state=42)
        .reset_index(drop=True)
    )

    # Cleanup memory
    del filtered_chunks

    # Inspect the result
    print(f"\n---DATASET SHAPE---\nRows: {df_train.shape[0]}\nColumns: {df_train.shape[1]}")
    display(df_train.head())

✅ Local files detected. Starting chunked ingestion...

---DATASET SHAPE---
Rows: 2000000
Columns: 6


Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
0,12891204,2013-10-22,46,308766,2.0,
1,51564450,2015-07-16,48,881910,1.0,False
2,112463413,2017-04-14,47,852934,2.0,False
3,17037468,2014-01-12,49,1473479,123.506,
4,56638373,2015-09-15,20,504457,1.0,False


### 4 - Missing Data

In [16]:
# Checking missing values
df_train.isnull().sum()

id                  0
date                0
store_nbr           0
item_nbr            0
unit_sales          0
onpromotion    360337
dtype: int64

Since promotions are relatively uncommon in this dataset, the missing values in the `onpromotion` column likely indicate non‑promoted items rather than true gaps in the data. To preserve consistency, I’ll replace these **NaN** values with **False** before continuing.

In [17]:
# Replacing missing values in the 'onpromotion' column
df_train['onpromotion'] = (
    df_train['onpromotion']
    .astype('boolean')
    .fillna(False)
)

# Rechecking missing values
df_train.isnull().sum()

id             0
date           0
store_nbr      0
item_nbr       0
unit_sales     0
onpromotion    0
dtype: int64

### 5 - Data Integrity & Returns

To begin addressing outliers, I first look for negative sales values, since these represent product returns and need to be handled before any meaningful analysis can continue.

In [18]:
# Checking for negative sales (returns)
negative_sales = df_train[df_train['unit_sales'] < 0]
print(f"Number of negative sales: {len(negative_sales)}\n")

if not negative_sales.empty:
    display(negative_sales.head())

Number of negative sales: 153



Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
52819,41210498,2015-02-15,17,821186,-11.0,False
77632,102753084,2017-01-12,1,1463806,-1.0,False
122804,15649850,2013-12-18,4,559494,-1.0,False
134805,76700861,2016-04-18,48,470760,-1.0,False
139374,103377302,2017-01-18,1,1576285,-1.0,False


Since negative sales values correspond to product returns rather than true sales demand, I’ll convert these entries to zero so the dataset reflects actual sales for forecasting purposes.

In [19]:
# Replacing negative sales with 0 using .clip() for better performance
df_train['unit_sales'] = df_train['unit_sales'].clip(lower=0)

# Confirm all negative sales were replaced
remaining_negatives = (df_train['unit_sales'] < 0).sum()
print(f"Negative sales remaining: {remaining_negatives}")

Negative sales remaining: 0


### 6 - Handling Outliers

It is essential to account for unusually high sales spikes that may not reflect normal demand patterns. These extreme observations can arise from promotions, special events, or simple data inconsistencies, and they can distort both exploratory analysis and forecasting models if left untreated. To flag these anomalies, I’ll examine the distribution of sales and use **Z‑scores** to identify values that sit far outside the typical range for each store or item.

In [20]:
# Group by store_nbr and item_nbr to calculate mean and std dev
group_stats = df_train.groupby(['store_nbr', 'item_nbr'])['unit_sales']

# Compute mean and standard deviation for each store-item group
mean_sales = group_stats.transform('mean')
std_sales = group_stats.transform('std')

# Calculate Z-score for unit_sales and handle division by zero
z_scores = (df_train['unit_sales'] - mean_sales) / std_sales.replace(0, 1)

# Define threshold for outliers and filter
df_outliers = df_train[z_scores > 5].copy()
df_outliers['z_score'] = z_scores[z_scores > 5]

# Print summary
print(f"Total rows: {len(df_train)}")
print(f"Number of outliers detected: {len(df_outliers)}")
print(f"Percentage of data flagged: {(len(df_outliers) / len(df_train)) * 100:.2f}%")
display(df_outliers.head())

Total rows: 2000000
Number of outliers detected: 2036
Percentage of data flagged: 0.10%


Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,z_score
297,92826350,2016-10-03,47,1250226,63.0,True,5.386841
608,73101908,2016-03-12,18,164647,24.0,False,5.069494
783,23491997,2014-05-07,49,1105212,217.0,False,6.487229
3322,24728213,2014-06-01,48,169028,23.0,False,5.329154
3464,96135373,2016-11-06,44,265279,146.0,True,7.051297


Overall, the outlier analysis shows that only a very small fraction (~0.01%) of the dataset exhibits unusually high sales values. These spikes are not necessarily errors and may reflect genuine demand surges. Since the goal at this stage is to understand the data rather than aggressively clean it, I’m keeping these observations in the dataset. In a real forecasting workflow, decisions about handling extreme values would depend on domain knowledge and the model’s sensitivity to rare events. For now, identifying them is enough to stay aware of potential anomalies without prematurely removing information that could be meaningful for future modeling.

### 7 - Filling Missing Dates with Zeros

Before building any time‑series features, the dataset needs a complete and continuous calendar. If a product in a store is missing certain dates, the model can’t tell whether those gaps represent _no sales_ or _missing data_. To avoid misinterpretation and ensure stable feature engineering, we explicitly fill all missing dates with 0 sales. This step is essential because it:

- **Keeps the time axis consistent:** Time‑series models assume evenly spaced observations. Missing days break that structure.
- **Preserves true demand signals:** A zero is meaningful and should not be silently dropped.
- **Prevents misalignment in lag and rolling features:** Sliding windows rely on a complete calendar. Gaps shift values and distort patterns.
- **Improves model reliability:** A fully populated timeline gives the model a clearer, more accurate view of historical behavior.

With that foundation in place, the next step is to generate a complete daily calendar for every (`store_nbr`, `item_nbr`) pair and fill any missing dates with `unit_sales = 0`.

In [21]:
# Ensure date is a proper datetime type
df_train['date'] = pd.to_datetime(df_train['date'])

# Preview train
print("Before filling in missing calendar dates:\n")
display(df_train.head())

# Convert onpromotion to numpy bool for compatibility with asfreq()
df_train['onpromotion'] = df_train['onpromotion'].astype(bool)

# Function to fill in missing calendar days for each store–item pair
def fill_calendar(group):
    # One (store_nbr, item_nbr) group at a time
    g = group.set_index("date").sort_index()  # Use date as index
    g = g.asfreq("D", fill_value=0)  # Full daily calendar with 0 where missing

    # Restore identifiers dropped by asfreq()
    g["store_nbr"] = group["store_nbr"].iloc[0]
    g["item_nbr"]  = group["item_nbr"].iloc[0]

    return g.reset_index()  # Bring date back as a normal column

# Apply the calendar-filling step to every store–item pair
df_train = (
    df_train
    .groupby(["store_nbr", "item_nbr"], group_keys=False)
    .apply(fill_calendar)
)

# Preview train
print("\nAfter filling in missing calendar dates:\n")
display(df_train.head())

Before filling in missing calendar dates:



Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
0,12891204,2013-10-22,46,308766,2.0,False
1,51564450,2015-07-16,48,881910,1.0,False
2,112463413,2017-04-14,47,852934,2.0,False
3,17037468,2014-01-12,49,1473479,123.506,False
4,56638373,2015-09-15,20,504457,1.0,False



After filling in missing calendar dates:



Unnamed: 0,date,id,store_nbr,item_nbr,unit_sales,onpromotion
0,2013-05-09,5329375,1,96995,1.0,False
1,2013-05-10,0,1,96995,0.0,0
2,2013-05-11,0,1,96995,0.0,0
3,2013-05-12,0,1,96995,0.0,0
4,2013-05-13,0,1,96995,0.0,0


In [22]:
# Inspect the result
print(f"\n---DATASET SHAPE---\nRows: {df_train.shape[0]}\nColumns: {df_train.shape[1]}")


---DATASET SHAPE---
Rows: 68537168
Columns: 6


### 8 - Feature Engineering

To make the forecasting model more aware of real‑world patterns, I enrich the dataset with a small set of calendar‑based and trend‑capturing features. These help the model recognise seasonality, weekly cycles, and short‑term momentum that aren’t obvious from the raw timestamp alone.

| Feature | Description | Why it helps |
| --- | --- | --- |
| `year` | Extracted from the date | Captures long‑term growth or decline trends |
| `month` | Month number (1–12) | Learns seasonal patterns like holidays or back‑to‑school spikes |
| `day` | Day of the month | Useful for month‑end or mid‑month effects |
| `day_of_week` | Monday=0 … Sunday=6 | Reveals weekday vs weekend behaviour |
| `unit_sales_7d_avg` | 7‑day rolling mean of unit sales | Smooths noise and highlights short‑term momentum |

In [23]:
# Split the timestamp into model-friendly parts
df_train['year']        = df_train['date'].dt.year
df_train['month']       = df_train['date'].dt.month
df_train['day']         = df_train['date'].dt.day
df_train['day_of_week'] = df_train['date'].dt.dayofweek # Monday=0 … Sunday=6

# Confirm new features added
df_train.head()

Unnamed: 0,date,id,store_nbr,item_nbr,unit_sales,onpromotion,year,month,day,day_of_week
0,2013-05-09,5329375,1,96995,1.0,False,2013,5,9,3
1,2013-05-10,0,1,96995,0.0,0,2013,5,10,4
2,2013-05-11,0,1,96995,0.0,0,2013,5,11,5
3,2013-05-12,0,1,96995,0.0,0,2013,5,12,6
4,2013-05-13,0,1,96995,0.0,0,2013,5,13,0


In [24]:
# Compute a 7‑day rolling average of unit_sales for each (item, store)
df_train = df_train.sort_values(["item_nbr", "store_nbr", "date"]).reset_index(drop=True)

df_train["unit_sales_7d_avg"] = (
    df_train
    .groupby(["item_nbr", "store_nbr"])["unit_sales"]  # Isolate each time-series
    .transform(lambda s: s.rolling(window=7, min_periods=1).mean())  # 7‑day moving mean
)

# Inspect the result for one store–item pair
store_id = df_train.iloc[0]['store_nbr']
item_id = df_train.iloc[0]['item_nbr']

sample = df_train[
    (df_train['store_nbr'] == store_id) &
    (df_train['item_nbr'] == item_id)
]

# Preview
sample.head()

Unnamed: 0,date,id,store_nbr,item_nbr,unit_sales,onpromotion,year,month,day,day_of_week,unit_sales_7d_avg
0,2013-05-09,5329375,1,96995,1.0,False,2013,5,9,3,1.0
1,2013-05-10,0,1,96995,0.0,0,2013,5,10,4,0.5
2,2013-05-11,0,1,96995,0.0,0,2013,5,11,5,0.333333
3,2013-05-12,0,1,96995,0.0,0,2013,5,12,6,0.25
4,2013-05-13,0,1,96995,0.0,0,2013,5,13,0,0.2


### 9 - Export Cleaned Dataset

In [25]:
# Confirmed cleaned column names and data types
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 68537168 entries, 0 to 68537167
Data columns (total 11 columns):
 #   Column             Dtype         
---  ------             -----         
 0   date               datetime64[ns]
 1   id                 int64         
 2   store_nbr          int64         
 3   item_nbr           int64         
 4   unit_sales         float64       
 5   onpromotion        object        
 6   year               int32         
 7   month              int32         
 8   day                int32         
 9   day_of_week        int32         
 10  unit_sales_7d_avg  float64       
dtypes: datetime64[ns](1), float64(2), int32(4), int64(3), object(1)
memory usage: 4.6+ GB


In [26]:
# Export to Pickle
output_path = "../data/train_sample.pkl"

print(f"Saving {len(df_train):,} rows to Pickle...")
df_train.to_pickle(output_path) 

print(f"\n✅ Success! Data safely pickled at: {output_path}")

Saving 68,537,168 rows to Pickle...

✅ Success! Data safely pickled at: ../data/train_sample.pkl
