<a href="https://colab.research.google.com/github/krauseannelize/project-ml-favorita-sales-forecasting/blob/main/notebooks/favorita_s1_sampling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Time-Series Forecasting | Corporación Favorita Grocery Sales

# Section 1: Sampling

**Author:** [Annelize Krause](https://www.linkedin.com/in/annelizekrause/)  
**Date:** January 2026  

## 1 - Notebook Overview

The raw `train.csv` file from the **Corporación Favorita** dataset is extremely large, making it impractical to load directly during exploratory analysis or modeling. The sole purpose of this notebook is to create a lightweight, analysis‑ready subset of the data by:

- filtering the dataset to include only stores located in the **Pichincha** region
- sampling **2 million rows** to keep downstream notebooks fast and responsive
- cleaning the `onpromotion` column to ensure consistent boolean values
- saving the result as a separate file for reuse in later stages

This preprocessing step keeps the main workflow clean and efficient. The prepared sample dataset is then used in [Section 2](https://colab.research.google.com/drive/1WM1RG4q3JP0dARrYMc1NkqiFdvmAG8AB?usp=sharing) for data preparation and an exploratory data analysis.

## 2 - Import Libraries

In [None]:
# Core libraries
import pandas as pd
import numpy as np

# File handling
import gdown
from google.colab import drive

## 3 - Data Import & Initial Inspection

In [None]:
# File paths of CSV files necessary
stores_url = "https://drive.google.com/uc?id=1LIPJnAoFkpA0dDP-JpB2pryjSYEw2qR9"
train_url = "https://drive.google.com/uc?id=1r02PerNvXBwAJDP-9vaWUrxnwPymk2-U"

# Download the files using gdown
gdown.download(stores_url, "stores.csv", quiet=True)
gdown.download(train_url, "train.csv", quiet=True)

'train.csv'

In [None]:
# Filter stores from the Pichincha Region only
df_stores = pd.read_csv("stores.csv")
store_ids = df_stores[df_stores['state'] == 'Pichincha']['store_nbr'].unique()

# Read train.csv in chunks 1 million rows at a time to prevent memory overload
chunk_size = 10**6
filtered_chunks = []

for chunk in pd.read_csv("train.csv", chunksize=chunk_size, dtype={'onpromotion': object}):
  # Convert string values ("True", "False", None) into proper nullable booleans
  chunk['onpromotion'] = chunk['onpromotion'].map({'True': True, 'False': False, None: None})
  chunk['onpromotion'] = chunk['onpromotion'].astype('boolean')

  # Filter rows belonging to Pichincha region
  chunk_filtered = chunk[chunk['store_nbr'].isin(store_ids)]
  filtered_chunks.append(chunk_filtered)

  # Free up memory before reading next chunk
  del chunk

# Combine all filtered chunks and randomly sample 2 million rows
df_train = pd.concat(filtered_chunks, ignore_index=True)

sample_size = min(2_000_000, len(df_train))
df_train = (
    df_train
    .sample(n=sample_size, random_state=42)
    .reset_index(drop=True)
)

# Remove intermediate objects to free memory
del filtered_chunks

# Quick preview of the filtered and sampled dataset
df_train.head(5)

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
0,12891204,2013-10-22,46,308766,2.0,
1,51564450,2015-07-16,48,881910,1.0,False
2,112463413,2017-04-14,47,852934,2.0,False
3,17037468,2014-01-12,49,1473479,123.506,
4,56638373,2015-09-15,20,504457,1.0,False


In [None]:
# Inspect shape of sessions dataset
print(f"---DATASET SHAPE---\nRows: {df_train.shape[0]}\nColumns: {df_train.shape[1]}")

---DATASET SHAPE---
Rows: 2000000
Columns: 6


## 4 - Export Sample Dataset

After filtering and sampling the raw `train.csv` file, the final step is to export the prepared subset as `train_pichincha_2M.csv` so it can be reused in [Section 2](https://colab.research.google.com/drive/1WM1RG4q3JP0dARrYMc1NkqiFdvmAG8AB?usp=sharing) for data preparation and an exploratory data analysis without repeating the heavy preprocessing steps.

In [None]:
# Mount Google Drive
drive.mount('/content/drive')

df_train.to_csv("/content/drive/MyDrive/Colab Notebooks/ms-data/favorita/train_pichincha_2M.csv", index=False)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


For convenience, the filtered 2‑million‑row Pichincha sample used in this project can also be downloaded directly from [Google Drive](https://drive.google.com/file/d/1Pcw8fED4bi0EHyxVaef4gPq7_OXNdCKS/view?usp=sharing).