# Instacart Data Generation for CDC Pipeline

This notebook loads Instacart data into PostgreSQL to simulate an OLTP application:
- **Dimension tables** (aisles, departments, products): Load all data at once
- **Transactional tables** (orders, order_products): Load incrementally to simulate real-time data generation

## 1. Import Required Libraries

In [1]:
from dotenv import load_dotenv
load_dotenv("../.env")

True

In [2]:
%load_ext autoreload
%autoreload 2

import os
import sys

# Get the absolute path to the directory containing your package
module_path = os.path.abspath('..') 

# Add the path to the system path if it's not already there
if module_path not in sys.path:
    sys.path.insert(0, module_path)


from data_generation.data_generation import InstacartDataLoader 

In [3]:
import pandas as pd

## 2. Database Configuration

In [4]:
# Database connection parameters
DB_CONFIG = {
    'host': os.getenv('DB_HOST'),
    'port': os.getenv('DB_PORT'),
    'database': os.getenv('DB_NAME'),
    'user': os.getenv('DB_USER'),
    'password': os.getenv('DB_PASSWORD')
}

# Data directory
DATA_DIR = '../data'


# Application log file
LOG_FILE = '../logs/data_generation/data_loading.log'


# Initialize the data loader
loader = InstacartDataLoader(
    db_config=DB_CONFIG,
    data_dir=DATA_DIR,
    log_file=LOG_FILE,
)

In [6]:
# Load Dimension Data (All at Once)
loader.load_initial()

2025-12-28 09:07:03,775 - INFO - LOADING DIMENSION TABLES
2025-12-28 09:07:03,778 - INFO - Loading aisles...
2025-12-28 09:07:03,779 - INFO - Loading aisles.csv...
2025-12-28 09:07:03,782 - INFO - Loaded 134 rows from aisles.csv
2025-12-28 09:07:03,795 - INFO - Inserted 134 rows into instacart.aisles
2025-12-28 09:07:03,797 - INFO - Loading departments...
2025-12-28 09:07:03,798 - INFO - Loading departments.csv...
2025-12-28 09:07:03,802 - INFO - Loaded 21 rows from departments.csv
2025-12-28 09:07:03,810 - INFO - Inserted 21 rows into instacart.departments
2025-12-28 09:07:03,812 - INFO - Loading products...


2025-12-28 09:07:03,813 - INFO - Loading products.csv...
2025-12-28 09:07:03,852 - INFO - Loaded 49688 rows from products.csv
2025-12-28 09:07:06,702 - INFO - Inserted 49688 rows into instacart.products
2025-12-28 09:07:06,707 - INFO - DIMENSION TABLES LOADED SUCCESSFULLY
2025-12-28 09:07:06,710 - INFO - LOADING FACT TABLES
2025-12-28 09:07:06,712 - INFO - Loading orders with dates...
2025-12-28 09:07:06,712 - INFO - Loading orders_with_dates.csv...
2025-12-28 09:07:09,221 - INFO - Loaded 3421083 rows from orders_with_dates.csv
2025-12-28 09:07:09,904 - INFO - Filtered orders to 62974 rows with order_date between 2025-12-01 and 2025-12-28 00:00:00
2025-12-28 09:07:13,988 - INFO - Inserted 62974 rows into instacart.orders
2025-12-28 09:07:20,345 - INFO - Filtered orders to 623871 order products matching loaded orders
2025-12-28 09:07:54,574 - INFO - Inserted 623871 rows into instacart.order_products
2025-12-28 09:07:54,606 - INFO - FACT TABLES LOADED SUCCESSFULLY


In [13]:
# Progress log file
PROGRESS_LOG_FILE = '../logs/data_generation/loading_progress.json'

# Batch configuration for incremental loading
MIN_ORDERS_PER_BATCH = 50  # Minimum number of orders per batch
MAX_ORDERS_PER_BATCH = 150  # Maximum number of orders per batch
MIN_SLEEP_SECONDS = 1  # Minimum seconds to sleep between batches
MAX_SLEEP_SECONDS = 10  # Maximum seconds to sleep between batches

loader.load_orders_incrementally(
    tracking_file_path=PROGRESS_LOG_FILE
)

2025-12-28 08:39:28,199 - INFO - LOADING ORDERS DATA (INCREMENTALLY)
2025-12-28 08:39:28,201 - INFO - Loading orders from 2025-12-28 onwards...
2025-12-28 08:39:28,202 - INFO - Loading orders_with_dates.csv...


2025-12-28 08:39:34,111 - INFO - Found 872123 orders to process
2025-12-28 08:39:34,113 - INFO - Loading test_batch_order_products.csv...
2025-12-28 08:39:35,744 - INFO - Found 8862276 order products
2025-12-28 08:39:35,747 - INFO - Incremental index range: 0 -> 7679
2025-12-28 08:39:35,855 - INFO - Batch completed: 21 orders, 190 order products | Latest: order_date=2025-12-28, index=21/7679 (0.3%) | 
2025-12-28 08:39:35,857 - INFO - Sleeping for 6 seconds...
2025-12-28 08:39:41,951 - INFO - Batch completed: 10 orders, 93 order products | Latest: order_date=2025-12-28, index=31/7679 (0.4%) | 
2025-12-28 08:39:41,953 - INFO - Sleeping for 14 seconds...
2025-12-28 08:39:56,053 - INFO - Batch completed: 33 orders, 367 order products | Latest: order_date=2025-12-28, index=64/7679 (0.8%) | 
2025-12-28 08:39:56,054 - INFO - Sleeping for 10 seconds...
2025-12-28 08:40:05,237 - INFO - Progress saved at: order_date=2025-12-28, index=64/7679
2025-12-28 08:40:05,238 - INFO - Remaining orders: 872