# Instacart Data Generation for CDC Pipeline

This notebook loads Instacart data into PostgreSQL to simulate an OLTP application:
- **Dimension tables** (aisles, departments, products): Load all data at once
- **Transactional tables** (orders, order_products): Load incrementally to simulate real-time data generation

## 1. Import Required Libraries

In [1]:
from dotenv import load_dotenv
load_dotenv("../.env")

True

In [2]:
%load_ext autoreload
%autoreload 2

import os
import sys

# Get the absolute path to the directory containing your package
module_path = os.path.abspath('..') 

# Add the path to the system path if it's not already there
if module_path not in sys.path:
    sys.path.insert(0, module_path)


from src.data_generation import InstacartDataLoader 

## 2. Database Configuration

In [3]:
# Database connection parameters
DB_CONFIG = {
    'host': os.getenv('DB_HOST'),
    'port': os.getenv('DB_PORT'),
    'database': os.getenv('DB_NAME'),
    'user': os.getenv('DB_USER'),
    'password': os.getenv('DB_PASSWORD')
}

# Data directory
DATA_DIR = '../data'


# Application log file
LOG_FILE = '../logs/data_loading.log'


# Initialize the data loader
loader = InstacartDataLoader(
    db_config=DB_CONFIG,
    data_dir=DATA_DIR,
    log_file=LOG_FILE,
)

In [None]:
# Load Dimension Data (All at Once)
loader.load_dimension_tables()

2025-12-21 01:43:18,094 - INFO - LOADING DIMENSION TABLES
2025-12-21 01:43:18,097 - INFO - Loading aisles...
2025-12-21 01:43:18,098 - INFO - Loading aisles.csv...
2025-12-21 01:43:18,101 - INFO - Loaded 134 rows from aisles.csv
2025-12-21 01:43:18,116 - INFO - Inserted 134 rows into instacart.aisles
2025-12-21 01:43:18,118 - INFO - Loading departments...
2025-12-21 01:43:18,121 - INFO - Loading departments.csv...
2025-12-21 01:43:18,124 - INFO - Loaded 21 rows from departments.csv


2025-12-21 01:43:18,140 - INFO - Inserted 21 rows into instacart.departments
2025-12-21 01:43:18,143 - INFO - Loading products...
2025-12-21 01:43:18,144 - INFO - Loading products.csv...
2025-12-21 01:43:18,218 - INFO - Loaded 49688 rows from products.csv
2025-12-21 01:43:21,487 - INFO - Inserted 49688 rows into instacart.products
2025-12-21 01:43:21,495 - INFO - DIMENSION TABLES LOADED SUCCESSFULLY


In [4]:
# Progress log file
PROGRESS_LOG_FILE = '../logs/loading_progress.json'

# Batch configuration for incremental loading
MIN_ORDERS_PER_BATCH = 50  # Minimum number of orders per batch
MAX_ORDERS_PER_BATCH = 150  # Maximum number of orders per batch
MIN_SLEEP_SECONDS = 1  # Minimum seconds to sleep between batches
MAX_SLEEP_SECONDS = 10  # Maximum seconds to sleep between batches

loader.load_train_orders_incrementally(
    tracking_file_path=PROGRESS_LOG_FILE
)

2025-12-21 01:45:02,218 - INFO - LOADING TRAIN ORDERS DATA (INCREMENTALLY)
2025-12-21 01:45:02,223 - INFO - Resuming from order index 95
2025-12-21 01:45:02,224 - INFO - Previously loaded: 95 orders, 990 order products
2025-12-21 01:45:02,225 - INFO - Loading orders.csv...


2025-12-21 01:45:04,253 - INFO - Found 131209 train orders
2025-12-21 01:45:04,256 - INFO - Loading order_products__train.csv...
2025-12-21 01:45:04,536 - INFO - Found 1384617 train order products
2025-12-21 01:45:04,572 - INFO - Train batch completed: 20 orders, 204 order products | Progress: 115/131209 orders (0.1%)
2025-12-21 01:45:04,577 - INFO - Sleeping for 12 seconds...
2025-12-21 01:45:16,629 - INFO - Train batch completed: 31 orders, 285 order products | Progress: 146/131209 orders (0.1%)
2025-12-21 01:45:16,631 - INFO - Sleeping for 5 seconds...
2025-12-21 01:45:21,668 - INFO - Train batch completed: 15 orders, 209 order products | Progress: 161/131209 orders (0.1%)
2025-12-21 01:45:21,670 - INFO - Sleeping for 12 seconds...
2025-12-21 01:45:28,096 - INFO - Progress saved at order index: 161
2025-12-21 01:45:28,097 - INFO - Total orders loaded so far: 161
2025-12-21 01:45:28,098 - INFO - Total order products loaded so far: 1688
2025-12-21 01:45:28,099 - INFO - Remaining order