# Instacart Data Generation for CDC Pipeline

This notebook loads Instacart data into PostgreSQL to simulate an OLTP application:
- **Dimension tables** (aisles, departments, products): Load all data at once
- **Transactional tables** (orders, order_products): Load incrementally to simulate real-time data generation

## 1. Import Required Libraries

In [6]:
from dotenv import load_dotenv
load_dotenv("../.env")

True

In [7]:
%load_ext autoreload
%autoreload 2

import os
import sys

# Get the absolute path to the directory containing your package
module_path = os.path.abspath('..') 

# Add the path to the system path if it's not already there
if module_path not in sys.path:
    sys.path.insert(0, module_path)


from data_generation.data_generation import InstacartDataLoader 

## 2. Database Configuration

In [8]:
# Database connection parameters
DB_CONFIG = {
    'host': os.getenv('DB_HOST'),
    'port': os.getenv('DB_PORT'),
    'database': os.getenv('DB_NAME'),
    'user': os.getenv('DB_USER'),
    'password': os.getenv('DB_PASSWORD')
}

# Data directory
DATA_DIR = '../data'


# Application log file
LOG_FILE = '../logs/data_generation/data_loading.log'


# Initialize the data loader
loader = InstacartDataLoader(
    db_config=DB_CONFIG,
    data_dir=DATA_DIR,
    log_file=LOG_FILE,
)

In [9]:
# Load Dimension Data (All at Once)
loader.load_dimension_tables()

2025-12-27 02:42:01,389 - INFO - LOADING DIMENSION TABLES
2025-12-27 02:42:01,391 - INFO - Loading aisles...
2025-12-27 02:42:01,392 - INFO - Loading aisles.csv...
2025-12-27 02:42:01,396 - INFO - Loaded 134 rows from aisles.csv
2025-12-27 02:42:01,412 - INFO - Inserted 134 rows into instacart.aisles
2025-12-27 02:42:01,414 - INFO - Loading departments...
2025-12-27 02:42:01,416 - INFO - Loading departments.csv...
2025-12-27 02:42:01,420 - INFO - Loaded 21 rows from departments.csv


2025-12-27 02:42:01,429 - INFO - Inserted 21 rows into instacart.departments
2025-12-27 02:42:01,431 - INFO - Loading products...
2025-12-27 02:42:01,432 - INFO - Loading products.csv...
2025-12-27 02:42:01,487 - INFO - Loaded 49688 rows from products.csv
2025-12-27 02:42:04,745 - INFO - Inserted 49688 rows into instacart.products
2025-12-27 02:42:04,750 - INFO - DIMENSION TABLES LOADED SUCCESSFULLY


In [10]:
# Progress log file
PROGRESS_LOG_FILE = '../logs/data_generation/loading_progress.json'

# Batch configuration for incremental loading
MIN_ORDERS_PER_BATCH = 50  # Minimum number of orders per batch
MAX_ORDERS_PER_BATCH = 150  # Maximum number of orders per batch
MIN_SLEEP_SECONDS = 1  # Minimum seconds to sleep between batches
MAX_SLEEP_SECONDS = 10  # Maximum seconds to sleep between batches

loader.load_train_orders_incrementally(
    tracking_file_path=PROGRESS_LOG_FILE
)

2025-12-27 02:42:15,546 - INFO - LOADING TRAIN ORDERS DATA (INCREMENTALLY)
2025-12-27 02:42:15,548 - INFO - Resuming from order index 647
2025-12-27 02:42:15,548 - INFO - Previously loaded: 647 orders, 6831 order products
2025-12-27 02:42:15,549 - INFO - Loading orders.csv...


2025-12-27 02:42:16,991 - INFO - Found 131209 train orders
2025-12-27 02:42:16,992 - INFO - Loading order_products__train.csv...
2025-12-27 02:42:17,215 - INFO - Found 1384617 train order products
2025-12-27 02:42:17,443 - INFO - Train batch completed: 47 orders, 643 order products | Progress: 694/131209 orders (0.5%)
2025-12-27 02:42:17,444 - INFO - Sleeping for 10 seconds...
2025-12-27 02:42:27,520 - INFO - Train batch completed: 42 orders, 349 order products | Progress: 736/131209 orders (0.6%)
2025-12-27 02:42:27,522 - INFO - Sleeping for 6 seconds...
2025-12-27 02:42:33,591 - INFO - Train batch completed: 31 orders, 389 order products | Progress: 767/131209 orders (0.6%)
2025-12-27 02:42:33,593 - INFO - Sleeping for 9 seconds...
2025-12-27 02:42:42,698 - INFO - Train batch completed: 50 orders, 476 order products | Progress: 817/131209 orders (0.6%)
2025-12-27 02:42:42,699 - INFO - Sleeping for 11 seconds...
2025-12-27 02:42:53,739 - INFO - Train batch completed: 23 orders, 236 or

In [3]:
import pandas as pd
import os

orders_df = pd.read_csv(os.path.join('../data/', 'orders.csv'))

In [5]:
orders_df['days_since_prior_order'].max()

30.0