# 01 - Data Source Setup

## Overview
This notebook sets up the streaming data source for our ETL pipeline. We simulate real-time e-commerce transaction data by generating CSV files incrementally.

## Business Context
We are processing transaction events from an e-commerce platform. Each transaction contains:
- Transaction ID and User ID
- Product information
- Transaction amount
- Event timestamp
- Transaction status (completed, pending, failed)

## Approach
We will generate batches of CSV files to simulate a streaming source. In production, this would be replaced with Kafka, Kinesis, or other streaming platforms.

In [None]:
# Import required libraries
import os
import time
import random
import csv
from datetime import datetime, timedelta
from pathlib import Path

## Configuration

Define the output directory and data generation parameters.

In [None]:
# Configuration
BASE_DIR = Path(os.path.abspath('')).parent
INPUT_DIR = BASE_DIR / 'data' / 'input'
OUTPUT_DIR = BASE_DIR / 'data' / 'output'

# Ensure directories exist
INPUT_DIR.mkdir(parents=True, exist_ok=True)
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print(f"Base Directory: {BASE_DIR}")
print(f"Input Directory: {INPUT_DIR}")
print(f"Output Directory: {OUTPUT_DIR}")

## Data Generation Functions

Create helper functions to generate realistic transaction data.

In [None]:
# Sample data for generation
PRODUCT_CATEGORIES = ['Electronics', 'Clothing', 'Home', 'Books', 'Sports', 'Toys', 'Food', 'Beauty']
STATUSES = ['completed', 'pending', 'failed', 'refunded']
PAYMENT_METHODS = ['credit_card', 'debit_card', 'paypal', 'bank_transfer']

def generate_transaction(transaction_id, base_time):
    """
    Generate a single transaction record.
    
    Args:
        transaction_id: Unique transaction identifier
        base_time: Base timestamp for the transaction
    
    Returns:
        Dictionary containing transaction data
    """
    # Add some randomness to timestamp (within 1 minute)
    event_time = base_time + timedelta(seconds=random.randint(0, 60))
    
    # Generate transaction data with realistic distributions
    status = random.choices(
        STATUSES, 
        weights=[0.85, 0.08, 0.05, 0.02]  # Most transactions complete successfully
    )[0]
    
    transaction = {
        'transaction_id': f'TXN{transaction_id:08d}',
        'user_id': f'USER{random.randint(1000, 9999)}',
        'product_id': f'PROD{random.randint(100, 999)}',
        'product_category': random.choice(PRODUCT_CATEGORIES),
        'amount': round(random.uniform(10.0, 500.0), 2),
        'quantity': random.randint(1, 5),
        'payment_method': random.choice(PAYMENT_METHODS),
        'status': status,
        'event_time': event_time.strftime('%Y-%m-%d %H:%M:%S'),
        'country_code': random.choice(['US', 'UK', 'CA', 'DE', 'FR', 'JP', 'AU']),
        # Introduce some data quality issues (nulls, invalid values)
        'discount_percent': None if random.random() < 0.3 else round(random.uniform(0, 30), 2),
        'customer_segment': random.choice(['premium', 'regular', 'new', None])
    }
    
    return transaction

## Generate Streaming Data

Create multiple batches of CSV files to simulate streaming data arrival.

In [None]:
def generate_batch(batch_number, num_records=100):
    """
    Generate a batch of transactions and save to CSV.
    
    Args:
        batch_number: Batch identifier
        num_records: Number of records to generate
    """
    base_time = datetime.now() - timedelta(hours=24) + timedelta(minutes=batch_number * 5)
    start_id = batch_number * num_records
    
    transactions = [
        generate_transaction(start_id + i, base_time) 
        for i in range(num_records)
    ]
    
    # Write to CSV
    filename = INPUT_DIR / f'transactions_batch_{batch_number:04d}.csv'
    
    with open(filename, 'w', newline='') as csvfile:
        fieldnames = transactions[0].keys()
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        writer.writerows(transactions)
    
    print(f"Generated batch {batch_number}: {filename} ({num_records} records)")
    return filename

## Execute Data Generation

Generate initial batches of data. For demonstration purposes, we'll create 5 batches.
In a real streaming scenario, new files would arrive continuously.

In [None]:
# Clean up existing files (optional - for fresh runs)
print("Cleaning up existing input files...")
for file in INPUT_DIR.glob('*.csv'):
    file.unlink()
print("Cleanup complete.\n")

# Generate initial batches
NUM_BATCHES = 5
RECORDS_PER_BATCH = 100

print(f"Generating {NUM_BATCHES} batches with {RECORDS_PER_BATCH} records each...\n")

for batch_num in range(NUM_BATCHES):
    generate_batch(batch_num, RECORDS_PER_BATCH)
    # Simulate delay between batch arrivals (optional)
    if batch_num < NUM_BATCHES - 1:
        time.sleep(0.5)

print(f"\nData generation complete! Total records: {NUM_BATCHES * RECORDS_PER_BATCH}")

## Verify Generated Data

Inspect the first few records to ensure data quality.

In [None]:
# List generated files
files = sorted(INPUT_DIR.glob('*.csv'))
print(f"Total files generated: {len(files)}\n")

for file in files:
    print(f"  - {file.name}")

In [None]:
# Read and display sample data from first batch
import pandas as pd

if files:
    sample_df = pd.read_csv(files[0])
    print(f"\nSample data from {files[0].name}:")
    print(f"\nShape: {sample_df.shape}")
    print(f"\nFirst 5 records:")
    print(sample_df.head())
    print(f"\nData types:")
    print(sample_df.dtypes)
    print(f"\nNull counts:")
    print(sample_df.isnull().sum())

## Summary

This notebook has successfully generated streaming transaction data in CSV format. The data includes:

- Realistic transaction records with various attributes
- Multiple product categories and payment methods
- Intentional data quality issues (nulls) to demonstrate cleaning in downstream steps
- Timestamp-based ordering to simulate real-time arrival

**Next Steps:**
- Proceed to notebook 02 to set up Spark Structured Streaming ingestion
- Define explicit schema for type safety
- Configure streaming read from this input directory