# Data Extraction Process

This notebook extracts data from source systems and loads it into the Bronze layer. It represents the first step in our ETL pipeline.

## Process Overview
1. Set up environment and initialize Spark session
2. Define paths and configure logging
3. Extract data from source files
4. Load data into Bronze layer
5. Log results and clean up resources

## 1. Environment Setup

Initialize Spark session and import required libraries.

In [1]:
# Set JAVA_HOME environment variable
import os
os.environ['JAVA_HOME'] = r'C:\Program Files\Java\jre1.8.0_451'

from pyspark.sql import SparkSession
from datetime import datetime
import os
import logging

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Data Ingestion") \
    .getOrCreate()

## 2. Path Configuration

Define paths for source data, bronze layer, and logs.

In [2]:
# Define paths
base_path = "../data/"
bronze_base_path = "../output/bronzeLayer"
date_str = datetime.now().strftime("%Y-%m-%d")

# Log path organized by date
log_dir = os.path.join("../logs", "data_ingestion", date_str)
log_path = os.path.join(log_dir, "data_ingestion.log")

# Ensure the logs directory exists
os.makedirs(log_dir, exist_ok=True)

## 3. Logging Configuration

Set up logging to track the extraction process.

In [3]:
# Configure logging to write to the log file
logging.basicConfig(filename=log_path, level=logging.INFO,
                    format='%(asctime)s - %(levelname)s - %(message)s')

def log_message(message, level="info"):
    """Logs messages with the specified level."""
    if level == "info":
        logging.info(message)
    elif level == "error":
        logging.error(message)
    print(message)  

# Define source
source = "mrp_production"

## 4. Data Extraction and Loading

Extract data from source files and load it into the Bronze layer.

In [4]:
try:
    raw_path = os.path.join(base_path, f"{source}.csv")
    bronze_path = os.path.join(bronze_base_path, date_str)

    # Log start of the ingestion process
    log_message(f"Starting ingestion for {source} from {raw_path} to {bronze_path}")

    # Read raw data
    df = spark.read.csv(raw_path, header=True, inferSchema=True)
    
    # Ensure the bronze path exists
    os.makedirs(bronze_path, exist_ok=True)
    
    # Write to Bronze layer in Parquet format, organized by date
    df.write.mode("overwrite").parquet(bronze_path)
    
    # Log successful ingestion
    log_message(f"Successfully ingested {source} data to {bronze_path}")
    
except Exception as e:
    # Log any errors encountered during ingestion
    log_message(f"Error ingesting {source} data: {e}", level="error")

# Stop Spark session
spark.stop()
log_message("Data ingestion process completed.")

Starting ingestion for mrp_production from ../data/mrp_production.csv to ../output/bronzeLayer\2025-05-08
Successfully ingested mrp_production data to ../output/bronzeLayer\2025-05-08
Data ingestion process completed.


## 5. Summary

The extraction process has successfully:
- Read data from the source CSV file
- Loaded it into the Bronze layer in Parquet format
- Organized the data by date
- Logged all operations for traceability

The data is now ready for the next step in the ETL pipeline: data auditing and transformation.