# Data Ingestion and Raw Tables

This notebook performs the initial data ingestion from source files and creates raw tables in the data warehouse. The process includes:

1. Loading source data from JSON and CSV files
2. Basic data validation and cleanup
3. Saving data in Parquet format for efficient processing

## Setup

First, we'll import required libraries and initialize our Spark session:

# E-commerce Data Ingestion and Raw Tables

This notebook handles the initial data ingestion from various sources (CSV, JSON, Excel) and creates raw tables for further processing.

## Contents
1. Setup and Dependencies
2. Schema Definitions
3. Data Loading
4. Initial Validation
5. Raw Table Creation

In [None]:
# Import required libraries
import os
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType, DateType
from datetime import datetime
from src.config import SparkConfig, DataConfig, get_spark_configs

# Initialize Spark session with configurations
spark = SparkSession.builder \
    .appName(SparkConfig.APP_NAME) \
    .master(SparkConfig.MASTER)

# Add all configurations
for key, value in get_spark_configs().items():
    spark = spark.config(key, value)

spark = spark.getOrCreate()

ModuleNotFoundError: No module named 'pandas'

## Load Raw Data

Next, we'll load the raw data from JSON and CSV files. We'll display a sample of each dataset to verify the data loading:

In [None]:
# Get project root directory
import os
from src.config import DataConfig

# Set up paths
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
data_dir = os.path.join(project_root, DataConfig.DATA_DIR)
processed_dir = os.path.join(project_root, DataConfig.PROCESSED_DIR)

# Verify directories exist
if not os.path.exists(data_dir):
    raise FileNotFoundError(f"Data directory not found at {data_dir}")

# Create processed directory if it doesn't exist
os.makedirs(processed_dir, exist_ok=True)

print(f"Using data directory: {data_dir}")
print(f"Using processed directory: {processed_dir}")

Using data directory: c:\Users\kulde\Downloads\PEI\ecom_assignment\data


## Schema Definitions

Define the schemas for our raw tables to ensure data consistency and proper type enforcement.

In [4]:
# Define schema for Products table
products_schema = StructType([
    StructField("Product ID", StringType(), False),
    StructField("Category", StringType(), True),
    StructField("Sub-Category", StringType(), True),
    StructField("Product Name", StringType(), False),
    StructField("Price", FloatType(), True)
])

# Define schema for Customers table
customers_schema = StructType([
    StructField("Customer ID", StringType(), False),
    StructField("Customer Name", StringType(), False),
    StructField("Email", StringType(), True),
    StructField("Country", StringType(), True),
    StructField("City", StringType(), True),
    StructField("Postal Code", StringType(), True)
])

# Define schema for Orders table
orders_schema = StructType([
    StructField("Order ID", StringType(), False),
    StructField("Customer ID", StringType(), False),
    StructField("Product ID", StringType(), False),
    StructField("Order Date", DateType(), False),
    StructField("Quantity", IntegerType(), False),
    StructField("Sales", FloatType(), False),
    StructField("Profit", FloatType(), False)
])

NameError: name 'StructType' is not defined

## Data Validation Functions

Define functions to validate data quality according to configured thresholds.

In [None]:
def validate_dataframe(df, df_name):
    """Validate DataFrame against quality thresholds"""
    from pyspark.sql.functions import col, count, when
    
    # Check required columns
    missing_cols = set(DataConfig.REQUIRED_COLUMNS[df_name]) - set(df.columns)
    if missing_cols:
        raise ValueError(f"Missing required columns in {df_name}: {missing_cols}")
    
    # Check for nulls
    for column in DataConfig.REQUIRED_COLUMNS[df_name]:
        null_count = df.filter(col(column).isNull()).count()
        null_percentage = null_count / df.count()
        
        if null_percentage > DataConfig.MAX_NULL_PERCENTAGE:
            print(f"WARNING: Column {column} has {null_percentage:.2%} null values")
    
    # Check for duplicates
    if df_name == "orders":  # Only check duplicates for order records
        duplicate_count = df.count() - df.dropDuplicates(["Order ID"]).count()
        duplicate_percentage = duplicate_count / df.count()
        
        if duplicate_percentage > DataConfig.MAX_DUPLICATE_PERCENTAGE:
            print(f"WARNING: Found {duplicate_percentage:.2%} duplicate orders")

## Data Loading

Load data from various source files (CSV, JSON, Excel) and apply the defined schemas.

In [None]:
# Load Products data from CSV
products_df = spark.read.format("csv") \
    .option("header", "true") \
    .schema(products_schema) \
    .load(os.path.join(data_dir, "Products.csv"))

# Load Customers data from Excel
customers_df = spark.createDataFrame(
    pd.read_excel(os.path.join(data_dir, "Customer.xlsx")),
    schema=customers_schema
)

# Load Orders data from JSON
orders_df = spark.read.format("json") \
    .schema(orders_schema) \
    .load(os.path.join(data_dir, "Orders.json"))

# Show sample data
print("Products sample:")
products_df.show(5)
print("\nCustomers sample:")
customers_df.show(5)
print("\nOrders sample:")
orders_df.show(5)

## Initial Data Validation

Perform basic data quality checks on the loaded data.

In [None]:
# Check for missing values
def check_missing_values(df, table_name):
    print(f"\nMissing values in {table_name}:")
    for col in df.columns:
        missing_count = df.filter(df[col].isNull()).count()
        if missing_count > 0:
            print(f"{col}: {missing_count} missing values")

# Check for duplicate keys
def check_duplicate_keys(df, key_col, table_name):
    duplicate_count = df.groupBy(key_col).count().filter("count > 1").count()
    print(f"\nDuplicate keys in {table_name}: {duplicate_count}")

# Perform checks
for df, name, key in [
    (products_df, "Products", "Product ID"),
    (customers_df, "Customers", "Customer ID"),
    (orders_df, "Orders", "Order ID")
]:
    check_missing_values(df, name)
    check_duplicate_keys(df, key, name)