## Task 3: Create Enriched Orders Table

This notebook demonstrates the creation of the **Enriched Orders Table**, which provides a denormalized, comprehensive view of each sale by joining order, customer, and product information.


### 0. Environment Setup and Data Loading

In [2]:
import os
import sys
from pyspark.sql import SparkSession
from pyspark.sql.types import *

# Add project root to path for module imports
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

from src.spark_session import get_spark_session
from src.load_source_data import load_customer_data, load_orders_data, load_products_data

# Initialize Spark session
spark = get_spark_session("EnrichedOrdersNotebook")

# Load Raw Data
customers_df = load_customer_data(spark, os.path.join(PROJECT_ROOT, "data", "Customer.xlsx"))
orders_df = load_orders_data(spark, os.path.join(PROJECT_ROOT, "data", "Orders.json"))
products_df = load_products_data(spark, os.path.join(PROJECT_ROOT, "data", "Products.csv"))

print("Data loaded.")


Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Customer.xlsx
File found. Loading Excel data using Spark...
Customer data loaded successfully

Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Orders.json
File found. Loading JSON data using Spark...
Orders data loaded successfully

Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Products.csv
File found. Loading CSV data using Spark...
Products data loaded successfully
Data loaded.


### 1. Create Enriched Orders Table

In [4]:
import importlib
import src.load_enriched_orders
importlib.reload(src.load_enriched_orders)

from src.load_enriched_orders import create_enriched_orders_table

# Create Enriched Orders Table
print("Creating Enriched Orders Table...")
enriched_orders_df = create_enriched_orders_table(orders_df, customers_df, products_df)

print(f"\nEnriched orders table created with {enriched_orders_df.count()} rows\n")
enriched_orders_df.printSchema()
enriched_orders_df.show(5)

Creating Enriched Orders Table...

--- Cleaning Orders Data for Enrichment ---
Removing records with negative Profit...
   Records with negative Profit: 1870
Removing records with NULL Order ID...
   Records with NULL Order ID: 0
   Removing duplicate Order IDs...
   Records with negative Profit: 1870
Removing records with NULL Order ID...
   Records with NULL Order ID: 0
   Removing duplicate Order IDs...
   Records with duplicate Order ID removed: 3697
Validating date formats...
   Records with invalid date format: 0
Removing records with NULL Customer ID...
   Records with duplicate Order ID removed: 3697
Validating date formats...
   Records with invalid date format: 0
Removing records with NULL Customer ID...
   Records with NULL Customer ID: 0
Removing records with NULL Product ID...
   Records with NULL Product ID: 0
   Records with NULL Customer ID: 0
Removing records with NULL Product ID...
   Records with NULL Product ID: 0
Orders data cleaning completed
  Original records: 9

### 2. Data Quality Validation

In [5]:
from src.load_enriched_orders import validate_enriched_orders_table

print("Running data quality checks on enriched orders data...")
validated_enriched_orders_df = validate_enriched_orders_table(enriched_orders_df, orders_df)
print("\nEnriched orders data validation complete.")


Running data quality checks on enriched orders data...

--- Validating Enriched Orders Table ---



Data quality checks for enriched orders table passed successfully.

Enriched orders data validation complete.

Data quality checks for enriched orders table passed successfully.

Enriched orders data validation complete.


### 3. Creation of Enriched Views

In [6]:
# Create temporary views
validated_enriched_orders_df.createOrReplaceTempView("enriched_orders_view")

print("Temporary view 'enriched_orders_view' created.")


Temporary view 'enriched_orders_view' created.


25/12/01 14:54:02 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


### 4. Displaying data from Enriched Views

In [7]:
print("Enriched Orders View...\n")
spark.sql("SELECT * FROM enriched_orders_view LIMIT 5").show()


Enriched Orders View...

+--------------+----------+---------+--------------+-------------+-------------+---------------+------------+--------------------+--------+------+--------+------+
|      Order ID|Order Date|Ship Date|     Ship Mode|Customer Name|      Country|       Category|Sub-Category|        Product Name|Quantity| Price|Discount|Profit|
+--------------+----------+---------+--------------+-------------+-------------+---------------+------------+--------------------+--------+------+--------+------+
|CA-2017-147039| 29/6/2017| 4/7/2017|Standard Class|   Alex Avila|United States|Office Supplies|  Appliances|Belkin 325VA UPS ...|       3|362.94|     0.0| 90.73|
|CA-2016-103982|  3/3/2016| 8/3/2016|Standard Class|   Alex Avila|United States|     Technology| Accessories|Verbatim 25 GB 6x...|       7| 41.72|     0.2|  5.74|
|CA-2015-121391| 4/10/2015|7/10/2015|   First Class|   Alex Avila|United States|Office Supplies|     Storage|Tenex Personal Pr...|       2| 26.96|     0.0|   7.

### 5. Stop Spark Session

In [None]:
spark.stop()
print("Spark session stopped.")