## Task 2: Create Enriched Tables

This notebook demonstrates how to use the enrichment functions from `load_enriched_table.py` to create enriched customer and product tables.

### Enrichments Covered:
- **Customer Enrichments:**
  - Total Sales, Profit, and Orders
  - Average Order Value
  - Product Diversity
  - Churn Risk Score
- **Product Enrichments:**
  - Price Positioning
  - Inventory Velocity

### 0. Environment Setup and Data Loading

In [2]:
import os
import sys
from pyspark.sql import SparkSession
from pyspark.sql.types import *

# Add project root to path for module imports
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

from src.spark_session import get_spark_session
from src.load_source_data import load_customer_data, load_orders_data, load_products_data

# Initialize Spark session
spark = get_spark_session("EnrichmentNotebook")

# Load Raw Data
customers_df = load_customer_data(spark, os.path.join(PROJECT_ROOT, "data", "Customer.xlsx"))
orders_df = load_orders_data(spark, os.path.join(PROJECT_ROOT, "data", "Orders.json"))
products_df = load_products_data(spark, os.path.join(PROJECT_ROOT, "data", "Products.csv"))

print("Data loaded.")


Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Customer.xlsx
File found. Loading Excel data using Spark...
Customer data loaded successfully

Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Orders.json
File found. Loading JSON data using Spark...
Orders data loaded successfully

Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Products.csv
File found. Loading CSV data using Spark...
Products data loaded successfully
Data loaded.


### 1. Customer Enrichment

In [None]:
import importlib
import src.load_enriched_table
importlib.reload(src.load_enriched_table)

from src.load_enriched_table import (
    summarize_customer_spending,
    calculate_average_basket_size,
    measure_customer_product_variety,
    identify_at_risk_customers,
    classify_product_price_level,
    identify_fast_and_slow_sellers
)

# Create Enriched Customer Table
print("Creating Enriched Customer Table...")

# Columns added: total_sales, total_profit, total_orders.
# Description: Summarizes lifetime value and engagement.
enriched_customers_df = summarize_customer_spending(customers_df, orders_df)

print(f"\nEnriched customers table created with {enriched_customers_df.count()} rows\n")
enriched_customers_df.printSchema()
enriched_customers_df.take(5)

Creating Enriched Customer Table...


                                                                                


Enriched customers table created with 793 rows

root
 |-- Customer ID: string (nullable = true)
 |-- Customer Name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- address: string (nullable = true)
 |-- Segment: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Postal Code: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- total_sales: double (nullable = true)
 |-- total_profit: double (nullable = true)
 |-- total_orders: long (nullable = true)



[Row(Customer ID='PW-19240', Customer Name='Pierre Wener', email='bettysullivan808@gmail.com', phone='421.580.0902x9815', address='001 Jones Ridges Suite 338\nJohnsonfort, FL 95462', Segment='Consumer', Country='United States', City='Louisville', State='Colorado', Postal Code='80027', Region='West', total_sales=3921.6, total_profit=1291.1500000000003, total_orders=7),
 Row(Customer ID='GH-14410', Customer Name='Gary567 Hansen', email='austindyer948@gmail.com', phone='001-542-415-0246x314', address='00347 Murphy Unions\nAshleyton, IA 29814', Segment='Home Office', Country='United States', City='Chicago', State='Illinois', Postal Code='60653', Region='Central', total_sales=2818.88, total_profit=-577.0099999999999, total_orders=9),
 Row(Customer ID='KL-16555', Customer Name='Kelly Lampkin', email='clarencehughes280@gmail.com', phone='7185624866', address='007 Adams Lane Suite 176\nEast Amyberg, IN 34581', Segment='Corporate', Country='United States', City='Colorado Springs', State='Colora

In [4]:
from pyspark.sql.functions import round, col, desc 

print("Calculating average basket size for each customer...")

# Columns added: average_order_value.
# Description: Calculates the average spending per order.
customer_basket_size_df = calculate_average_basket_size(enriched_customers_df, orders_df)

customer_basket_size_df.select(
    col("Customer ID"),
    col("Customer Name"),
    round(col("average_order_value"), 2).alias("average_order_value")
).orderBy(desc("average_order_value")).take(5)

Calculating average basket size for each customer...


[Row(Customer ID='DB-13555', Customer Name='Dorothy Badders', average_order_value=13892.0),
 Row(Customer ID='DP-13105', Customer Name='Dave Poirier', average_order_value=6216.74),
 Row(Customer ID='SC-20770', Customer Name='Stewart Carmichael', average_order_value=6110.55),
 Row(Customer ID='SM-20320', Customer Name='Sean Miller', average_order_value=5008.61),
 Row(Customer ID='SC-20020', Customer Name='Sam Craven', average_order_value=4495.38)]

In [5]:
from pyspark.sql.functions import col, desc

print("Analyzing product variety per customer...")

# Columns added: unique_products_purchased.
# Description: Counts the number of unique products purchased.
customer_product_variety_df = measure_customer_product_variety(customer_basket_size_df, orders_df)

customer_product_variety_df.select(
    col("Customer ID"),
    col("Customer Name"),
    col("unique_products_purchased")
).orderBy(desc("unique_products_purchased")).take(5)

Analyzing product variety per customer...


[Row(Customer ID='WB-21850', Customer Name='William Brown', unique_products_purchased=36),
 Row(Customer ID='MA-17560', Customer Name='Matt Abelman', unique_products_purchased=34),
 Row(Customer ID='PP-18955', Customer Name='Paul Prost', unique_products_purchased=34),
 Row(Customer ID='JL-15835', Customer Name='John Lee', unique_products_purchased=33),
 Row(Customer ID='EH-13765', Customer Name='Edward Hooks', unique_products_purchased=32)]

In [6]:
from pyspark.sql.functions import col, desc

print("Identifying at-risk customers based on purchase order...")

# Columns added: days_since_last_order, churn_risk_score.
# Description: Identifies customers at risk of churning.
at_risk_customers_df = identify_at_risk_customers(customer_product_variety_df, orders_df)

at_risk_customers_df.select(
    col("Customer ID"),
    col("Customer Name"),
    col("days_since_last_order"),
    col("churn_risk_score")
).where(col("days_since_last_order").isNotNull()).orderBy(desc("days_since_last_order")).take(5)

Identifying at-risk customers based on purchase order...


[]

In [7]:
# Final enriched customers dataframe 
final_enriched_customers_df=at_risk_customers_df

final_enriched_customers_df.printSchema()
final_enriched_customers_df.take(5)

root
 |-- Customer ID: string (nullable = true)
 |-- Customer Name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- phone: string (nullable = true)
 |-- address: string (nullable = true)
 |-- Segment: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- City: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Postal Code: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- total_sales: double (nullable = true)
 |-- total_profit: double (nullable = true)
 |-- total_orders: long (nullable = true)
 |-- average_order_value: double (nullable = true)
 |-- unique_products_purchased: long (nullable = true)
 |-- days_since_last_order: integer (nullable = true)
 |-- churn_risk_score: string (nullable = true)



[Row(Customer ID='PW-19240', Customer Name='Pierre Wener', email='bettysullivan808@gmail.com', phone='421.580.0902x9815', address='001 Jones Ridges Suite 338\nJohnsonfort, FL 95462', Segment='Consumer', Country='United States', City='Louisville', State='Colorado', Postal Code='80027', Region='West', total_sales=3921.6, total_profit=1291.1500000000003, total_orders=7, average_order_value=560.2285714285714, unique_products_purchased=12, days_since_last_order=None, churn_risk_score='Low Risk'),
 Row(Customer ID='GH-14410', Customer Name='Gary567 Hansen', email='austindyer948@gmail.com', phone='001-542-415-0246x314', address='00347 Murphy Unions\nAshleyton, IA 29814', Segment='Home Office', Country='United States', City='Chicago', State='Illinois', Postal Code='60653', Region='Central', total_sales=2818.88, total_profit=-577.0099999999999, total_orders=9, average_order_value=313.2088888888889, unique_products_purchased=17, days_since_last_order=None, churn_risk_score='Low Risk'),
 Row(Cust

In [None]:
from src.load_enriched_table import validate_enriched_customer_data

print("Running data quality checks on enriched customer data...")
validate_enriched_df=validate_enriched_customer_data(final_enriched_customers_df)
validate_enriched_df.take(5)
print("\nCustomer data validation complete.")

### 2. Product Enrichment

In [11]:
from numpy import take

print("Classifying product price levels...")

# Columns added: price_position.
# Description: Classifies product price as Premium, Mid-Range, or Budget within its category.
product_price_level_df = classify_product_price_level(products_df)

product_price_level_df.take(5)

Classifying product price levels...


[Row(Category='Furniture', Product ID='FUR-CH-10002024', Sub-Category='Chairs', Product Name='HON 5400 Series Task Chairs for Big and Tall', State='Texas', Price per product=490.686, price_rank_in_category=1, price_position='Premium'),
 Row(Category='Furniture', Product ID='FUR-TA-10001950', Sub-Category='Tables', Product Name='Balt Solid Wood Round Tables', State='Washington', Price per product=446.49, price_rank_in_category=2, price_position='Premium'),
 Row(Category='Furniture', Product ID='FUR-BO-10004834', Sub-Category='Bookcases', Product Name='Riverside Palais Royal Lawyers Bookcase, Royale Cherry Finish', State='Pennsylvania', Price per product=440.49, price_rank_in_category=3, price_position='Premium'),
 Row(Category='Furniture', Product ID='FUR-BO-10003404', Sub-Category='Bookcases', Product Name='Global Adaptabilites Bookcase, Cherry/Storm Gray Finish', State='New Jersey', Price per product=430.98, price_rank_in_category=4, price_position='Premium'),
 Row(Category='Furniture

In [12]:
print("Identifying fast and slow-selling products...")

# Columns added: total_quantity_sold, inventory_velocity.
# Description: Classifies products as Fast, Medium, or Slow moving.
product_sales_velocity_df = identify_fast_and_slow_sellers(product_price_level_df, orders_df)

product_sales_velocity_df.take(5)

Identifying fast and slow-selling products...


[Row(Product ID='FUR-CH-10002024', Category='Furniture', Sub-Category='Chairs', Product Name='HON 5400 Series Task Chairs for Big and Tall', State='Texas', Price per product=490.686, price_rank_in_category=1, price_position='Premium', total_quantity_sold=39, inventory_velocity='Medium Moving'),
 Row(Product ID='FUR-TA-10001950', Category='Furniture', Sub-Category='Tables', Product Name='Balt Solid Wood Round Tables', State='Washington', Price per product=446.49, price_rank_in_category=2, price_position='Premium', total_quantity_sold=19, inventory_velocity='Slow Moving'),
 Row(Product ID='FUR-BO-10004834', Category='Furniture', Sub-Category='Bookcases', Product Name='Riverside Palais Royal Lawyers Bookcase, Royale Cherry Finish', State='Pennsylvania', Price per product=440.49, price_rank_in_category=3, price_position='Premium', total_quantity_sold=24, inventory_velocity='Medium Moving'),
 Row(Product ID='FUR-BO-10003404', Category='Furniture', Sub-Category='Bookcases', Product Name='Glo

In [None]:
final_enriched_product_df=product_sales_velocity_df

final_enriched_product_df.printSchema()
final_enriched_product_df.take(5)

root
 |-- Product ID: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- Sub-Category: string (nullable = true)
 |-- Product Name: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Price per product: double (nullable = false)
 |-- price_rank_in_category: integer (nullable = false)
 |-- price_position: string (nullable = false)
 |-- total_quantity_sold: long (nullable = true)
 |-- inventory_velocity: string (nullable = false)
 |-- return_rate_percent: double (nullable = false)



[Row(Product ID='FUR-CH-10002024', Category='Furniture', Sub-Category='Chairs', Product Name='HON 5400 Series Task Chairs for Big and Tall', State='Texas', Price per product=490.686, price_rank_in_category=1, price_position='Premium', total_quantity_sold=39, inventory_velocity='Medium Moving', return_rate_percent=0.0),
 Row(Product ID='FUR-TA-10001950', Category='Furniture', Sub-Category='Tables', Product Name='Balt Solid Wood Round Tables', State='Washington', Price per product=446.49, price_rank_in_category=2, price_position='Premium', total_quantity_sold=19, inventory_velocity='Slow Moving', return_rate_percent=0.0),
 Row(Product ID='FUR-BO-10004834', Category='Furniture', Sub-Category='Bookcases', Product Name='Riverside Palais Royal Lawyers Bookcase, Royale Cherry Finish', State='Pennsylvania', Price per product=440.49, price_rank_in_category=3, price_position='Premium', total_quantity_sold=24, inventory_velocity='Medium Moving', return_rate_percent=0.0),
 Row(Product ID='FUR-BO-1

In [None]:
from src.load_enriched_table import validate_enriched_product_data

print("Running data quality checks on enriched product data...")
validate_enriched_product_data(final_enriched_product_df)
print("\nProduct data validation complete.")

### 3. Creation of Enriched Views

In [27]:
# Create temporary views
final_enriched_customers_df.createOrReplaceTempView("enriched_customers_view")
final_enriched_product_df.createOrReplaceTempView("enriched_products_view")

print("Temporary views 'enriched_customers_view' and 'enriched_products_view' created.")

Temporary views 'enriched_customers_view' and 'enriched_products_view' created.


### 4. Displaying data from Enriched Views

In [29]:
print("Enriched Customer View...\n")
spark.sql("SELECT * FROM enriched_customers_view LIMIT 5").show()

print("\nEnriched Product View...")
spark.sql("SELECT * FROM enriched_products_view LIMIT 5").show()

Enriched Customer View...

+-----------+-------------------+--------------------+--------------------+--------------------+-----------+-------------+----------------+----------+-----------+-------+------------------+-------------------+------------+-------------------+-------------------------+---------------------+----------------+
|Customer ID|      Customer Name|               email|               phone|             address|    Segment|      Country|            City|     State|Postal Code| Region|       total_sales|       total_profit|total_orders|average_order_value|unique_products_purchased|days_since_last_order|churn_risk_score|
+-----------+-------------------+--------------------+--------------------+--------------------+-----------+-------------+----------------+----------+-----------+-------+------------------+-------------------+------------+-------------------+-------------------------+---------------------+----------------+
|   PW-19240|       Pierre Wener|bettysullivan808