# Task 2: Create Enriched Tables for Customers and Products

This notebook creates enriched tables with calculated metrics, customer segmentation, and product performance analysis.

## Objectives:
- Enrich customer data with purchase behavior and segmentation
- Enhance product data with sales performance and profitability metrics
- Calculate derived metrics like Customer Lifetime Value, RFM scores
- Implement business logic for customer and product classification

In [None]:
# Import required libraries
import sys
import os
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))

from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import col, sum as spark_sum, count, avg, max as spark_max, when, lit, round as spark_round
from src.processing import init_spark, get_customer_metrics, analyze_product_performance
from src.config import BusinessConfig

# Initialize Spark session
spark = init_spark("Task2_EnrichedTables")
print(" Spark session initialized successfully")

In [None]:
# Load raw data (from Task 1 or sample data)
print(" Loading Raw Data...")

# Sample data for demonstration
customer_data = [
    (1, "John Doe", "USA"),
    (2, "Jane Smith", "UK"), 
    (3, "Bob Wilson", "Canada"),
    (4, "Alice Brown", "USA"),
    (5, "Charlie Davis", "Germany")
]
customer_schema = StructType([
    StructField("Customer ID", IntegerType(), False),
    StructField("Customer Name", StringType(), False),
    StructField("Country", StringType(), False)
])
customers_df = spark.createDataFrame(customer_data, customer_schema)

# Sample orders data with varied amounts for segmentation
orders_data = [
    (1, 1, 1, "2023-01-01", 2, 8000.0, 1600.0),   # High value customer
    (2, 1, 2, "2023-01-15", 1, 5000.0, 1000.0),   # High value customer
    (3, 2, 1, "2023-02-01", 1, 3000.0, 600.0),    # Medium value customer
    (4, 2, 3, "2023-02-15", 2, 3500.0, 700.0),    # Medium value customer
    (5, 3, 2, "2023-03-01", 1, 1500.0, 300.0),    # Low value customer
    (6, 4, 1, "2023-03-15", 3, 4500.0, 900.0),    # Medium value customer
    (7, 5, 3, "2023-04-01", 1, 2000.0, 400.0),    # Low value customer
]
orders_schema = StructType([
    StructField("Order ID", IntegerType(), False),
    StructField("Customer ID", IntegerType(), False),
    StructField("Product ID", IntegerType(), False),
    StructField("Order Date", StringType(), False),
    StructField("Quantity", IntegerType(), False),
    StructField("Sales", DoubleType(), False),
    StructField("Profit", DoubleType(), False)
])
orders_df = spark.createDataFrame(orders_data, orders_schema)

# Sample products data
products_data = [
    (1, "Enterprise Laptop", "Technology", "Computers"),
    (2, "Executive Chair", "Furniture", "Office Furniture"),
    (3, "Business Phone", "Technology", "Mobile Devices")
]
products_schema = StructType([
    StructField("Product ID", IntegerType(), False),
    StructField("Product Name", StringType(), False),
    StructField("Category", StringType(), False),
    StructField("Sub-Category", StringType(), False)
])
products_df = spark.createDataFrame(products_data, products_schema)

print(" Raw data loaded successfully")
print(f"Customers: {customers_df.count()}")
print(f"Orders: {orders_df.count()}")
print(f"Products: {products_df.count()}")

In [None]:
# Create Enriched Customer Table
print(" Creating Enriched Customer Table...")

# Calculate customer metrics
enriched_customers = get_customer_metrics(orders_df, customers_df)

print(f" Enriched customers table created with {enriched_customers.count()} customers")
enriched_customers.printSchema()

print("\n Enriched Customer Data:")
enriched_customers.show(truncate=False)

In [None]:
# Customer Segmentation Analysis
print(" Customer Segmentation Analysis...")

# Show customer segments
segment_distribution = enriched_customers.groupBy("Customer Segment").agg(
    count("*").alias("Customer Count"),
    spark_sum("Total Sales").alias("Segment Total Sales"),
    avg("Total Sales").alias("Avg Sales per Customer")
).orderBy("Segment Total Sales", ascending=False)

print("\n Customer Segment Distribution:")
segment_distribution.show()

# High-value customer analysis
print("\n High-Value Customer Analysis:")
high_value_customers = enriched_customers.filter(col("Customer Segment") == "High Value")
print(f"High-value customers: {high_value_customers.count()}")
high_value_customers.select("Customer Name", "Country", "Total Sales", "Total Profit", "Total Orders").show()

In [None]:
# Create Enriched Product Table
print(" Creating Enriched Product Table...")

# Calculate product performance metrics
enriched_products = analyze_product_performance(orders_df, products_df)

print(f" Enriched products table created with {enriched_products.count()} products")
enriched_products.printSchema()

print("\n Enriched Product Data:")
enriched_products.show(truncate=False)

In [None]:
# Product Performance Analysis
print(" Product Performance Analysis...")

# Show performance by category
category_performance = enriched_products.groupBy("Category").agg(
    count("*").alias("Product Count"),
    spark_sum("Total Sales").alias("Category Total Sales"),
    avg("Profit Margin").alias("Avg Profit Margin"),
    spark_sum("Total Profit").alias("Category Total Profit")
).orderBy("Category Total Sales", ascending=False)

print("\n Category Performance:")
category_performance.show()

# Top performing products
print("\n Top Performing Products (by Sales):")
top_products = enriched_products.orderBy(col("Total Sales").desc())
top_products.select("Product Name", "Category", "Sub-Category", "Total Sales", "Profit Margin", "Performance Flag").show()

In [None]:
# Advanced Customer Analytics
print(" Advanced Customer Analytics...")

# Customer Activity Status Analysis
activity_distribution = enriched_customers.groupBy("Activity Status").agg(
    count("*").alias("Customer Count"),
    avg("Total Sales").alias("Avg Sales"),
    avg("Days Since Order").alias("Avg Days Since Order")
)

print("\n Customer Activity Distribution:")
activity_distribution.show()

# Customer Value vs Activity Cross-Analysis
print("\n Customer Value vs Activity Analysis:")
value_activity_matrix = enriched_customers.groupBy("Customer Segment", "Activity Status").agg(
    count("*").alias("Count")
).orderBy("Customer Segment", "Activity Status")

value_activity_matrix.show()

In [None]:
# Product Profitability Deep Dive
print(" Product Profitability Analysis...")

# Profit margin distribution
profit_margin_ranges = enriched_products.withColumn(
    "Margin Range",
    when(col("Profit Margin") >= 25, "Excellent (25%+)")
    .when(col("Profit Margin") >= 15, "Good (15-25%)")
    .when(col("Profit Margin") >= 5, "Fair (5-15%)")
    .otherwise("Poor (<5%)")
)

margin_distribution = profit_margin_ranges.groupBy("Margin Range").agg(
    count("*").alias("Product Count"),
    avg("Total Sales").alias("Avg Sales")
).orderBy("Product Count", ascending=False)

print("\n Profit Margin Distribution:")
margin_distribution.show()

# Category profitability comparison
print("\n Category Profitability Comparison:")
category_profitability = enriched_products.groupBy("Category", "Sub-Category").agg(
    spark_sum("Total Sales").alias("Total Sales"),
    spark_sum("Total Profit").alias("Total Profit"),
    avg("Profit Margin").alias("Avg Margin")
).orderBy("Total Profit", ascending=False)

category_profitability.show()

In [None]:
# Data Quality Validation for Enriched Tables
print(" Data Quality Validation...")

def validate_enriched_data():
    """Validate enriched tables data quality"""
    print("\n Enriched Tables Validation:")
    print("=" * 40)
    
    # Customer table validation
    print("\n Customer Table Validation:")
    
    # Check for null values in calculated fields
    null_total_sales = enriched_customers.filter(col("Total Sales").isNull()).count()
    null_segments = enriched_customers.filter(col("Customer Segment").isNull()).count()
    
    print(f"  Null Total Sales: {null_total_sales}")
    print(f"  Null Customer Segments: {null_segments}")
    
    # Validate segment logic
    high_value_count = enriched_customers.filter(col("Customer Segment") == "High Value").count()
    medium_value_count = enriched_customers.filter(col("Customer Segment") == "Medium Value").count()
    low_value_count = enriched_customers.filter(col("Customer Segment") == "Low Value").count()
    
    print(f"  High Value: {high_value_count}, Medium Value: {medium_value_count}, Low Value: {low_value_count}")
    
    # Product table validation
    print("\n Product Table Validation:")
    
    null_profit_margins = enriched_products.filter(col("Profit Margin").isNull()).count()
    null_performance_flags = enriched_products.filter(col("Performance Flag").isNull()).count()
    
    print(f"  Null Profit Margins: {null_profit_margins}")
    print(f"  Null Performance Flags: {null_performance_flags}")
    
    # Validate profit margin calculations
    incorrect_margins = enriched_products.filter(
        (col("Total Sales") > 0) & 
        (abs(col("Profit Margin") - (col("Total Profit") / col("Total Sales") * 100)) > 0.1)
    ).count()
    
    print(f"  Incorrect Margin Calculations: {incorrect_margins}")
    
    if null_total_sales == 0 and null_segments == 0 and null_profit_margins == 0 and incorrect_margins == 0:
        print("\n All validation checks passed!")
    else:
        print("\n Some validation issues detected")

validate_enriched_data()

In [None]:
# Create views for SQL access
print(" Creating Temporary Views...")

enriched_customers.createOrReplaceTempView("enriched_customers")
enriched_products.createOrReplaceTempView("enriched_products")

print(" Views created:")
print("  - enriched_customers")
print("  - enriched_products")

# Test SQL queries
print("\n Testing SQL Access:")

print("\nCustomer Segment Summary:")
spark.sql("""
    SELECT Customer_Segment, 
           COUNT(*) as count, 
           ROUND(AVG(Total_Sales), 2) as avg_sales
    FROM enriched_customers 
    GROUP BY Customer_Segment
    ORDER BY avg_sales DESC
""").show()

print("\nProduct Performance Summary:")
spark.sql("""
    SELECT Performance_Flag, 
           COUNT(*) as count, 
           ROUND(AVG(Profit_Margin), 2) as avg_margin
    FROM enriched_products 
    GROUP BY Performance_Flag
    ORDER BY avg_margin DESC
""").show()

## Summary of Task 2: Enriched Tables Creation

###  Accomplished:
1. **Customer Enrichment**: Created enriched customer table with metrics and segmentation
2. **Product Enrichment**: Enhanced product data with performance analytics
3. **Business Logic**: Implemented customer segmentation and product classification
4. **Advanced Analytics**: Customer lifetime value, activity status, and profitability analysis

###  Enriched Tables Created:

#### Customer Enrichment:
- **Total Sales & Profit**: Aggregated customer purchase history
- **Customer Segmentation**: High/Medium/Low value classification
- **Activity Status**: Active/Inactive customer classification
- **Order Metrics**: Total orders, average order value, days since last order

#### Product Enrichment:
- **Sales Performance**: Total sales, profit, and quantity metrics
- **Profitability**: Profit margin calculations and classifications
- **Performance Flags**: High/Good/Needs Improvement categorization
- **Category Analytics**: Performance by product categories

###  Key Insights:
- Customer segmentation reveals value distribution across customer base
- Product performance analysis identifies top performers and improvement opportunities
- Activity status helps identify at-risk customers for retention strategies
- Profit margin analysis guides pricing and product mix decisions

###  Next Steps:
Ready for Task 3: Create enriched orders table with complete customer and product information.