## Task 5: SQL Profit Aggregates - Interactive Analysis

This notebook demonstrates the four SQL profit aggregate queries from the `sql_profit_aggregates.py` module.

The queries provide different perspectives on profit analysis:
1. **Profit by Year** - Yearly profit trends
2. **Profit by Year + Category** - Category performance by year
3. **Profit by Customer** - Top and bottom customers
4. **Profit by Customer + Year** - Customer profitability over time

## 1. Import Required Libraries

In [3]:
import sys
import os

# Add project root to path for module imports
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

from pyspark.sql import SparkSession
import pandas as pd

from src.sql_profit_aggregates import (
    get_profit_by_year,
    get_profit_by_year_category,
    get_profit_by_customer,
    get_profit_by_customer_year,
    validate_profit_aggregates
)

from src.load_source_data import load_customer_data, load_orders_data, load_products_data
from src.data_cleaning_utils import (
    clean_orders_for_enrichment,
    clean_customers_for_enrichment,
    clean_products_for_enrichment
)
from src.load_enriched_orders import create_enriched_orders_table

print("Libraries imported successfully!")

Libraries imported successfully!


## 2. Initialize Spark Session and Load Data

In [4]:
# Create Spark session
spark = SparkSession.builder \
    .appName("SQL_Profit_Aggregates") \
    .master("local[*]") \
    .getOrCreate()

print("Spark session created!")

# Define data paths
project_root = '/Users/kushalsenlaskar/Documents/E-commerce Sales Data'
customer_path = os.path.join(project_root, 'data', 'Customer.xlsx')
orders_path = os.path.join(project_root, 'data', 'Orders.json')
products_path = os.path.join(project_root, 'data', 'Products.csv')

# Load raw source data
print("\nLoading raw source data...")
customers_df = load_customer_data(spark, customer_path)
orders_df = load_orders_data(spark, orders_path)
products_df = load_products_data(spark, products_path)

print(f"Raw data loaded:")
print(f"  Customers: {customers_df.count()} records")
print(f"  Orders: {orders_df.count()} records")
print(f"  Products: {products_df.count()} records")

# Apply data cleaning
print("\n=== Applying Data Cleaning ===")
customers_df = clean_customers_for_enrichment(customers_df)
orders_df = clean_orders_for_enrichment(orders_df)
products_df = clean_products_for_enrichment(products_df)

print(f"\nCleaned data:")
print(f"  Customers: {customers_df.count()} records")
print(f"  Orders: {orders_df.count()} records")
print(f"  Products: {products_df.count()} records")

Spark session created!

Loading raw source data...

Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Customer.xlsx
File found. Loading Excel data using Spark...
Customer data loaded successfully

Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Orders.json
File found. Loading JSON data using Spark...
Orders data loaded successfully

Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Products.csv
File found. Loading CSV data using Spark...
Products data loaded successfully
Raw data loaded:
  Customers: 793 records
  Orders: 9994 records
  Products: 1851 records

=== Applying Data Cleaning ===

--- Cleaning Customers Data for Enrichment ---
  Customers: 793 records
  Orders: 9994 records
  Products: 1851 records

=== Applying Data Cleaning ===

--- Cleaning Customers Data for Enrichment ---
Cleaning special characters from Customer Name...
   Customer names cleaned successfully
Removing records with NULL

## 3. Create Enriched Orders Table

In [5]:
# Create enriched orders table
print("Creating enriched orders table...")
enriched_orders_df = create_enriched_orders_table(orders_df, customers_df, products_df)

print(f"Enriched orders table created: {enriched_orders_df.count()} records")
print(f"\nColumns: {enriched_orders_df.columns}")

Creating enriched orders table...

--- Cleaning Orders Data for Enrichment ---
Removing records with negative Profit...
Removing records with negative Profit...
   Records with negative Profit: 0
Removing records with NULL Order ID...
   Records with NULL Order ID: 0
   Removing duplicate Order IDs...
   Records with negative Profit: 0
Removing records with NULL Order ID...
   Records with NULL Order ID: 0
   Removing duplicate Order IDs...
   Records with duplicate Order ID removed: 0
Validating date formats...
   Records with duplicate Order ID removed: 0
Validating date formats...
   Records with invalid date format: 0
Removing records with NULL Customer ID...
   Records with invalid date format: 0
Removing records with NULL Customer ID...
   Records with NULL Customer ID: 0
Removing records with NULL Product ID...
   Records with NULL Customer ID: 0
Removing records with NULL Product ID...
   Records with NULL Product ID: 0
Orders data cleaning completed
  Original records: 4427, A

## 4. Generate All Profit Aggregates

In [6]:
# Generate all four profit aggregates individually
print("Generating profit aggregates...\n")

profit_by_year = get_profit_by_year(enriched_orders_df)
profit_by_year_category = get_profit_by_year_category(enriched_orders_df)
profit_by_customer = get_profit_by_customer(enriched_orders_df)
profit_by_customer_year = get_profit_by_customer_year(enriched_orders_df)

print("Aggregates generated successfully!")
print("Available aggregates: profit_by_year, profit_by_year_category, profit_by_customer, profit_by_customer_year")

Generating profit aggregates...

Aggregates generated successfully!
Available aggregates: profit_by_year, profit_by_year_category, profit_by_customer, profit_by_customer_year
Aggregates generated successfully!
Available aggregates: profit_by_year, profit_by_year_category, profit_by_customer, profit_by_customer_year


25/12/01 15:58:54 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


## 5. Profit by Year

In [7]:
# Display profit by year
print("\n--- Profit by Year ---")
profit_by_year.limit(15).show(10)



--- Profit by Year ---


                                                                                

+----+------------+-----------+
|Year|Total_Profit|Order_Count|
+----+------------+-----------+
|2014|     39104.9|        830|
|2015|     46621.2|        902|
|2016|    64079.74|       1147|
|2017|     69547.6|       1448|
+----+------------+-----------+



## 6. Profit by Year + Category

In [9]:
# Display profit by year and category
print("\n--- Profit by Year + Category ---")
profit_by_year_category.limit(15).show(10)



--- Profit by Year + Category ---
+----+---------------+------------+-----------+
|Year|       Category|Total_Profit|Order_Count|
+----+---------------+------------+-----------+
|2014|      Furniture|     7395.89|        140|
|2014|Office Supplies|    15850.52|        524|
|2014|     Technology|    15858.49|        166|
|2015|      Furniture|    11258.47|        161|
|2015|Office Supplies|    15524.48|        539|
|2015|     Technology|    19838.25|        202|
|2016|      Furniture|    10429.55|        214|
|2016|Office Supplies|    23122.29|        712|
|2016|     Technology|     30527.9|        221|
|2017|      Furniture|     8895.09|        227|
+----+---------------+------------+-----------+
only showing top 10 rows

+----+---------------+------------+-----------+
|Year|       Category|Total_Profit|Order_Count|
+----+---------------+------------+-----------+
|2014|      Furniture|     7395.89|        140|
|2014|Office Supplies|    15850.52|        524|
|2014|     Technology|    1

## 7. Profit by Customer

In [8]:
# Display profit by customer
print("\n--- Top 15 Customers by Profit ---")
profit_by_customer.limit(15).show()



--- Top 15 Customers by Profit ---
+----------------+------------+-----------+
|   Customer Name|Total_Profit|Order_Count|
+----------------+------------+-----------+
|    Tamara Chand|     8443.12|          4|
|   Adrian Barton|      5800.2|          9|
|    Hunter Lopez|     5185.18|          5|
|    Sanjit Engle|     2916.79|          9|
|    Karen Danels|     2592.29|          5|
|Tom Boeckenhauer|      2283.8|          7|
|       Jane Waco|     2115.66|          6|
|             NaN|      2108.6|         52|
|    Fred Hopkins|     1772.47|          7|
|       Pete Kriz|     1666.61|         11|
|     John Murray|      1559.6|          5|
|  Alan Dominguez|     1551.73|          7|
|Corinna Mitchell|     1510.51|          6|
|   Yana Sorensen|     1450.42|          7|
|  Lena Creighton|     1430.01|          9|
+----------------+------------+-----------+

+----------------+------------+-----------+
|   Customer Name|Total_Profit|Order_Count|
+----------------+------------+--------

## 8. Profit by Customer + Year

In [10]:
# Display profit by customer and year
print("\n--- Profit by Customer + Year (Sample) ---")
profit_by_customer_year.show(15)



--- Profit by Customer + Year (Sample) ---
+--------------------+----+------------+-----------+
|       Customer Name|Year|Total_Profit|Order_Count|
+--------------------+----+------------+-----------+
|       Aaron Bergman|2014|        5.48|          1|
|       Aaron Hawkins|2014|       51.53|          4|
|      Aaron Smayling|2014|       32.23|          1|
|          Ad am Hart|2014|         2.4|          1|
|  Adam Shillingsburg|2014|        9.37|          2|
|       Adrian Barton|2014|       497.0|          1|
|         Aimee Bixby|2014|       16.75|          1|
|      Alan Dominguez|2014|        1.98|          1|
|          Alan Hwang|2014|       71.99|          1|
|   Alan Schoenberger|2014|       11.04|          1|
|        Alan Shonely|2014|       51.56|          2|
|Alejandro Ballentine|2014|        6.49|          1|
|     Alejandro Grove|2014|       17.99|          2|
| Aleksandra Gannaway|2014|        5.71|          1|
|          Alex Avila|2014|       21.09|          2|
+-

## 9. Analyze Customer Trends Over Time

In [11]:
# Get top 5 customers and analyze trends
pdf_customer = profit_by_customer.toPandas()
top_5_customers = pdf_customer.head(5)['Customer Name'].tolist()

print("Top 5 Customers and Their Profit Trends Over Years:")
print("=" * 80)

pdf_cust_year = profit_by_customer_year.toPandas()

for customer in top_5_customers:
    customer_data = pdf_cust_year[pdf_cust_year['Customer Name'] == customer].sort_values('Year')
    print(f"\n{customer}:")
    print(customer_data[['Year', 'Total_Profit', 'Order_Count']].to_string(index=False))
    
    total_customer_profit = customer_data['Total_Profit'].sum()
    print(f"  Total Profit (All Years): ${total_customer_profit:,.2f}")

Top 5 Customers and Their Profit Trends Over Years:

Tamara Chand:
 Year  Total_Profit  Order_Count
 2014         11.72            1
 2015         28.86            1
 2016       8402.54            2
  Total Profit (All Years): $8,443.12

Adrian Barton:
 Year  Total_Profit  Order_Count
 2014        497.00            1
 2015         33.59            1
 2016       4952.27            2
 2017        317.34            5
  Total Profit (All Years): $5,800.20

Hunter Lopez:
 Year  Total_Profit  Order_Count
 2014         10.78            1
 2016        128.54            2
 2017       5045.86            2
  Total Profit (All Years): $5,185.18

Sanjit Engle:
 Year  Total_Profit  Order_Count
 2014         12.05            3
 2015         34.91            1
 2016       2806.85            2
 2017         62.98            3
  Total Profit (All Years): $2,916.79

Karen Danels:
 Year  Total_Profit  Order_Count
 2014         16.70            1
 2015        128.97            1
 2016       2446.62        

## 10. Validate All Aggregates

In [12]:
# Validate all aggregates
print("\n=== Validation Results ===")

validate_profit_aggregates(profit_by_year, "Profit by Year")
validate_profit_aggregates(profit_by_year_category, "Profit by Year + Category")
validate_profit_aggregates(profit_by_customer, "Profit by Customer")
validate_profit_aggregates(profit_by_customer_year, "Profit by Customer + Year")

print("\nAll aggregates validated!")


=== Validation Results ===

--- Validating Profit by Year ---
Profit by Year validation passed (4 rows)

--- Validating Profit by Year + Category ---
Profit by Year validation passed (4 rows)

--- Validating Profit by Year + Category ---
Profit by Year + Category validation passed (12 rows)

--- Validating Profit by Customer ---
Profit by Year + Category validation passed (12 rows)

--- Validating Profit by Customer ---
Profit by Customer validation passed (785 rows)

--- Validating Profit by Customer + Year ---
Profit by Customer validation passed (785 rows)

--- Validating Profit by Customer + Year ---
Profit by Customer + Year validation passed (2321 rows)

All aggregates validated!
Profit by Customer + Year validation passed (2321 rows)

All aggregates validated!


## 11. Summary Statistics

In [14]:
# Calculate summary statistics
print("\n=== Summary Statistics ===")

pdf_year = profit_by_year.toPandas()
total_profit = pdf_year['Total_Profit'].sum()
total_orders = pdf_year['Order_Count'].sum()
avg_profit_per_year = pdf_year['Total_Profit'].mean()

print(f"\nProfit Metrics:")
print(f"  Total Profit: ${total_profit:,.2f}")
print(f"  Total Orders: {int(total_orders):,}")
print(f"  Average Profit per Year: ${avg_profit_per_year:,.2f}")

pdf_customer = profit_by_customer.toPandas()
print(f"\nCustomer Metrics:")
print(f"  Total Customers: {len(pdf_customer)}")
print(f"  Average Customer Profit: ${pdf_customer['Total_Profit'].mean():,.2f}")
print(f"  Max Customer Profit: ${pdf_customer['Total_Profit'].max():,.2f}")
print(f"  Min Customer Profit: ${pdf_customer['Total_Profit'].min():,.2f}")

pdf_year_cat = profit_by_year_category.toPandas()
print(f"\nCategory Metrics:")
category_summary = pdf_year_cat.groupby('Category')['Total_Profit'].sum().sort_values(ascending=False)
for category, profit in category_summary.items():
    print(f"  {category}: ${profit:,.2f}")


=== Summary Statistics ===

Profit Metrics:
  Total Profit: $219,353.44
  Total Orders: 4,327
  Average Profit per Year: $54,838.36

Customer Metrics:
  Total Customers: 785
  Average Customer Profit: $279.43
  Max Customer Profit: $8,443.12
  Min Customer Profit: $0.46

Category Metrics:
  Technology: $93,652.63
  Office Supplies: $87,721.81
  Furniture: $37,979.00


## 12. Cleanup

In [15]:
# Stop Spark session
spark.stop()
print("Spark session stopped!")

Spark session stopped!
