## Task 4: Create Profit Summary Table

This notebook demonstrates the creation of the **Profit Summary Table**, which provides an aggregate view of profit across key business dimensions: Year, Product Category, Product Sub-Category, and Customer.

### 0. Environment Setup and Data Loading

In [21]:
import os
import sys
from pyspark.sql import SparkSession
from pyspark.sql.types import *

# Add project root to path for module imports
PROJECT_ROOT = os.path.abspath(os.path.join(os.getcwd(), '..', '..'))
if PROJECT_ROOT not in sys.path:
    sys.path.insert(0, PROJECT_ROOT)

from src.spark_session import get_spark_session
from src.load_source_data import load_customer_data, load_orders_data, load_products_data
from src.load_enriched_orders import create_enriched_orders_table

# Initialize Spark session
spark = get_spark_session("ProfitSummaryNotebook")

# Load Raw Data
customers_df = load_customer_data(spark, os.path.join(PROJECT_ROOT, "data", "Customer.xlsx"))
orders_df = load_orders_data(spark, os.path.join(PROJECT_ROOT, "data", "Orders.json"))
products_df = load_products_data(spark, os.path.join(PROJECT_ROOT, "data", "Products.csv"))

print("Data loaded successfully.")


Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Customer.xlsx
File found. Loading Excel data using Spark...
Customer data loaded successfully

Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Orders.json
File found. Loading JSON data using Spark...
Orders data loaded successfully

Checking file at: /Users/kushalsenlaskar/Documents/E-commerce Sales Data/data/Products.csv
File found. Loading CSV data using Spark...
Products data loaded successfully
Data loaded successfully.


### 1. Create Enriched Orders Table

In [22]:
# Create Enriched Orders Table as foundation
print("Creating Enriched Orders Table...")
enriched_orders_df = create_enriched_orders_table(orders_df, customers_df, products_df)

print(f"Enriched orders table created with {enriched_orders_df.count()} rows")
enriched_orders_df.show(3)

Creating Enriched Orders Table...

--- Cleaning Orders Data for Enrichment ---
Removing records with negative Profit...
   Records with negative Profit: 1870
Removing records with NULL Order ID...
Removing records with negative Profit...
   Records with negative Profit: 1870
Removing records with NULL Order ID...
   Records with NULL Order ID: 0
   Removing duplicate Order IDs...
   Records with NULL Order ID: 0
   Removing duplicate Order IDs...
   Records with duplicate Order ID removed: 3697
Validating date formats...
   Records with duplicate Order ID removed: 3697
Validating date formats...
   Records with invalid date format: 0
Removing records with NULL Customer ID...
   Records with invalid date format: 0
Removing records with NULL Customer ID...
   Records with NULL Customer ID: 0
Removing records with NULL Product ID...
   Records with NULL Customer ID: 0
Removing records with NULL Product ID...
   Records with NULL Product ID: 0
   Records with NULL Product ID: 0
Orders data

### 2. Create Profit Summary Table

In [23]:
import importlib
import src.load_profit_summary
importlib.reload(src.load_profit_summary)

from src.load_profit_summary import create_profit_summary_table

# Create Profit Summary Table
print("Creating Profit Summary Table by Year, Category, Sub-Category, and Customer...")
profit_summary_df = create_profit_summary_table(enriched_orders_df)

print(f"\nProfit summary table created with {profit_summary_df.count()} groups")
profit_summary_df.printSchema()
profit_summary_df.show(10, truncate=False)

Creating Profit Summary Table by Year, Category, Sub-Category, and Customer...

Profit summary table created with 4020 groups
root
 |-- Year: integer (nullable = true)
 |-- Category: string (nullable = true)
 |-- Sub-Category: string (nullable = true)
 |-- Customer Name: string (nullable = true)
 |-- Total_Profit: double (nullable = true)
 |-- Order_Count: long (nullable = false)


Profit summary table created with 4020 groups
root
 |-- Year: integer (nullable = true)
 |-- Category: string (nullable = true)
 |-- Sub-Category: string (nullable = true)
 |-- Customer Name: string (nullable = true)
 |-- Total_Profit: double (nullable = true)
 |-- Order_Count: long (nullable = false)

+----+---------+------------+-------------------+------------+-----------+
|Year|Category |Sub-Category|Customer Name      |Total_Profit|Order_Count|
+----+---------+------------+-------------------+------------+-----------+
|2014|Furniture|Bookcases   |Brian Dahlen       |3.93        |1          |
|2014|Furni

### 3. Data Quality Validation

In [24]:
from src.load_profit_summary import validate_profit_summary_table

print("Running data quality checks on profit summary data...")
validated_profit_summary_df = validate_profit_summary_table(profit_summary_df, enriched_orders_df)
print("\nProfit summary data validation complete.")

Running data quality checks on profit summary data...

--- Validating Profit Summary Table ---

Info: All 4020 records are unique combinations.

Info: All 4020 records are unique combinations.

Info: Profit summary has 4020 unique groups from 4327 orders.

Data quality checks for profit summary table passed successfully.

Profit summary data validation complete.

Info: Profit summary has 4020 unique groups from 4327 orders.

Data quality checks for profit summary table passed successfully.

Profit summary data validation complete.


### 4. Display Summary Statistics

In [26]:
from pyspark.sql.functions import col, sum as pyspark_sum, avg, max as pyspark_max, min as pyspark_min, count, round as pyspark_round

print("\n=== Profit Summary Statistics ===")

# Overall statistics
print("\nOverall Profit Summary Statistics:")
stats = profit_summary_df.agg(
    pyspark_round(pyspark_sum("Total_Profit"), 2).alias("Total_Profit_All_Groups"),
    pyspark_round(avg("Total_Profit"), 2).alias("Average_Profit_Per_Group"),
    pyspark_round(pyspark_max("Total_Profit"), 2).alias("Max_Profit_Group"),
    pyspark_round(pyspark_min("Total_Profit"), 2).alias("Min_Profit_Group"),
    count("*").alias("Total_Groups")
).collect()[0]

print(f"  Total Profit (All Groups): {stats['Total_Profit_All_Groups']}")
print(f"  Average Profit Per Group: {stats['Average_Profit_Per_Group']}")
print(f"  Max Profit (Single Group): {stats['Max_Profit_Group']}")
print(f"  Min Profit (Single Group): {stats['Min_Profit_Group']}")
print(f"  Total Number of Groups: {stats['Total_Groups']}")

# Breakdown by Year
print("\n\nProfit Breakdown by Year:")
profit_by_year = profit_summary_df.groupBy("Year").agg(
    pyspark_round(pyspark_sum("Total_Profit"), 2).alias("Yearly_Profit"),
    count("*").alias("Groups_Per_Year")
).orderBy("Year")

profit_by_year.show()

# Breakdown by Category
print("\nProfit Breakdown by Category:")
profit_by_category = profit_summary_df.groupBy("Category").agg(
    pyspark_round(pyspark_sum("Total_Profit"), 2).alias("Category_Profit"),
    count("*").alias("Groups_Per_Category")
).orderBy("Category")

profit_by_category.show()

# Top 5 Most Profitable Groups
print("\nTop 5 Most Profitable Groups:")
profit_summary_df.orderBy(col("Total_Profit").desc()).show(5, truncate=False)

# Bottom 5 Least Profitable Groups
print("\nBottom 5 Least Profitable Groups:")
profit_summary_df.orderBy(col("Total_Profit").asc()).show(5, truncate=False)


=== Profit Summary Statistics ===

Overall Profit Summary Statistics:
  Total Profit (All Groups): 219353.44
  Average Profit Per Group: 54.57
  Max Profit (Single Group): 8399.98
  Min Profit (Single Group): 0.0
  Total Number of Groups: 4020


Profit Breakdown by Year:
  Total Profit (All Groups): 219353.44
  Average Profit Per Group: 54.57
  Max Profit (Single Group): 8399.98
  Min Profit (Single Group): 0.0
  Total Number of Groups: 4020


Profit Breakdown by Year:
+----+-------------+---------------+
|Year|Yearly_Profit|Groups_Per_Year|
+----+-------------+---------------+
|2014|      39104.9|            792|
|2015|      46621.2|            855|
|2016|     64079.74|           1052|
|2017|      69547.6|           1321|
+----+-------------+---------------+


Profit Breakdown by Category:
+----+-------------+---------------+
|Year|Yearly_Profit|Groups_Per_Year|
+----+-------------+---------------+
|2014|      39104.9|            792|
|2015|      46621.2|            855|
|2016|     6

### 5. Creation of Profit Summary Views

In [27]:
# Create temporary views
validated_profit_summary_df.createOrReplaceTempView("profit_summary_view")

print("Temporary view 'profit_summary_view' created.")

Temporary view 'profit_summary_view' created.


### 6. SQL Queries on Profit Summary View

In [28]:
print("\n=== SQL Queries on Profit Summary ===")

# Query 1: Total profit by year
print("\nTotal Profit by Year:")
spark.sql("""
    SELECT Year, SUM(Total_Profit) as Total_Profit, COUNT(*) as Groups
    FROM profit_summary_view
    GROUP BY Year
    ORDER BY Year
""").show()

# Query 2: Customer performance
print("\nTop 10 Customers by Profit:")
spark.sql("""
    SELECT `Customer Name`, SUM(Total_Profit) as Total_Customer_Profit, COUNT(*) as Order_Groups
    FROM profit_summary_view
    GROUP BY `Customer Name`
    ORDER BY Total_Customer_Profit DESC
    LIMIT 10
""").show()

# Query 3: Category performance
print("\nCategory & Sub-Category Performance:")
spark.sql("""
    SELECT Category, `Sub-Category`, SUM(Total_Profit) as Total_Profit, COUNT(*) as Groups
    FROM profit_summary_view
    GROUP BY Category, `Sub-Category`
    ORDER BY Category, Total_Profit DESC
""").show()


=== SQL Queries on Profit Summary ===

Total Profit by Year:
+----+------------------+------+
|Year|      Total_Profit|Groups|
+----+------------------+------+
|2014|39104.899999999994|   792|
|2015| 46621.19999999999|   855|
|2016| 64079.73999999996|  1052|
|2017| 69547.59999999993|  1321|
+----+------------------+------+


Top 10 Customers by Profit:
+----+------------------+------+
|Year|      Total_Profit|Groups|
+----+------------------+------+
|2014|39104.899999999994|   792|
|2015| 46621.19999999999|   855|
|2016| 64079.73999999996|  1052|
|2017| 69547.59999999993|  1321|
+----+------------------+------+


Top 10 Customers by Profit:
+----------------+---------------------+------------+
|   Customer Name|Total_Customer_Profit|Order_Groups|
+----------------+---------------------+------------+
|    Tamara Chand|    8443.119999999999|           4|
|   Adrian Barton|    5800.199999999999|           7|
|    Hunter Lopez|    5185.179999999999|           5|
|    Sanjit Engle|        

### 7. Stop Spark Session

In [29]:
spark.stop()
print("Spark session stopped.")

Spark session stopped.
