# Module 3: PySpark Aggregations - Retail Sales Analytics
**Scenario:** Working for a Service Company (e.g., TCS/Infosys) for a Retail Client (e.g., Walmart, Target, Tesco).

**Objective:** Compute Key Performance Indicators (KPIs) like Total Revenue, Top Selling Products, and Store Performance.

**Thinking Like a Data Engineer:**
In retail, raw transaction data (POS data) is massive (billions of rows).
*   **Managers don't want raw data.** They want *Aggregates* (Sums, Averages, Counts).
*   **Your Job:** Take the 100GB of raw sales data -> Group By Store/Product -> Calculate Metrics -> Save Small Report.

---
## 1. Setup Environment

In [None]:
# Setup PySpark
try:
    import pyspark
    print("PySpark is already installed")
except ImportError:
    print("Installing PySpark...")
    !pip install pyspark findspark

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder \
    .appName("Retail_Sales_Analytics") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Ready")

## 2. Load Sales Data (Simulate POS System)
We will create a DataFrame representing daily sales across different stores.
*   `store_id`: Which physical store?
*   `product_id`: What item?
*   `category`: Electronics, Grocery, etc.
*   `quantity`: How many sold?
*   `unit_price`: Cost per item.
*   `cost_price`: How much the store paid (needed for profit calculation).

In [None]:
# --- Generate Mock Sales Data ---
sales_data = [
    ("Store_NYC_1", "TV_55_Inch", "Electronics", 5, 500.0, 300.0),
    ("Store_NYC_1", "Laptop_Pro", "Electronics", 2, 1200.0, 900.0),
    ("Store_NYC_1", "Milk_1Gal", "Grocery", 100, 4.5, 3.0),
    ("Store_LA_1", "TV_55_Inch", "Electronics", 10, 500.0, 300.0),
    ("Store_LA_1", "Laptop_Pro", "Electronics", 5, 1200.0, 900.0),
    ("Store_LA_1", "Milk_1Gal", "Grocery", 150, 4.5, 3.0),
    ("Store_CHI_1", "TV_55_Inch", "Electronics", 1, 500.0, 300.0), # Low sales
    ("Store_CHI_1", "Milk_1Gal", "Grocery", 20, 6.0, 3.0)  # Higher price
]

schema = ["store_id", "product_name", "category", "quantity", "unit_price", "cost_price"]
df_sales = spark.createDataFrame(sales_data, schema=schema)

print("--- Raw Daily Sales Data ---")
df_sales.show()

## 3. Basic Calculations: Revenue & Profit
Raw data often gives you Quantity and Unit Price.
We need to calculate:
1.  **Total Revenue per Row** = `quantity` * `unit_price`
2.  **Total Cost per Row** = `quantity` * `cost_price`
3.  **Profit per Row** = Revenue - Cost

**Pyspark Tool:** `withColumn("new_col_name", expression)`

In [None]:
# 1. Calculate Total Revenue
df_calculated = df_sales.withColumn("total_revenue", col("quantity") * col("unit_price"))

# 2. Calculate Total Cost
df_calculated = df_calculated.withColumn("total_cost", col("quantity") * col("cost_price"))

# 3. Calculate Profit
df_calculated = df_calculated.withColumn("profit", col("total_revenue") - col("total_cost"))

print("--- Sales with Revenue & Profit ---")
df_calculated.show()
# Can you see which product makes the most money? Not yet. We need to Group By.

## 4. Aggregation: Where the Business Value Is
CEOs don't look at single transactions. They ask:
1.  **Which Store made the most money?** (Group By `store_id`)
2.  **Which Category has the highest sales?** (Group By `category`)

**PySpark Tool:** `groupBy("col").agg(sum("col"), avg("col"))`

In [None]:
# 1. Total Revenue Per Store
df_store_revenue = df_calculated.groupBy("store_id") \
    .agg(
        sum("total_revenue").alias("revenue"),
        sum("profit").alias("net_profit")
    ) \
    .orderBy("revenue", ascending=False) # Rank them highest to lowest

print("--- Top Performing Stores ---")
df_store_revenue.show()

# 2. Category Performance
df_category_performance = df_calculated.groupBy("category") \
    .agg(
        sum("quantity").alias("units_sold"),
        avg("unit_price").alias("avg_price")
    )

print("--- Category Insights ---")
df_category_performance.show()
# Notice: Groceries sell MANY units (270) but at low price ($4.6). Electronics sell few (23) but expensive.

## 5. Identifying "Loss Leaders" (Or Low Margin Items)
Sometimes high sales doesn't mean high profit.
Let's calculate the **Margin Percentage** `(Profit / Revenue) * 100`.

*   If Margin < 10%, we might be discounting too much.
*   This is called "Margin Analysis" - very common in retail interviews.

In [None]:
# 1. Group By Product and Sum revenue/profit
df_product_margins = df_calculated.groupBy("product_name") \
    .agg(
        sum("total_revenue").alias("product_revenue"),
        sum("profit").alias("product_profit")
    )

# 2. Calculate Margin %
df_product_margins = df_product_margins.withColumn(
    "margin_percentage",
    round((col("product_profit") / col("product_revenue")) * 100, 2)
)

print("--- Product Profitability ---")
df_product_margins.orderBy("margin_percentage").show()

# 3. Filter Low Margin Products (< 30%)
df_low_margin = df_product_margins.filter(col("margin_percentage") < 30)

print("--- WARNING: Low Margin Products (Check Pricing Strategy) ---")
df_low_margin.show()

## 6. Save The Report
Instead of Parquet, business managers often want **CSV** reports they can open in Excel.
We will save the **Store Revenue** report as a CSV.

*   `header=true`: So they see column names.
*   `coalesce(1)`: Merges all partitions into 1 file (Useful for small reports < 1GB, so you don't get 100 tiny files).

In [None]:
output_path = "store_sales_report_csv"

# Coalesce to 1 single CSV file (For easy emailing/Excel use)
df_store_revenue.coalesce(1).write.option("header", "true").mode("overwrite").csv(output_path)

print(f"Sales Report Saved to {output_path}")