## Spark Setup
We create a SparkSession to process batch CSvs and micro-batch inventory streams.
The app name and master configuration are from "config.yaml".

In [99]:
from pathlib import Path

project_root = Path.cwd().parent

In [100]:
import yaml
from pyspark.sql import SparkSession

# Load config.yaml
with open("../config.yaml", 'r') as f:
    cfg = yaml.safe_load(f)

 # Initialize Spark Session
spark = (SparkSession.builder
    .appName(cfg["spark"]["app_name"])
    .master(cfg["spark"]["master"])
    .getOrCreate())
spark

## Data Ingestion
We are loading batch load of historical sales and customers data and a stimulated mini-stream files drop for inventory sensor events.

In [101]:
# Configuring path for global use
sales_path = project_root/ "data" / "raw" /"sales_data.csv"
customers_path = project_root/ "data" / "raw" /"customers_data.csv"
inventory_path = project_root/ "data" / "raw" /"inventory_stream"

In [102]:
sales_df = spark.read.csv(str(sales_path), header=True, inferSchema=True)
customers_df = spark.read.csv(str(customers_path), header=True, inferSchema=True)

sales_df.show(5)
customers_df.show(5)


+------+--------+------+--------+-----------+--------+-------+
|txn_id|    date|sku_id|store_id|customer_id|quantity|  price|
+------+--------+------+--------+-----------+--------+-------+
|T13899|20250830|  P001|    X001|       C003|      23|2240.29|
|T87892|20250830|  P001|    X001|       C002|       6| 613.95|
|T35545|20250830|  P001|    X001|       C002|       6| 634.08|
|T96496|20250830|  P001|    X001|       C002|      19|1815.12|
|T72236|20250830|  P001|    X001|       C004|      14|1388.67|
+------+--------+------+--------+-----------+--------+-------+
only showing top 5 rows

+-----------+-------------+------------+-----------+
|customer_id|last_purchase|total_orders|total_spend|
+-----------+-------------+------------+-----------+
|       C001|     20250822|          63|       6174|
|       C002|     20250820|          27|       3240|
|       C003|     20250828|          53|       5777|
|       C004|     20250825|          43|       7740|
+-----------+-------------+----------

We have to define schema for Spark as Spark would not know the schema for inventory as we will be uploading inventory stream batch files incrementally.

In [103]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define Schema
inventory_schema = StructType([
    StructField("timestamp", StringType(), True),
    StructField("store_id", StringType(), True),
    StructField("sku_id", StringType(), True),
    StructField("on_stock", IntegerType(), True),
])

# Read streaming JSON with mini-batch drops
inventory_df = spark.readStream \
    .format("json") \
    .schema(inventory_schema) \
    .option("header", "true")  \
    .option("inferSchema", "true") \
    .option("maxFilesPerTrigger",1) \
    .load(str(inventory_path))

inventory_df.printSchema()


root
 |-- timestamp: string (nullable = true)
 |-- store_id: string (nullable = true)
 |-- sku_id: string (nullable = true)
 |-- on_stock: integer (nullable = true)



## Processing Layer - Curate the Data
### Daily item-store sales
It tells us how many items were in stock for each SKU at each store per day. This is the foundation for demand calculations.
### Moving Average Demand
It helps us smooth out fluctuations and detect trends. For streaming , we can compute it using window functions.
### Stock-Out Risk Signal
It shows risk signals for Stock-Out.
### RFM Analysis On Customers
It involves Recency which means how recently a customer purchased, Frequency means how often a customer purchased and Monetary means total spend.


# Daily Item-Store Sales And Moving Average OF Quantity

In [104]:
from pyspark.sql import functions as f
from pyspark.sql.window import Window

# Aggregate sales at store, date and sku level
sales_df = sales_df.withColumn("date", f.to_date(f.col("date").cast("string"), "yyyyMMdd"))
daily_item_store = sales_df.groupby("date", "store_id", "sku_id") \
    .agg(
    f.sum("quantity").alias("total_qty"),
    f.round(f.sum("price"),2).alias("total_sales"),
    f.round(f.avg("price"), 2).alias("avg_price"),
    f.max("price").alias("max_price"),
    f.min("price").alias("min_price")
)

# Moving Average of Quantity/Demand
window_spec = Window.partitionBy("store_id", "sku_id").orderBy("date").rowsBetween(-6,0)
daily_item_store = daily_item_store.withColumn("moving_avg_qty", f.avg("total_qty").over(window_spec))

# Performance optimizations

# Partition distributes DataFrame across Spark partitions based on store_id and sku_id. This helps reduce shuffling when performing joins or window operation because rows with the same store and SKU end up in the same partition. It improves performance for subsequent operations like window functions or aggregations.
daily_item_store = daily_item_store.repartition("store_id", "sku_id")

# If the following DataFrame is used multiple times, Spark doesn't need to recompute it from scratch each time. This speeds up repeated operations and avoids redundant computations.
daily_item_store = daily_item_store.cache()

daily_item_store.show(5)

+----------+--------+------+---------+-----------+---------+---------+---------+------------------+
|      date|store_id|sku_id|total_qty|total_sales|avg_price|max_price|min_price|    moving_avg_qty|
+----------+--------+------+---------+-----------+---------+---------+---------+------------------+
|2025-08-30|    X002|  P002|      224|   44874.94|   4986.1|  9149.04|  1695.15|             224.0|
|2025-08-31|    X002|  P002|      185|   37471.49|  5353.07|  8159.79|  1334.84|             204.5|
|2025-09-01|    X002|  P002|      234|   46675.81|  4667.58|   8398.4|  2164.03|214.33333333333334|
|2025-09-02|    X002|  P002|      242|   48352.79|   6044.1|  9510.97|  1076.87|            221.25|
|2025-09-03|    X002|  P002|      327|   66061.94|  7340.22| 10158.55|  3053.65|             242.4|
+----------+--------+------+---------+-----------+---------+---------+---------+------------------+
only showing top 5 rows



25/09/19 21:45:33 WARN CacheManager: Asked to cache already cached data.


# Stock-Out Risk Signal

In [106]:
# Extracting date from timestamp for joiing
spark.stop()