## Spark Setup
We create a SparkSession to process batch CSvs and micro-batch inventory streams.
The app name and master configuration are from "config.yaml".

In [52]:
import os
from pathlib import Path

project_root = Path.cwd().parent

In [53]:
import yaml
from pyspark.sql import SparkSession

# Load config.yaml
with open("../config.yaml", 'r') as f:
    cfg = yaml.safe_load(f)

 # Initialize Spark Session
spark = (SparkSession.builder
    .appName(cfg["spark"]["app_name"])
    .master(cfg["spark"]["master"])
    .getOrCreate())
spark

## Data Ingestion
We are loading batch load of historical sales and customers data and a stimulated mini-stream files drop for inventory sensor events.

In [54]:
# Configuring path for global use
sales_path = project_root/ "data" / "raw" /""
customers_path = project_root/ "data" / "raw" /""
inventory_path = project_root/ "data" / "raw" /""

In [55]:
sales_df = spark.read.csv(str(sales_path), header=True, inferSchema=True)
customers_df = spark.read.csv(str(customers_path), header=True, inferSchema=True)

sales_df.show(5)
customers_df.show(5)


+------+--------+------+-----------+--------+-----+
|txn_id|    date|sku_id|customer_id|quantity|price|
+------+--------+------+-----------+--------+-----+
| T3650|20250830|  P001|       C002|      17| 1700|
| T4366|20250830|  P002|       C004|      10| 2000|
| T7417|20250830|  P003|       C002|      20| 6000|
| T1662|20250830|  P001|       C003|      32| 3200|
| T3343|20250830|  P002|       C004|      40| 8000|
+------+--------+------+-----------+--------+-----+
only showing top 5 rows

+------+--------+------+-----------+--------+-----+
|txn_id|    date|sku_id|customer_id|quantity|price|
+------+--------+------+-----------+--------+-----+
| T3650|20250830|  P001|       C002|      17| 1700|
| T4366|20250830|  P002|       C004|      10| 2000|
| T7417|20250830|  P003|       C002|      20| 6000|
| T1662|20250830|  P001|       C003|      32| 3200|
| T3343|20250830|  P002|       C004|      40| 8000|
+------+--------+------+-----------+--------+-----+
only showing top 5 rows



We have to define schema for Spark as Spark would not know the schema for inventory as we will be uploading inventory stream batch files incrementally.

In [56]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define Schema
inventory_schema = StructType([
    StructField("timestamp", StringType(), True),
    StructField("store_id", StringType(), True),
    StructField("sku_id", StringType(), True),
    StructField("on_stock", IntegerType(), True),
])

# Read streaming JSON with mini-batch drops
inventory_df = spark.readStream \
    .format("json") \
    .schema(inventory_schema) \
    .option("header", "true")  \
    .option("inferSchema", "true") \
    .option("maxFilesPerTrigger",1) \
    .load(str(inventory_path))

inventory_df.printSchema()


root
 |-- timestamp: string (nullable = true)
 |-- store_id: string (nullable = true)
 |-- sku_id: string (nullable = true)
 |-- on_stock: integer (nullable = true)



## Processing Layer - Curate the Data
### Daily item-store sales
It tells us how many items were in stock for each SKU at each store per day. This is the foundation for demand calculations.
### Moving Average Demand
It helps us smooth out fluctuations and detect trends. For streaming , we can compute it using window functions.
### Stock-Out Risk Signal
It shows risk signals for Stock-Out.
### RFM Analysis On Customers
It involves Recency which means how recently a customer purchased, Frequency means how often a customer purchased and Monetary means total spend.


In [58]:
from pyspark.sql.functions import to_date, col, sum as spark_sum

spark.stop()