Understand the architecture, core abstractions, and design philosophy behind Spark Structured Streaming — the declarative, fault-tolerant, and scalable solution for real-time data pipelines

#Batch Vs Real Time

## Traditional Systems (Age of Batch Processing)
    - Data at Rest. Files, Tables, Records. Query in Bulk.

## Modern Systems - Constantly Changing

- User clicks
- Sensor readings from IOT Devices
- Credit and Transactions
- Social Media Messages

# Data at Rest Vs Data in motion

Store and analyze vs React

What is streaming?
Streaming refers to the continuous and incremental processing of data as it arrives, rather than waiting to collect a full dataset first (as in batch processing).
In Streaming every event is processed, transformed and acted

#Charasterstics of Streaming

- Low Latency
- Unbounded Data
- Incremental Computation
- Fault Tolerance
- Time Awareness

#Why Spark?

1. Unified APIs
2. Declarative Processing
3. Fault Tolerance and Scalability

In [0]:
#Architecture of Structured Streaming 

+------------+      +----------------+      +------------+
|  Data      |      |  Spark Engine  |      |   Sink     |
|  Source    | ---> |  (Transform)   | ---> | (Output)   |
+------------+      +----------------+      +------------+

1. Source : Is the entry point where data come from
2. Transformations: These are the operations that are applied to the data. 
3. Sink: Is the output of the stream.


Behind the scene the stream is set of icrobatch

# Concept of checkpointing  


# Concepts of Watermarks

 Step 1: Set Catalog and Schema (SQL Cell or %sql) and Create the Target Delta Table (sales_landing)

In [0]:
%sql

USE CATALOG company;
USE SCHEMA unit;

In [0]:
%sql
CREATE OR REPLACE TABLE sales_landing (
  transaction_time TIMESTAMP,
  transaction_id INT,
  customer_id INT,
  product_id INT,
  quantity INT
) USING DELTA;


In [0]:
%sql

USE CATALOG company;
USE SCHEMA unit;

CREATE VOLUME IF NOT EXISTS sales_input_volume
COMMENT 'Volume for streaming sales input JSON files';



Read JSON or CSV File from Unity Catalog Volume (PySpark Cell)

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, TimestampType

# Define schema for the input file
sales_schema = StructType([
    StructField("transaction_time", TimestampType()),
    StructField("transaction_id", IntegerType()),
    StructField("customer_id", IntegerType()),
    StructField("product_id", IntegerType()),
    StructField("quantity", IntegerType())
])

# Read data from the volume (assume JSON format)
df = spark.read \
    .format("json") \
    .schema(sales_schema) \
    .load("/Volumes/company/unit/sales_input_volume/")

df.show()


Write DataFrame to Unity Catalog Delta Table

In [0]:
# Write to managed Delta table in Unity Catalog
df.write \
    .format("delta") \
    .mode("append") \
    .saveAsTable("company.unit.sales_landing")


In [0]:
%sql

select * from company.unit.sales_landing

In [0]:
# File 2
dbutils.fs.put("/Volumes/company/unit/sales_input_volume/sample2.json", """
{
  "transaction_time": "2024-06-06T10:02:30",
  "transaction_id": 1003,
  "customer_id": 1,
  "product_id": 103,
  "quantity": 5
}
""", overwrite=True)

# File 3
dbutils.fs.put("/Volumes/company/unit/sales_input_volume/sample3.json", """
{
  "transaction_time": "2024-06-06T10:04:00",
  "transaction_id": 1004,
  "customer_id": 3,
  "product_id": 104,
  "quantity": 3
}
""", overwrite=True)

# File 4
dbutils.fs.put("/Volumes/company/unit/sales_input_volume/sample4.json", """
{
  "transaction_time": "2024-06-06T10:08:00",
  "transaction_id": 1005,
  "customer_id": 2,
  "product_id": 101,
  "quantity": 1
}
""", overwrite=True)


In [0]:
%sql

select * from company.unit.sales_landing