## Streaming-joins

1. Consider Event1 from Source1 needs to be joined with Event2 from source2 based on ID columns.
2. Spark keeps the data in memory for a future potential match
3. We must specify a time window beyond which spark can safely drop the data from its memory, otherwise OOM issue

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as sf
from pyspark.sql.types import StructType, StringType, TimestampType, IntegerType
from datetime import datetime
from pyspark.sql.functions import from_json, col, expr

spark = SparkSession \
    .builder \
    .appName("Python Spark Streaming join example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/07/21 12:11:21 WARN Utils: Your hostname, spark-master, resolves to a loopback address: 127.0.1.1; using 10.168.136.115 instead (on interface ens3)
25/07/21 12:11:21 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/21 12:11:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
schema1 = StructType() \
    .add("id", StringType()) \
    .add("value1", StringType()) \
    .add("event_time", TimestampType())

schema2 = StructType() \
    .add("id", StringType()) \
    .add("value2", StringType()) \
    .add("event_time", TimestampType())

In [3]:
# Stream 1: listens on port 9999
stream1 = spark.readStream.format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

25/07/21 12:11:26 WARN TextSocketSourceProvider: The socket source should not be used for production applications! It does not support recovery.


In [4]:
# Stream 2: listens on port 9998
stream2 = spark.readStream.format("socket") \
    .option("host", "localhost") \
    .option("port", 9998) \
    .load()


25/07/21 12:11:26 WARN TextSocketSourceProvider: The socket source should not be used for production applications! It does not support recovery.


In [5]:
# Parse JSON and apply schema
df1 = stream1.select(from_json(col("value"), schema1).alias("data1")) \
    .selectExpr("data1.id", "data1.value1", "data1.event_time") \
    .withWatermark("event_time", "10 minutes")

df2 = stream2.select(from_json(col("value"), schema2).alias("data2")) \
    .selectExpr("data2.id", "data2.value2", "data2.event_time") \
    .withWatermark("event_time", "10 minutes")

In [13]:
df1 = df1.alias("df1")
df2 = df2.alias("df2")

In [7]:
joined = df1.join(
    df2,
    expr("""
        df1.id = df2.id AND
        df1.event_time BETWEEN df2.event_time AND df2.event_time + interval 5 minutes
    """)
)

In [8]:
query = joined.writeStream \
    .format("console") \
    .outputMode("append") \
    .start()

25/07/21 12:11:32 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-53c16d3b-154d-4195-b72a-cc31528c92a5. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
25/07/21 12:11:32 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.
                                                                                

-------------------------------------------
Batch: 0
-------------------------------------------
+---+------+----------+---+------+----------+
| id|value1|event_time| id|value2|event_time|
+---+------+----------+---+------+----------+
+---+------+----------+---+------+----------+



                                                                                

-------------------------------------------
Batch: 1
-------------------------------------------
+---+------+-------------------+---+----------+-------------------+
| id|value1|         event_time| id|    value2|         event_time|
+---+------+-------------------+---+----------+-------------------+
|124| click|2025-07-15 10:00:00|124|impression|2025-07-15 09:55:00|
+---+------+-------------------+---+----------+-------------------+



                                                                                

-------------------------------------------
Batch: 2
-------------------------------------------
+---+------+----------+---+------+----------+
| id|value1|event_time| id|value2|event_time|
+---+------+----------+---+------+----------+
+---+------+----------+---+------+----------+



In [14]:
spark.stop()