In [1]:
!pip install pyspark



In [2]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.sql import SparkSession
from pprint import pprint
import time


In [3]:

# Initialize SparkContext
sc = SparkContext()
# Create a StreamingContext with batch interval of 10 seconds
ssc = StreamingContext(sc, 10)




```ssc.queueStream(rdd_queue)```: Creates a DStream from rdd_queue.

```.map(lambda x: (x % 10, 1))```: Applies a transformation to each element of the DStream, mapping it to a tuple (x % 10, 1). This groups elements by their remainder when divided by 10, converting each element into a key-value pair with the key as the remainder and the value as 1.

```.window(10)```: Creates a sliding window of 10 seconds on the DStream, collecting elements over this time frame.

```.reduceByKey(lambda v1, v2: v1 + v2```: Reduces elements with the same key by adding their values together. In this case, it counts the occurrences of each remainder within the 10-second window.

In [4]:
# Initialize an empty list to hold RDDs
rdd_queue = []

# Create 5 batches of data
for i in range(5):
    # Generate data ranging from 0 to 999
    rdd_data = range(1000)

    # Create an RDD from the generated data
    rdd = ssc.sparkContext.parallelize(rdd_data)

    # Append the RDD to the list
    rdd_queue.append(rdd)

# Print the list of RDDs
pprint(rdd_queue)

[PythonRDD[5] at RDD at PythonRDD.scala:53,
 PythonRDD[6] at RDD at PythonRDD.scala:53,
 PythonRDD[7] at RDD at PythonRDD.scala:53,
 PythonRDD[8] at RDD at PythonRDD.scala:53,
 PythonRDD[9] at RDD at PythonRDD.scala:53]




---


`ssc.queueStream(rdd_queue)`: Creates a DStream from rdd_queue.

`.map(lambda x: (x % 10, 1))`: Applies a transformation to each element of the DStream, mapping it to a tuple `(x % 10, 1)`. This groups elements by their remainder when divided by 10, converting each element into a key-value pair with the key as the remainder and the value as 1.

`.window(10)`: Creates a sliding window of 10 seconds on the DStream, collecting elements over this time frame.

`.reduceByKey(lambda v1, v2: v1 + v2)`: Reduces elements with the same key by adding their values together. In this case, it counts the occurrences of each remainder within the 10-second window.

In [5]:
# Create a DStream from a queue of RDDs, mapping each element to (remainder, 1), apply a 10-second window, and count occurrences.
dataset1 = ssc.queueStream(rdd_queue).map(lambda x: (x % 10, 1)).window(10).reduceByKey(lambda v1, v2: v1 + v2)
# Print the result of dataset1 DStream
dataset1.pprint()



---


This following code constructs a streaming computation pipeline similar to the previous one but with a 20 seconds window duration and grouping based on the remainder when dividing by 5.

In [6]:
# Create a DStream from a queue of RDDs, mapping each element to (remainder, 1), apply a 20-second window, and count occurrences.
dataset2 = ssc.queueStream(rdd_queue).map(lambda x: (x % 5, 1)).window(windowDuration=20).reduceByKey(lambda v1, v2: v1 + v2)
# Print the result of dataset2 DStream
dataset2.pprint()



---
The code performs an inner join operation between two DStreams.

In both cases, the key is computed as the remainder when dividing each element `x` by either 10 (for `dataset1`) or 5 (for `dataset2`). Therefore, the common key between the two datasets is the remainder of dividing each element by `5`.


In [7]:
# Perform an inner join operation between dataset1 and dataset2
joinedStream = dataset1.join(dataset2)

# Print the output of the joinedStream
joinedStream.pprint()

In [8]:
# Perform left outer join operation between dataset1 and dataset2
joinedStream_left_outer = dataset1.leftOuterJoin(dataset2)

# Print the output of the joinedStream
joinedStream_left_outer.pprint()

In [9]:
# Start the streaming context
ssc.start()



---

**Inner Join**:

The joined counts from `dataset1` and `dataset2` represent the combined results of the inner join operation performed between the two datasets. In the provided output, the joined counts are displayed as tuples where the first element corresponds to the key (remainder) and the second element is a tuple containing the count from `dataset1` and `dataset2` respectively.

At the timestamp "`Time: 2024-05-10 18:11:30`", the joined count for key 0 is (100, 200). This means that within the specified window duration:

* Key 0 appears 100 times in `dataset1`.

* Key 0 appears 200 times in `dataset2`.


**Left Outer Join**:

A left outer join operation between two DStreams combines elements from the left DStream with matching elements from the right DStream, based on a common key. Left outer join returns all records from the left dataset (`dataset1`) and matching records from the right dataset (`dataset2`), if any.
If a key exists in `dataset1` but not in `dataset2`, the corresponding value in the second element of the tuple is None.

At the timestamp "`Time: 2024-05-10 18:11:30`", the joined count for key 6 is (100, None). This means that within the specified window duration:

* Key 6 appears 100 times in `dataset1`.

* Key 6 doesnot appear in `dataset2`.


**Comparison**:

* Inner join focuses on the intersection of keys present in both datasets, filtering out non-matching keys.

* Left outer join retains all keys from the left dataset (`dataset1`), including those that do not have a match in the right dataset (`dataset2`), represented by None in the output.

In [10]:
# Await termination for a specified duration (30 seconds)
ssc.awaitTermination(timeout=30)

# Stop the streaming context gracefully
ssc.stop(stopSparkContext=True)

-------------------------------------------
Time: 2024-05-10 18:11:30
-------------------------------------------
(0, 100)
(2, 100)
(4, 100)
(6, 100)
(8, 100)
(1, 100)
(3, 100)
(5, 100)
(7, 100)
(9, 100)

-------------------------------------------
Time: 2024-05-10 18:11:30
-------------------------------------------
(0, 200)
(2, 200)
(4, 200)
(1, 200)
(3, 200)

-------------------------------------------
Time: 2024-05-10 18:11:30
-------------------------------------------
(0, (100, 200))
(2, (100, 200))
(4, (100, 200))
(1, (100, 200))
(3, (100, 200))

-------------------------------------------
Time: 2024-05-10 18:11:30
-------------------------------------------
(0, (100, 200))
(2, (100, 200))
(4, (100, 200))
(6, (100, None))
(8, (100, None))
(1, (100, 200))
(3, (100, 200))
(5, (100, None))
(7, (100, None))
(9, (100, None))

-------------------------------------------
Time: 2024-05-10 18:11:40
-------------------------------------------
(0, 100)
(2, 100)
(4, 100)
(6, 100)
(8, 100)
(