# Stream-Stream Join Demo

### Join Operations

Finally, its worth highlighting how easily you can perform different kinds of joins in Spark Streaming.

### Stream-stream joins

Streams can be very easily joined with other streams.
```python
stream1 = ...
stream2 = ...
joinedStream = stream1.join(stream2)
```
Here, in each batch interval, the RDD generated by `stream1` will be joined with the RDD generated by `stream2`. You can also do `leftOuterJoin`, `rightOuterJoin`, `fullOuterJoin`. Furthermore, it is often very useful to do joins over windows of the streams. That is pretty easy as well.
```python
windowedStream1 = stream1.window(20)
windowedStream2 = stream2.window(60)
joinedStream = windowedStream1.join(windowedStream2)
```

### Demo

In [1]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')

In [2]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pprint import pprint
import time

In [3]:
sc = SparkContext()
ssc = StreamingContext(sc, 1)

In [4]:
rdd_queue = []
for i in range(5): 
    rdd_data = range(1000)
    rdd = ssc.sparkContext.parallelize(rdd_data)
    rdd_queue.append(rdd)
pprint(rdd_queue)

# Creating queue stream # 1
ds1 = ssc.queueStream(rdd_queue).map(lambda x: (x % 10, 1)).window(4).reduceByKey(lambda v1,v2:v1+v2)
ds1.pprint()

[PythonRDD[5] at RDD at PythonRDD.scala:48,
 PythonRDD[6] at RDD at PythonRDD.scala:48,
 PythonRDD[7] at RDD at PythonRDD.scala:48,
 PythonRDD[8] at RDD at PythonRDD.scala:48,
 PythonRDD[9] at RDD at PythonRDD.scala:48]


In [5]:
# Creating queue stream # 2
ds2 = ssc.queueStream(rdd_queue).map(lambda x: (x % 5, 1)).window(windowDuration=20).reduceByKey(lambda v1,v2:v1+v2)
ds2.pprint()

In [6]:
# Crossing the Streams
joinedStream = ds1.join(ds2)
joinedStream.pprint()

In [7]:
ssc.start()

-------------------------------------------
Time: 2018-03-01 22:03:45
-------------------------------------------
(0, 100)
(8, 100)
(2, 100)
(4, 100)
(6, 100)
(1, 100)
(3, 100)
(9, 100)
(5, 100)
(7, 100)

-------------------------------------------
Time: 2018-03-01 22:03:45
-------------------------------------------
(0, 200)
(2, 200)
(4, 200)
(1, 200)
(3, 200)

-------------------------------------------
Time: 2018-03-01 22:03:45
-------------------------------------------
(0, (100, 200))
(2, (100, 200))
(4, (100, 200))
(1, (100, 200))
(3, (100, 200))

-------------------------------------------
Time: 2018-03-01 22:03:46
-------------------------------------------
(0, 200)
(8, 200)
(2, 200)
(4, 200)
(6, 200)
(1, 200)
(3, 200)
(9, 200)
(5, 200)
(7, 200)

-------------------------------------------
Time: 2018-03-01 22:03:46
-------------------------------------------
(0, 400)
(2, 400)
(4, 400)
(1, 400)
(3, 400)

-------------------------------------------
Time: 2018-03-01 22:03:46
-----

In [8]:
ssc.stop()

-------------------------------------------
Time: 2018-03-01 22:03:52
-------------------------------------------
(0, (100, 1000))
(2, (100, 1000))
(4, (100, 1000))
(1, (100, 1000))
(3, (100, 1000))

-------------------------------------------
Time: 2018-03-01 22:03:53
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 22:03:53
-------------------------------------------
(0, 1000)
(2, 1000)
(4, 1000)
(1, 1000)
(3, 1000)



## Reference
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations