# Stream-Stream Join Demo

### Join Operations

Finally, its worth highlighting how easily you can perform different kinds of joins in Spark Streaming.

### Stream-stream joins

Streams can be very easily joined with other streams.
```python
stream1 = ...
stream2 = ...
joinedStream = stream1.join(stream2)
```
Here, in each batch interval, the RDD generated by `stream1` will be joined with the RDD generated by `stream2`. You can also do `leftOuterJoin`, `rightOuterJoin`, `fullOuterJoin`. Furthermore, it is often very useful to do joins over windows of the streams. That is pretty easy as well.
```python
windowedStream1 = stream1.window(20)
windowedStream2 = stream2.window(60)
joinedStream = windowedStream1.join(windowedStream2)
```

### Demo

In [None]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')

In [None]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pprint import pprint
import time

In [None]:
sc = SparkContext()
ssc = StreamingContext(sc, 1)

In [None]:
rdd_queue = []
for i in xrange(5): 
    rdd_data = xrange(1000)
    rdd = ssc.sparkContext.parallelize(rdd_data)
    rdd_queue.append(rdd)
pprint(rdd_queue)

# Creating queue stream # 1
ds1 = ssc.queueStream(rdd_queue).map(lambda x: (x % 10, 1)).window(4).reduceByKey(lambda v1,v2:v1+v2)
ds1.pprint()

In [None]:
# Creating queue stream # 2
ds2 = ssc.queueStream(rdd_queue).map(lambda x: (x % 5, 1)).window(windowDuration=20).reduceByKey(lambda v1,v2:v1+v2)
ds2.pprint()

In [None]:
# Crossing the Streams
joinedStream = ds1.join(ds2)
joinedStream.pprint()

In [None]:
ssc.start()

In [None]:
ssc.stop()

## Reference
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations