# Join Operations Exercise

### Join Operations

Finally, its worth highlighting how easily you can perform different kinds of joins in Spark Streaming.

### Stream-stream joins

Streams can be very easily joined with other streams.
```python
stream1 = ...
stream2 = ...
joinedStream = stream1.join(stream2)
```
Here, in each batch interval, the RDD generated by `stream1` will be joined with the RDD generated by `stream2`. You can also do `leftOuterJoin`, `rightOuterJoin`, `fullOuterJoin`. Furthermore, it is often very useful to do joins over windows of the streams. That is pretty easy as well.
```python
windowedStream1 = stream1.window(20)
windowedStream2 = stream2.window(60)
joinedStream = windowedStream1.join(windowedStream2)
```

### Stream-dataset joins

This has already been shown earlier while explain `DStream.transform` operation. Here is yet another example of joining a windowed stream with a dataset.
```python
dataset = ... # some RDD
windowedStream = stream.window(20)
joinedStream = windowedStream.transform(lambda rdd: rdd.join(dataset))
```
In fact, you can also dynamically change the `dataset` you want to join against. The function provided to `transform` is evaluated every batch interval and therefore will use the current dataset that `dataset` reference points to.

The complete list of DStream transformations is available in the API documentation. For the Python API, see [DStream](https://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.DStream).



### Exercise
Create a streaming app that can join the incoming orders with our previous knowledge of whether this customer is good or bad.

In [None]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')

In [None]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pprint import pprint
import time

In [None]:
sc = SparkContext()
ssc = StreamingContext(sc, 1)

In [None]:
# For testing, create prepopulated QueueStream of streaming customer orders. 
transaction_rdd_queue = []
for i in range(5): 
    transactions = [(customer_id, None) for customer_id in range(10)]
    transaction_rdd = ssc.sparkContext.parallelize(transactions)
    transaction_rdd_queue.append(transaction_rdd)


In [None]:
# Batch RDD of whether customers are good or bad. 
# (customer_id, is_good_customer)
customers = [(0,True),(1,False), (2,True), (3,False), (4,True), (5,False), (6,True), (7,False), (8,True), (9,False)]
customer_rdd = ssc.sparkContext.parallelize(customers)

In [None]:
# Creating queue stream
ds = ssc.queueStream(transaction_rdd_queue)

In [None]:
# Join the streaming RDD and batch RDDs to filter out bad customers.


## END OF EXERCISE SECTION ==================================
dst.pprint()

In [None]:
ssc.start()
time.sleep(6)
ssc.stop()

## Reference
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations