# Stream-DataSet Join Demo

### Stream-dataset joins

This has already been shown earlier while explain `DStream.transform` operation. Here is yet another example of joining a windowed stream with a dataset.
```python
dataset = ... # some RDD
windowedStream = stream.window(20)
joinedStream = windowedStream.transform(lambda rdd: rdd.join(dataset))
```
In fact, you can also dynamically change the `dataset` you want to join against. The function provided to `transform` is evaluated every batch interval and therefore will use the current dataset that `dataset` reference points to.

The complete list of DStream transformations is available in the API documentation. For the Python API, see [DStream](https://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark.streaming.DStream).



### Demo

In [None]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')

In [None]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
import time

In [None]:
sc = SparkContext("local[2]", "IP-Matcher")
ssc = StreamingContext(sc, 2)

In [None]:
ips_rdd = sc.parallelize(set())
lines_ds = ssc.socketTextStream("localhost", 9999)

In [None]:
# split each line into IPs
ips_ds = lines_ds.flatMap(lambda line: line.split(" "))
pairs_ds = ips_ds.map(lambda ip: (ip, 1))

In [None]:
# join with the IPs RDD
matches_ds = pairs_ds.transform(lambda rdd: rdd.join(ips_rdd))
matches_ds.pprint()

In [None]:
# In another window run:
# $ nc -lk 9999
# Then enter IP addresses separated by spaces into the nc window

In [None]:
ssc.start()

# alternate between two sets of IP addresses for the RDD
IP_FILES = ('data/ip_file1.txt', 'data/ip_file2.txt')
file_index = 0
while True:
    with open(IP_FILES[file_index]) as f:
        ips = f.read().splitlines()
    ips_rdd = sc.parallelize(ips).map(lambda ip: (ip, 1))
    print "using", IP_FILES[file_index]
    file_index = (file_index + 1) % len(IP_FILES)
    time.sleep(8)

In [None]:
ssc.stop()

## Reference
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#join-operations