# Transformation Operation Exercise

The `transform` operation (along with its variations like `transformWith`) allows arbitrary RDD-to-RDD functions to be applied on a DStream. It can be used to apply any RDD operation that is not exposed in the DStream API. For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. However, you can easily use `transform to do` this. This enables very powerful possibilities. For example, one can do real-time data cleaning by joining the input data stream with precomputed spam information (maybe generated with Spark as well) and then filtering based on it.
```python
spamInfoRDD = sc.pickleFile(...)  # RDD containing spam information

# join data stream with spam information to do data cleaning
cleanedDStream = wordCounts.transform(lambda rdd: rdd.join(spamInfoRDD).filter(...))
```
Note that the supplied function gets called in every batch interval. This allows you to do time-varying RDD operations, that is, RDD operations, number of partitions, broadcast variables, etc. can be changed between batches.

What is the benefit of it?


### Exercise

Suppose we have two rdds tthat are combined into a DStream

We would like to apply the `union()` function to this DStream and the RDD `commonRdd`

In [None]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')

In [None]:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext

In [None]:
conf = SparkConf().setMaster("local[2]").setAppName("StreamingTransformExample")
sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 5)    

In [None]:
rdd1 = ssc.sparkContext.parallelize([1,2,3])
rdd2 = ssc.sparkContext.parallelize([4,5,6])
rddQueue = [rdd1,rdd2]

In [None]:
# Creates a DStream from the RDDs above
numsDStream = ssc.queueStream(rddQueue)
plusOneDStream = numsDStream.map(lambda x : x+1)
plusOneDStream.pprint()

In [None]:
commonRdd = ssc.sparkContext.parallelize([7,8,9])
# TODO: Use the transform function to apply the union function to the RDDs within numsDStream and elements of commonRdd
# and print the resulting DStream




In [None]:
ssc.start() 
# ssc.awaitTermination()

In [None]:
ssc.stop(stopSparkContext=True, stopGraceFully=True)

## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation
