# Transformation Operation Exercise

The `transform` operation (along with its variations like `transformWith`) allows arbitrary RDD-to-RDD functions to be applied on a DStream. It can be used to apply any RDD operation that is not exposed in the DStream API. For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. However, you can easily use `transform to do` this. This enables very powerful possibilities. For example, one can do real-time data cleaning by joining the input data stream with precomputed spam information (maybe generated with Spark as well) and then filtering based on it.
```python
spamInfoRDD = sc.pickleFile(...)  # RDD containing spam information

# join data stream with spam information to do data cleaning
cleanedDStream = wordCounts.transform(lambda rdd: rdd.join(spamInfoRDD).filter(...))
```
Note that the supplied function gets called in every batch interval. This allows you to do time-varying RDD operations, that is, RDD operations, number of partitions, broadcast variables, etc. can be changed between batches.

What is the benefit of it?


### Exercise

Suppose we have two rdds that we need to join together. They are RDD1

RDD1
```
[(u'2', u'100', 2),
 (u'1', u'300', 1),
 (u'1', u'200', 1)]
```
and RDD2
```
[(u'1', u'2'), (u'1', u'3')]
```
We would like to select those from RDD2 that have the same second value of RDD1.

In [None]:
rdd = sc.parallelize([(u'2', u'100', 2),(u'1', u'300', 1),(u'1', u'200', 1)])
rdd1 = sc.parallelize([(u'1', u'2'), (u'1', u'3')])
rdd2 = rdd1.map(lambda x:(x[1], x[0]))

##### TODO: Creat a `newRdd` variable with the elements from RDD2 that have the same second value of RDD1



## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation
