# Transformation Operation Demo

The `transform` operation (along with its variations like `transformWith`) allows arbitrary RDD-to-RDD functions to be applied on a DStream. It can be used to apply any RDD operation that is not exposed in the DStream API. For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. However, you can easily use `transform to do` this. This enables very powerful possibilities. For example, one can do real-time data cleaning by joining the input data stream with precomputed spam information (maybe generated with Spark as well) and then filtering based on it.
```python
spamInfoRDD = sc.pickleFile(...)  # RDD containing spam information

# join data stream with spam information to do data cleaning
cleanedDStream = wordCounts.transform(lambda rdd: rdd.join(spamInfoRDD).filter(...))
```
Note that the supplied function gets called in every batch interval. This allows you to do time-varying RDD operations, that is, RDD operations, number of partitions, broadcast variables, etc. can be changed between batches.


### Demo
Create DStream from text file

In [1]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')

In [2]:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
import sys, re
import random
from apache_log_parser import ApacheAccessLog

In [3]:
random.seed(15)
conf = (SparkConf().setMaster("local[2]").setAppName("TextUpdater").set("spark.executor.memory", "2g"))

sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 1)
ssc.checkpoint("checkpoint")

In [4]:
# TODO: Change this directory to match yours
directory = 'file:///home/matthew/pyspark-streaming/2_basics/logs'

In [5]:
log_data = ssc.textFileStream(directory)
access_log_dstream = log_data.map(ApacheAccessLog.parse_from_log_line).filter(lambda parsed_line: parsed_line is not None)
access_log_dstream.pprint(num = 30)

In [6]:
# Implementing a transformed stream here using the `transform()` function
def mapIPValues(rdd):
    # Dummy Implementation
    rdd1 = rdd.map(lambda parsed_line: (parsed_line.ip, 1))
    return rdd1

In [7]:
transformed_access_log_dstream = access_log_dstream.transform(mapIPValues)
transformed_access_log_dstream.pprint(num = 30)

Note: the spark streaming checks for any updates to this directory. First, start this program, and then copy the log file logs/access_log.log to 'directory' location

In [8]:
ssc.start() 
#ssc.awaitTermination()

-------------------------------------------
Time: 2018-02-13 16:28:43
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:28:43
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:28:44
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:28:44
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:28:45
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:28:45
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:28:46
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:28:46
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:28:47
----------

-------------------------------------------
Time: 2018-02-13 16:29:02
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:29:02
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:29:03
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:29:03
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:29:04
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:29:04
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:29:05
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:29:05
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:29:06
----------

In [9]:
ssc.stop(stopSparkContext=True, stopGraceFully=True)

-------------------------------------------
Time: 2018-02-13 16:29:20
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:29:20
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:29:21
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:29:21
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:29:22
-------------------------------------------

-------------------------------------------
Time: 2018-02-13 16:29:22
-------------------------------------------



## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation
