# Transformation Operation Demo

The `transform` operation (along with its variations like `transformWith`) allows arbitrary RDD-to-RDD functions to be applied on a DStream. It can be used to apply any RDD operation that is not exposed in the DStream API. For example, the functionality of joining every batch in a data stream with another dataset is not directly exposed in the DStream API. However, you can easily use `transform to do` this. This enables very powerful possibilities. For example, one can do real-time data cleaning by joining the input data stream with precomputed spam information (maybe generated with Spark as well) and then filtering based on it.
```python
spamInfoRDD = sc.pickleFile(...)  # RDD containing spam information

# join data stream with spam information to do data cleaning
cleanedDStream = wordCounts.transform(lambda rdd: rdd.join(spamInfoRDD).filter(...))
```
Note that the supplied function gets called in every batch interval. This allows you to do time-varying RDD operations, that is, RDD operations, number of partitions, broadcast variables, etc. can be changed between batches.

What is the benefit of it?


### Demo


In [None]:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
import sys
import random
from apache_log_parser import ApacheAccessLog

random.seed(15)

if len(sys.argv) != 2:
    print('Please provide the path to Apache log file')
    print('10_10.py <path_to_log_directory>')
    sys.exit(2)

conf = (SparkConf().setMaster("local[4]").setAppName("log processor").set("spark.executor.memory", "2g"))

sc = SparkContext(conf=conf)

ssc = StreamingContext(sc, 2)
ssc.checkpoint("checkpoint")
 
directory = sys.argv[1]
print(directory)

# create DStream from text file
# Note: the spark streaming checks for any updates to this directory.
# So first, start this program, and then copy the log file logs/access_log.log to 'directory' location

log_data = ssc.textFileStream(directory)
access_log_dstream = log_data.map(ApacheAccessLog.parse_from_log_line).filter(lambda parsed_line: parsed_line is not None)

Implementing a transformed stream here using the `transform()` function
def extractOutliers(rdd):
    # Currently, no logic implemented
    return rdd

transformed_access_log_dstream = access_log_dstream.transform(extractOutliers)
transformed_access_log_dstream.pprint()

ssc.start() 
ssc.awaitTermination()

## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation
