# Basics of Transformations Demo

In Spark Streaming, DStreams are treated very similarly to the RDDs that make them up. Like RDDs, there are a wide variety of data transformation options. 

Here are some examples of the transformations from the Spark documentation that might be useful for your purposes

| Transformation        | Meaning         |
| ------------------------------ |:-------------|
| **map**(func)      | Return a new DStream by passing each element of the source DStream through a function func.    |
| **flatMap**(func)	| Similar to map, but each input item can be mapped to 0 or more output items.    |
| **filter**(func)	| Return a new DStream by selecting only the records of the source DStream on which func returns true.    |
| **repartition**(numPartitions)	| Changes the level of parallelism in this DStream by creating more or fewer partitions.    |
| **union**(otherStream)	| Return a new DStream that contains the union of the elements in the source DStream and otherDStream. |
| **count**()	| Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.  |
| **reduce**(func)	| Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using  a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
| **countByValue**()	| When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
| **reduceByKey**(func, [numTasks])	| When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
| **join**(otherStream, [numTasks])	| When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
| **cogroup**(otherStream, [numTasks])	| When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.


If you look at the spark streaming documentation, you will also find the `transform(func)` and `updateStateByKey(func)`. We will discuss these later in the course.


### Demo (Part 1)

We're going to be demoing the map and flatmap functions with respect to DStreams. One important question is "What is the difference between the two?"

`map`: It returns a new RDD by applying a function to each element of the RDD. Function in map can return only one item. Works with DStreams as well as RDDs

`flatMap`: Similar to map, it returns a new RDD by applying  a function to each element of the RDD, but output is flattened.
Also, function in flatMap can return a list of elements (0 or more). Works with DStreams as well as RDDs.

Here's an example:

In [None]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')

In [None]:
sc.parallelize([3,4,5]).map(lambda x: range(1,x)).collect()

In [None]:
sc.parallelize([3,4,5]).flatMap(lambda x: range(1,x)).collect()

notice o/p is flattened out in a single list

Here's Another Example:

In [None]:
sc.parallelize([3,4,5]).map(lambda x: [x,  x*x]).collect() 

In [None]:
sc.parallelize([3,4,5]).flatMap(lambda x: [x, x*x]).collect() 

notice that the list is flattened in the latter version

Here's another example, this time interacting with a file, which can often be useful for debugging code that interacts with full DStreams

There is a text file `greetings.txt` with following lines:
```
Good Morning
Good Evening
Good Day
Happy Birthday
Happy New Year
```

In [None]:
lines = sc.textFile("greetings.txt")
lines.map(lambda line: line.split()).collect()

In [None]:
lines.flatMap(lambda line: line.split()).collect()

# Demo (Part 2)

Last time we went over the `map` and `flapmap` functions. We'll explore a few other options.

Suppose we have a this example text from Dr Suess's _The Cat in the Hat_.

In [None]:
scc = Streamingcontext("local[2]","PythonSparkApp", 10)

myFile = scc.sparkContext.textFile("..data/DrSeuss.txt")
wordspair = myFile.flatMap(lambda row: row.split(" ")).map(lambda x: (x, 1)).reduceByKey(lambda x,y : x + y)
oldwordcount = wordspair.reduceByKey(lambda x,y : x + y)
lines = scc.socketTextStream("192.168.56.101", 9999)

# Suppose then that we want to get wordcounts for this. We can use the map function from before here. 
# map returns a new RDD containing values created by applying the supplied function to each value in the original RDD
# Here we use a lambda function which replaces some common punctuation characters with spaces and convert to lower 
# case, producing a new RDD:

wordcounts1 = lines.map( lambda x: x.replace(',',' ').replace('.',' ').replace('-',' ').lower())
wordcounts1.take(10)
pprint(wordcounts1)

# The flatMap function takes these input values and returns a new, flattened list. In this case, the lines are split 
# into words and then each word becomes a separate value in the output RDD:

wordcounts2 = wordcounts1.flatMap(lambda x: x.split())
wordcounts2.take(20)
pprint(wordcounts2)

# Expect that the input RDD contains tuples of the form (key,value). Create a new RDD containing a tuple for 
# each unique value of key in the input, where the value in the second position of the tuple is created by 
# applying the supplied lambda function to the values with the matching key in the input RDD
# Here the key will be the word and lambda function will sum up the word counts for each word. The output RDD 
# will consist of a single tuple for each unique word in the data, where the word is stored at the first position 
# in the tuple and the word count is stored at the second position

wordcounts3 = wordcounts2.map(lambda x: (x, 1))
wordcounts3.take(20)
pprint(wordcounts3)

wordcounts4 = wordcounts3.reduceByKey(lambda x,y:x+y)
wordcounts4.take(20)
pprint(wordcounts4)

# map a lambda function to the data which will swap over the first and second values in each tuple, now the word count
# appears in the first position and the word in the second position

wordcounts5 = wordcounts4.map(lambda x:(x[1],x[0]))
wordcounts5.take(20)
pprint(wordcounts5)

# we sort the input RDD by the key value (i.e., the value at the first position in each tuple)
# In this example the first position stores the word count so this will sort the words so that the 
# most frequently occurring words occur first in the RDD. The ascending=False parameter results in a descending sort order

wordcounts6 = wordcounts5.sortByKey(ascending=False)
wordcounts6.take(20)
pprint(wordcounts6)


# References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams
