# Basics of Transformations Demo 2

As we discussed earlier, there are a wide variety of data transformations available for use on DStreams, most of which are similar to those used on the DStreams' constituent parts.

As a reminder, here is the list of transformations from the previous demo again:

| Transformation        | Meaning         |
| ------------------------------ |:-------------|
| **map**(func)      | Return a new DStream by passing each element of the source DStream through a function func.    |
| **flatMap**(func)	| Similar to map, but each input item can be mapped to 0 or more output items.    |
| **filter**(func)	| Return a new DStream by selecting only the records of the source DStream on which func returns true.    |
| **repartition**(numPartitions)	| Changes the level of parallelism in this DStream by creating more or fewer partitions.    |
| **union**(otherStream)	| Return a new DStream that contains the union of the elements in the source DStream and otherDStream. |
| **count**()	| Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.  |
| **reduce**(func)	| Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using  a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
| **countByValue**()	| When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
| **reduceByKey**(func, [numTasks])	| When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
| **join**(otherStream, [numTasks])	| When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
| **cogroup**(otherStream, [numTasks])	| When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
| **transform**(func)	| Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
| **updateStateByKey**(func)	| Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.


Let's go though another example:



### Demo

Last time we went over the `map` and `flapmap` functions. We'll explore a few other options.

Suppose we have a this example text from Dr Suess's _The Cat in the Hat_.

In [None]:
lines = sc.parallelize(['Its fun to have fun,','but you have to know how.'])

Suppose then that we want to get wordcounts for this. We can use the map function from before here. map returns a new RDD containing values created by applying the supplied function to each value in the original RDD

Here we use a lambda function which replaces some common punctuation characters with spaces and convert to lower case, producing a new RDD:

In [None]:
wordcounts1 = lines.map( lambda x: x.replace(',',' ').replace('.',' ').replace('-',' ').lower())
wordcounts1.take(10)

The flatMap  function takes these input values and returns a new, flattened list. In this case, the lines are split into words and then each word becomes a separate value in the output RDD:

In [None]:
wordcounts2 = wordcounts1.flatMap(lambda x: x.split())
wordcounts2.take(20)

In this second map invocation, we use a function which replaces each original value in the input RDD with a 2-tuple containing the word in the first position and the integer value 1 in the second position:

In [None]:
wordcounts3 = wordcounts2.map(lambda x: (x, 1))
wordcounts3.take(20)

Expect that the input RDD contains tuples of the form `(key,value)`. Create a new RDD containing a tuple for each unique value of `key` in the input, where the value in the second position of the tuple is created by applying the supplied lambda function to the `value`s with the matching `key` in the input RDD

Here the key will be the word and lambda function will sum up the word counts for each word. The output RDD will consist of a single tuple for each unique word in the data, where the word is stored at the first position in the tuple and the word count is stored at the second position

In [None]:
wordcounts4 = wordcounts3.reduceByKey(lambda x,y:x+y)
wordcounts4.take(20) 

map a lambda function to the data which will swap over the first and second values in each tuple, now the word count appears in the first position and the word in the second position

In [None]:
wordcounts5 = wordcounts4.map(lambda x:(x[1],x[0]))
wordcounts5.take(20)

Now we sort the input RDD by the key value (i.e., the value at the first position in each tuple)

In this example the first position stores the word count so this will sort the words so that the most frequently occurring words occur first in the RDD. The `ascending=False` parameter results in a descending sort order

In [None]:
wordcounts6 = wordcounts5.sortByKey(ascending=False)
wordcounts6.take(20)

And now we have our sorted wordcount for this incredibly small poem

# References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams 