# Basics of Transformations Exercise

DStreams Transformations

### Exercise
Use any of the functions above to return the largest key of every RDD in a DStream (not just the largest in the entire DStream).

| Transformation        | Meaning         |
| ------------------------------ |:-------------|
| **map**(func)      | Return a new DStream by passing each element of the source DStream through a function func.    |
| **flatMap**(func)	| Similar to map, but each input item can be mapped to 0 or more output items.    |
| **filter**(func)	| Return a new DStream by selecting only the records of the source DStream on which func returns true.    |
| **repartition**(numPartitions)	| Changes the level of parallelism in this DStream by creating more or fewer partitions.    |
| **union**(otherStream)	| Return a new DStream that contains the union of the elements in the source DStream and otherDStream. |
| **count**()	| Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.  |
| **reduce**(func)	| Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using  a function func (which takes two arguments and returns one). The function should be associative and commutative so that it can be computed in parallel.
| **countByValue**()	| When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
| **reduceByKey**(func, [numTasks])	| When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
| **join**(otherStream, [numTasks])	| When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
| **cogroup**(otherStream, [numTasks])	| When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.


If you look at the spark streaming documentation, you will also find the `transform(func)` and `updateStateByKey(func)`. We will discuss these later in the course.


In [1]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')

In [2]:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext

In [3]:
sc = SparkContext(appName="PythonStreamingExercise")
ssc = StreamingContext(sc, 1)

In [4]:
# Defining the stream
stream = ssc.queueStream([sc.parallelize([(1,"a"), (2,"b"),(1,"c"),(2,"d"),
(1,"e"),(3,"f")],3)])

In [5]:
# TODO: Use any of the functions above, or some combinaion, to 
# return the largest key of every RDD in a DStream (not just the largest in the entire DStream).

maxstream = stream.reduce(max)
maxstream.pprint()

###### End of Exercise section

In [6]:
ssc.start() 
# ssc.awaitTermination()

-------------------------------------------
Time: 2018-03-01 03:48:44
-------------------------------------------
(3, 'f')

-------------------------------------------
Time: 2018-03-01 03:48:45
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 03:48:46
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 03:48:47
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 03:48:48
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 03:48:49
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 03:48:50
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 03:48:51
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 03:48:52
-

In [7]:
ssc.stop(stopSparkContext=True, stopGraceFully=True)

-------------------------------------------
Time: 2018-03-01 03:49:25
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 03:49:26
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 03:49:27
-------------------------------------------



# References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#transformations-on-dstreams