# countByValueAndWindow Transformation Exercise

Here is an overview of the countByValueAndWindow transformation and how the function is used.

| Transformation        | Meaning           |
| -------------:|:-------------|
| **countByValueAndWindow**(windowLength, slideInterval, [numTasks]) | When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.      |

### Exercise

In [1]:
import findspark
# TODO: your path will likely not have 'matthew' in it. Change it to reflect your path.
findspark.init('/home/matthew/spark-2.1.0-bin-hadoop2.7')

In [2]:
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
import sys
import random
from apache_log_parser import ApacheAccessLog

In [3]:
conf = (SparkConf().setMaster("local[4]").setAppName("log processor").set("spark.executor.memory", "2g"))

sc = SparkContext(conf=conf)
ssc = StreamingContext(sc, 2)
ssc.checkpoint("checkpoint")


In [4]:
# create DStream from text file
# Note: the spark streaming checks for any updates to this directory.
# So first, start this program, and then copy the log file logs/access_log.log to 'directory' location
log_data = ssc.textFileStream('logs')
access_log_dstream = log_data.map(ApacheAccessLog.parse_from_log_line).filter(lambda parsed_line: parsed_line is not None)
ip_dstream = access_log_dstream.map(lambda parsed_line: (parsed_line.ip, 1)) 
ip_count = ip_dstream.reduceByKey(lambda x,y: x+y)
ip_count.pprint(num = 30)
ip_bytes_dstream = access_log_dstream.map(lambda parsed_line: (parsed_line.ip, parsed_line.content_size))
ip_bytes_sum_dstream = ip_bytes_dstream.reduceByKey(lambda x,y: x+y)
ip_bytes_request_count_dstream = ip_count.join(ip_bytes_sum_dstream)
ip_bytes_request_count_dstream.pprint(num = 30)

In [5]:
####### TODO: Windowed count operation using countByValueAndWindow() ###########
ip_dstream = access_log_dstream.map(lambda entry: entry.ip)
ip_address_request_count = ip_dstream.countByValueAndWindow(windowDuration = 6, slideDuration=4)
ip_address_request_count.pprint()

####### Exercise End ##########################################################

In [6]:
ssc.start() 
# ssc.awaitTermination()

-------------------------------------------
Time: 2018-03-01 19:12:28
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 19:12:28
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 19:12:30
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 19:12:30
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 19:12:30
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 19:12:32
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 19:12:32
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 19:12:34
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 19:12:34
----------

In [7]:
ssc.stop(stopSparkContext=True, stopGraceFully=True)

-------------------------------------------
Time: 2018-03-01 19:12:58
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 19:12:58
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 19:13:00
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 19:13:00
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 19:13:02
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 19:13:02
-------------------------------------------

-------------------------------------------
Time: 2018-03-01 19:13:02
-------------------------------------------



## References
1. https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations