## Task 2. Heuristic user segmentation

In this task you should firstly parse the user logs. Then distiguish segments and count the *unique* uids in each segment. Sort the output by counts.

You may find more useful methods in the following sources:

* Book "Learning Spark: Lightning-Fast Big Data Analysis" by Holden Karau.

* [SparkStreaming Documentation](https://spark.apache.org/docs/latest/api/python/pyspark.streaming.html#pyspark-streaming-module)

* [HyperLogLog documentation](https://pypi.org/project/hyperloglog/), [HyperLogLog theory](https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html)

* [Ua_Parser_documentation](https://pypi.org/project/ua-parser/0.7.0/)

In [1]:
import os
from time import sleep
from pyspark import SparkContext, SparkConf
from pyspark.streaming import StreamingContext

# You need also use this specific libraries
from ua_parser import user_agent_parser
from hyperloglog import HyperLogLog
from ua_parser import user_agent_parser


In [2]:
# # Here is an example of `user_agent_parser`
# from ua_parser import user_agent_parser
# ua = 'Mozilla/5.0 (iPad; CPU OS 7_0_4 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11B554a Safari/9537.53'
# user_agent_parser.Parse(ua)

**NB.** Please don't change the cell below. It is used for emulation realtime batch arriving. But figure out the code, it will help you when you'll work with real SparkStreaming applications.

In [3]:
sc = SparkContext(master='local[4]')

# Preparing batches with the input data
DATA_PATH = "/data/course4/uid_ua_100k_splitted_by_5k"
batches = [sc.textFile(os.path.join(DATA_PATH, path)) for path in os.listdir(DATA_PATH)]

# Creating Dstream to emulate realtime data generating
BATCH_TIMEOUT = 5 # Timeout between the batch generation
ssc = StreamingContext(sc, BATCH_TIMEOUT)
dstream = ssc.queueStream(rdds=batches)

There are 2 flags used in this task. 
* `finished` flag indicates if the current RDD is empty.
* `printed` indicates the the result has been printed and SparkStreaming context can be stopped.

In [5]:
finished = False
printed = False

def set_ending_flag(rdd):
    global finished
    if rdd.isEmpty():
        finished = True

def print_only_at_the_end(rdd):
    global printed
    rdd.count()
    if finished and not printed:

        ans = rdd.sortBy(lambda x : x[1], ascending=False).collect()
   
        for key in ans:
            print("seg_"+key[0], key[1])
             
        printed = True

# If we have received empty an rdd, the stream is finished.
# So print the result and stop the context.

dstream.foreachRDD(set_ending_flag)

In [6]:
def aggregator(values, old):
    if not old :
        old = HyperLogLog(0.01)
    
    for value in values : 
        old.add(value)
        
    return old

In [7]:
def preprocess(line):
    
    line_split =  line.split("\t")
    user_hash = line_split[0]
    parsed_line = user_agent_parser.Parse(line_split[1])
    
    device = parsed_line['device']['family'].lower()

    browser = parsed_line['user_agent']['family'].lower()

    os = parsed_line['os']['family'].lower()
    
    list_tupples = []
    
    list_tupples.append((device, user_hash))
    list_tupples.append((browser, user_hash))
    list_tupples.append((os, user_hash))
    
    return list_tupples

In [8]:
def filter_devices(filter_tuple):
    word, _ = filter_tuple
    return word=="iphone" or word=="firefox" or word=="windows"

In [9]:
# Type your code for data processing and aggregation here


dstream.flatMap(lambda line: preprocess(line))\
       .filter(lambda word : filter_devices(word))\
       .updateStateByKey(aggregator)\
       .map(lambda tupple : (tupple[0], len(tupple[1])))\
       .foreachRDD(print_only_at_the_end)

    
    

**NB.** Please don't change the cell below. It is used for stopping SparkStreaming context and Spark context when the stream finished.

In [None]:
ssc.checkpoint('./checkpoint')  # checkpoint for storing current state        
ssc.start()
while not printed:
    pass
ssc.stop()  # when the result printed, stop the SparkStreaming context
sc.stop()  # stop the Spark context to be able restart the code without restarting the kernel

Here you can see the part of an output on the sample dataset:
```
...
seg_unknown 22377
seg_firefox 8237
...
```
Of course, the numbers may be different but not very much (the error about 1% will be accepted).