# Spark exercise
In this exercise, we'll work with Spark and the IPython notebook to load data from a file, process it and extract some information.

Here are some important links / references:
- [Spark documentation overview](https://spark.apache.org/documentation.html)
- [Spark Python API](https://spark.apache.org/docs/latest/api/python/index.html) / [Python RDD API](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD)
- [Spark programming guide](https://spark.apache.org/docs/latest/programming-guide.html)

## 1. Load data
In the current directory, you can find a CSV dump of the first 200000 (range 1...200001) numbers with their FizzBuzz output. Load this data into a RDD as a text file and count the number of lines it has.

In [1]:
fizzbuzz = sc.textFile('fizzbuzz.csv')
fizzbuzz.count()

200000

## 2. Parse the CSV
Parse each line in the CSV into a tuple. Display the first 20 parsed tuples as a result.

In [2]:
parsed = fizzbuzz.map(lambda line: tuple(line.strip().split(',')))
parsed.take(20)

[(u'1', u'1'),
 (u'2', u'2'),
 (u'3', u'Fizz'),
 (u'4', u'4'),
 (u'5', u'Buzz'),
 (u'6', u'Fizz'),
 (u'7', u'7'),
 (u'8', u'8'),
 (u'9', u'Fizz'),
 (u'10', u'Buzz'),
 (u'11', u'11'),
 (u'12', u'Fizz'),
 (u'13', u'13'),
 (u'14', u'14'),
 (u'15', u'FizzBuzz'),
 (u'16', u'16'),
 (u'17', u'17'),
 (u'18', u'Fizz'),
 (u'19', u'19'),
 (u'20', u'Buzz')]

## 3. Make histogram of Fizz, Buzz, FizzBuzz or number
Count how often the values in the dataset are Fizz, Buzz, FizzBuzz or just a number. I.e. make a histogram of the types of values in the data.

In [3]:
def fizzbuzz_key(fb):
    if fb == 'Fizz' or fb == 'Buzz' or fb == 'FizzBuzz':
        return fb
    else:
        return 'Number'

parsed\
.map(lambda (n, fb): (fizzbuzz_key(fb), 1))\
.reduceByKey(lambda x,y: x + y)\
.collect()

[(u'FizzBuzz', 13333), ('Number', 106667), (u'Fizz', 53333), (u'Buzz', 26667)]