In [9]:
import pyspark as ps
import json

spark = (ps.sql.SparkSession.builder 
        .master("local[4]") 
         
         
        .appName("morning sprint") 
        .getOrCreate()
        )
sc = spark.sparkContext

Spark operates in **[Resilient Distributed Datasets](http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds) (RDDs). An RDD is
a collection of data partitioned across machines**. RDDs allow the processing
of data to be parallelized due to the partitions. RDDs can be created from
a SparkContext in two ways: loading an external dataset, or by parallelizing
an existing collection of objects in your currently running program (in our
Python programs, this is often times a list).


* Create an RDD from a Python list.

In [10]:
lst_rdd = sc.parallelize([1, 2, 3])

* Read an RDD in from a text file. **By default, the RDD will treat each line
as an item and read it in as string.**


In [11]:
dir_link = '/home/asus/DSI_Lectures/spark/natalie_hunt/'
file_rdd = sc.textFile(dir_link + 'data/cookie_data.txt')

In [12]:
file_rdd.first() # Returns the first entry in the RDD
file_rdd.take(2) # Returns the first two entries in the RDD as a list

['{"Jane": "2"}', '{"Jane": "1"}']

 To retrieve all the items in your RDD, every partition in the RDD has to be
  accessed, and this could take a long time. In general, before you execute
  commands (like the following) to retrieve all the items in your RDD, you
  should be aware of how many entries you are pulling. Keep in mind that to
  execute the `.collect()` method on the RDD object (like we do below), your entire
  dataset must fit in memory in your driver program (we in general don't want
  to call `.collect()` on very large datasets).

  The standard workflow when working with RDDs is to perform all the big data
  operations/transformations **before** you pool/retrieve the results. If the
  results can't be collected onto your driver program, it's common to write
  data out to a distributed storage system, like HDFS or S3.

  With that said, we can retrieve all the items from our RDD as follows:

In [13]:
file_rdd.collect()

['{"Jane": "2"}',
 '{"Jane": "1"}',
 '{"Pete": "20"}',
 '{"Tyler": "3"}',
 '{"Duncan": "4"}',
 '{"Yuki": "5"}',
 '{"Duncan": "6"}',
 '{"Duncan": "4"}',
 '{"Duncan": "5"}']

In [14]:
lst_rdd.collect()

[1, 2, 3]

## Part 2: Intro to Functional Programming

Spark operations fit within the [functional programming paradigm](https://en.wikipedia.org/wiki/Functional_programming).
In terms of our RDD objects, this means that our RDD objects are immutable and that
anytime we apply a **transformation** to an RDD (such as `.map()`, `.reduceByKey()`,
or `.filter()`) it returns another RDD.

Transformations in Spark are lazy, this means that performing a transformation does
not cause computations to be performed. Instead, an RDD remembers the chain of
transformations that you define and computes them all only when and action requires
a result to be returned.

**Spark notes**:

  * A lot of Spark's functionalities assume the items in an RDD to be tuples
  of `(key, value)` pairs, so often times it can be useful to structure your
  RDDs this way.
  * Beware of [lazy evaluation](https://en.wikipedia.org/wiki/Lazy_evaluation), where transformations
  on the RDD are not executed until an **action** is executed on the RDD
  to retrieve items from it (such as `.collect()`, `.first()`, `.take()`, or
  `.count()`). So if you are doing a lot transformations in a row, it can
  be helpful to call `.first()` in between to ensure your transformations are
  running properly.
  * If you are not sure what RDD transformations/actions there are, you can
  check out the [docs](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD).

Turn the items in `file_rdd` into `(key, value)` pairs using `.map()`. In order to do that, you'll find a template function `parse_json_first_key_pair` in the `spark_intro.py` file. Implement this function that takes a json formatted string (use `json.loads()`) and output the key,value pair you need. Test it with the string `u'{"Jane": "2"}'`, your function should return `(u'Jane', 2)`. **Remember to cast value as type** `int`.

In [24]:
def parse_json_first_key_pair(json_string):
    json_data = json.loads(json_string)
    for k,v in json_data.items():
        return(k,int(v))


In [25]:
parse_json_first_key_pair('{"Jane": "2"}')

('Jane', 2)

In [46]:
x = 99
y = 10


In [73]:
file_rdd = sc.textFile(dir_link + 'data/cookie_data.txt')\
            .map(lambda x: parse_json_first_key_pair(x))\
            .filter(lambda x: (x[1] >= 5))\
            .reduceByKey(lambda x, y: max(x,y))\
            .sortBy(lambda x: x[1])
file_rdd.collect()

[('Yuki', 5), ('Duncan', 6), ('Pete', 20)]

In [74]:
file_rdd.take(2)

[('Yuki', 5), ('Duncan', 6)]