# Spark (WIP)


**TODO: Merging with ds_pragmatic_programming_pyspark**

* how to run using docker image for spark

    * options is the course docker image: ucsddse230/cse255-dse230
 
```sh
docker run --name edx_big_data -it -p 8889:8888 -v /media/leandroohf/sdb1/leandro/edx_big_data_analytics_using_spark:/home/ucsddse230/ ucsddse230/cse255-dse230 /bin/bash

# If you need to ssh to the container
docker exec -it edx_big_data /bin/bash

```
    * search a simple docker image in docker hub


* refs:

    * https://courses.edx.org/courses/course-v1:BerkeleyX+CS105x+1T2016/course/
    * https://courses.edx.org/courses/course-v1:BerkeleyX+CS105x+1T2016/courseware/d1f293d0cb53466dbb5c0cd81f55b45b/fe9a95cc542d4c30b855e632663c4797/8?activate_block_id=block-v1%3ABerkeleyX%2BCS105x%2B1T2016%2Btype%40vertical%2Bblock%4083ff2d3b4e93489b9b7b4861811e0872



In [None]:
import numpy as np
import pandas as pd

from scipy import stats

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline 

import IPython
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

from sqlalchemy import create_engine
import datetime as dt


from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType
%pylab inline

#start the SparkContext
import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext(master="local[4]")


In [None]:
!pwd
!ls data/Weather/

## Creation

In [None]:
# form txt file 
!head data/Moby-Dick.txt

text_file = sc.textFile('data/Moby-Dick.txt')
type(text_file)


pair_rdd = sc.parallelize([(1,2), (3,4)])
print(pair_rdd.collect())

## Transfortmations

* You cannot use any operation on the map functio. The operation should NOT depend of the other like subtraction or division. Will get different results while runnning multiple times

 Transformations on (key,value) rdds. **RDD $\to$ RDD**

### map, filter n sample 

* **No** communication needed.

In [None]:
regular_rdd = sc.parallelize([1, 2, 3, 4, 2, 5, 6])

pair_rdd = regular_rdd.map( lambda x: (x, x*x) )
print(pair_rdd.collect())

print(regular_rdd.filter( lambda x: x > 3 ).collect())

# sample(withReplacement, fraction, seed)
print(regular_rdd.sample(True, 0.5, 11))


rdd = sc.parallelize([(1,2), (2,4), (2,6)])
print("Original RDD :", rdd.collect())

# LHOF Notes
x = 3
print('list: ', list(range(x,x+2)))

# the lambda function generates for each number i, an iterator that produces i,i+1
print("After transformation : ", rdd.flatMapValues(lambda x: list(range(x,x+2))).collect())

### GroupbyKey n reduceByKey

**Shuffles:** RDD $\to$ RDD, **shuffle** needed

**Shuffles are costly transfromations**

* **Examples:** sort, distinct, repartition, sortByKey, reduceByKey, join [More](http://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations)
  * **A LOT** of communication might be needed.


**Properties of reduce operations**

* Reduce operations **must not depend on the order**
  * Order of operands should not matter
  * Order of application of reduce operator should not matter

* Multiplication and summation are good:

```
                1 + 3 + 5 + 2                      5 + 3 + 1 + 2 
```

 * Division and subtraction are bad:

```
                    1 - 3 - 5 - 2                      1 - 3 - 5 - 2
```


**groupByKey():**
Returns a new RDD of `(key,<iterator>)` pairs where the iterator iterates over the values associated with the key.


[Iterators](http://anandology.com/python-practice-book/iterators.html) are python objects that generate a sequence of values. Writing a loop over `n` elements as 
```python
for i in range(n):
    ##do something
```
is inefficient because it first allocates a list of `n` elements and then iterates over it.
Using the iterator `xrange(n)` achieves the same result without materializing the list. Instead, elements are generated on the fly.

To materialize the list of values returned by an iterator we will use the list comprehension command:
```python
[a for a in <iterator>]
```


In [None]:
# groupByKey return (key, <iterator>)

A = sc.parallelize([(1,3), (3,100),(1,-5),(3,2)])
A.groupByKey().mapValues(lambda x: [elem for elem in x ])

# output
#[ (1, [3,-5]), (3, [100, 2]) ]

print(A.groupByKey().map(lambda elem: (elem[0],[x for x in elem[1] ])).collect())

rdd = sc.parallelize([(1,2), (2,4), (2,6)])
print("Original RDD :", rdd.collect())
print("After transformation : ", rdd.reduceByKey(lambda a,b: a+b).collect())


rdd = sc.parallelize([(2,2), (1,4), (3,6)])
print("Original RDD :", rdd.collect())
print("After transformation : ", rdd.sortByKey().collect())

# Using sortBy
print("After transformation : ", rdd.sortBy(lambda x: x[1],ascending=False).collect())


### Operations 2 rdds

**subtractByKey**
Remove from RDD1 all elements whose key is present in RDD2.

In [None]:
# LHOF Notes

rdd1 = sc.parallelize([(1,2), (2,1), (2,2)])
rdd2 = sc.parallelize([(2,5), (3,1)])

print('rdd1: ', rdd1.collect())
print('rdd2: ', rdd2.collect())
print('subtractByKey: ', rdd1.subtractByKey(rdd2).collect())

print()
# Pay attention. This is a set operation
x = sc.parallelize([("a", 1), ("b", 4), ("b", 5), ("a", 3)])
y = sc.parallelize([("a", 3), ("c", None)])

print('x: ', x.collect())
print('y: ', y.collect())
print('subtract: ', sorted(x.subtract(y).collect()))

**join**

* A fundamental operation in relational databases.
* assumes two tables have a **key** column in common. 
* merges rows with the same key.


When `Join` is called on datasets of type `(Key, V)` and `(Key, W)`, it  returns a dataset of `(Key, (V, W))` pairs with all pairs of elements for each key. Joining the 2 datasets above yields: 


There are four variants of `join` which differ in how they treat keys that appear in one dataset but not the other.
* `join` is an *inner* join which means that keys that appear only in one dataset are eliminated.
* `leftOuterJoin` keeps all keys from the left dataset even if they don't appear in the right dataset. The result of leftOuterJoin in our example will contain the keys `John, Jill, Kate`
* `rightOuterJoin` keeps all keys from the right dataset even if they don't appear in the left dataset. The result of leftOuterJoin in our example will contain the keys `Jill, Grace, John`
* `FullOuterJoin` keeps all keys from both datasets. The result of leftOuterJoin in our example will contain the keys `Jill, Grace, John, Kate`

In outer joins, if the element appears only in one dataset, the element in `(K,(V,W))` that does not appear in the dataset is represented bye `None`

In [None]:
# OuterJoin
print('rdd1=',rdd1.collect())
print('rdd2=',rdd2.collect())
print("Result:", rdd1.rightOuterJoin(rdd2).collect())

print()

# leftOuterJoin
print('rdd1=',rdd1.collect())
print('rdd2=',rdd2.collect())
print("Result:", rdd1.leftOuterJoin(rdd2).collect())


## Actions

Actions on (key,val) RDDs. **RDD $\to$ Python-object in head node.**

In [None]:
#  countByKey: returns dictionary
A = sc.parallelize([(1,3), (3,100),(1,-5),(3,2)])

A.countByKey()

# output (dictionnary
# {1:2, 3:2}

# lookup (key): returns the list of all of the values associated with key
A = sc.parallelize([(1,3), (3,100),(1,-5),(3,2)])

A.lookup(3)

# output (list)
# [100,2]

#  collectAsMap(): like collect() - collect returns list of tuples -  but returns a map = Dictionary
A = sc.parallelize([(1,3), (3,100),(1,-5),(3,2)])
A.collectAsMap()

# output Dictionary
# {1:[3,-5], 3: [100,2]}

regular_rdd = sc.parallelize([1, 2, 3, 4, 2, 5, 6])

# takeSample(withReplacement, num, [seed])
print(regular_rdd.sample(True, 5, 11))


## Famous word count example (hello word)

In [None]:
## Famous word count example (hello word)

words = ['this', 'is', 'the', 'best', 'mac', 'ever']

wordRDD = sc.parallelize(words)

wordRDD.reduce(lambda w,v: w if len(w) < len(v) else v)

# another example is sum (the prder does not matter)
B=sc.parallelize([1,3,5,2])

B.reduce(lambda x,y: x+y)


## Dataframe 

### Read and write from disk


* parquet files (it is folder) are very popular e efficeent for IO in disk
* parqet can be query directly from the disk