# Spark Tutorial

Author: Matthew K. MacLeod


### Tutorial goals:

* background
* configuration
* introduction
* HQL
* machine learning
* streaming

## Background

Spark is an open source project created in 2009 by the UC Berkeley RAD lab.

Spark is a distributed framework which uses the MapReduce paradigm in memory. 

http://spark.apache.org/docs/latest/api/python/pyspark.html



Allows for
* interactive queries (Spark SQL and Hive)
* stream processing
* data analytics (MLlib)
* graph processing (GraphX)


## Configuration

download and install spark,
    
    cd spark-1.5.2
    
    ./sbt/sbt -Phive assembly

set environmental variables:

    export SPARK_HOME="$HOME/programs/spark/spark-1.5.2"
    
    export PYSPARK_SUBMIT_ARGS="--master local[4] pyspark-shell"


In [1]:
# double check env 
!echo $SPARK_HOME

/home/matej/programs/spark/spark-1.5.2


we will load configuration in notebook..this way don't need to configure profile (ipython has issues)

    ipython notebook

In [2]:
import os
import sys

In [3]:
# spark configuration
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))

spark_release_file = spark_home + "/RELEASE"

if os.path.exists(spark_release_file) and "Spark 1.5" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: 
        pyspark_submit_args += " pyspark-shell"
        os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args


Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
      /_/

Using Python version 3.4.3 (default, Oct 19 2015 21:52:17)
SparkContext available as sc, HiveContext available as sqlContext.


## Introduction to Spark

we will run spark interactively in jupyter, 

normally a pyspark script can be run

    bin/spark-submit ps.py

In [4]:
from pyspark import SparkContext

In [13]:
spark_home = os.environ.get('SPARK_HOME')

text_file = sc.textFile(spark_home + "/README.md")

word_counts = text_file \
    .flatMap(lambda line: line.split()) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b)

word_counts.first()

('guide,', 1)

**collect** Return a list that contains all of the elements in this RDD.

###  RDDs

An RDD (_resilient distributed dataset_) is an immutable distributed collection of objects. Since they are partitioned they can be computed on different nodes in cluster.

**transformations** in spark-speak, operate on RDDs and return new RDDs. These are evaluated lazily.  map() and filter() are transformations.

### one pair RDD  Transformations

* reduceByKey
* groupByKey
* combineByKey
* mapValues
* flatMapValues
* keys()
* values()
* sortByKey


### NB 

    reduceByKey() and foldByKey() 
          
will automatically perform combining locally on each machine before computing global totals for each key.  
        
    combineByKey() 
  
allows to customize combining behavior.


### two pair RDD  Transformations
* subtractByKey
* join
* rightOuterJoin
* leftOuterJoin


In [6]:
lines = sc.textFile(spark_home +"/README.md")
lines.count()

98

In [7]:
pythonlines = lines.filter(lambda line: "Python" in line)
pythonlines.first()

'high-level APIs in Scala, Java, Python, and R, and an optimized engine that'

In [None]:
pairs = lines.map(lambda x: (x.split(" ")[0], x))

In [None]:
result = pairs.filter(lambda keyValue: len(keyValue[1]) < 20)

In [None]:
rdd = sc.textFile(spark_home +"/README.md")
words = rdd.flatMap(lambda x: x.split(" "))

In [None]:
#result = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)

**parallelize** Distribute a local Python collection to form an RDD. Using xrange is recommended if the input represents a range for performance.

In [None]:
data = [("a", 3), ("b", 4), ("a", 1)]
sc.parallelize(data).reduceByKey(lambda x, y: x + y)      # Default parallelism
sc.parallelize(data).reduceByKey(lambda x, y: x + y, 10)  # Custom parallelism

In [None]:
#rdd.sortByKey(ascending=True, numPartitions=None, keyfunc = lambda x: str(x))

In [None]:
rdd = sc.parallelize([(1, 2), (3, 6), (3, 4)],2)
rdd.groupByKey()
rdd.collect()

In [None]:
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)],2)
rdd.reduceByKey(lambda x, y: x + y)
rdd.collect()

In [None]:
print(rdd.keys().collect())
print(rdd.values().collect())
print(rdd.sortByKey().collect())

### flatMap() vs map()

    map:  produce one to one input and output

    flatMap:  produce multiple output elements for each input element, 


In [9]:
lines = sc.parallelize(["this is the first line", "hello second line", "third guy"])
words = lines.flatMap(lambda line: line.split(" "))
words.collect()

['this',
 'is',
 'the',
 'first',
 'line',
 'hello',
 'second',
 'line',
 'third',
 'guy']

In [10]:
# filter

In [12]:
# distinct

In [None]:
# sample with replacement


In [None]:
# sample without replacement

###  RDD Actions

**actions** in spark-speak, are computations on RDDs. Actually doing someting with the data here. eg,

* take(), 
* first(), 
* count()
* reduce()
* aggregate()
* fold()
* collect()

In [None]:
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)],2)
rdd.countByKey()

#### numeric RDD operations
* count()
* mean()
* sum()
* max()
* min()
* variance()
* sampleVariance()
* stdev()
* sampleStdev()

## Spark SQL

## Machine Learning in Spark:  MLlib

## Streaming in Spark