# Spark Tutorial

Author: Matthew K. MacLeod


### Tutorial goals:

* background
* configuration
* introduction
* HQL
* machine learning
* streaming

## Background

Spark is an open source project created in 2009 by the UC Berkeley RAD lab.

Spark is a distributed framework which uses the MapReduce paradigm in memory and much more.. 

http://spark.apache.org/docs/latest/api/python/pyspark.html

Allows for
* interactive queries (Spark SQL and Hive)
* stream processing
* data analytics (MLlib)
* graph processing (GraphX)


### Directed Acyclic Graph Scheduler

Spark uses a DAG scheduler in order figure out how to execute the data analysis pipeline.

DAGs help to make a dependency flow for the transformations..allows 
lineage and for recovery of lost data partitions.

RDDs (spark objects, more on these below) are nodes and directed edges are are the transformations to create a execution graph.

This DAG scheduler also allows for parallelization optimization.


### Caching

Another important aspect to understanding Spark is caching. Since in- memory operations set Spark apart and is important to understand.

in general 
    
   * 10x to 100x speed ups typical
    
   * caching is gradual
    
   * fault tolerant (like rest of Spark)
    
   * want to cache the cleaned data set into memory
    
   * heavy calculations might want to use both memory and disk

## Configuration

download and install spark,
    
    cd spark-1.5.2
    
    ./sbt/sbt -Phive assembly

set environmental variables:

    export SPARK_HOME="$HOME/programs/spark/spark-1.5.2"
    
    export PYSPARK_SUBMIT_ARGS="--master local[4] pyspark-shell"


In [3]:
# double check env 
!echo $SPARK_HOME

/home/matej/programs/spark/spark-1.5.2


we will load configuration in notebook..this way don't need to configure profile (ipython has issues)

    ipython notebook

In [4]:
import os
import sys

In [5]:
# spark configuration
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

filename = os.path.join(spark_home, 'python/pyspark/shell.py')
exec(compile(open(filename, "rb").read(), filename, 'exec'))

spark_release_file = spark_home + "/RELEASE"

if os.path.exists(spark_release_file) and "Spark 1.5" in open(spark_release_file).read():
    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: 
        pyspark_submit_args += " pyspark-shell"
        os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args


Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.2
      /_/

Using Python version 3.4.3 (default, Oct 19 2015 21:52:17)
SparkContext available as sc, HiveContext available as sqlContext.


### small note on setup,

it may have been quicker to simply use:

    IPTYON_OPTS="notebook" ./bin/pyspark
    
to start the jupyter notebook.

# Introduction to Spark

  * starting spark, intialization
  * RDD transformations
  * RDD actions

we will run spark interactively in jupyter, 

normally a pyspark script can be run

    bin/spark-submit ps.py

In [6]:
from pyspark import SparkContext

In [7]:
# the standard spark 'hello world' example:
spark_home = os.environ.get('SPARK_HOME')

text_file = sc.textFile(spark_home + "/README.md")

word_counts = text_file \
    .flatMap(lambda line: line.split()) \
    .map(lambda word: (word, 1)) \
    .reduceByKey(lambda a, b: a + b)

word_counts.take(5)

[('guide,', 1), ('APIs', 1), ('optimized', 1), ('name', 1), ('Scala,', 1)]

**collect** Return a list that contains all of the elements in this RDD.

###  RDDs

An RDD (_resilient distributed dataset_) is an **immutable** distributed data container... like a collection of objects. These have distributed partitions across nodes and clusters.

Resilency is realized by tracking partitions and re-running partition history.

RDDs can be created from many sources, including local text, Amazon S3, JSON, HDFS, HBase, Cassandra and other sources.

RDD operations consist of transformations and actions.

**transformations** in spark-speak, operate on RDDs and *return new RDDs*. These are evaluated **lazily**.  eg map() and filter() are transformations.

Due to the immutablity of the RDDs, functional style of transforming the data is essential.


### one pair RDD  Transformations

* map
* filter
* reduceByKey
* groupByKey
* combineByKey
* mapValues
* flatMapValues
* keys
* values
* sortByKey
* sample(withReplacement, fraction ,seed)
* coalesce(numPartitions)    to reduce number of partitions


### Locality

Another important distinction to recognize is if the transformation is local (on one node eg) or not. These are referred to as narrow or wide transformations. 

** Narrow transformations ** (no transfer over network)
map 
flatMap
filter
colalesce (generally local)

** Wide transformations ** (can be expensive, involve shuffles)
groupByKey
repartition
distinct
subtract
intersection


### NB 

    reduceByKey() and foldByKey() 
          
will automatically perform combining _locally_ on each machine before computing global totals for each key.  
        
    combineByKey() 
  
allows to customize combining behavior, use this instead of groupByKey.


### two pair RDD  Transformations
* subtractByKey
* join
* rightOuterJoin
* leftOuterJoin


In [8]:
lines = sc.textFile(spark_home +"/README.md")
lines.count()

98

In [9]:
pythonlines = lines.filter(lambda line: "Python" in line)
pythonlines.first()

'high-level APIs in Scala, Java, Python, and R, and an optimized engine that'

In [10]:
pairs = lines.map(lambda x: (x.split(" ")[0], x))

In [13]:
result = pairs.filter(lambda keyValue: len(keyValue[1]) < 20)

In [14]:
rdd = sc.textFile(spark_home +"/README.md")
words = rdd.flatMap(lambda x: x.split(" "))

In [16]:
result = words.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
result.take(5)

[('', 67), ('guide,', 1), ('APIs', 1), ('name', 1), ('It', 2)]

**parallelize** Distribute a local Python collection to form a partitioned
RDD. Using xrange is recommended if the input represents a range for performance.

In [14]:
data = [("a", 3), ("b", 4), ("a", 1)]
sc.parallelize(data).reduceByKey(lambda x, y: x + y)      # Default parallelism
sc.parallelize(data).reduceByKey(lambda x, y: x + y, 10)  # Custom parallelism

PythonRDD[23] at RDD at PythonRDD.scala:43

In [15]:
#rdd.sortByKey(ascending=True, numPartitions=None, keyfunc = lambda x: str(x))

In [20]:
# the 2 at the end means we split the list into 2 partitions
rdd = sc.parallelize([(1, 2), (3, 6), (3, 4)],3)
rdd.groupByKey()
rdd.collect()

[(1, 2), (3, 6), (3, 4)]

In [21]:
# to see partitions use glom, useful also for debugging
rdd.glom().collect()

[[(1, 2)], [(3, 6)], [(3, 4)]]

In [22]:
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)],2)
rdd.reduceByKey(lambda x, y: x + y)
rdd.collect()

[(1, 2), (3, 4), (3, 6)]

In [23]:
print(rdd.keys().collect())
print(rdd.values().collect())
print(rdd.sortByKey().collect())

[1, 3, 3]
[2, 4, 6]
[(1, 2), (3, 4), (3, 6)]


### flatMap() vs map()

    map:  produce one to one input and output. note this is a completely **local** (narrow in spark-speak) operation. 

    flatMap:  produce multiple output elements for each input element, different sizes of partitions may be involved.


In [24]:
lines = sc.parallelize(["this is the first line", "hello second line", "third guy"])
words = lines.flatMap(lambda line: line.split(" "))
words.collect()

['this',
 'is',
 'the',
 'first',
 'line',
 'hello',
 'second',
 'line',
 'third',
 'guy']

In [39]:
# can also use functions..set up word count key value pairs
def split_words(line):
    return line.split()

def create_pair(word):
    return (word,1)

#equivalent to reduce.(lambda x,y: x+y)
def sum_counts(a,b):
    return a + b

def starts_with_vowel(pair):
    vowels = ["a","e","i","o","u"]
    word = pair[0]
    first_letter = word[0]
    return first_letter.lower() in vowels

In [26]:
pairs_rdd = text_file.flatMap(split_words).map(create_pair)
pairs_rdd.take(5)

[('#', 1), ('Apache', 1), ('Spark', 1), ('Spark', 1), ('is', 1)]

In [27]:
# reduce example
wordcounts_rdd = pairs_rdd.reduceByKey(sum_counts)
wordcounts_rdd.take(5)

[('guide,', 1), ('APIs', 1), ('optimized', 1), ('name', 1), ('Scala,', 1)]

In [40]:
# filter
vs = pairs_rdd.filter(starts_with_vowel).reduceByKey(sum_counts)
vs.take(5)

[('APIs', 1), ('Once', 1), ('only', 1), ('overview', 1), ('examples', 2)]

In [29]:
# group by key example
pairs_rdd.groupByKey().take(5)

[('guide,', <pyspark.resultiterable.ResultIterable at 0x7f28f4158d30>),
 ('APIs', <pyspark.resultiterable.ResultIterable at 0x7f28f4158cc0>),
 ('optimized', <pyspark.resultiterable.ResultIterable at 0x7f28f41685f8>),
 ('name', <pyspark.resultiterable.ResultIterable at 0x7f28f4168438>),
 ('Scala,', <pyspark.resultiterable.ResultIterable at 0x7f28f41686a0>)]

In [21]:
# distinct

In [22]:
# sample with replacement


In [23]:
# sample without replacement


In [30]:
# coalesce example
sc.parallelize(range(10),4).coalesce(2).glom().collect()

[[0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]

###  RDD Actions

**actions** in spark-speak, are computations on RDDs. Action return non-RDD objects and values. Actually doing something with the data here, typically reside at the end of data analysis pipelines,
eg, reduce. More examples:

* take(),
* top(),
* first(), 
* count()
* reduce()
* aggregate()
* fold()
* collect()

In [24]:
rdd = sc.parallelize([(1, 2), (3, 4), (3, 6)],2)
rdd.countByKey()

defaultdict(int, {1: 1, 3: 2})

#### numeric RDD operations
* count()
* mean()
* sum()
* max()
* min()
* variance()
* sampleVariance()
* stdev()
* sampleStdev()

# Advanced Spark

**joins**

** broadcast variables** may be useful for large configuration, lookup tables

**accumulators**  accumulate a variable across the cluster, concurrently write into a variable




In [41]:
def test_accum(x):
    accum.add(x)

In [44]:
# test the famous Gauss summation 1 to 100
accum = sc.accumulator(0)
sc.parallelize(range(1,101)).foreach(test_accum)
accum.value

5050

# Spark SQL

here we will illustrate some HiveQL work on twitter stream data 

In [26]:
from pyspark.sql import HiveContext
hiveCtx = HiveContext(sc)

In [1]:
# get data from twitter, use api to download tweets.
# python twitterstream.py > output.json
import os
os.chdir('/home/matej/develop/mkm_notebooks/data/twitter')
os.getcwd()

'/home/matej/develop/mkm_notebooks/data/twitter'

In [45]:
!ls -lah output.json

-rw-rw-r--. 1 matej matej 188M Dec  8 19:54 output.json


In [46]:
tweets = hiveCtx.read.json('output.json')
tweets.registerTempTable("tweets")
results = hiveCtx.sql("SELECT user.name, text FROM tweets")
results.take(5)

[Row(name='didi', text='RT @daiIygopro: Retweet if you want to travel the world 🌎🌍🌏'),
 Row(name='✨vega✨™', text='RT @vahlokmusic: Better'),
 Row(name='c4deb1t4', text='https://t.co/fAPWkVQdsD'),
 Row(name='HAL@12/11Kiraカン2.0観覧', text='RT @Camus_SH: お付き合い頂き、ありがとうございました。また時間がみつけて答えていこうと思います。'),
 Row(name='dan', text='Amber is the color of my energy✨')]

In [53]:
# get number of tweets
r = hiveCtx.sql("SELECT count(*) FROM tweets")
r.collect()

[Row(_c0=53151)]

note that spark comes with beeline, if compiled properly

    ./sbt/sbt -Phive-thirftserver clean assembly/assembly

activate via:

    spark$ ./bin/beeline -u jdbc:hive2://
    
use standard beeline hive interface.

# Machine Learning in Spark:  MLlib

In [54]:
from pyspark.mllib.feature import HashingTF

In [55]:
sentence = "hello there world of spark"
words = sentence.split()
tf = HashingTF(1000)
tf.transform(words)

SparseVector(1000, {389: 1.0, 391: 1.0, 606: 1.0, 833: 1.0, 874: 1.0})

# Streaming in Spark

note see file mkm_notebooks/license.txt for license of this notebook.