In [1]:
%%HTML
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Quicksand:300,700" />
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Fira Code" />
<link rel="stylesheet" type="text/css" href="rise.css">

# Apache Spark



![footer_logo_new](images/logo_new.png)

## Agenda

 - Welcome
 - Introductions
 - Course subjects
 - The environment

## Welcome

 - Class start and end times
 - Facilities

    - Wifi
    - Restrooms
    - Lunch

## Special Corona Measures

 - Let's try to keep webcams on, unless bandwidth becomes a problem.
 - If it's a problem, please switch on webcam if/when you have a question.
 - Please mute your microphone if you're not talking.
 - If you run into trouble and need help: let me know!
 - It's harder for me as a teacher, as I can't see you (well), bear with me :-).
 - Let's have a quick round-up of how everyone is doing :-)

### WiFi

The access point for this class is `Xebia Guest`. The password is `EasyAccess`.

## Introductions

 - About me
 - About you

   - Your role
   - Your skillset

       - Data Science
       - Hadoop
       - Spark maybe
       - Programming in general

   - Your expectations

## Course subjects


 - Spark basics:

   - Spark Execution
   - SparkSession
   - DataFrames
   - Transformation
   - Laziness
   - Lineage

## More Course Subjects

 - Spark Advanced:

   - How Spark reads/writes data
   - From DF's to Pandas DF's and back
   - Dataframes: Basic concepts
   - SparkSQL
   - Narrow and wide operations (and why you should care)
   - The Catalyst optimizer
   - Caching and persistence
   - More Dataframe operations
   - Spark applications

## Even More Course Subjects

 -  DataFrames:

    - Windowing operations
    - UDF
    - UDAFs

# Hadoop: First Generation Big Data Platform

 - Solution for distributed storage and computation
 - Initially Hadoop consisted of two components:

   1. HDFS: distributed filesystem

      - Divide data in blocks
      - Replicate blocks across nodes (fault-tolerance)
      - Keep track of what is where!

   2. MapReduce: distributed computation

      - _Map:_ perform computation in parallel
      - _Reduce:_ combine results

Block replication was also aimed at performance: the assumption is that we read more often than we write, and it will go faster we have multiple copies to choose from.

Map-reduce also tries to place computations where the data is, so it doesn't have to cross the network (again). This is easier when there are replicas. For this to be a win, it also assumes that the code to run is smaller than the data.

# What is Spark?

 - Spark is a second-generation Big Data platform.
 - The first generation was based on MapReduce. Powerful, but cumbersome to use with technical limitations.
 - Spark introduced a new architecture.

   - Early versions presented a functional programming abstraction.
   - The last few years have moved on towards a DataFrame/SQL abstraction.
   - The Hadoop roots are still there!

What's different about the new architecture?

 - Spark tries to keep most data in memory or nearby; map-reduce passes everything via HDFS.
 - Spark is designed to work on multiple distributed platforms.

# Spark History

 - Emerged from the AMPLab at UC Berkeley, at the same time as Mesos.
 - Spark is now an Apache project.
 - It's now bundled with Cloudera, Hortonworks and MapR.
 - Available on AWS, Google Cloud and Microsoft Azure
 - Evolving very quickly.

Spark was originally written over a weekend as a demonstration for Mesos.

# Spark Data Processing

 - Hadoop is based on two key concepts:

    - Distribute where data is stored.
    - Run computation where the data lives. __*__

 - Spark adds to that:

    - Provide a high level API.
    - From the high level API automatically produce an execution plan, and optimize it where possible.
    - Keep data in memory where possible for faster computation.

__\*__ While true for Hadoop, this is no longer *necessarily* the case for Spark jobs.

The memory thing is often cited, and is also a bit of a lie. It turns out that Spark keeps intermediate data on disk as well. However unlike MapReduce it's usually "local" disk, which is faster to write to than HDFS.

# Spark Language Support

Spark is implemented in Scala (a functional programming language which runs in the JVM) but supports programming in the following languages:

 - Scala
 - Python
 - R
 - Java

Via Java you also get other languages like Kotlin, if that floats your boat.

# Common Workflows

There are several common ways of working with Spark:

 - Notebooks, typically Jupyter
 - Spark shell (a REPL in Python or Scala)
 - Spark applications

# Spark & Jupyter Notebook

Jupyter is a stand-alone Notebook environment.
Start a Jupyter Notebook server with:

```sh
% jupyter notebook
```

This will start a notebook server and will direct your browser to it.

Since Spark 2.2.0 we can do `pip install pyspark` (without needing things like `findspark`).

In [None]:
import pyspark

spark = pyspark.sql.SparkSession.builder \
    .getOrCreate()

sc = spark.sparkContext
spark

In [None]:
spark.stop()

# Spark Shell for Python

Spark supports a shell, for interactive work. To launch this use the `$SPARK_HOME/bin/pyspark` command, which starts a Python REPL connected to Spark:

```
Python 3.5.3 | packaged by conda-forge | (default, Jan 24 2017, 06:45:37)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.54)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Python version 3.5.3 (default, Jan 24 2017 06:45:37)
SparkSession available as 'spark'.
>>>
>>> sc
<pyspark.context.SparkContext object at 0x103ba6d90>
>>> sqlContext
<pyspark.sql.context.SQLContext object at 0x103d7a790>
>>> spark
<pyspark.sql.session.SparkSession object at 0x103d7a590>
>>> exit()
```

# Spark Shell for Scala

Spark supports a shell, for interactive work. To launch this use the `$SPARK_HOME/bin/spark-shell` command, which starts a scala REPL connected to Spark:

```
% spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://192.168.1.114:4040
Spark context available as 'sc' (master = local[*], app id = local-1487929648200).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

scala> sc
res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@37b1218

scala> spark
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@88e85ascala>sc

scala> exit
```

# Summary

In this chapter we covered:

- What Spark is?
- How to start and use the Spark Shell
- How to use Spark from within a Jupyter Notebook