In [1]:
%%HTML
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Quicksand:300,700" />
<link rel="stylesheet" type="text/css" href="https://fonts.googleapis.com/css?family=Fira Code" />
<link rel="stylesheet" type="text/css" href="rise.css">

# Spark applications


![footer_logo_new](images/logo_new.png)

So far we used techniques to explore data and execute small transformations on data to get it into the form we want.

This is not the best way if we end up with transformations that need to be done on a regular basis.

For this we have the `spark-submit` command.


## Spark shell vs Spark submit

#### Spark shell

- Explore data
- Try out transformations
- Experiment different algorithms

#### Spark application

- Python, Scala or Java
- For regular running jobs
- ETL
- Streaming applications
- Productionized code

## Sparkcontext in a application

#### Spark shell

- Creates the spark context for you.

#### Spark application

- Need to create yourself (just like in the notebook).
- Called `sc` by convention.
- Call sc.stop when program terminates.

### Wordcount example application

Don't worry if you don't understand what's in here. We will cover it later.

```python
import argparse
import pyspark
import pyspark.sql.functions as sf
import pyspark.sql.types as st


def main(df):
    results = (
        df.select(sf.explode(sf.split(sf.col("word"), r"\s+")).alias("word"))
          .withColumn('nr', sf.lit(1))
          .groupBy('word')
          .agg(sf.sum('nr').alias('sum'))
    )
    return results


if __name__ == "__main__":

    parser = argparse.ArgumentParser(description='This is a demo application.')
    parser.add_argument('-i', '--input',  help='Input file name',  required=True)
    parser.add_argument('-o', '--output', help='Output file name', required=True)
    args = parser.parse_args()

    spark = (
        pyspark.sql.SparkSession.builder
               .getOrCreate()
    )

    schema = st.StructType([st.StructField("word", st.StringType())])
    input_df = (
        spark.read
             .option("header", "false")
             .csv(args.input, schema=schema)
    )

    counts = main(input_df)
    counts.write.parquet(args.output)

    spark.sparkContext.stop()
```    

## Running a application

```bash
$SPARK_HOME/bin/spark-submit wordcount.py -i inputdir -o outputdir
```

## Where do we run this

Spark can run:

- Locally
- Locally with multiple threads
- On a cluster
    - Client mode
    - Cluster mode

The cluster option is mostly used for production jobs.  
Local can be useful for testing and development.

### Client mode

When using client mode:

 - The spark driver is part of the `spark-submit` process.
 - The `spark-submit` command does not exit until the application has finished and the driver has shut-down.

### Cluster Mode

When using cluster mode:

 - The spark driver runs remotely, normally in a process very similar to the workers. (The exact mechanism depends on the type of cluster.)
 - The `spark-submit` returns immediately, once the application has started, without waiting for it to finish.

## How to specify the environment to run on

```bash
spark-submit --master 'local[2]' wordcount.py ....
```

Other options:
- local
- local[n]
- local[*]
- yarn (default mode is `client`)
- yarn (if you want `cluster` mode, specify `--deploy-mode cluster`)
- spark://HOST:PORT
- mesos://HOST:PORT

Kubernetes is new.

# Summary

In this chapter we looked at:

+ How Spark Applications can be packaged up.
+ The differences between running in client and cluster mode.

# Exercise
Grab the code from the previous exercise. The code should do the following:
1. First, load the heroes dataset and calculate the average attack per role using `groupBy`.
2. Now filter the resulting averages to show only the Warrior role and print the result using `show`.

Package it up in a Python file, and run it using `spark-submit` on your local machine