## Batch processing data
Batch processing refers to partitioning and processing data in blocks called **batches**. Data is usually partitioned by time intervals (weekly, daily, hourly, 5 min, etc.).

The advantages of batch processing include:
* Easy to manage pipelines.
* Easy to retry or repair pipeline workflows.
* East to scale and parallelize.

Some disadvantages include:
* Slower execution times (in many cases need to rerun entire pipeline).

Batch processing is the most common type of data processing in industry.


## Apache Spark
The most common tool used today for batch processing is Apache Spark. Spark is a multi-language open-source analytics **engine** for large-scale data processing that provides an interface for programming distributed machine clusters. Spark's main language is Scala, but it also has wrappers for Java, Python, etc. PySpark is usually the preferred wrapper for Spark.

Spark is usually used to process files from a data lake and saved back into the data lake.

```mermaid
    flowchart LR
        A(Raw Data) --> B
        B[(Data Lake)] --> C
        C["SQL (Athena or Presto)"] --> D
        D[Spark] --> B
```



## Intro to PySpark
Spark works over distributed coordinated clusters. There are two main abstractions:
* RDD (): a distributed collection of objects.
* Dataframe: a distributed dataset of tabular data.

There are two important concepts in PySpark are:
* Immutability: changes create new object references, old version are unchanged.
* Actions vs. Transformations: Spark commands are either transformations or actions.
    - Transformations are **lazy**, meaning the actual compute does not happen until an output is requested. This allows Spark to collect all actions and make optimizations when the output is requested.
    - Actions are **eager**, meaning that they are evaulated immediately. This applies to commands such as `show`, `take`, `head`, etc.

#### Simplified Spark Architectural Overview
The user interacts with the **driver** which controls **executors** run on a **master** that operates on the data.
```mermaid
    flowchart LR
        A[Driver] --> B
        subgraph "Master"
            B[Executors]
        end
        B --> C[Data]
```

#### Common operations
Load a csv
```python
df = spark.read \
    .options(header=True, inferSchema=True) \
    .csv("mtcars.csv")
```

View a dataframe
```python
df.show()
df.show(10)  # specific number of rows
```

View columns and datatypes
```python
df.columns
df.dtypes
```

Rename columns
```python
df.toDF('a', 'b', 'c')
df.withColumnRenamed('old', 'new')  # rename specific column
```

Drop columns
```python
df.drop('mpg')
```

Filtering
```python
df[df.mpg < 20]
df[(df.mpg < 20) & (df.cyl == 6)]
```

Add columns
```python
df.withColumn('gpm', 1 / df.mpg)
```

Fill nulls
```python
df.fillna(0)
```

Aggregation
```python
df.groupby(['cyl', 'gear']) \
    .agg({'mpg': 'mean', 'disp': 'min'})
```

Standard Transformations
There are tons of common transformations available in the `functions` module.
```python
import pyspark.sql.functions as F
df.withColumn('logdisp', F.log(df.disp))
```
Using transformations from the `functions` module keeps the code execution within the JVM and keeps the execution performant.

Row conditional statements
```python
import pyspark.sql.functions as F
df.withColumn('cond', \
    F.when(df.mpg > 20, 1) \
    .when(df.cyl == 6, 2) \
    .otherwise(3)
)
```

When Python is required
You can register UDFs (User Defined Functions).
```python
import pyspark.sql.functions as F
from pyspark.sql.types import DoubleType
fn = F.udf(lambda x: x+1, DoubleType())
df.withColumn('disp1', fn(df.disp))
```
It is important that the UDF be deterministic, because Spark may evaulate the function more than once or apply optimizations.

Merge/join dataframes
```python
left.join(right, on='key')
left.join(right, left.a == right.b)
```

Pivot tables
```python
df.groupBy('A', 'B').pivot('C').sum('D')
```

Summary statistics
```python
df.describe().show()  # display count, mean, stddev, min, max
# percentiles:
df.selectExpr(
    "percentile_approx(mpg, array(.25, .5, .75)) as mpg"
).show()
```

Plotting
There is not an options for plotting directly in PySpark, but you can export to Pandas for plotting. This is not advised.
```python
df.sample(False, 0.1).toPandas().hist()
```

SQL
Spark allows you to switch between SQL and Dataframes. To use our dataframe within the SQL query, we need to register the dataframe as view or table:
```python
df.createOrReplaceTempView('foo')  # registering a table in SQL
```

Then we can use SQL to reference our dataframe like a table and query it:
```python
df2 = spark.sql('select * from foo')
```

#### PySpark Best Practices
* Use `pyspark.sql.functions` and other built in functions.
* Use the same version of python and packages on cluster as the driver.
* Check the UI at http://localhost:4040/.
* Learn about SSH port forwarding for working with Spark in a notebook.
* Check out Spark MLlib, basically the Spark equivalent of scikit-learn.
* Check out the docs at https://spark.apache.org/docs/latest

#### Things to avoid
* Iterating through rows.
* Hard code a master in your driver (use command `spark-submit` for that).
* Filter before conversion to Pandas.



# Installing and running Spark

#### Installing Java
Download OpenJDK 11 or Oracle JDK 11 (It's important that the version is 11 - spark requires 8 or 11)

In [None]:
!wget https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz
!tar xzfv openjdk-11.0.2_linux-x64_bin.tar.gz

In [None]:
!export JAVA_HOME="${HOME}/spark/jdk-11.0.2"
!export PATH="${JAVA_HOME}/bin:${PATH}"

In [None]:
!java --version

In [None]:
!rm openjdk-11.0.2_linux-x64_bin.tar.gz

#### Installing Spark
Download Spark 3.3.2

In [None]:
!wget https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz

In [None]:
!tar xzfv spark-3.3.2-bin-hadoop3.tgz

In [None]:
!rm spark-3.3.2-bin-hadoop3.tgz

In [None]:
!export SPARK_HOME="${HOME}/spark/spark-3.3.2-bin-hadoop3"
!export PATH="${SPARK_HOME}/bin:${PATH}"

Add environmental variables to `.bashrc` file to save: `nano .bashrc` and copy in the variable definitions:
```
export JAVA_HOME="${HOME}/spark/jdk-11.0.2"
export PATH="${JAVA_HOME}/bin:${PATH}"

export SPARK_HOME="${HOME}/spark/spark-3.3.2-bin-hadoop3"
export PATH="${SPARK_HOME}/bin:${PATH}"
```

In [None]:
!pip install pyspark

In [None]:
# Add PySpark to `PYTHONPATH`:
!export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
!export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH"

Alternatively use [findspark](https://github.com/minrk/findspark) to add PySpark to sys.path at runtime.

In [None]:
!pip install -q findspark

Test that the installation worked:

In [None]:
import findspark
findspark.init()

import pyspark
print(pyspark.__version__)

In [None]:
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf

In [None]:
# create Spark session with a local cluster using all available cpus.
spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

spark.sparkContext.getConf().set('spark.ui.port', '4040')

# Connecting to Google Cloud Storage in Spark
To read data from a GCS datalake we can use the [Google Cloud Storage connector for Hadoop](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#clusters).

Download [gcs-connector-hadoop3-latest.jar](https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-latest.jar) and move the `.jar` file to a `/jars` directory within your Spark directory.

Within PySpark, import the following:
```python
import pyspark
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
```

And configure the Spark context:
```python
credentials_location = './path/to/google_credentials.json'

conf = SparkConf() \
    .setMaster('local[*]') \
    .setAppName('test') \
    .set('spark.jars', './apache-spark/<version>/jars/gcs-connector-hadoop3-latest.jar') \
    .set("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
    .set("spark.hadoop.google.cloud.auth.service.account.json.keyfile", credentials_location)

sc = SparkContext(conf=conf)

hadoop_conf = sc._jsc.hadoopConfiguration()

hadoop_conf.set("fs.AbstractFileSystem.gs.impl",  "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS")
hadoop_conf.set("fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem")
hadoop_conf.set("fs.gs.auth.service.account.json.keyfile", credentials_location)
hadoop_conf.set("fs.gs.auth.service.account.enable", "true")
```

Then build the SparkSession with the new parameters:
```python
spark = SparkSession.builder \
    .config(conf=sc.getConf()) \
    .getOrCreate()
```

This will allow you to  read files directly from GCS:
```python
df_green = spark.read.parquet("gs://{BUCKET}/green/202*/")
```