# **1. Introduction to Spark**

# **2. Install Spark Locally**

## **2.1 Spark**

1. **Installing java**

First, download OpenJDK 11 or Oracle JDK 11 in [OpenJDK](https://jdk.java.net/archive/)

Download it to `~/spark` folder with:

```bash
wget https://download.java.net/java/GA/jdk11/9/GPL/openjdk-11.0.2_linux-x64_bin.tar.gz
```

Unpack it. the flag `xzfv` means to extract, decompress the archive using gzip before extracting, indicate that the next argument is the name of the file and print the names of files as they are extracted, respectively:

```bash
tar xzfv openjdk-11.0.2_linux-x64_bin.tar.gz
```

In the `.bashrc` we need to define `JAVA_HOME` and add it to `PATH` to make it available to the system when the terminal are opened:

```bash
export JAVA_HOME="${HOME}/spark/jdk-11.0.2"
export PATH="${JAVA_HOME}/bin:${PATH}"
```

check that it works:

```bash
java --version
```

2. **Installing Spark**


Download [Spark](https://spark.apache.org/downloads.html) to `~/spark` folder:

```bash
wget https://dlcdn.apache.org/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3-scala2.13.tgz
```

Unpack:

```bash
tar xzfv spark-3.4.2-bin-hadoop3-scala2.13.tgz
```

In the `.bashrc` add again a new path for spark to `PATH`:

```bash
export SPARK_HOME="${HOME}/spark/spark-3.4.2-bin-hadoop3-scala2.13"
export PATH="${SPARK_HOME}/bin:${PATH}"
```

3. **Testing Spark**

Execute `spark-shell` and run the following:

```scala
val data = 1 to 10000
val distData = sc.parallelize(data)
distData.filter(_ < 10).collect()
```

<center>
<img src="data/spark-test.png" >
</center>



## **2.2 PySpark**

Assuming that we already have python, to run PySpark, we first need to add it to `PYTHONPATH`:

```bash
export PYTHONPATH="${SPARK_HOME}/python/:$PYTHONPATH"
export PYTHONPATH="${SPARK_HOME}/python/lib/py4j-0.10.9.7-src.zip:$PYTHONPATH"
```
This tells the Python interpreter where to locate additional modules and packages that are not part of the standard library or not in the current directory. Make sure that the version under `${SPARK_HOME}/python/lib/` matches the filename of py4j or will
encounter `ModuleNotFoundError: No module named 'py4j'` while executing `import pyspark`.

or we could simply use pip to install the package `pyspark`:

```bash
pip install pyspark
```

This will automatically add the path to `PYTHONPATH`.



Download a CSV file that we'll use for testing:

```bash
wget https://s3.amazonaws.com/nyc-tlc/misc/taxi+_zone_lookup.csv
```

Now let's execute here the following:

In [5]:
from pyspark import SparkContext 
sc = SparkContext.getOrCreate() 

data = range(10000) 
distData = sc.parallelize(data)
distData.filter(lambda x: not x&1).take(10)



JAVA_HOME is not set


PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

In [3]:
import pyspark
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

df = spark.read \
    .option("header", "true") \
    .csv('taxi+_zone_lookup.csv')

df.show()

JAVA_HOME is not set


PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

In [None]:
df.write.parquet('zones')

# **3. Spark with Airflow using Dataproc**

If you're using Google Cloud Dataproc, you only need to have Airflow set up with the necessary operators to orchestrate your Spark jobs. Dataproc handles the Spark environment, so you don't need to manage Spark installations directly. Airflow will communicate with Dataproc to submit and manage Spark jobs, leveraging Dataproc's managed Spark clusters.