I've stumbled across the word "Apache Spark" on the internet so many times, yet I never had the chance to really get to know what it was. For one thing, it seemed rather intimidating, full of buzzwords like "cloud computing", "data streaming," or "scalability," just to name a few among many others. However, a few days ago, I decided to give it a shot and try to at least get a glimpse of what it was all about. So here I report my findings after binge watching online tutorials on Apache Spark. 

# Apache

If you're into data science or even just software development, you might have heard some other Apache variants like Apache Kafka, Apache Cassandra, and many more. When I first heard about these, I began wondering: is Apache some sort of umbrella software, with Spark, Kafka, and other variants being different spinoffs from this parent entity? I was slightly more confused because the Apache I had heard of, at least as far as I recalled, had to do with web servers and hosting. 

Turns out that there is an organization called the Apache Software Foundation, which is the world's largest open source foundation. This foundation, of course, has to do with the Apache HTTP server project, which was the web server side of things that I had ever so faintly heard about. Then what is Apache Spark? Spark was originally developed at UC Berkeley at the AMP Lab. Later, its code base was open sourced and eventually donated to the Apache Software Foundation; hence its current name, Apache Spark. 

# Setup

For this tutorial, we will be loading Apache Spark on Jupypter notebook. There are many tutorials on how to install Apache Spark, and they are easy to follow along. However, I'll also share a quick synopsis of my own just for reference.

## Installation

Installing Apache Spark is pretty straight forward if you are comfortable dealing with `.bash_profile` on macOS or `.bashrc` on Linux. The executive summary is that you need to add Apache Spark binaries to the `PATH` variable of your system. 

What is a `PATH`? Basically, the `PATH` variable is where all your little UNIX programs live. For example, when we run simple commands like `ls` or `mkdir`, we are essentially invoking built-in mini-programs in our POSIX system. The `PATH` variable tells the computer where these mini-programs reside in, namely `/usr/bin`, which is by default part of the `PATH` variable. 

Can the `PATH` variable be modified? The answer is a huge yes. Say we have our own little mini-program, and we want to be able to run it from the command line prompt. Then, we would simply modify `PATH` so that the computer knows where our custom mini-program is located and know what to do whenever we type some command in the terminal. 

This is why we enter the Python shell in interactive mode when we type `python` on the terminal. Here is the little setup I have on my own `.bash_profile`:

```bash
export PYTHON_HOME="/Library/Frameworks/Python.framework/Versions/3.7"
export SPARK_HOME="/Users/jaketae/opt/apache-spark/spark-2.4.5-bin-hadoop2.7"
export PATH="${PYTHON_HOME}/bin:${PATH}:${SPARK_HOME}/bin"
```

Here, I prepended `PYTHON_HOME` to the default `PATH` then appended `SPARK_HOME` at the end. Appending and prepending result in different behaviors: by default, the computer searches for commands in the `PATH` variable in order. In other words, in the current setup, the computer will first search the `PYTHON_HOME` directory, then search the default `PATH` directory, and look at `SPARK_HOME` the very last, at least in my current setup. 

The contents in the `SPARK_HOME` directory simply contains the result of unzipping the `.tar` file available for download on the Apache Spark website. 

Once the `PATH` variable has been configured, run `source ~/.bash_profile`, and you should be ready to run Apache Spark on your local workstation! To see if the installation and `PATH` configuration has been done correctly, type `pyspark` on the terminal:

```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.5
      /_/

Using Python version 3.7.5 (default, Oct 25 2019 10:52:18)
SparkSession available as 'spark'.
>>> 
```

## Jupyter Notebook

To use Jupyter with Spark, we need to do a little more work. There are two ways to do this, but I will introduce the method that I found not only fairly simple, but also more applicable and generalizable. All we need is to install `findspark` package via `pip install findspark`. Then, on Jupyter, we can do:

In [1]:
import findspark
findspark.init()

Then simply import Apache Spark via

In [2]:
import pyspark

That is literally all we need! We can of course still use Apache Spark on the terminal simply by typing `pyspark` if we want, but it's always good to have more options on the table. 

# 