## Intermediate Parallel Computing
### Segment 2 of 6

### Apache Spark, the Spark of a Journey!

### In this segment we will answer:
* What is Apache Spark?
* What is its architecture?
* How to install and initiate a Spark computing session?


*Lesson Developer: Mohsen Ahmadkhani, ahmad178@umn.edu*

## Thank you for helping our study


<a href="#/slide-1-0" class="navigate-right" style="background-color:blue;color:white;padding:8px;margin:2px;font-weight:bold;">Continue with the lesson</a>

Throughout this lesson you will see reminders, like the one below, to ensure that all participants understand that they are in a voluntary research study.

### Reminder

<font size="+1">

By continuing with this lesson you are granting your permission to take part in this research study for the Hour of Cyberinfrastructure: Developing Cyber Literacy for GIScience project. In this study, you will be learning about cyberinfrastructure and related concepts using a web-based platform that will take approximately one hour per lesson. Participation in this study is voluntary.

Participants in this research must be 18 years or older. If you are under the age of 18 then please exit this webpage or navigate to another website such as the Hour of Code at https://hourofcode.com, which is designed for K-12 students.

If you are not interested in participating please exit the browser or navigate to this website: http://www.umn.edu. Your participation is voluntary and you are free to stop the lesson at any time.

For the full description please navigate to this website: <a href="../../gateway-lesson/gateway/gateway-1.ipynb">Gateway Lesson Research Study Permission</a>.

</font>

In [1]:
# This code cell starts the necessary setup for Hour of CI lesson notebooks.
# First, it enables users to hide and unhide code by producing a 'Toggle raw code' button below.
# Second, it imports the hourofci package, which is necessary for lessons and interactive Jupyter Widgets.
# Third, it helps hide/control other aspects of Jupyter Notebooks to improve the user experience
# This is an initialization cell
# It is not displayed because the Slide Type is 'Skip'

from IPython.display import HTML, IFrame, Javascript, display
from ipywidgets import interactive
import ipywidgets as widgets
from ipywidgets import Layout

import getpass # This library allows us to get the username (User agent string)

# import package for hourofci project
import sys
sys.path.append('../../supplementary') # relative path (may change depending on the location of the lesson notebook)
import hourofci

import warnings
warnings.filterwarnings('ignore') # Hide warnings

# load javascript to initialize/hide cells, get user agent string, and hide output indicator
# hide code by introducing a toggle button "Toggle raw code"
# HTML(''' 
#     <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
#     <input id="toggle_code" type="button" value="Toggle raw code">
# ''')

HTML(''' 
    <script type="text/javascript" src=\"../../supplementary/js/custom.js\"></script>
    
    <style>
        .output_prompt{opacity:0;}
    </style>
    
    <input id="toggle_code" type="button" value="Toggle raw code">
''')


# What is Apache Spark?

Apache Spark is an open-source data processing framework designed to carry out processing tasks on large volumes of data. It enables dividing of data into smaller chunks and distributing them between multiple processing units to perform multiple data processing tasks at the same time. 



# Apache Spark architecture


<table>
    <tr style="background: #fff; text-align: left; vertical-align:">
        <td style="background: #fff; text-align: left; font-size: 23px;">
            <ul>
                <li>
                    <b>Driver Program:</b> The code we write to process our data using the Apache Spark framework is indeed the driver program. In the driver program, the first thing we do is build a Spark Session.
                </li>
                <li>
                    <b>Spark Session:</b> Spark Session is a gateway to all Spark functionalities (methods, data types, etc.). 
                </li>
                <li>
                    <b>Cluster Manager:</b> The cluster manager takes care of dividing the job into multiple smaller tasks and distributes them between worker nodes. 
                </li>
                <li>
                    <b>Worker Node:</b> A worker node executes the tasks assigned by the cluster manager. Each worker node has one or more executors that can have multiple tasks to execute in parallel.
                </li>
            </ul>
        </td>
        <td style="width: 50%; background: #fff; text-align: left; vertical-align: top;"> 
            <img src="supplementary/spark_ar.png" width="100%">
    </td>
    </tr>
</table>




Before jumping into the code...
<table>
    <tr style="background: #fff; text-align: left; vertical-align:">
        <td style="background: #fff; text-align: left; font-size: 23px;">
            <p>Apache Spark has a core engine and four extensions:</p>
            <ul>
                <li>
<b>Spark SQL (enables executing SQL queries on data)</b>
                </li>
                <li>
Spark Streaming (streaming data add-on)
                </li>
                <li>
Spark MLlib (set of machine learning libraries)
                </li>
                <li>
GraphX (designed for distributed graph processing)
                </li>
            </ul>
</td>
     <td style="width: 40%; background: #fff; text-align: left; vertical-align: top;"> <img src='https://blog.kakaocdn.net/dn/uO5KB/btqA2a0rio3/n95MXm1PSKxJzz1oaoCA21/img.png?raw=1' width="500" height="700" alt='map'></td>
    </tr>
</table>



In this lesson, we will focus on **Spark SQL** only. Let's start by installing PySpark and then writing the driver program. 

First, we need to install the Python version of Spark. It's called PySpark. <br>
Run the cell below to install it using python package installer. <br>
<br>
Don't worry about restarting the kernel. We will do that next.

In [None]:
pip install pyspark --quiet

Now, click the "Restart Kernel" to update the list of installed packages.

In [4]:

def restarter():
    display(HTML(
        '''
            <script id="p1">
                function restart_kernel(){
                    IPython.notebook.kernel.restart();
                    var element = document.getElementById('p1').parentElement;
                    element.innerHTML = "Kernel restarted successfully!"+"<br/>"+"Move on to the next slide.";
               }
            </script>
            <button onclick="restart_kernel()">Restart Kernel</button>
        '''
    ))
    
restarter()

Allright, now let's build a Spark Session. Remember, this is the first thing to do in the driver program. 

We use `SparkConf()` function to set a name to our application and specify the number of threads for our program to run. 

Here, we call it `hourofci` and specify 4 **threads** in our local machine then store the session in a variable named `spark`. Run the following code:


In [None]:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf().setAppName("hourofci").setMaster("local[4]")
spark = SparkSession.builder.config(conf=conf).getOrCreate()
spark

Cool! Now we have a platform to perform our data analysis using 4 threads in parallel! 

Now, quiz time!

In [5]:
widget1 = widgets.RadioButtons(
    options = ['No', 'Yes'],
    description = 'Is the number of threads equivalent to the number of physical CPU cores in our machine?', style={'description_width': 'initial'},
    layout = Layout(width='100%',display="flex", justify_content="flex-start"),
    value = None
)

display(widget1)

hourofci.SubmitBtn2(widget1)



RadioButtons(description='Is the number of threads equivalent to the number of physical CPU cores in our machi…

Button(description='Submit', icon='check', layout=Layout(height='auto', width='auto'), style=ButtonStyle())

Output()

## Threads Vs. physical CPU Cores

Unlike what it sounds, threads and CPU cores are **NOT** the same. 
To explain threads let's get back to our grocery store analogy. 

Normally, a grocery store has a **checkout section** with one or more **cashiers** working in this section. You (a customer) unload your purchased items on a cashier's **conveyor belt** and the cashier's job is to pick up and scan your items. 

In the computer/grocery analogy:
<ul>
    <li>
        the <font style="font-weight: bold; color:red">checkout section</font> would be the <font style="font-weight: bold; color:red">CPU</font>,
    </li> 
    <li>
        each <font style="font-weight: bold; color:blue">cashier</font> is equivalent to a <font style="font-weight: bold; color:blue">physical core</font>,
    </li>
    <li>
        and the <font style="font-weight: bold; color:green">conveyor belt</font> would be analogous to a single <font style="font-weight: bold; color:green">thread</font>. 
    </li>
</ul>

So, in a fantasy world, if a cashier can handle two customers (= two conveyor belts) at the same time, it would be the case of a single core with two threads. 


## Parallelizing Using Threads!

The number of threads is the number of gates you specify to process your data in parallel. We saw how to use `setMaster()` function to pass the variable `local` to specify the number of threads. The followings are the available parallelism options:

* `local`: Run Spark with a single thread - No parallelism at all!

* `local[N]`: Run Spark with N multiple threads.

* `local[*]`: Set the number of threads as same as the number of available physical CPU cores on your machine.


Click back 3 slides to re-visit the code where we set the number of threads. 



If you are curious to see how many physical cores are available on the HourofCI server (<a href="https://jetstream-cloud.org">Jetstream2</a>) right now, run the following code:


In [None]:
import os
print('The number of available physical CPU cores: ', os.cpu_count())

Great! You now have a general sense of what Spark is and how to install and load it!<br> 
In the next segment we will get into some more details of **Spark SQL**.<br><br>


<font size="+1"><a style="background-color:blue;color:white;padding:12px;margin:10px;font-weight:bold;" href="pc-4.ipynb">Click here to go to the next notebook.</a></font>