# Module 6 - Advanced Distributed Computation

This module investigates the linkage of prior technologies and concepts for supporting advanced parallel data processing Map-Reduce based computation (Spark).
### Topics
 1. Map-Reduce
 2. Hadoop
 3. Spark
 4. GCP Dataproc
 

#### Module Kickoff Video
* [Introduction to Map-Reduce, Hadoop, Spark, and Dataproc (12 min)](https://youtu.be/RV4TWYefZwo)
 * [Slides](./resources/DSA8430_Parallel_AdvancedDist.pdf)
 

## Videos

To get a little taste of the relationship of Map Reduce, Hadoop, Spark, and GCP Dataproc please watch these first short video.
 * GCP Dataproc (1 min): https://youtu.be/Jj6mp7Sam10
 * Introduction to Map Reduce and Hadoop (6 min): https://youtu.be/aReuLtY0YMI
 * Introduction to Spark (38 min): https://youtu.be/znBa13Earms
 * GCP Dataproc as Hadoop/Spark (3 min): https://youtu.be/h1LvACJWjKc

Suggested Additional Viewings
 * Map Reduce (35 min): https://youtu.be/b-IvmXoO0bU
 * GCP Dataproc (47 min): https://youtu.be/IgnwXDU770M
 * PySpark vs Pandas (31 min): https://youtu.be/b-IvmXoO0bU

## Setup (Lab)

Remember to access GCP through our special IAM portal.  See [GCP Getting Started](https://europa.dsa.missouri.edu/user/tpgd5/notebooks/ParallelProgrammingAnalytics/module1/practices/GCP_Getting_Started.ipynb) for details.

### [Enable the appropriate APIs for Dataproc, etc.](https://console.cloud.google.com/apis/enableflow?apiid=dataproc,bigquery.googleapis.com,compute_component&_ga=2.131884943.1006469968.1647351992-382265758.1640621151&project=umc-dsa-8430-sp2022)


Ensure you see the correct project, `umc-dsa-8430-sp2022`

![GCP_Dataproc_Enable_1.png MISSING](./images/GCP_Dataproc_Enable_1.png)

You should then see

> You are about to enable:
>   
> Cloud Dataproc API  
> BigQuery API  
> Compute Engine API

Click **ENABLE**


## Lab - BigQuery Revisited

#### Estimated Tutorial Time: <span style='color:blue'>approximately 15 minutes</span>

Follow the tutorial here: https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml#create_a_subset_of_bigquery_natality_data

We are just working down until you have created your `regression_input` table within your newly created dataset.

**Don't forget to use your SSO to prefix resources you create in the project!**

**Additionally, set your table to expire in 30 days!**

#### Saving the Query Results to Table in your Dataset Should look similar to below:

![GCP_BigQuery_Result_to_Datatable.png MISSING](./images/GCP_BigQuery_Result_to_Datatable.png)

## Practices and Tutorials

The practices for this module involve following some of the select tutorials from the User Guide.
We then ask you to do a few extra steps to practice your data science skills.

For each practice, ensure your artifacts are properly linked into this notebook and uploaded in the appropriate location.

---

#### Status Check:  Create a screen snip of your Table Info similar to below

![GCP_BigQuery_Datatable_Info.png is MISSING](./images/GCP_BigQuery_Datatable_Info.png)

##### Artifact - Your image `regression_input.png` should be linked below

![Your regression_input.png image is MISSING](regression_input.png)

---


### Google Cloud Shell
For some of the activities in this module, we will be using the **Google Cloud Shell**.

You can launch the Google Cloud Shell from the GCP Console using the icon noted below.
![GCP_GoogleCloudShell_launch.png MISSING](./images/GCP_GoogleCloudShell_launch.png)

---

Once you activate the Google Cloud Shell, you will get a Linux Terminal view a the bottom of that browser tab.
This is from a container launced in the GCP, similar to how you accessed pods in the Kubernetes module.

![GCP_GoogleCloudShell_initialized.png MISSING](./images/GCP_GoogleCloudShell_initialized.png)


---

## Practice 1:

#### Estimated Time: <span style='color:blue'>approximately 20 minutes</span>

Create a Dataproc cluster using the Console

https://console.cloud.google.com/dataproc/clustersAdd?_ga=2.130526095.1006469968.1647351992-382265758.1640621151

Name your cluster `cluster-SSO`, substituting your actual SSO.

Leave the type as _Standard (1 master, N workers)_

Check the box to _Enable component Gateway_

For _Option components_, sekect **Jupyter Notebook**

Leave all the default options under _Configure Nodes_, _Customize Cluster_, and _Manage Security_.

Click the **Create** button when you are ready!

### Practice 1 - Artifact

#### 1. From the Cloud Shell, export the YAML file that defines/configures your cluster! 
Example (guess what you need to do with SSO below):
```BASH
gcloud dataproc  clusters export cluster-SSO --destination cluster-SSO.yaml --region us-central1
```

#### 2. After the YAML file is produced, download it from GCP using the download option from the Google Cloud Shell.

#### 3. Then, upload it into this module's `practices/` folder.


#### 4. Update the link below to point to your Cluster YAML file.
Test that the link does not produce a 404 error (and therefor zero points).

[cluster-SSO.yaml](./practices/cluster-lcmhng.yaml)


#### 5. Delete your Cluster!

---

## Practice 2

#### Estimated Time: <span style='color:blue'>approximately 60 minutes</span>


#### 1. We will now recreate the cluster using the Google Cloud Shell, but add a suffix of `-2` to the cluster name.

```BASH
gcloud dataproc clusters import cluster-SSO-2 --source cluster-SSO.yaml --region us-central1
```

This will take a little bit to run, but initially you should see output similar to below:
```BASH
Waiting on operation [projects/umc-dsa-8430-sp2022/regions/us-central1/operations/36783211-573b-362d-b525-64dff1987991].
Waiting for cluster creation operation...
WARNING: For PD-Standard without local SSDs, we strongly recommend provisioning 1TB or larger to ensure consistently high I/O performance. See https://cloud.google.com/compute/docs/disks/performance for information on disk I/O performance.
Waiting for cluster creation operation...working   
```

Eventually, you should see a message similar to:
```BASH
Created [https://dataproc.googleapis.com/v1/projects/umc-dsa-8430-sp2022/regions/us-central1/clusters/cluster-scottgs-2] Cluster placed in zone [us-central1-a].
```
And the cluster will be visible in the Console.

### Practice 2 - Artifact - A

Click on your running cluster and get a screen snip of the top details (Name, Cluster UUID, Type, and Status ... showing the cluster in Running state).
Name your screen snip `practice2a.png`, and save it into the module `practices` folder. 
Ensure your screen snip shows in the cell below.


![Your ./practices/practice2a.png is MISSING](./practices/practice2a.png)

#### 2. Create a Jupyter Notebook from BigQuery Spark-ML Linear Regression tutorial code.

The code is found at the link below.

https://cloud.google.com/dataproc/docs/tutorials/bigquery-sparkml#run_a_linear_regression


As you make your Notebook with multiple Code Cells, 
be sure to add Markdown Cells and copy code comments into Markdown to allow a future user (or your future self) to read back through and understand what is going on.

##### Important Change from Tutorial

Update the way the Spark Session is instantiated.
This changes the way that the Java Archive (JAR) is loaded into memory for your PySpark session.

```
#sc = SparkContext()
#spark = SparkSession(sc)

spark = SparkSession.builder \
  .config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
  .getOrCreate()
```

**Also make sure you update the SQL** to be appropriate for your dataset and table name.

### Practice 2 - Artifact - B

Name your notebook `SSO_SparkML.ipynb`, using your SSO instead of literal characters `SSO`.
**Ensure the output is saved in the notebook.**

Upload the notebook into the module `practices` folder, then update the link in the cell below to point to your uploaded file.



[Your Spark ML Notebook](./practices/lcmhng_SparkML.ipynb)

#### 3. Delete your Dataproc cluster (cluster-SSO-2).

---

## Exercise 1:

#### Estimated Time: <span style='color:blue'>approximately 60 minutes</span>

Spin up a new cluster, **cluster-SSO-3**.

Briefly review your work on the [BigQuery Exercise in Module 3](../module3/GCP_BigQuery.ipynb#Exercise)

Re-use or re-stage the Big Query dataset.
In module 3, you created Logistic Regression model in BigQuery.

For this exercise, build a notebook modeled off of the Practice 2, but using a **classification model** of you choice for the `google_analytics_sample` data.
https://spark.apache.org/docs/3.1.2/api/python/reference/



Name your notebook `SSO_SparkClassification.ipynb`, using your SSO instead of literal characters `SSO`.
**Ensure the output is saved in the notebook.**

Upload the notebook into the module `exercises` folder, then update the link in the cell below to point to your uploaded file.


[Your Spark Classification Notebook](./exercises/lcmhng_SparkClassification.ipynb)

---

## Exercise 2:

#### Estimated Time: <span style='color:blue'>approximately 90 minutes</span>

Briefly review your work on the [AWS Quicksight Exercise in Module 4](../module4/AWS_Quicksight.ipynb#Excercises)

Using the same dataset, if possible, create a new BigQuery dataset.
For this exercise, use [Spark ML](https://spark.apache.org/docs/3.1.2/api/python/reference/) to perform one of the following options on your data.
 * Classification (not the same as you used in Exercise 1)
 * Regression (not the same as you used in Practice above)
 * Clustering Analysis
 
Be sure to add some analytics and explanations of your data, models, and findings.
Visualizations are good!
 
Name your notebook `SSO_SparkExercise.ipynb`, using your SSO instead of literal characters `SSO`.
**Ensure the output is saved in the notebook.**

Upload the notebook into the module `exercises` folder, then update the link in the cell below to point to your uploaded file.


[Your Spark Exercises Notebook](./exercises/lcmhng_SparkExercise.ipynb)

# Exercise 3:

Delete you Dataproc cluster and your BigQuery Datasets!

---

# Save your Notebook and commit the Notebook and artifacts for Grading

---

## Submitting your work

### <span style='background:lightblue'>Please be sure the artifacts from all practices and exercises are added into your repository for the commit and push!</span>

#### Steps:
  1. Open Terminal in JupyterHub
  1. Change into the course folder
  1. Stage (Git Add) the module's learning activities   
  `git  add   module6`
  1. Create your work snapshot (Git Commit)  
  `git   commit   -m   "Module 6 submission"`
  1. Upload the snapshot to the server (Git Push)  
  `git   push`