# Working with computing clusters

  <a href = "http://yogen.io"><img src="http://yogen.io/assets/logo.svg" alt="yogen" style="width: 200px; float: right;"/></a>


The idea here is to transmit the idea of how a Data Scientist works remotely.

We do it all the time.

## Amazon Web Services

![AWS](https://d7umqicpi7263.cloudfront.net/img/product/0f7858eb-0831-4a33-9af9-8e78db6b23d8/c7939ac3-d352-4bee-bf8a-7ee4a2dd2bff.png)

AWS is widely used

You can use an introductory offer to get a taste of it for free.


### Signing up for aws

https://aws.amazon.com/

### AWS services of interest:

- EC2: Elastic Cloud Compute. Allows us to rent virtual machines, computing power, fit to our needs and under several pricing models.

- S3: Simple Storage Service: Allows us to rent storage capacity to plug into our virtual servers.

- EMR: Elastic MapReduce. Simplifies the creation and management of Hadoop/Spark clusters. 

- Lambda: Serverless computing.

### Creating a single instance: 

- choosing an operating system

- choosing an instance type - spot prices

- auto scaling groups

- creating a new keypair and storing the private key in .ssh/

## Google Cloud Platform

![Google Cloud](https://cloud.google.com/blog/static/assets/GCP_Twitter_Card-2000%C3%971000.png)


Rival to AWS

300$ introductory credit!

We will look into it in a bit.

## Accessing remote computers

ssh is your basic tool. You should always use public/private key pair authentication rather than passwords, especially if the ssh port (usually 22) is open to the internet.

### public-private keys

Generating ssh keys:

- [ssh-keygen](https://www.ssh.com/ssh/keygen/)


```shell
ssh-keygen -y -f $PRIVATE_KEY
```

### `ssh`

We need to keep the private key (not the `.pub` file) somewhere where we will not lose it. The standard place is the `~/.ssh/` folder.

```shell
mkdir -p ~/.ssh
mv $key_file ~/.ssh
```



`ssh` will let you control a remote machine as if you were typing at its terminal

Let's connect to the instance we just created. 

We need to use the "identity file" (private key) to authenticate ourselves:

```shell
ssh -i $PRIVATE_KEY $REMOTE_USER@$REMOTE_MACHINE
```


### scp

Sending our data to a remote computer

Let's send coupon150720.csv to the recently created instance.

```shell
scp -i $PRIVATE_KEY $LOCAL_PATH $REMOTE_USER@$REMOTE_MACHINE:$REMOTE_PATH
```

### SSH config file


An SSH config file saves us from typing those long connections every time. It needs to be in `~/.ssh/config` and looks like this:

```
Host mygcpcluster
    User remoteuser
    HostName masternodename
    IdentityFile ~/.ssh/my-private-key
```

Once it's there, we can just type

```
ssh kschoolcluster
```
and we'll be connected.

There are lots and lots of options to ssh we can configure like this. More details [here](https://nerderati.com/2011/03/17/simplify-your-life-with-an-ssh-config-file/).


## Google Cloud Platform

![Google Cloud](https://cloud.google.com/images/velostrata/cloud-lockup-logo.png)


Three ways to interact with GCP:

* The Google Cloud Platform console (the GUI)

* `gcloud` command-line tool

* Cloud Dataproc REST API


## Google dataproc

[Cloud Dataproc FAQ](https://cloud.google.com/dataproc/docs/resources/faq)

### Creating a cluster in Google dataproc

* Creating a cluster

* Installing gcloud SDK 


### Installing the `gcloud` SDK

[On Ubuntu/Debian](https://cloud.google.com/sdk/docs/quickstart-debian-ubuntu):

```bash
# Add the Cloud SDK distribution URI as a package source
echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

# Import the Google Cloud Platform public key
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key --keyring /usr/share/keyrings/cloud.google.gpg add -

# Update the package list and install the Cloud SDK
sudo apt-get update && sudo apt-get install google-cloud-sdk
```

### Configuring the `gcloud` SDK

```bash
gcloud init
```

### Adding users to a project

[Identity and access management](https://cloud.google.com/iam/docs/)

### Creating a cluster

With GUI: Google Cloud Console -> dataproc -> Clusters -> create cluster

With SDK: 

```bash
gcloud dataproc clusters create $CLUSTERNAME --region $REGION
```

Many more options available. You can explore them within the SDK or through the GUI.

[Creating a Cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster)

### Uploading data a GCP cluster

2 Options:

* scp to the master node

* Upload the data to Google Cloud Storage, then use `gs://` as a path prefic on your script

    * First, you'll need to [create a storage bucket].
    
[create a storage bucket]: https://cloud.google.com/storage/docs/creating-buckets

### Creating a storage bucket

```bash
gsutil mb -p kschool-spark  gs://bucket-name
```


### Uploading your data

```bash
gsutil cp [LOCAL_OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/
```


### Creating a cluster from Gcloud SDK

```bash
gcloud dataproc clusters create [CLUSTER_NAME] --region=region
```

### Submitting a job to Google dataproc

To submit a PySpark job, run:

```bash
  $ gcloud dataproc jobs submit pyspark --cluster my_cluster \
      my_script.py -- arg1 arg2
      
```

https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/

## Storage: HDFS

Assumptions in [HDFS design]:

* The system is built from many inexpensive commodity components that often fail. 

* The system stores a modest number of large files. 

* The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. 

* The workloads also have many large, sequential writes that append data to files.

* The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file.

* High sustained bandwidth is more important than low latency. 

[HDFS design]: https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html


![HDFS](https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/images/hdfsarchitecture.png)


### hdfs dfs

Mimics the shell, but with a few differences:

* We call shell commands as options to a module named hdfs dfs

* There is no concept of a current working directory (therefore, no cd command)

* It has some annoying inconsistencies with regular bash

```shell

[hadoop@masternode ~]$ hdfs dfs 

Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
	[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] <path> ...]
	[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] <path> ...]
	[-expunge]
	[-find <path> ... <expression> ...]
	[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] <src> <localdst>]
	[-help [cmd ...]]
	[-ls [-d] [-h] [-R] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <src> <localdst>]
	[-mv <src> ... <dst>]
	[-put [-f] [-p] [-l] <localsrc> ... <dst>]
	[-renameSnapshot <snapshotDir> <oldName> <newName>]
	[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
	[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
	[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
	[-setfattr {-n name [-v value] | -x name} <path>]
	[-setrep [-R] [-w] <rep> <path> ...]
	[-stat [format] <path> ...]
	[-tail [-f] <file>]
	[-test -[defsz] <path>]
	[-text [-ignoreCrc] <src> ...]
	[-touchz <path> ...]
	[-truncate [-w] <length> <path> ...]
	[-usage [cmd ...]]
```

Try:

```shell
user@gateway$ hdfs dfs -ls
```

Why does it return nothing?

Now try:

```shell
user@gateway$ hdfs dfs -ls /
user@gateway$ hdfs dfs -ls /user
```

#### `hdfs dfs -mkdir`

Create a folder inside your hdfs home folder that is called "data", on your own


#### `hdfs dfs -put`

By analogy with ls, can you guess where the
`$LOCAL_FILE` will be put if I do this? (don't do it)
                                       
```shell
user@gateway$ hdfs dfs -put $LOCAL_FILE

```
                                       
                                       
Now, put the file in hdfs, inside your "data" folder:
```shell
user@gateway$ hdfs dfs -put $LOCAL_FILE $HDFS_FOLDER
```
 
                                       

#### `hdfs dfs -get` / `hdfs dfs -cat`

If you do any kind of work in HDFS, eventually you'll need to get something out of it!

```shell
user@gateway$ hdfs dfs -cat $HDFS_FILE
```

However, you might only need take a peek into the contents of a file:

```shell
user@gateway$ hdfs dfs -get $HDFS_FILE
```

The neat thing about hdfs dfs -cat is that it outputs to stdout, so you can chain it to all your favorite shell pipelines!

Other useful hadoop filesystem commands:
    
```shell
user@gateway$ hdfs dfs -getmerge $HDFS_GLOB $LOCAL_FILE
user@gateway$ hdfs dfs -stat $HDFS_FILE
user@gateway$ hdfs dfs -tail $HDFS_FILE
````

Much more at:
https://hadoop.apache.org/docs/r2.8.3/hadoop-project-dist/hadoop-common/FileSystemShell.html

## spark-submit

#### ```mysparkjob.py```


```python
from __future__ import print_function
from pyspark import SparkContext
import sys

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: mysparkjob arg1 arg2 ", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="MyTestJob")
    dataTextAll = sc.textFile(sys.argv[1])
    dataRDD = dataTextAll.filter(lambda line: line.startswith('79065'))
    dataRDD.saveAsTextFile(sys.argv[2])
    sc.stop()
```

Just a simple Spark job.

### Runnning our Spark app

```shell
./bin/spark-submit \
    mysparkjob.py \
    data/coupon150720.csv \
    test.csv
    
```

Once it runs, what is test.csv? How would you get it back on the local file system?

#### Exercise 

Adapt our exercise from notebook 02 to run in the cluster. Remember:

Get stats for all tickets with destination MAD from `coupons150720.csv`. You will need to extract ticket amounts with destination MAD, and then calculate:

* Total ticket amounts per origin


### Running on cluster versus client mode

This setting controls where the driver runs.

The default deployment mode is `client`, that is, the driver runs on the machine that is running the spark-submit script.


```shell
./bin/spark-submit \
    mysparkjob.py \
    data/coupon150720.csv \
    --deploy-mode client
```


```shell
./bin/spark-submit \
    mysparkjob.py \
    data/coupon150720.csv \
    --deploy-mode cluster
```

## Further reading



[hadoop fs](https://hadoop.apache.org/docs/r2.8.3/hadoop-project-dist/hadoop-common/FileSystemShell.html)

[standalone Spark versus yarn versus Mesos](http://www.agildata.com/apache-spark-cluster-managers-yarn-mesos-or-standalone/)

[How Spark runs on clusters](https://spark.apache.org/docs/2.2.0/cluster-overview.html)

[spark-submit](https://aws.amazon.com/es/blogs/big-data/submitting-user-applications-with-spark-submit/)

[Cluster versus Client deployment modes](https://stackoverflow.com/questions/28807490/what-conditions-should-cluster-deploy-mode-be-used-instead-of-client)

[Tunnelling web connections through ssh to view the Spark management web views](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel.html)

[Findings on running Google Dataproc](https://www.inovex.de/blog/findings-in-running-google-dataproc/)

[Dataproc - Spark cluster in minutes](https://medium.com/google-cloud/dataproc-spark-cluster-on-gcp-in-minutes-3843b8d8c5f8)

[Using the `gcloud` command line tool](https://cloud.google.com/sdk/docs/quickstart-debian-ubuntu)