# Working with computing clusters

  <a href = "http://yogen.io"><img src="http://yogen.io/assets/logo.svg" alt="yogen" style="width: 200px; float: right;"/></a>


The idea here is to transmit the idea of how a Data Scientist works remotely.

We do it all the time.

## Amazon Web Services

![AWS](https://d7umqicpi7263.cloudfront.net/img/product/0f7858eb-0831-4a33-9af9-8e78db6b23d8/c7939ac3-d352-4bee-bf8a-7ee4a2dd2bff.png)

AWS is widely used

You can use an introductory offer to get a taste of it for free.

Jeff Bezos intentó que sus productos internos se comportarna como si estuvieran oriuentados a clientes externos. Por ejemplo, que no compartan base de datos, sino que para hacer algo con ese servicio, tengas que pasar por un servicio como pasarela (por otro lado, un buen patrón en ingeniería de SW). De esa forma, le fue fácil compartirlo externamente. 


### Signing up for aws

https://aws.amazon.com/

### AWS services of interest:

- EC2: Elastic Cloud Compute. Allows us to rent virtual machines, computing power, fit to our needs and under several pricing models. Permite lanzar máquinas virtuales en los sistemas de amazon. Fue lo primero que lanzó junto con S3. Te permite tanto máquinas virtuales _limpias_, sólo son el SO (Windows, Linux,...) como algunas con paquetes ya instalados, por ejemplo de ML . Lo bueno de esto, por ejemplo, para una startup, es la elasticidad tremenda que tengo. Tú dices cuánto estás dispuesto a pagar la hora, y el precio para el mismo tipo de rercurso va cambiando según la demanda. Si sube por encima de lo que estás tú dispuesto a pagar, te lo cortan. 

- S3: Simple Storage Service: Allows us to rent storage capacity to plug into our virtual servers.

- EMR: Elastic MapReduce. Simplifies the creation and management of Hadoop/Spark clusters. Me permite crear un clúster con los nodos que yo quiera. Ya con su hadoop o spark configurados. *Equivalente en Google*: `DataProc`

- Lambda: Serverless computing.

### Creating a single instance: 

- choosing an operating system

- choosing an instance type - spot prices

- auto scaling groups

- creating a new keypair and storing the private key in .ssh/

## Google Cloud Platform

![Google Cloud](https://cloud.google.com/_static/5b213a6cb2/images/cloud/cloud-logo.svg)


Rival to AWS

300$ introductory credit!

We will look into it in a bit.

## Accessing remote computers

ssh is your basic tool. You should always use public/private key pair authentication rather than passwords, especially if the ssh port (usually 22) is open to the internet.

### public-private keys

Generating ssh keys:

- [ssh-keygen](https://www.ssh.com/ssh/keygen/)

Fijémonos que al crearlo, la clave privada (la que no tiene extensión `.pub`) sólo tiene permisos para el admin, por seguridad.

```
-rw-------  1 juanluisgarcialopez  staff   1.8K Nov 15 18:36 kschool_jlg
(base) ➜  ~ ll kschool_jlg.pub
-rw-r--r--  1 juanluisgarcialopez  staff   411B Nov 15 18:36 kschool_jlg.pub
(base) ➜  ~
```

Esto no lo usó:

```shell
ssh-keygen -y -f $PRIVATE_KEY
```

Después, en el ordenador remoto, en el archivo /.ssh/authorized_keys hay qye a;adir esta clave publica. (incluir el ssh-rsa del principio

### `ssh`

We need to keep the private key (not the `.pub` file) somewhere where we will not lose it. The standard place is the `~/.ssh/` folder.

```shell
mkdir -p ~/.ssh
mv $key_file ~/.ssh
```



`ssh` will let you control a remote machine as if you were typing at its terminal

Let's connect to the instance we just created. 

We need to use the "identity file" (private key) to authenticate ourselves:

```shell
ssh -i $PRIVATE_KEY $REMOTE_USER@$REMOTE_MACHINE
```

Una de las cosas malas es que si pierdo la conexión, se pierde el proceso largo que estuviera ejecutando. Se usa `screen tmux`


### scp

Sending our data to a remote computer. Para enviar ficheros a la máquina virtual o recibirlos. s'olo tengo que cambiar el origen por el destino

Let's send coupon150720.csv to the recently created instance.

```shell
scp -i $PRIVATE_KEY $LOCAL_PATH $REMOTE_USER@$REMOTE_MACHINE:$REMOTE_PATH
```

### SSH config file


An SSH config file saves us from typing those long connections every time. It needs to be in `~/.ssh/config` and looks like this:

```
Host mycluster
    User remoteuser
    HostName masternodename
    IdentityFile ~/.ssh/my-private-key
```

Once it's there, we can just type

```
ssh mycluster
```
and we'll be connected.

There are lots and lots of options to ssh we can configure like this. More details [here](https://nerderati.com/2011/03/17/simplify-your-life-with-an-ssh-config-file/).


## Google Cloud Platform

![Google Cloud](https://cloud.google.com/_static/5b213a6cb2/images/cloud/cloud-logo.svg)


Three ways to interact with GCP:

* The Google Cloud Platform console (the GUI)

* `gcloud` command-line tool

* Cloud Dataproc REST API: puedo crear y destruir clusters usando una api rest!!


## Google dataproc

[Cloud Dataproc FAQ](https://cloud.google.com/dataproc/docs/resources/faq)

### Installing the `gcloud` SDK

[On Ubuntu/Debian](https://cloud.google.com/sdk/docs/quickstart-debian-ubuntu):

```bash
# Create environment variable for correct distribution
export CLOUD_SDK_REPO="cloud-sdk-$(lsb_release -c -s)"

# Add the Cloud SDK distribution URI as a package source
echo "deb http://packages.cloud.google.com/apt $CLOUD_SDK_REPO main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

# Import the Google Cloud Platform public key
curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -

# Update the package list and install the Cloud SDK
sudo apt-get update && sudo apt-get install google-cloud-sdk
```

### Configuring the `gcloud` SDK

```bash
gcloud init
```

Ejectuando esto puedo ya hacer por consola cualquier cosa de las cosas que se hacen por la consola web

### Adding users to a project

[Identity and access management](https://cloud.google.com/iam/docs/)

### Creating a cluster

With GUI: Google Cloud Console -> dataproc -> Clusters -> create cluster

With SDK: 

```bash
gcloud dataproc clusters create $CLUSTERNAME --region $REGION
```

Many more options available. You can explore them within the SDK or through the GUI.

[Creating a Cluster](https://cloud.google.com/dataproc/docs/guides/create-cluster)

### Uploading data a GCP cluster

2 Options:

* scp to the master node

* Upload the data to Google Cloud Storage, then use `gs://` as a path prefix on your script

    * First, you'll need to [create a storage bucket].
    
[create a storage bucket]: https://cloud.google.com/storage/docs/creating-buckets

### Creating a storage bucket

```bash
gsutil mb -p $PROJECT_NAME  gs://bucket-name
```


### Uploading your data

```bash
gsutil cp [LOCAL_OBJECT_LOCATION] gs://[DESTINATION_BUCKET_NAME]/
```


### Submitting a job to Google dataproc

To submit a PySpark job, run:

```bash
  $ gcloud dataproc jobs submit pyspark --cluster my_cluster \
      my_script.py -- arg1 arg2
      
```

https://cloud.google.com/sdk/gcloud/reference/dataproc/jobs/submit/

## Storage: HDFS (Hadoop Distributed File System)

El HDFS está ligado al clúster. Si este se elimina, el HFDS fuere con él. En constraste, el Google Storage no. 

The "internal storage" of a Spark cluster is usually HDFS. In GCP we don't worry too much about it because we use Google Storage, but it is important to be able basic concepts about it in case we use another cloud provider or have an internal cluster.

Assumptions in [HDFS design]:

* The system is built from many inexpensive commodity components that often fail. El sistema tiene un montón de componentes nada especiales y que por tanto fallan a menudo. 

* The system stores a modest number of large files. 

* The workloads primarily consist of two kinds of reads: large streaming reads and small random reads. 

* The workloads also have many large, sequential writes that append data to files.

* The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file.

* High sustained bandwidth is more important than low latency. 

[HDFS design]: https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html


![HDFS](https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/images/hdfsarchitecture.png)


`Blocks` = en lo que está partido el archivo, que puede estar partido
`Rack` = es consciente que os nodos están en racks, y sabe que si dos nodos están en distintos racks, la comunicación irá más lenta. 

### hdfs dfs

Mimics the shell, but with a few differences:

* We call shell commands as options to a module named hdfs dfs

* There is no concept of a current working directory (therefore, no cd command)

* It has some annoying inconsistencies with regular bash

```shell

[hadoop@masternode ~]$ hdfs dfs 

Usage: hadoop fs [generic options]
	[-appendToFile <localsrc> ... <dst>]
	[-cat [-ignoreCrc] <src> ...]
	[-checksum <src> ...]
	[-chgrp [-R] GROUP PATH...]
	[-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...]
	[-chown [-R] [OWNER][:[GROUP]] PATH...]
	[-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]
	[-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-count [-q] [-h] <path> ...]
	[-cp [-f] [-p | -p[topax]] <src> ... <dst>]
	[-createSnapshot <snapshotDir> [<snapshotName>]]
	[-deleteSnapshot <snapshotDir> <snapshotName>]
	[-df [-h] [<path> ...]]
	[-du [-s] [-h] <path> ...]
	[-expunge]
	[-find <path> ... <expression> ...]
	[-get [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]
	[-getfacl [-R] <path>]
	[-getfattr [-R] {-n name | -d} [-e en] <path>]
	[-getmerge [-nl] <src> <localdst>]
	[-help [cmd ...]]
	[-ls [-d] [-h] [-R] [<path> ...]]
	[-mkdir [-p] <path> ...]
	[-moveFromLocal <localsrc> ... <dst>]
	[-moveToLocal <src> <localdst>]
	[-mv <src> ... <dst>]
	[-put [-f] [-p] [-l] <localsrc> ... <dst>]
	[-renameSnapshot <snapshotDir> <oldName> <newName>]
	[-rm [-f] [-r|-R] [-skipTrash] <src> ...]
	[-rmdir [--ignore-fail-on-non-empty] <dir> ...]
	[-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]
	[-setfattr {-n name [-v value] | -x name} <path>]
	[-setrep [-R] [-w] <rep> <path> ...]
	[-stat [format] <path> ...]
	[-tail [-f] <file>]
	[-test -[defsz] <path>]
	[-text [-ignoreCrc] <src> ...]
	[-touchz <path> ...]
	[-truncate [-w] <length> <path> ...]
	[-usage [cmd ...]]
```

Try:

```shell
user@gateway$ hdfs dfs -ls
```

Why does it return nothing?

Now try:

```shell
user@gateway$ hdfs dfs -ls /
user@gateway$ hdfs dfs -ls /user
```

#### `hdfs dfs -mkdir`

Create a folder inside your hdfs home folder that is called "data", on your own


#### `hdfs dfs -put`

By analogy with ls, can you guess where the
`$LOCAL_FILE` will be put if I do this? (don't do it)
                                       
```shell
user@gateway$ hdfs dfs -put $LOCAL_FILE

```
                                       
                                       
Now, put the file in hdfs, inside your "data" folder:
```shell
user@gateway$ hdfs dfs -put $LOCAL_FILE $HDFS_FOLDER
```
 
                                       

#### `hdfs dfs -get` / `hdfs dfs -cat`

If you do any kind of work in HDFS, eventually you'll need to get something out of it!

```shell
user@gateway$ hdfs dfs -cat $HDFS_FILE
```

However, you might only need take a peek into the contents of a file:

```shell
user@gateway$ hdfs dfs -get $HDFS_FILE
```

The neat thing about hdfs dfs -cat is that it outputs to stdout, so you can chain it to all your favorite shell pipelines!

Other useful hadoop filesystem commands:
    
```shell
user@gateway$ hdfs dfs -getmerge $HDFS_GLOB $LOCAL_FILE
user@gateway$ hdfs dfs -stat $HDFS_FILE
user@gateway$ hdfs dfs -tail $HDFS_FILE
````

Much more at:
https://hadoop.apache.org/docs/r2.8.3/hadoop-project-dist/hadoop-common/FileSystemShell.html

## spark-submit

#### ```mysparkjob.py```


```python
from __future__ import print_function
from pyspark import SparkContext
import sys

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: mysparkjob arg1 arg2 ", file=sys.stderr)
        exit(-1)
    sc = SparkContext(appName="MyTestJob")
    dataTextAll = sc.textFile(sys.argv[1])
    dataRDD = dataTextAll.filter(lambda line: line.startswith('79065'))
    dataRDD.saveAsTextFile(sys.argv[2])
    sc.stop()
```

Just a simple Spark job.

### Runnning our Spark app

```shell
./bin/spark-submit \
    mysparkjob.py \
    data/coupon150720.csv \
    test.csv
    
```

Once it runs, what is test.csv? How would you get it back on the local file system?

#### Exercise 

Adapt our exercise from notebook 02 to run in the cluster. Remember:

Get stats for all tickets with destination MAD from `coupons150720.csv`. You will need to extract ticket amounts with destination MAD, and then calculate:

* Total ticket amounts per origin


In [29]:
path = 'data/coupon150720.csv'
df = spark.sql(f'''SELECT  
            _c0 AS tft_number,
            _c1 AS coupon_number,
            _c2 AS origin,
            _c3 AS destination,
            _c4 AS carrier,
            CAST(_c6 AS double) AS amount
            FROM csv.`{path}`''')

In [30]:
amounts = (
    df
    .where(df['origin'] == 'MAD')
    .groupBy('destination')
    .sum('amount')
    .withColumnRenamed('sum(amount)','sum_amount') 
)

amounts

DataFrame[destination: string, sum_amount: double]

In [31]:
amounts.take(10)

[Row(destination='PMI', sum_amount=37475.79000000004),
 Row(destination='HEL', sum_amount=8268.34),
 Row(destination='SXB', sum_amount=264.46),
 Row(destination='UIO', sum_amount=14339.710000000005),
 Row(destination='XRY', sum_amount=8611.57),
 Row(destination='OLB', sum_amount=785.98),
 Row(destination='CCS', sum_amount=53432.03),
 Row(destination='VRN', sum_amount=2469.480000000001),
 Row(destination='SPC', sum_amount=6033.049999999999),
 Row(destination='AUH', sum_amount=8775.109999999999)]

Lo copio a un py

In [32]:
cat sum_amount.py

from __future__ import print_function
from pyspark.sql import SparkSession,types, functions
import sys

    
if __name__=='__main__':
    
    in_ = sys.argv[1]
    out_ = sys.argv[2]
    
    spark = SparkSession.builder.getOrCreate()

    df = spark.sql(f'''SELECT  
            _c0 AS tft_number,
            _c1 AS coupon_number,
            _c2 AS origin,
            _c3 AS destination,
            _c4 AS carrier,
            CAST(_c6 AS double) AS amount
            FROM csv.`{in_}`''')

    amounts = (
    df
    .where(df['origin'] == 'MAD')
    .groupBy('destination')
    .sum('amount')
    .withColumnRenamed('sum(amount)','sum_amount') 
    )

    amounts.write.json(out_)
    spark.stop()


**Este no funciona porque `f'{valor}'` es de python 3, y el cluster tiene python 2`**

Así que estamos usando `sum_amount_2.py`

Para probarlo desde el notebook, antes de subirlo (aunque esto es python 3 y el otro 2, no te asegura 100% que vaya bien

In [38]:
! unset PYSPARK_DRIVER_PYTHON; spark-submit sum_amount.py data/coupon150720.csv out.json

19/11/16 09:43:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/11/16 09:43:22 INFO SparkContext: Running Spark version 2.4.4
19/11/16 09:43:22 INFO SparkContext: Submitted application: sum_amount.py
19/11/16 09:43:22 INFO SecurityManager: Changing view acls to: juanluisgarcialopez
19/11/16 09:43:22 INFO SecurityManager: Changing modify acls to: juanluisgarcialopez
19/11/16 09:43:22 INFO SecurityManager: Changing view acls groups to: 
19/11/16 09:43:22 INFO SecurityManager: Changing modify acls groups to: 
19/11/16 09:43:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(juanluisgarcialopez); groups with view permissions: Set(); users  with modify permissions: Set(juanluisgarcialopez); groups with modify permissions: Set()
19/11/16 09:43:23 INFO Utils: Success

19/11/16 09:43:27 INFO FileSourceStrategy: Post-Scan Filters: isnotnull(_c2#18),(_c2#18 = MAD)
19/11/16 09:43:27 INFO FileSourceStrategy: Output Data Schema: struct<_c2: string, _c3: string, _c6: string ... 1 more fields>
19/11/16 09:43:27 INFO FileSourceScanExec: Pushed Filters: IsNotNull(_c2),EqualTo(_c2,MAD)
19/11/16 09:43:27 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:27 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/11/16 09:43:27 INFO CodeGenerator: Code generated in 29.562143 ms
19/11/16 09:43:27 INFO CodeGenerator: Code generated in 43.282683 ms
19/11/16 09:43:27 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 277.3 KB, free 365.4 MB)
19/11/16 09:43:27 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 23.3 KB, free 365.4 MB)
19/11/16 09:43:27 INFO BlockManagerInfo: Added broadcast_3_piece0 

19/11/16 09:43:28 INFO Executor: Finished task 10.0 in stage 1.0 (TID 11). 2459 bytes result sent to driver
19/11/16 09:43:28 INFO TaskSetManager: Finished task 10.0 in stage 1.0 (TID 11) in 697 ms on localhost (executor driver) (2/12)
19/11/16 09:43:28 INFO Executor: Finished task 9.0 in stage 1.0 (TID 10). 2459 bytes result sent to driver
19/11/16 09:43:28 INFO TaskSetManager: Finished task 9.0 in stage 1.0 (TID 10) in 710 ms on localhost (executor driver) (3/12)
19/11/16 09:43:28 INFO Executor: Finished task 4.0 in stage 1.0 (TID 5). 2459 bytes result sent to driver
19/11/16 09:43:28 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 5) in 725 ms on localhost (executor driver) (4/12)
19/11/16 09:43:28 INFO Executor: Finished task 7.0 in stage 1.0 (TID 8). 2502 bytes result sent to driver
19/11/16 09:43:28 INFO TaskSetManager: Finished task 7.0 in stage 1.0 (TID 8) in 745 ms on localhost (executor driver) (5/12)
19/11/16 09:43:28 INFO Executor: Finished task 6.0 in stage 1.0 (T

19/11/16 09:43:28 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/11/16 09:43:28 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/11/16 09:43:28 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:28 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/11/16 09:43:28 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:28 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/11/16 09:43:28 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:28 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter


19/11/16 09:43:28 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remote blocks
19/11/16 09:43:28 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:28 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remote blocks
19/11/16 09:43:28 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:28 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:28 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remote blocks
19/11/16 09:43:28 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:28 INFO ShuffleBlockFetcherIterator: Getting 0 non-

19/11/16 09:43:29 INFO Executor: Finished task 52.0 in stage 2.0 (TID 37). 3691 bytes result sent to driver
19/11/16 09:43:29 INFO TaskSetManager: Starting task 88.0 in stage 2.0 (TID 49, localhost, executor driver, partition 88, PROCESS_LOCAL, 7767 bytes)
19/11/16 09:43:29 INFO Executor: Running task 88.0 in stage 2.0 (TID 49)
19/11/16 09:43:29 INFO TaskSetManager: Finished task 52.0 in stage 2.0 (TID 37) in 38 ms on localhost (executor driver) (25/200)
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remot

19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:29 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/11/16 09:43:29 INFO TaskSetManager: Starting task 119.0 in stage 2.0 (TID 60, localhost, executor driver, partition 119, PROCESS_LOCAL, 7767 bytes)
19/11/16 09:43:29 INFO SparkHadoopMapRedUtil: No need to commit output of task because needsTaskCommit=false: attempt_20191116094327_0002_m_000090_50
19/11/16 09:43:29 INFO TaskSetManager: Finished task 77.0 in stage 2.0 (TID 44) in 98 ms on localhost (executor driver) (36/200)
19/11/16 09:43:29 INFO Executor: Finished task 90.0 in stage 2.0 (TID 50). 3734 bytes result sent to driver
19/11/16 09:43:29 I

19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:29 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:29 INFO SQLHadoopMapReduce

19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 0 non-empty blocks including 0 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:29 INFO SQLHadoopMapReduceCommitProtocol: Using out

19/11/16 09:43:29 INFO SparkHadoopMapRedUtil: No need to commit output of task because needsTaskCommit=false: attempt_20191116094327_0002_m_000175_86
19/11/16 09:43:29 INFO Executor: Finished task 175.0 in stage 2.0 (TID 86). 3734 bytes result sent to driver
19/11/16 09:43:29 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 103, localhost, executor driver, partition 3, ANY, 7767 bytes)
19/11/16 09:43:29 INFO TaskSetManager: Finished task 175.0 in stage 2.0 (TID 86) in 95 ms on localhost (executor driver) (79/200)
19/11/16 09:43:29 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:29 INFO Executor: Running task 3.0 in stage 2.0 (TID 103)
19/11/16 09:43:29 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/11/16 09:43:29 INFO SparkHadoopMapRedUtil: No need to commit output of task because needsTaskCommit=false: attempt_20191116094327_0002_m_000164_83
19/11/16

19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks including 4 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks including 3 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks including 1 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks including 3 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks including 3 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetch

19/11/16 09:43:29 INFO Executor: Running task 36.0 in stage 2.0 (TID 118)
19/11/16 09:43:29 INFO Executor: Running task 37.0 in stage 2.0 (TID 119)
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 7 non-empty blocks including 7 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 5 non-empty blocks including 5 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks including 2 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks including 1 local blocks and 0 remote blocks
19/11/16 09:

19/11/16 09:43:29 INFO Executor: Finished task 41.0 in stage 2.0 (TID 121). 3777 bytes result sent to driver
19/11/16 09:43:29 INFO TaskSetManager: Finished task 35.0 in stage 2.0 (TID 117) in 111 ms on localhost (executor driver) (106/200)
19/11/16 09:43:29 INFO Executor: Finished task 37.0 in stage 2.0 (TID 119). 3777 bytes result sent to driver
19/11/16 09:43:29 INFO TaskSetManager: Starting task 59.0 in stage 2.0 (TID 131, localhost, executor driver, partition 59, ANY, 7767 bytes)
19/11/16 09:43:29 INFO TaskSetManager: Starting task 61.0 in stage 2.0 (TID 132, localhost, executor driver, partition 61, ANY, 7767 bytes)
19/11/16 09:43:29 INFO TaskSetManager: Finished task 41.0 in stage 2.0 (TID 121) in 110 ms on localhost (executor driver) (107/200)
19/11/16 09:43:29 INFO TaskSetManager: Finished task 37.0 in stage 2.0 (TID 119) in 111 ms on localhost (executor driver) (108/200)
19/11/16 09:43:29 INFO Executor: Running task 53.0 in stage 2.0 (TID 128)
19/11/16 09:43:29 INFO Executor:

19/11/16 09:43:29 INFO Executor: Running task 66.0 in stage 2.0 (TID 136)
19/11/16 09:43:29 INFO TaskSetManager: Finished task 55.0 in stage 2.0 (TID 129) in 105 ms on localhost (executor driver) (112/200)
19/11/16 09:43:29 INFO FileOutputCommitter: Saved output of task 'attempt_20191116094327_0002_m_000063_133' to file:/Users/juanluisgarcialopez/repos-datascience/big-data-spark/out.json/_temporary/0/task_20191116094327_0002_m_000063
19/11/16 09:43:29 INFO SparkHadoopMapRedUtil: attempt_20191116094327_0002_m_000063_133: Committed
19/11/16 09:43:29 INFO Executor: Finished task 63.0 in stage 2.0 (TID 133). 3820 bytes result sent to driver
19/11/16 09:43:29 INFO TaskSetManager: Starting task 67.0 in stage 2.0 (TID 137, localhost, executor driver, partition 67, ANY, 7767 bytes)
19/11/16 09:43:29 INFO TaskSetManager: Finished task 63.0 in stage 2.0 (TID 133) in 112 ms on localhost (executor driver) (113/200)
19/11/16 09:43:29 INFO Executor: Running task 67.0 in stage 2.0 (TID 137)
1

19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks including 4 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks including 3 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:29 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 3 non-empty blocks including 3 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks including 2 local blocks and 0 remote blocks
19/11/16 09:43:

19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 6 non-empty blocks including 6 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 6 non-empty blocks including 6 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:29 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/11/16 09:43:29 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks including 4 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator

19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks including 2 local blocks and 0 remote blocks
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:29 INFO FileOutputCommitter: Saved output of task 'attempt_20191116094327_0002_m_000102_160' to file:/Users/juanluisgarcialopez/repos-datascience/big-data-spark/out.json/_temporary/0/task_20191116094327_0002_m_000102
19/11/16 09:43:29 INFO SparkHadoopMapRedUtil: attempt_20191116094327_0002_m_000102_160: Committed
19/11/16 09:43:29 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:29 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/11/16 09:43:29 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:29 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks including 2 local blocks and 0 remote blocks
19/11/16 09:43

19/11/16 09:43:30 INFO FileOutputCommitter: Saved output of task 'attempt_20191116094327_0002_m_000113_167' to file:/Users/juanluisgarcialopez/repos-datascience/big-data-spark/out.json/_temporary/0/task_20191116094327_0002_m_000113
19/11/16 09:43:30 INFO SparkHadoopMapRedUtil: attempt_20191116094327_0002_m_000113_167: Committed
19/11/16 09:43:30 INFO FileOutputCommitter: Saved output of task 'attempt_20191116094327_0002_m_000122_172' to file:/Users/juanluisgarcialopez/repos-datascience/big-data-spark/out.json/_temporary/0/task_20191116094327_0002_m_000122
19/11/16 09:43:30 INFO SparkHadoopMapRedUtil: attempt_20191116094327_0002_m_000122_172: Committed
19/11/16 09:43:30 INFO Executor: Finished task 113.0 in stage 2.0 (TID 167). 3820 bytes result sent to driver
19/11/16 09:43:30 INFO Executor: Finished task 122.0 in stage 2.0 (TID 172). 3777 bytes result sent to driver
19/11/16 09:43:30 INFO TaskSetManager: Starting task 141.0 in stage 2.0 (TID 183, localhost, executor driver, part

19/11/16 09:43:30 INFO ShuffleBlockFetcherIterator: Getting 4 non-empty blocks including 4 local blocks and 0 remote blocks
19/11/16 09:43:30 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:30 INFO ShuffleBlockFetcherIterator: Getting 6 non-empty blocks including 6 local blocks and 0 remote blocks
19/11/16 09:43:30 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms
19/11/16 09:43:30 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:30 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/11/16 09:43:30 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
19/11/16 09:43:30 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/11/16 09:43:30 INFO ShuffleBlockFetcherIterator: Getting 6 non-empty blocks including 6 local blocks an

19/11/16 09:43:30 INFO TaskSetManager: Finished task 143.0 in stage 2.0 (TID 185) in 70 ms on localhost (executor driver) (177/200)
19/11/16 09:43:30 INFO Executor: Running task 174.0 in stage 2.0 (TID 201)
19/11/16 09:43:30 INFO FileOutputCommitter: Saved output of task 'attempt_20191116094327_0002_m_000158_192' to file:/Users/juanluisgarcialopez/repos-datascience/big-data-spark/out.json/_temporary/0/task_20191116094327_0002_m_000158
19/11/16 09:43:30 INFO SparkHadoopMapRedUtil: attempt_20191116094327_0002_m_000158_192: Committed
19/11/16 09:43:30 INFO FileOutputCommitter: Saved output of task 'attempt_20191116094327_0002_m_000155_190' to file:/Users/juanluisgarcialopez/repos-datascience/big-data-spark/out.json/_temporary/0/task_20191116094327_0002_m_000155
19/11/16 09:43:30 INFO SparkHadoopMapRedUtil: attempt_20191116094327_0002_m_000155_190: Committed
19/11/16 09:43:30 INFO Executor: Finished task 155.0 in stage 2.0 (TID 190). 3777 bytes result sent to driver
19/11/16 09:43:30 INFO 

19/11/16 09:43:30 INFO FileOutputCommitter: Saved output of task 'attempt_20191116094327_0002_m_000176_202' to file:/Users/juanluisgarcialopez/repos-datascience/big-data-spark/out.json/_temporary/0/task_20191116094327_0002_m_000176
19/11/16 09:43:30 INFO SparkHadoopMapRedUtil: attempt_20191116094327_0002_m_000176_202: Committed
19/11/16 09:43:30 INFO FileOutputCommitter: Saved output of task 'attempt_20191116094327_0002_m_000173_200' to file:/Users/juanluisgarcialopez/repos-datascience/big-data-spark/out.json/_temporary/0/task_20191116094327_0002_m_000173
19/11/16 09:43:30 INFO SparkHadoopMapRedUtil: attempt_20191116094327_0002_m_000173_200: Committed
19/11/16 09:43:30 INFO FileOutputCommitter: Saved output of task 'attempt_20191116094327_0002_m_000181_203' to file:/Users/juanluisgarcialopez/repos-datascience/big-data-spark/out.json/_temporary/0/task_20191116094327_0002_m_000181
19/11/16 09:43:30 INFO SparkHadoopMapRedUtil: attempt_20191116094327_0002_m_000181_203: Committed
19/11/16 0

19/11/16 09:43:30 INFO FileFormatWriter: Finished processing stats for write job 122c47cf-5679-4adc-b340-bd585416216d.
19/11/16 09:43:30 INFO SparkUI: Stopped Spark web UI at http://juans-mbp:4043
19/11/16 09:43:30 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/11/16 09:43:30 INFO MemoryStore: MemoryStore cleared
19/11/16 09:43:30 INFO BlockManager: BlockManager stopped
19/11/16 09:43:30 INFO BlockManagerMaster: BlockManagerMaster stopped
19/11/16 09:43:30 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/11/16 09:43:30 INFO SparkContext: Successfully stopped SparkContext
19/11/16 09:43:30 INFO ShutdownHookManager: Shutdown hook called
19/11/16 09:43:30 INFO ShutdownHookManager: Deleting directory /private/var/folders/k4/33yt9vyn3_x0v6tsv21x51_w0000gn/T/spark-ee3b1700-274c-456f-bf68-557b62232ac3/pyspark-44243b4d-c7dc-428a-a1f8-9fda8840e5eb
19/11/16 09:43:30 INFO ShutdownHookManager: Deleting directory /pri

### Running on cluster versus client mode

This setting controls where the driver runs.

The default deployment mode is `client`, that is, the driver runs on the machine that is running the spark-submit script.


```shell
./bin/spark-submit \
    mysparkjob.py \
    data/coupon150720.csv \
    --deploy-mode client
```


```shell
./bin/spark-submit \
    mysparkjob.py \
    data/coupon150720.csv \
    --deploy-mode cluster
```

## Further reading



[hadoop fs](https://hadoop.apache.org/docs/r2.8.3/hadoop-project-dist/hadoop-common/FileSystemShell.html)

[standalone Spark versus yarn versus Mesos](http://www.agildata.com/apache-spark-cluster-managers-yarn-mesos-or-standalone/)

[How Spark runs on clusters](https://spark.apache.org/docs/2.2.0/cluster-overview.html)

[spark-submit](https://aws.amazon.com/es/blogs/big-data/submitting-user-applications-with-spark-submit/)

[Cluster versus Client deployment modes](https://stackoverflow.com/questions/28807490/what-conditions-should-cluster-deploy-mode-be-used-instead-of-client)

[Tunnelling web connections through ssh to view the Spark management web views](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-ssh-tunnel.html)

[Findings on running Google Dataproc](https://www.inovex.de/blog/findings-in-running-google-dataproc/)

[Dataproc - Spark cluster in minutes](https://medium.com/google-cloud/dataproc-spark-cluster-on-gcp-in-minutes-3843b8d8c5f8)

[Using the `gcloud` command line tool](https://cloud.google.com/sdk/docs/quickstart-debian-ubuntu)

#### Ejercicio 

- Create cluster
- Upload job.py file. 
- Submit job. 
- ???
- Profit