# Spark + Hadoop + H2O Demo

Due to the complexities of code involved, most of the examples below can't actually be run from the notebook. Where noted, copy the code shown into an appropiate script and execute from the console

## Hadoop

### Hadoop Installation

Follow the tutorial at https://www.edureka.co/blog/install-hadoop-single-node-hadoop-cluster . Some notes to consider:
* Download and install JDK 8u101 (from https://www.oracle.com/technetwork/java/javase/downloads/java-archive-javase8-2177648.html , the version from the tutorial may not work for your processor)
* Download and install Hadoop
* Configure environment for use
   * This configuration consists of editing xml files so that Hadoop creates a single node environment. They're pretty easy to follow
   * Before starting the daemons, run the command `sudo apt-get install openssh-client openssh-server`. This is required to open port 22
   * At the end of this tutorial, you will have a Hadoop instance running. Browse to http://localhost:50070/dfshealth.html to see it in action.
   * SH script to configure environment and launch Hadoop:

```
#Setup Java
export JAVA_HOME=$HOME/jdk1.8.0_101
export PATH=$JAVA_HOME/bin:$PATH

#Setup Hadoop
export HADOOP_HOME=$HOME/hadoop-2.7.3
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export PATH=$PATH:$HADOOP_HOME/bin

#Launch Hadoop
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
```

### Hadoop Demo

Now that the Hadoop services are running, you can insert data into the HDFS

`$ hadoop fs -mkdir /input`

`$ hadoop fs -put starcraft.csv /input/.`

starcraft.csv is a database with information about Starcraft players, including their rank and age. Let's perform a Map Reduce operation on this dataset, to obtain the average age at each rank.

To achieve this we have created a mapper.py script, which extracts the age and rank of the dataset, and a reducer.py script, which performs an averaging operation.

They key item to note about this scripts is that they operate with the standard input/output facilites. This allows Hadoop to run several instances in parallel with different portions of the dataset

Now to execute our code (the current directory has to contain the mapper.py and reducer.py scripts):

`$ hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.3.jar -file mapper.py -file reducer.py -mapper mapper.py -reducer reducer.py -input /input/starcraft.csv -output /output`

This will tell the Hadoop server to perform the operation on the dataset in a distributed cluster. Once all the Mapping and Reducing operations are done, a success message is printed:

```
19/07/16 15:42:19 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
packageJobJar: [mapper.py, reducer.py, /tmp/hadoop-unjar3585551381596376153/] [] /tmp/streamjob3165364863209447883.jar tmpDir=null
19/07/16 15:42:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/07/16 15:42:20 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
19/07/16 15:42:20 INFO mapred.FileInputFormat: Total input paths to process : 1
19/07/16 15:42:20 INFO mapreduce.JobSubmitter: number of splits:2
19/07/16 15:42:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1563301111758_0003
19/07/16 15:42:21 INFO impl.YarnClientImpl: Submitted application application_1563301111758_0003
19/07/16 15:42:21 INFO mapreduce.Job: The url to track the job: http://ubuntu:8088/proxy/application_1563301111758_0003/
19/07/16 15:42:21 INFO mapreduce.Job: Running job: job_1563301111758_0003
19/07/16 15:42:27 INFO mapreduce.Job: Job job_1563301111758_0003 running in uber mode : false
19/07/16 15:42:27 INFO mapreduce.Job:  map 0% reduce 0%
19/07/16 15:42:32 INFO mapreduce.Job:  map 100% reduce 0%
19/07/16 15:42:37 INFO mapreduce.Job:  map 100% reduce 100%
19/07/16 15:42:37 INFO mapreduce.Job: Job job_1563301111758_0003 completed successfully
19/07/16 15:42:37 INFO mapreduce.Job: Counters: 49
        File System Counters            
                FILE: Number of bytes read=26726          
                FILE: Number of bytes written=419739      
                FILE: Number of read operations=0           
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=548941
                HDFS: Number of bytes written=91
                HDFS: Number of read operations=9
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters              
                Launched map tasks=2
                Launched reduce tasks=1
                Data-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=6275
                Total time spent by all reduces in occupied slots (ms)=2226
                Total time spent by all map tasks (ms)=6275
                Total time spent by all reduce tasks (ms)=2226           
                Total vcore-milliseconds taken by all map tasks=6275       
                Total vcore-milliseconds taken by all reduce tasks=2226
                Total megabyte-milliseconds taken by all map tasks=6425600 
                Total megabyte-milliseconds taken by all reduce tasks=2279424
        Map-Reduce Framework
                Map input records=3395
                Map output records=3340
                Map output bytes=20040
                Map output materialized bytes=26732
                Input split bytes=186
                Combine input records=0
                Combine output records=0
                Reduce input groups=145
                Reduce shuffle bytes=26732
                Reduce input records=3340
                Reduce output records=7
                Spilled Records=6680
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=189
                CPU time spent (ms)=2120
                Physical memory (bytes) snapshot=701128704
                Virtual memory (bytes) snapshot=5860065280
                Total committed heap usage (bytes)=533725184
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=548755
        File Output Format Counters
                Bytes Written=91
19/07/16 15:42:37 INFO streaming.StreamJob: Output directory: /output
```

The results are stored in the directory indicated by the previous command:

`$ hadoop fs -cat /output/part-00000`

```
1,22.724551                                  
2,22.155620              
3,22.050633               
4,21.981504                                                            
5,21.362283                     
6,20.677939                                     
7,21.171429   
```

Now you know enough to bring up a Hadoop environment, populate it with data and perform operations on it.

## Spark

### Spark Installation

Follow the tutorial at https://www.edureka.co/blog/spark-tutorial/
* Java is already installed from the previous step
* spark-shell to run commands interactively