# Lab #1 - HDFS e MapReduce

This lab demonstrates the use of HDFS and Hadoop's Map Reduce algorithm. We will use data from the National Climatic Data Center (NCDC), available at http://www.ncdc.noaa.gov/. This data is stored using a line-oriented ASCII format, one record each line. The format supports a rich set of meteorological elements, many of which are optional or with variable data lengths. The following lines are a sample of one of those files:

    0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
    0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
    0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
    0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
    0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999


The following lines explain what is the meaning of the fields (note that fields are packed into one line with no delimiters):

    0067
    011990   # USAF weather station identifier
    99999    # WBAN weather station identifier
    19500515 # observation date
    0700     # observation time
    4
    +68750   # latitude (degrees x 1000)
    +023550  # longitude (degrees x 1000)
    FM-12
    +0382    # elevation (meters)
    99999
    V020
    330      # wind direction (degrees)
    1        # quality code
    N
    0067
    1
    22000    # sky ceiling height (meters)
    1        # quality code
    C
    N
    999999   # visibility distance (meters)
    9        # quality code
    N
    9
    +0000    # air temperature (degrees Celsius x 10)
    1        # quality code
    +9999    # dew point temperature (degrees Celsius x 10)
    9        # quality code
    99999    # atmospheric pressure (hectopascals x 10)
    9        # quality code
    

A subset of the dataset was downloaded from the server. It compreends datafiles from 1901 to 1960. To avoid to many small files, records were grouped by year resulting in a bigger file.

For simplicity, we will focus on the temperature, which is always present and is of fixed width.

## What are we going to solve?

What we are going to find is maximum temperature registered of each year, using Map Reduce. Documentation is available on hadoop's website: https://hadoop.apache.org/docs/r1.2.1/streaming.html


# Set-up environment variables

In our setup, some environment variables are not automatically loaded, in particular, the path to hadoop's executables. 

So let's add it for jupyter notebook as well as for the terminal. 

## Jupyter notebooks

In [None]:
environment = %env

In [None]:
PATH = environment['PATH']

In [None]:
%env PATH=$PATH:/usr/local/hadoop/bin

In [None]:
%env HADOOP_HOME=/usr/local/hadoop/

## Terminal

In ther terminal, run the following command:

    source /etc/profile.d/apache-hadoop.sh
    

# Explore HDFS file system

The HDFS has a web page where we can chech the status, the architecture as well as browse the file system. The web page is available at https://iscte.me/hdfs .

There, however, some commands to work with HDFS. Let's see some of them.


### List files of your home

Note: `%%bash` is magic the will interpret all the following commands of the cell as bash commands.

In [None]:
%%bash 

hdfs dfs -ls

### datasets directory

Some datasets are already in our system. They are available at `/home/ABD/datasets`

Note: the magic `%%script` allows to interpret the following commands with any kind of enviroment (bash, python, sh, zsh, etc...). 

Thus, `%%script env bash` is equivalent to `%%bash`:

In [None]:
%%script env bash 

hdfs dfs -ls /home/ABD/datasets

### What is the size of each dataset?

Enter a command bellow to see each dataset size.

In [None]:
%%bash

# your code here

### Copy the sample file from HDFS to a local folder.


Copy the sample file to a local folder. The sample is in the HDFS in the following path:

    /home/ABD/datasets/ncdc-sample/sample.txt

In [None]:
%%bash

# your code here

## Show the sample file's content.


In [None]:
%%bash

# your code here

---

# Map Reduce algorithm

## Mapper function

Use the magic `%%file` to create a mapper file. The year is available at positions 15 until 19, the temperatura fom 87 to 92, and the quality value from 92 to 93. 

Process each line in a for loop through the standard input, i.e., sys.stdin. Strip the result, extract the info and check if the temperature is not "+9999" and the quality value belongs to {0, 1, 4, 5, 9}. It both conditions are met, print key value pair, separated by a tab.

To check if a number belongs to a set can be done using regular expressions: `re.match([01459], variable)`

In [None]:
%%file mapper.py
#!/usr/bin/env python3
import re
import sys

for line in sys.stdin:
    
    # your code here
    val = line.strip()
    (year, temp, q) = (val[15:19], val[87:92], val[92:93])
    if temp != "+9999" and re.match("[01459]", q):    
        print("{}\t{}".format(year, temp))

                                    

### Test the mapper function

To test your mapper function, run the following command:

    cat sample.txt | /home/.../mapper.py
    
Note: enter the full path to your function mapper.py

Expected values:

    1950	+0000
    1950	+0022
    1950	-0011
    1949	+0111
    1949	+0078

In [None]:
%%bash

# your code here

## Sort

Sort can be achieved by using the `sort` command:

    cat sample.txt | /home/.../mapper.py | sort

In [None]:
%%bash

# your code here

## Reducer

The reducer receives a lot of temperatures from the same year, and outputs the maximum value. The funcion is already defined:

In [None]:
%%file reducer.py

#!/usr/bin/env python3

import sys

last_key, max_val = (None, -sys.maxsize)
for line in sys.stdin:
    key, val = line.strip().split('\t')
    if last_key and last_key != key:
        print("{}\t{}".format(last_key, max_val))
        last_key, max_val = (key, int(val))
    else:
        last_key, max_val = key, max(max_val, int(val))

if last_key:
    print("{}\t{}".format(last_key, max_val))
    


# Test all your functions first

Before submiting a map reduce job, test all your functions together. We can do so with the following command:

    cat sample.txt | /home/joao/hadoop/mapper.py | sort | /home/joao/hadoop/reducer.py 
    
Measure the time it takes to complete using python's module `time`.

    import time
    t0 = time.time()
    ...
    t1 = time.time()
    print('Elapsed time: {}s'.format(t1-t0))
    

In [None]:
import time

In [None]:
t0 = time.time()
# your code here: start with !


# Run Map Reduce job in hadoop

The hadoop command to execute our code is the following:

    hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar\
       -files mapper.py,reducer.py\
       -input /home/ABD/datasets/ncdc/ \
       -mapper mapper.py\
       -combiner reducer.py\
       -reducer reducer.py\
       -output output

The options have the following meaning:

- `-files`: the files to upload to all the nodes in hadoop cluster
- `-input`: directory where the dataset's files are located
- `-mapper`: the mapper function
- `-combiner`: the combiner function
- `-reducer`: the reducer function
- `-output`: the output function



In [None]:
%%bash

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar\
       -files mapper.py,reducer.py\
       -input /home/ABD/datasets/ncdc \
       -mapper mapper.py\
       -combiner reducer.py\
       -reducer reducer.py\
       -output output

## Check the results in the output directory

In [None]:
%%bash

# your code here

# Solutions

Mapper function: 
    
    %%file mapper.py
    #!/usr/bin/env python3
    import re
    import sys

    for line in sys.stdin:
        val = line.strip()
        (year, temp, q) = (val[15:19], val[87:92], val[92:93])
        if temp != "+9999" and re.match("[01459]", q):
            print("{}\t{}".format(year, temp))