# Python MapReduce


This notebook exemplifies the execution of a map-reduce program in Python, using Hadoop.
In this example, hadoop runs in standalone mode and reads data from the local filesystem, while in cluster mode data is read typically from HDFS dsitributed file system.


### Download the dataset 

In [1]:
!wget -O air_quality.csv https://www.dropbox.com/s/4jxfdsgn2tdo7zo/epa_hap_daily_summary-small.csv?dl=0

--2021-12-23 17:29:18--  https://www.dropbox.com/s/4jxfdsgn2tdo7zo/epa_hap_daily_summary-small.csv?dl=0
Resolving www.dropbox.com (www.dropbox.com)... 162.125.68.18, 2620:100:6024:18::a27d:4412
Connecting to www.dropbox.com (www.dropbox.com)|162.125.68.18|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/4jxfdsgn2tdo7zo/epa_hap_daily_summary-small.csv [following]
--2021-12-23 17:29:23--  https://www.dropbox.com/s/raw/4jxfdsgn2tdo7zo/epa_hap_daily_summary-small.csv
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc4600fedfcb79390d4a2e527e1d.dl.dropboxusercontent.com/cd/0/inline/BcY4xug1_aVEDRAn04Vyj2X9oyQiqueVT2MV83Rjr1oS8edED_EkzhPD4jrmONQLMWoXafByHUuQOqCbcl5-qlEClWQKbQEa6RuFD-TX89WflIy4kbLHU79M4rnxb2c-Qb0vW6DQ7oJBcnoaidUDUqGV/file# [following]
--2021-12-23 17:29:23--  https://uc4600fedfcb79390d4a2e527e1d.dl.dropboxusercontent.com/cd/0/inline/BcY4xug1_aVEDRAn04Vyj2X9o

### Mapper

By starting an element with "%%file", you are specifying that when run, the contents are written to the local disk.

In [6]:
%%file mapper.py
#!/usr/bin/env python

# import sys
import sys
# import string library function  
import string
import math


skipLine1=True

# input comes from STDIN (standard input)
for line in sys.stdin:
    if skipLine1 == False:
        # remove leading and trailing whitespace
        line = line.strip()
        # split the line into words
        words = line.split(",")
        
        #send county and average quality for the sample. Since the sample is done daily, there is no need 
        #to send the date or do any type of filtering
        print(words[25]+","+words[16])
    else:
        skipLine1 = False

Overwriting mapper.py


### Reducer

In [14]:
%%file reducer.py
#!/usr/bin/env python

import sys
import math
import string
lastCounty = None
listMean = list() #lists store repeated values
avgMean = 0.0
avgMean1 = 0.0


# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    county, mean = line.split(',', 1)
    if county == lastCounty:
        listMean.add(float(mean)) #convert values to float so they can get summed
    else:
        if lastCounty:
            ##sum all values of pollutant for this state and divide by the number of samples
            avgMean = str(math.fsum(list(listMean))/len(list(listMean)))
            avgMean1 = avgMean.zfill(0) #zfill desnecessario afinal
            print(avgMean+','+lastCounty)          
        lastCounty = county
        listMean = {float(mean)}
               
if lastCounty:
    ##sum all values of pollutatnt for this last state and divide by the number of samples
    avgMean = str(math.fsum(list(listMean))/len(list(listMean)))
    avgMean1 = avgMean.zfill(0) #zfill desnecessario afinal
    print(avgMean+','+lastCounty)
    
##sort values.. smallest will be first, zfill makes sure that strings are ordered


Overwriting reducer.py


### Local execution

The scripts can be tested using just the unix shell, as follows...

#### Make the scripts executable

In [15]:
!chmod a+x mapper.py && chmod a+x reducer.py

#### Execute

The execution workflow is as follows:

+ The input file is piped into the input of the mapper;
+ The output the mapper is sorted;
+ The sorted output of the mapper is fed to the reducer stage.

In [16]:
!cat "air_quality.csv" | ./mapper.py | sort -k1,1 | ./reducer.py | sort -k1,1 

0.00032333333333333335,Wrangell Petersburg
0.0004684782608695652,Josephine
0.00048750000000000003,Matanuska-Susitna
0.0005637142857142858,Kenai Peninsula
0.0005898039215686275,Powell
0.0005,Crook
0.0006271739130434782,Lewis
0.0006620833333333334,Taos
0.0006708955223880598,Clallam
0.0006862162162162162,Garden
0.0007133333333333333,Sandusky
0.0007305454545454545,Sanders
0.0007508333333333334,Aleutians East
0.0007671153846153846,Thomas
0.0007708333333333333,Roosevelt
0.0007991304347826087,Mono
0.0008047422680412372,Maui
0.0008137999999999999,Trinity
0.00081725,Rosebud
0.0008324,Okanogan
0.0008364000000000001,Rio Arriba
0.0008384615384615385,Siskiyou
0.0008555696202531645,Keweenaw
0.0008692452830188679,Lemhi
0.0008708510638297872,Denali
0.0008870491803278688,Sheridan
0.00091734375,Apache
0.0009225423728813559,Wallowa
0.0009614000000000001,Del Norte
0.0009740740740740741,Hawaii
0.0009855445544554456,Sublette
0.0009,Vilas
0.0010026050420168066,Shasta
0.001009

### Hadoop standalone mode execution

For executing in an hadoop cluster, input data should be moved into an HDFS directory. For executing in standalone mode, data can be read from the local filesystem. 


The output directory needs to be cleared...

In [6]:
rm -rf results

#### Submitting the job

The _hadoop_ command is used to submit the mapreduce job to the cluster...

In [7]:
!hadoop jar /opt/hadoop-3.2.0/share/hadoop/tools/lib/hadoop-*streaming*.jar -files mapper.py,reducer.py -mapper mapper.py -reducer reducer.py -input os_maias.txt -output results

2021-12-23 17:29:56,074 INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties
2021-12-23 17:29:56,129 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
2021-12-23 17:29:56,129 INFO impl.MetricsSystemImpl: JobTracker metrics system started
2021-12-23 17:29:56,142 WARN impl.MetricsSystemImpl: JobTracker metrics system already initialized!
2021-12-23 17:29:56,264 INFO mapreduce.JobSubmitter: Cleaning up the staging area file:/tmp/hadoop/mapred/staging/jovyan1942714672/.staging/job_local1942714672_0001
2021-12-23 17:29:56,264 ERROR streaming.StreamJob: Error Launching job : Input path does not exist: file:/home/jovyan/work/os_maias.txt
Streaming Command Failed!


#### Checking the results
The result is stored in directory results.

In [8]:
!cat results/part-*

cat: 'results/part-*': No such file or directory
