# Project3-Climate Chart

## Description
Given a Geohash prefix as an input(the *input_geo_hash*), it will create a chart including high, low, average temperatures and monthly average rainfall.  

The *parse_line function* will read each line from the given file path(in hdfs) and filter out the location not fitting give geo prefix.  

## Input 
Need to assign a geo-hash prefix to the *input_geo_hash*. The default value is "9qq" (About on Las Vegas)  
Also need to give a NAM dataset, In this example we are using sampled dataset on 2019

In [29]:
input_geo_hash = '9qq'
text_file = sc.textFile("hdfs://orion11:12001/pj3/3hr_sample/sampled_2016")

In [30]:
import geohash
import datetime
def parse_line(line):
    variables = line.split("\t")
    try:
        milliseconds = int(variables[0])
        dt = datetime.datetime.fromtimestamp(milliseconds/1000.0)

        lat = float(variables[1])
        lon = float(variables[2])
        temperature = float(variables[10])
        precipitation = float(variables[13])
        gh = geohash.encode(lat, lon)
        
        if gh.startswith(input_geo_hash):
            return (dt.month, (temperature, precipitation))
        else :
            return (0,(0,0))
    except:
        return (0,(0,0))
    
parsed_data = text_file.map(lambda line: parse_line(line)).filter(lambda val: val[0] > 0)

## Spark Strategy
We will use *aggregateByKey* to combine all data in a same month and area. To do so, we need to define zero value, seq-func, comb-func for *aggregateByKey*.   

The aggregating format is in a tuple of (count, max temperature, min temerature, sum precipitation, sum temperature).   

The thrid element in the zero-value is 9999 because it would be reaplaced by the first element.    

In [31]:
zero_value = (0, 0, 9999, 0, 0)

def seqf(acc, temp_prec_pair):
    (temp, prec) = temp_prec_pair
    cnt = acc[0] + 1
    max_temp = max(acc[1], temp)
    min_temp = min(acc[2], temp)
    sum_prec = acc[3] + prec
    sum_temp = acc[4] + temp    
    return (cnt, max_temp, min_temp, sum_prec, sum_temp)

def combf(a, b): 
    cnt = a[0] + b[0]
    max_temp = max(a[1], b[1])
    min_temp = min(a[2], b[2])
    sum_prec = a[3] + b[3]
    sum_temp = a[4] + b[4]
    return (cnt, max_temp, min_temp, sum_prec, sum_temp)

def map_to_output_format(metric):
    month = "{:02d}".format(metric[0])
    cnt = metric[1][0]
    max_temp = metric[1][1]
    min_temp = metric[1][2]
    avg_prec = metric[1][3] / cnt
    avg_temp = metric[1][4] / cnt
    return (month, max_temp ,min_temp, avg_prec, avg_temp)


### Start to aggregate data
We aggregate the data here and output to a .clim file for instructor's script to create a chart.    
**Note** If the output result does not contain 12 element(12 months), an exception will be raised and no output file will be created.   
The output files will be under climate-chart-script/

In [32]:
aggregated = parsed_data.aggregateByKey(zero_value, seqf, combf) \
                        .map(lambda metric: map_to_output_format(metric))
output_file_path = 'climate-chart-script/climate-chart-' + input_geo_hash + '.clim'
output_data = aggregated.collect()

if(len(output_data) != 12):
    raise Exception("Not enough data to make a chart")


If the .clim output file is created, it executes the instructor's script to create a chart. 

In [33]:
import os
f = open(output_file_path, 'w')
f.write("#Geo-hash starts with " + input_geo_hash + "\n")
for elem in output_data:
    f.write("{} {:.5f} {:.5f} {:.5f} {:.5f}\n".format(elem[0], elem[1], elem[2], elem[3], elem[4]))
f.close()

os.system('python climate-chart-script/plot.py ' + output_file_path)
from IPython.display import IFrame
IFrame(output_file_path+'-climate.pdf', width=600, height=600)