# Climate Chart

### Problem
Given a Geohash prefix, create a climate chart for the region. This includes high, low, and average temperatures, as well as monthly average rainfall. 

Earn up to 1 point of extra credit for enhancing/improving this chart (or porting it to a more feature-rich visualization library)

## Setup the environment
Importing required types and the schema

In [123]:
from pyspark.sql.types import StructType, StructField, FloatType, LongType, StringType

In [124]:
hdfs_port = "hdfs://orion11:26990"
# data_path = "/nam_s/nam_201501_s*"
data_path = "/nam_s/*"
# data_path = "/sample/nam_tiny*"

In [125]:
feats = []
f = open('../features.txt')
for line_num, line in enumerate(f):
    if line_num == 0:
        # Timestamp
        feats.append(StructField(line.strip(), LongType(), True))
    elif line_num == 1:
        # Geohash
        feats.append(StructField(line.strip(), StringType(), True))
    else:
        # Other features
        feats.append(StructField(line.strip(), FloatType(), True))
    
schema = StructType(feats)

## Loading the Data

In [126]:
df = spark.read.format('csv').option('sep', '\t').schema(schema).load(f'{hdfs_port}{data_path}')

CPU times: user 0 ns, sys: 2.93 ms, total: 2.93 ms
Wall time: 50 ms


## Custom Geohash

This is the geohash prefix value.

In [127]:
hash_prefix = "c6s64"

## Query and Filterting

This query takes filters for the hash_prefix and then finds temperature (avg, min and max) and rain (avg) per month.


In [3]:
%%time

dg = df

dg.createOrReplaceTempView("nam_small")

query_str = f'''
SELECT AVG(temperature_surface) AS tmp_avg,
    MIN(temperature_surface) AS tmp_min,
    MAX(temperature_surface) AS tmp_max,
    AVG(categorical_rain_yes1_no0_surface) AS rain_avg,
    FROM_UNIXTIME(Timestamp/1000, 'YYYY-MM') AS year_month
FROM nam_small
WHERE Geohash LIKE "{hash_prefix}%"
GROUP BY year_month
ORDER BY year_month
'''

print(query_str)

NameError: name 'df' is not defined

### Run the SQL Query

In [1]:
%%time

climate = spark.sql(query_str).collect()

NameError: name 'query_str' is not defined

## Results

In [None]:
climate