# So Snowy



In [1]:
from pyspark.sql.types import StructType, StructField, FloatType, LongType, StringType
import pyspark.sql.functions as F
import numpy as np

from scipy.constants import convert_temperature

from datetime import datetime

In [2]:
hdfs_port = "hdfs://orion11:26990"
# data_path = "/nam_s/nam_201501_s*"
data_path = "/nam_s/*"
# data_path = "/sample/nam_tiny*"

In [3]:
feats = []
f = open('../features.txt')
for line_num, line in enumerate(f):
    line = line.strip()
    if line_num == 0:
        # Timestamp
        feats.append(StructField(line, LongType(), True))
    elif line_num == 1:
        # Geohash
        feats.append(StructField(line, StringType(), True))
    else:
        # Other features
        feats.append(StructField(line, FloatType(), True))
    
schema = StructType(feats)

In [4]:
df = spark.read.format('csv').option('sep', '\t').schema(schema).load(f'{hdfs_port}{data_path}')

## Filter

Here I assume that the feature "categorical_snow_yes1_no0_surface" means that there is snow on the surface if the value is 1. So I just take the sum of the feature "categorical_snow_yes1_no0_surface" and it should result in a place that has the snow on its surface the most.

In [24]:
%%time

dg = df.select("Timestamp", "Geohash", "categorical_snow_yes1_no0_surface", "snow_cover_surface")
dg = dg.groupby("Geohash")\
    .agg(F.sum("categorical_snow_yes1_no0_surface"), F.avg("snow_cover_surface"), F.min("snow_cover_surface"), F.max("snow_cover_surface"), F.count("Timestamp"))
dg = dg.withColumn("ratio_snow_counts", dg["sum(categorical_snow_yes1_no0_surface)"] / dg["count(Timestamp)"])\
    .sort(F.desc("ratio_snow_counts"))
dg_head = dg.head(10)

CPU times: user 22.9 ms, sys: 3.08 ms, total: 26 ms
Wall time: 2min 59s


## Results

It appears none of the items from the small set have year round snow.

In [26]:
dg_head

[Row(Geohash='c41uhb4r5n00', sum(categorical_snow_yes1_no0_surface)=168.0, avg(snow_cover_surface)=98.812351543943, min(snow_cover_surface)=0.0, max(snow_cover_surface)=100.0, count(Timestamp)=421, ratio_snow_counts=0.3990498812351544),
 Row(Geohash='c45277s4gjpb', sum(categorical_snow_yes1_no0_surface)=163.0, avg(snow_cover_surface)=88.87706855791963, min(snow_cover_surface)=0.0, max(snow_cover_surface)=100.0, count(Timestamp)=423, ratio_snow_counts=0.38534278959810875),
 Row(Geohash='c44jc11cn1rz', sum(categorical_snow_yes1_no0_surface)=165.0, avg(snow_cover_surface)=98.15242494226328, min(snow_cover_surface)=0.0, max(snow_cover_surface)=100.0, count(Timestamp)=433, ratio_snow_counts=0.3810623556581986),
 Row(Geohash='c41yek3dwk2p', sum(categorical_snow_yes1_no0_surface)=152.0, avg(snow_cover_surface)=98.01488833746899, min(snow_cover_surface)=0.0, max(snow_cover_surface)=100.0, count(Timestamp)=403, ratio_snow_counts=0.3771712158808933),
 Row(Geohash='c1uz20wg2gxb', sum(categorical_

## Analysis

The location with the most snow is "c41u...", which is in the Coast Mountains, a mountain range North West of the town Petersburg, Alaska. The town itself is in a basin, and across sound from the mountain, so it probably doesn't receive as much snow as the peak of a mountain. 

It is a little surprising that this location is not farther north, such as in the arctic circle where there is permafrost on the ground. This is probably due to the reduced data set being a much smaller set (I'm assuming it's built by taking a subset of Geohashes and all their relevent data).

![Petersburg, Alaska](img/petersburg.png)

![Alaska](img/alaska.png)