# Hot hot hot

In this notebook, I try to figure out what the hottest recorded data point is and some info about it.

## Initializing

First we initialize the data and load it into spark. I'm loading the smaller data set for a faster turn around time.

In [1]:
from pyspark.sql.types import StructType, StructField, FloatType, LongType, StringType
import pyspark.sql.functions as F
import numpy as np

from scipy.constants import convert_temperature

from datetime import datetime

In [2]:
hdfs_port = "hdfs://orion11:26990"
# data_path = "/nam_s/nam_201501_s*"
data_path = "/nam_s/*"
# data_path = "/sample/nam_tiny*"

In [3]:
feats = []
f = open('../features.txt')
for line_num, line in enumerate(f):
    line = line.strip()
    if line_num == 0:
        # Timestamp
        feats.append(StructField(line, LongType(), True))
    elif line_num == 1:
        # Geohash
        feats.append(StructField(line, StringType(), True))
    else:
        # Other features
        feats.append(StructField(line, FloatType(), True))
        
    
schema = StructType(feats)

In [4]:
df = spark.read.format('csv').option('sep', '\t').schema(schema).load(f'{hdfs_port}{data_path}')

## Filtering

After loading in the data, I try to minimize the data set to only the relevant information: Timestamp, Geohash and temperature_surface.

I then order it by the surface temperature and look at the top 30 values.

In [None]:
dg = df.select("Timestamp", "Geohash", "temperature_surface")
dg = dg.orderBy(F.desc("temperature_surface"))

In [None]:
%%time

dg_head = dg.head(30)

## Results

Now I convert the smaller data set to a pandas data frame for easier use.

In [None]:
%%time

rdd = sc.parallelize(dg_head)
p_df = rdd.toDF().toPandas()
p_df["date"] = p_df.apply(lambda x: datetime.utcfromtimestamp(x["Timestamp"]//1000).strftime('%Y-%m-%d'), axis=1)
p_df["fernheit"] = p_df.apply(lambda x: convert_temperature(x.temperature_surface, 'K', 'F'), axis=1)
p_df

## Analysis

It appears that while this is an anomaly (temparatures of 135F are extremely rare) it is not the only one of its kind. The rest of the top 30 values have similar surface temperatures and are all in the same "d5" hash region.

The highest temperature recorded maps to Cancún, Mexico.

![cancun](img/cancun.png)