# Hot hot hot

In this notebook, I try to figure out what the hottest recorded data point is and some info about it.

### Takeaway
The data is not 100% correct. The hottest recorded temperatures are higher than the hottest widely-accepted temperature recoded.

## Initializing

First we initialize the data and load it into spark. I'm loading the smaller data set for a faster turn around time.

In [1]:
from pyspark.sql.types import StructType, StructField, FloatType, LongType, StringType
import pyspark.sql.functions as F
import numpy as np

from scipy.constants import convert_temperature

from datetime import datetime

In [2]:
hdfs_port = "hdfs://orion11:26990"
# data_path = "/nam_s/nam_201501_s*"
data_path = "/nam_s/*"
# data_path = "/sample/nam_tiny*"

In [3]:
feats = []
f = open('../features.txt')
for line_num, line in enumerate(f):
    line = line.strip()
    if line_num == 0:
        # Timestamp
        feats.append(StructField(line, LongType(), True))
    elif line_num == 1:
        # Geohash
        feats.append(StructField(line, StringType(), True))
    else:
        # Other features
        feats.append(StructField(line, FloatType(), True))
        
    
schema = StructType(feats)

In [4]:
df = spark.read.format('csv').option('sep', '\t').schema(schema).load(f'{hdfs_port}{data_path}')

## Filtering

After loading in the data, I try to minimize the data set to only the relevant information: Timestamp, Geohash and temperature_surface.

I then order it by the surface temperature and look at the top 30 values.

In [5]:
dg = df.select("Timestamp", "Geohash", "temperature_surface")
dg = dg.orderBy(F.desc("temperature_surface"))

In [6]:
%%time

dg_head = dg.head(30)

CPU times: user 29.1 ms, sys: 6.13 ms, total: 35.2 ms
Wall time: 4min 48s


## Results

Now I convert the smaller data set to a pandas data frame for easier use.

In [7]:
%%time

rdd = sc.parallelize(dg_head)
p_df = rdd.toDF().toPandas()
p_df["date"] = p_df.apply(lambda x: datetime.utcfromtimestamp(x["Timestamp"]//1000).strftime('%Y-%m-%d'), axis=1)
p_df["fernheit"] = p_df.apply(lambda x: convert_temperature(x.temperature_surface, 'K', 'F'), axis=1)

CPU times: user 184 ms, sys: 44.7 ms, total: 229 ms
Wall time: 19.1 s


In [9]:
p_df

Unnamed: 0,Timestamp,Geohash,temperature_surface,date,fernheit
0,1440352800000,d5f0jqerq27b,330.674316,2015-08-23,135.54377
1,1440266400000,d5f0vd8eb80p,330.640625,2015-08-22,135.483125
2,1430157600000,9g77js659k20,330.604492,2015-04-27,135.418086
3,1439056800000,d5f0jqerq27b,330.536621,2015-08-08,135.295918
4,1440612000000,d59d5yttuc5b,330.481934,2015-08-26,135.19748
5,1440612000000,d59eqv7e03pb,330.356934,2015-08-26,134.97248
6,1440612000000,d59dntd726gz,330.231934,2015-08-26,134.74748
7,1440698400000,d59eqv7e03pb,330.220703,2015-08-27,134.727266
8,1438279200000,d5f04xyhucez,330.179932,2015-07-30,134.653877
9,1439488800000,d5dpds10m55b,330.149902,2015-08-13,134.599824


## Analysis

The data shows that while this is an anomaly (temparatures of 135.5 °F are extremely rare) it is not the only one of its kind. The rest of the top 30 values have similar surface temperatures and are all in the same "d5" hash region.

The highest temperature recorded maps to Cancún, Mexico. In fact, most if of these locations listed with a hash "df" or "d5" are in Cancún.

![cancun](img/cancun.png)

I believe that these results are wrong, likely due to improper calibration or faulty instrumnets. This is because a quick search says that "according to the World Meteorological Organization (WMO), the highest registered air temperature on Earth was 56.7 °C (134.1 °F) in Furnace Creek Ranch, California, located in the Death Valley..." (https://en.wikipedia.org/wiki/Highest_temperature_recorded_on_Earth). A full 1.5 °F lower than the one listed in the data set.