# Weather data

### Read HDF5 file, convert to pandas format, concat data for 2018-2020

This file contains the code to

1) Read in the weather data in hdf5 format, each year stored in a seperate file

2) Convert the data format to a python dictionary containing the weather data over the available time span

        The dictionary is structured as follows:

        {

        'WEATHER_APPARENT_TEMPERATURE_TOTAL': apparent temperature data,

        'WEATHER_ATMOSPHERIC_PRESSURE_TOTAL': atmospheric pressure data,

        'WEATHER_PRECIPITATION_RATE_TOTAL': percipitation rate data,

        'WEATHER_PROBABILITY_OF_PRECIPITATION_TOTAL': probability of precipitation data,

        'WEATHER_RELATIVE_HUMIDITY_TOTAL': relative humidity data,

        'WEATHER_SOLAR_IRRADIANCE_GLOBAL': solar irradiance data,

        'WEATHER_TEMPERATURE_TOTAL': temperature data,

        'WEATHER_WIND_DIRECTION_TOTAL': wind direction data,

        'WEATHER_WIND_GUST_SPEED_TOTAL': wind gust speed data,

        'WEATHER_WIND_SPEED_TOTAL': wind speed data,

         }

3) Save the dictionary to a pickle file at 'Data/weather/data_weather.pkl'

(4. Additional code used to check code functionality and data quality)

-------------

#### Imports

In [28]:
import h5py
import pandas as pd
import numpy as np
import pickle 
from datetime import datetime
import math

pd.options.mode.chained_assignment = None 

#### Functions to convert data

In [29]:
def hdf_to_pandas(hdf_dataset):
    column_type_dict = {x:str(y[0]) for x,y in hdf_dataset.dtype.fields.items()}
    column_list = []
    for index in column_type_dict:
        column_list.append(index)
    list_of_rows = []
    for line in range(0, hdf_dataset.size):
        list_of_rows.append(np.asarray(hdf_dataset[line]).tolist())
    return pd.DataFrame(data=list_of_rows, columns=column_list)

def first_n_digits(num, n):
    return num // 10 ** (int(math.log(num, 10)) - n + 1)

-------------

#### weather data for 2018 to one dictionary

In [30]:
file = h5py.File('Data/HDF5data/weather/2018_weather.hdf5', 'r')
dset_weather = file["WEATHER_SERVICE"]
dset_weather = dset_weather["IN"]

weather_dict_2018 = {}
for key in dset_weather:
    df_variable = dset_weather[key]
    df_variable = df_variable['table']
    weather_dict_2018[key] = hdf_to_pandas(df_variable)
    
    #shorten 64 to 32 bit integer
    weather_dict_2018[key]["index"] = weather_dict_2018[key]["index"].apply(lambda x: first_n_digits(x, 10))

#### weather data for 2019 to one dictionary

In [31]:
file = h5py.File('Data/HDF5data/weather/2019_weather.hdf5', 'r')
dset_weather = file["WEATHER_SERVICE"]
dset_weather = dset_weather["IN"]

weather_dict_2019 = {}
for key in dset_weather:
    df_variable = dset_weather[key]
    df_variable = df_variable['table']
    weather_dict_2019[key] = hdf_to_pandas(df_variable)
    
    #shorten 64 to 32 bit integer
    weather_dict_2019[key]["index"] = weather_dict_2019[key]["index"].apply(lambda x: first_n_digits(x, 10))

#### weather data for 2020 to one dictionary

In [32]:
file = h5py.File('Data/HDF5data/weather/2020_weather.hdf5', 'r')
dset_weather = file["WEATHER_SERVICE"]
dset_weather = dset_weather["IN"]

weather_dict_2020 = {}
for key in dset_weather:
    df_variable = dset_weather[key]
    df_variable = df_variable['table']
    weather_dict_2020[key] = hdf_to_pandas(df_variable)
    
    #shorten 64 to 32 bit integer
    weather_dict_2020[key]["index"] = weather_dict_2020[key]["index"].apply(lambda x: first_n_digits(x, 10))

#### concat weather data, 2018-2020 for each parameter in one dataframe

In [36]:
weather_dict = {}

for parameter in weather_dict_2018:
    weather_dict[parameter] = pd.concat([weather_dict_2018[parameter],weather_dict_2019[parameter],weather_dict_2020[parameter]])

#### save to pickle file

In [None]:
with open('Data/weather/data_weather.pkl', 'wb') as f:
    pickle.dump(weather_dict, f)

#### read saved file

In [39]:
with open('Data/weather/data_weather.pkl', 'rb') as f:
    weather_dict = pickle.load(f)

______________________________

#### weather data exploration

In [40]:
for parameter in weather_dict:
    print(str(parameter) + " " + str(len(weather_dict[parameter])))

WEATHER_APPARENT_TEMPERATURE_TOTAL 256047
WEATHER_ATMOSPHERIC_PRESSURE_TOTAL 256047
WEATHER_PRECIPITATION_RATE_TOTAL 256047
WEATHER_PROBABILITY_OF_PRECIPITATION_TOTAL 256047
WEATHER_RELATIVE_HUMIDITY_TOTAL 256047
WEATHER_SOLAR_IRRADIANCE_GLOBAL 255368
WEATHER_TEMPERATURE_TOTAL 256098
WEATHER_WIND_DIRECTION_TOTAL 256098
WEATHER_WIND_GUST_SPEED_TOTAL 256047
WEATHER_WIND_SPEED_TOTAL 255368


exploration for 'WEATHER_TEMPERATURE_TOTAL'

In [41]:
parameter = 'WEATHER_TEMPERATURE_TOTAL'
weather_dict_2019[parameter]

Unnamed: 0,index,TEMPERATURE:TOTAL
0,1546297200,8.4
1,1546297800,8.8
2,1546298100,8.8
3,1546298400,8.8
4,1546298700,8.8
...,...,...
104544,1577832000,1.2
104545,1577832300,1.1
104546,1577832600,1.1
104547,1577832900,1.1


In [42]:
weather_dict_2019[parameter]['time_difference'] = weather_dict_2019[parameter]['index'] - weather_dict_2019[parameter]['index'].shift(1)
weather_dict_2019[parameter]['time_difference'].value_counts()

time_difference
300.0      104310
600.0         146
0.0            37
900.0          32
1200.0         12
2100.0          3
3000.0          2
1500.0          2
24900.0         1
70120.0         1
80.0            1
1800.0          1
Name: count, dtype: int64