# Introduction

I have downloaded the weather data for the Theewaterskloof dam weather station and I need to parse this data and combine with the Theewaterskloof dam-level data. I have limited myself to this dam because doing this for every dam is only worth doing if I can find some interesting enough results. It's a good idea to get a 'minimum viable product' going to see if it's worth expending more effort. As it turns out, when I analyse the results, I find that the relationships between the observed weather and dam-level data is quite tenuous. 

We need to do 2 things:
+ 1) Read in the weather data, which is currently sitting in multiple json files (one for each day)
+ 2) Combine the weather data with the clean dam-level data (which I clean [here](http://localhost:8888/notebooks/dam-levels/clean-dam-level-data.ipynb))

In [34]:
import pandas as pd
import numpy as np
import datetime

# Read and clean weather data

Read weather data from json files & parse into DataFrame. The data is very detailed, reported on an hourly basis. For our purposes, I'm going to concentrate on the daily summary. When we read the json file directly into a DataFrame, the relevant data is in the 'history' column in the 'dailysummary' row as a list with a single entry: a dictionary. We need to extract the keys and values from ths dictionary.

In [35]:
pd.read_json('data/weather/Theewaterskloof_20120101.json')

Unnamed: 0,history,response
dailysummary,"[{'monthtodatecoolingdegreedays': '', 'since1j...",
date,"{'min': '00', 'mon': '01', 'tzname': 'Africa/J...",
features,,{'history': 1}
observations,"[{'wspdi': '10.4', 'tempm': '18.0', 'precipm':...",
termsofService,,http://www.wunderground.com/weather/api/d/term...
utcdate,"{'min': '00', 'mon': '12', 'tzname': 'UTC', 'm...",
version,,0.1


In [36]:
weather_data = pd.DataFrame()
index_error_dates = []  # At least 1 day where node we're referencing doesn't exist - keeps track of the date(s) where this occurs
date = datetime.date(year = 2012, month = 1, day = 1)
while date < datetime.date(year = 2017, month = 12, day = 23):
    nextday = pd.Series()
    nextday.name = date
    nextday_df = pd.read_json('data/weather/Theewaterskloof_{0}.json'.format(str(date.strftime('%Y%m%d'))))
    try:
        for key, value in nextday_df['history']['dailysummary'][0].items():
            if not isinstance(value, dict):
                nextday[key] = value
        nextday = nextday.to_frame().transpose()
        weather_data = pd.concat([weather_data, nextday], axis = 0)
    except IndexError:
        index_error_dates.append(date.strftime('%Y-%m-%d'))
    date = date + datetime.timedelta(days = 1)

There is only one date with missing data

In [37]:
index_error_dates

['2015-04-11']

Convert index to date and numeric values to numeric data types.

In [38]:
weather_data.index = pd.to_datetime(weather_data.index)
weather_data = weather_data.apply(pd.to_numeric, errors = 'ignore')
is_all_null = weather_data.apply(lambda x: x.isnull().sum() == len(x))
not_all_null = is_all_null[is_all_null == False]
weather_data = weather_data[not_all_null.index]

# Read dam-level data & combine with the weather data

In [39]:
dam_levels = pd.read_csv('data/Dam-levels-clean-20120101-20171206.csv', encoding = 'latin1')
tw_dam_levels = dam_levels.loc[dam_levels['dam_name'] == 'Theewaterskloof']
tw_dam_levels.index = tw_dam_levels['date']
tw_dam_levels.index = pd.to_datetime(tw_dam_levels.index)
tw_dam_levels = tw_dam_levels.drop('date', axis = 1)

In [40]:
data = pd.merge(tw_dam_levels, weather_data, left_index = True, right_index = True)
data.index.name = 'date'
data.head()

Unnamed: 0_level_0,dam_name,height_m,storage_ml,current_%,last year_%,heatingdegreedays,precipi,maxtempi,meanwdird,gdegreedays,...,maxhumidity,meanvisi,meanwindspdm,maxdewpti,maxtempm,coolingdegreedays,minhumidity,maxdewptm,minpressurei,thunder
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-01-01,Theewaterskloof,24.83,357963.0,74.5,,0,0.0,74,290,15,...,88,8.4,15,63,23,0,44,17,29.95,0
2012-01-02,Theewaterskloof,24.8,356677.0,74.3,,0,0.01,75,325,18,...,100,6.3,19,65,24,4,61,18,29.89,0
2012-01-03,Theewaterskloof,24.77,355394.0,74.0,,0,0.0,79,321,20,...,94,7.8,15,64,26,4,31,18,29.95,0
2012-01-04,Theewaterskloof,24.73,353687.0,73.7,,0,0.0,78,248,15,...,88,8.8,9,61,25,0,30,16,29.95,0
2012-01-05,Theewaterskloof,24.67,351135.0,73.1,,0,0.0,78,180,21,...,83,9.2,22,61,25,6,39,16,30.01,0


These columns aren't needed because:
+ The dam variables for current and last year storage as a percentage
+ The weather variables with imperial measurements (we already have the metric analogues)

In [41]:
data = data.drop(['current_%', 'last year_%', 'precipi', 'meanwindspdi', 'maxwspdi', 'minwspdi',
                 'maxvisi', 'minvisi', 'meanvisi', 'maxpressurei', 'minpressurei', 'meanpressurei',
                 'maxtempi', 'mintempi', 'meantempi', 'maxdewpti', 'mindewpti', 'meandewpti'], axis = 1)

# Write output to disk:

All that's left is to save this as a csv.

In [42]:
data.head()

Unnamed: 0_level_0,dam_name,height_m,storage_ml,heatingdegreedays,meanwdird,gdegreedays,meanpressurem,minvism,minwspdm,meanvism,...,meanwdire,maxpressurem,humidity,maxhumidity,meanwindspdm,maxtempm,coolingdegreedays,minhumidity,maxdewptm,thunder
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2012-01-01,Theewaterskloof,24.83,357963.0,0,290,15,1014.88,10.0,4,13.6,...,WNW,1016.0,67,88,15,23,0,44,17,0
2012-01-02,Theewaterskloof,24.8,356677.0,0,325,18,1013.54,2.0,11,10.3,...,NW,1015.0,88,100,19,24,4,61,18,0
2012-01-03,Theewaterskloof,24.77,355394.0,0,321,20,1015.06,10.0,6,12.5,...,NW,1016.0,65,94,15,26,4,31,18,0
2012-01-04,Theewaterskloof,24.73,353687.0,0,248,15,1015.91,10.0,0,14.1,...,WSW,1019.0,62,88,9,25,0,30,16,0
2012-01-05,Theewaterskloof,24.67,351135.0,0,180,21,1017.22,10.0,15,14.7,...,South,1018.0,65,83,22,25,6,39,16,0


In [43]:
data.to_csv('data/Theewaterskloof-weather-and-dam-levels-clean-20120101-20171206.csv')