# Merging SOEP with weather data

This notebook merges the weatherdata downloaded from `meteostat` with the SOEP household panel data. Since this merging can only be done based on timestamps and geographic locations of each household the resulting quality of the merged dataset highly depends on the granularity of the SOEP dataset. For now the merging has to be performed on a **NUTS 1 level** (Bundesebene) as more granular data can only be accessed at the secure data center in Berlin.

The merging of the data follows the specifications of the paper *Hue et al.*

In [1]:
import pandas as pd
import numpy as np
import random
from weather.src.helper import read_nuts_weather_data

## Reading the weather data

The weatherdata stems from the website `meteostat.com`. The contents are downloaded and prepared in the subfolder `./weather`. Also some helper functions can be found in the folder `./weather/src/helper.py`. Those functions convert data from individual stations into SOEP compatible format. For example the `read_nuts_weather_data` takes all the stations within the same NUTS compartment and takes the mean of the station data.

One thing that needs to be addressed is that the soep data uses different geographic naming convetions than meteostat.
In the following is the conversion chart for the NUTS level 1:
| Meteostat | SOEP | Name |
|-----|----|------------------------|
| DEF | 1  | Schleswig-Holstein     |
| DE6 | 2  | Hamburg                |
| DE9 | 3  | Niedersachsen          |
| DE5 | 4  | Bremen                 |
| DEA | 5  | Nordrhein-Westfalen    |
| DE7 | 6  | Hessen                 |
| DEB | 7  | Rheinland-Pfalz        |
| DE1 | 8  | Baden-Wuerttemberg     |
| DE2 | 9  | Bayern                 |
| DEC | 10 | Saarland               |
| DE3 | 11 | Berlin                 |
| DE4 | 12 | Brandenburg            |
| DE8 | 13 | Mecklenburg-Vorpommern |
| DED | 14 | Sachsen                |
| DEE | 15 | Sachsen-Anhalt         |
| DEG | 16 | Thuringen              |

In [2]:
# reading the weather data for NUTS 1 area codes
weather = read_nuts_weather_data('./weather/prod/weatherdata/nuts1', bar=False)
weather['time'] = pd.to_datetime(weather['time'])

# rename variables
chart = {
    "DE6" : 2, "DEF" : 1, "DE9" : 3, "DE5" : 4, "DEA" : 5, "DE7" : 6, "DEB" : 7, "DE1" : 8, "DE2" : 9, 
    "DEC" : 10, "DE3" : 11, "DE4" : 12, "DE8" : 13, "DED" : 14, "DEE" : 15, "DEG" : 16
}
weather["NUTS_CODE"].replace(chart, inplace=True)
# rename the column s.t. it matches the soep variable
weather.rename(columns={'NUTS_CODE':'bula_h'}, inplace=True)

# drop unusefull columns and set new index
weather.drop(["wdir", "wpgt", "elevation", "tmin"], axis=1, inplace=True)
weather.head()

Unnamed: 0,time,tavg,tmax,prcp,snow,wspd,pres,tsun,bula_h
0,1985-01-01,-1.455556,-0.368889,7.133333,40.0,19.326667,1011.508333,5.777778,8
1,1985-01-02,-3.862222,-0.4,5.442222,113.333333,14.513333,1006.208333,0.222222,8
2,1985-01-03,-6.426667,-4.477778,4.691111,170.666667,11.366667,1011.95,44.222222,8
3,1985-01-04,-11.495556,-5.075556,1.668889,229.555556,12.793333,1008.608333,202.888889,8
4,1985-01-05,-13.1,-10.142222,0.72,239.555556,5.526667,1012.908333,102.666667,8


## Computing climate variables

This particular step follows the data preparation steps taken in *Hue et al.*. In the following these steps.
1. Bin the data into 5 degrees intervals (first and last interval are open starting at -5 degrees going up to 35 degrees)
2. Calculate for each month how many days fall within each bin.
3. For other weather related variables (e.g. pressure/hours of sun) the monthly average is taken.

In the following you find all these steps.

In [3]:
# binning the data into 5 degree intervals
bins = [-float('inf')]+[x for x in range(0, 31, 5)]+[float('inf')]
labels = [x for x in range(len(bins)-1)]
weather['tavg'] = pd.cut(weather['tavg'], bins, labels=labels)
weather['tmax'] = pd.cut(weather['tmax'], bins, labels=labels)
weather.head()

Unnamed: 0,time,tavg,tmax,prcp,snow,wspd,pres,tsun,bula_h
0,1985-01-01,0,0,7.133333,40.0,19.326667,1011.508333,5.777778,8
1,1985-01-02,0,0,5.442222,113.333333,14.513333,1006.208333,0.222222,8
2,1985-01-03,0,0,4.691111,170.666667,11.366667,1011.95,44.222222,8
3,1985-01-04,0,0,1.668889,229.555556,12.793333,1008.608333,202.888889,8
4,1985-01-05,0,0,0.72,239.555556,5.526667,1012.908333,102.666667,8


In [4]:
# calculate for each month how many days fall within each month
dummies1 = pd.get_dummies(weather[['tavg']]) # explodes the interal column
dummies2 = pd.get_dummies(weather[['tmax']]) # explodes the interal column
weather = pd.concat([weather, dummies1, dummies2], axis=1)
weather.set_index('time', inplace=True)
weather.head()

Unnamed: 0_level_0,tavg,tmax,prcp,snow,wspd,pres,tsun,bula_h,tavg_0,tavg_1,...,tavg_6,tavg_7,tmax_0,tmax_1,tmax_2,tmax_3,tmax_4,tmax_5,tmax_6,tmax_7
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1985-01-01,0,0,7.133333,40.0,19.326667,1011.508333,5.777778,8,1,0,...,0,0,1,0,0,0,0,0,0,0
1985-01-02,0,0,5.442222,113.333333,14.513333,1006.208333,0.222222,8,1,0,...,0,0,1,0,0,0,0,0,0,0
1985-01-03,0,0,4.691111,170.666667,11.366667,1011.95,44.222222,8,1,0,...,0,0,1,0,0,0,0,0,0,0
1985-01-04,0,0,1.668889,229.555556,12.793333,1008.608333,202.888889,8,1,0,...,0,0,1,0,0,0,0,0,0,0
1985-01-05,0,0,0.72,239.555556,5.526667,1012.908333,102.666667,8,1,0,...,0,0,1,0,0,0,0,0,0,0


In [5]:
# applying the counter for each month and calculating the average for the other clim. vars.

weather:pd.DataFrame

# define for which column what aggregation function is used
aggs = {f"tavg_{x}":np.sum for x in range(0, 8)}
aggs['prcp'], aggs['tsun'], aggs['wspd'], aggs['pres'], aggs['snow'] = (np.mean for i in range(5))

# aggregate by month
weather = weather.groupby('bula_h').rolling('30D', min_periods=15).agg(aggs).dropna()
weather.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,tavg_0,tavg_1,tavg_2,tavg_3,tavg_4,tavg_5,tavg_6,tavg_7,prcp,tsun,wspd,pres,snow
bula_h,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1,1985-01-15,15.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.98967,117.92,18.895238,1018.550667,97.871429
1,1985-01-16,16.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.927816,118.6875,19.285714,1019.13625,101.084821
1,1985-01-17,17.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.873239,122.188235,19.461345,1019.054118,103.62605
1,1985-01-18,18.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.825122,117.666667,19.392857,1018.435556,105.527778
1,1985-01-19,19.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.781695,111.6,18.686466,1017.910526,106.890977


## Read soep data

Final step is to merge the mered SOEP data in `soeplong.ipynb` which the climate data. The key on which is merged is the NUTS 1 level and the timestamp. As mentioned the NUTS 1 level variable is included in the SOEP dataset for each household.

In [6]:
# read soep data
soep = pd.read_csv('./prod/soeplong.csv')

# merge with weather df
soep['time'] = pd.to_datetime(soep['time'])
soep.set_index(['bula_h', 'time'], inplace=True)
# join and drop nan values
soep = soep.join(weather).dropna()

# free up some memory
del weather 

# save dataset
soep.to_csv('./prod/data.csv')