# Import temperature data from the DWD and process it

This notebook pulls historical temperature data from the DWD server and formats it for future use in other projects. The data is reported hourly for each of the available weather stations and packaged in a zip file. To use the data, we need to store everything in a single .csv file, all stations side-by-side. Also, we need the daily average.

To reduce computing time, we also exclude data earlier than 2007.

Files should be executed in the following pipeline:
* 1-dwd_konverter_download
* 2-dwd_konverter_extract
* 3-dwd_konverter_build_df
* 4-dwd_konverter_final_processing

## 4.) Final data processing
We load in the data that has been saved in the last step, so we don't need to repeat our calculations if we pause  and come back later. 
### Data Cleaning
The data contains some errors, which need to be cleaned. Looking at the output of main_df.describe() in the last cell, you can see that the minimum teperature on some stations is -999. That leaves us with no plausible measurement for that particular hour. Changing this to np.nan allows us to safely calculate the avarage values. 
### Change the frequency
Finally we resample the data to daily means.

In [1]:
import numpy as np
import pandas as pd
from pathlib import Path

# Import and export paths
pkl_file = Path.cwd() / "export_uncleaned" / "to_clean.pkl"
cleaned_file = Path.cwd() / "export_cleaned" / "cleaned.csv"

# Read in the pickle file from the last cell
cleaning_df = pd.read_pickle(pkl_file)


# Replace all values with "-999", which indicate missing data
cleaning_df.replace(to_replace=-999, value=np.nan, inplace=True)

# Resample to daily frequency
cleaning_df = cleaning_df.resample('D').mean().round(decimals=2)

# Save as .csv
cleaning_df.to_csv(cleaned_file, sep=";", decimal=",")

display(cleaning_df.loc['2011-12-31':'2012-01-04'])
display(cleaning_df.describe())
display(cleaning_df)

Unnamed: 0_level_0,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU
STATIONS_ID,3,44,71,73,78,91,96,102,125
MESS_DATUM,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
2011-12-31,,3.88,2.76,1.19,4.3,2.43,,3.8,
2012-01-01,,10.9,8.14,4.03,10.96,10.27,,9.01,
2012-01-02,,7.41,6.18,4.77,7.57,7.77,,6.48,4.66
2012-01-03,,6.14,3.61,4.46,6.38,5.28,,5.63,3.51
2012-01-04,,5.8,2.48,4.45,5.46,4.57,,5.85,1.94


Unnamed: 0_level_0,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU
STATIONS_ID,3,44,71,73,78,91,96,102,125
count,1551.0,4995.0,3683.0,5018.0,5114.0,5112.0,633.0,4856.0,4301.0
mean,10.103939,10.147512,8.411244,9.705891,9.944949,9.265892,11.777093,10.283785,8.492823
std,6.74246,6.588813,7.511708,7.795431,6.599975,7.073431,6.704972,6.021094,7.639642
min,-10.87,-10.71,-14.94,-14.32,-12.39,-15.71,-1.16,-8.17,-16.42
25%,5.41,5.375,2.62,3.4125,5.18,3.92,5.92,5.8675,2.45
50%,10.14,10.35,8.57,9.95,9.955,9.315,11.72,10.21,8.59
75%,15.35,15.33,14.07,16.0475,15.12,14.84,16.88,15.2925,14.52
max,28.41,28.45,27.19,27.03,29.89,27.55,27.8,27.33,28.03


Unnamed: 0_level_0,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU,TT_TU
STATIONS_ID,3,44,71,73,78,91,96,102,125
MESS_DATUM,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
2007-01-01,7.38,,,,7.42,6.55,,8.32,
2007-01-02,4.67,,,,4.49,2.88,,6.73,0.51
2007-01-03,6.19,,,,4.87,4.25,,7.12,0.91
2007-01-04,7.69,,,,7.82,5.85,,8.34,4.43
2007-01-05,7.78,,,,7.47,6.03,,8.20,3.92
...,...,...,...,...,...,...,...,...,...
2020-12-27,,3.89,,-2.72,4.16,0.78,1.54,4.47,-5.86
2020-12-28,,3.10,,0.79,3.02,2.13,2.09,3.78,-1.40
2020-12-29,,2.93,,0.04,2.78,1.59,1.48,2.95,-1.63
2020-12-30,,3.61,,-0.45,3.16,1.92,1.71,4.38,-1.20
