Download data from DMI:
=========================
https://dmigw.govcloud.dk/v2/climateData/bulk/?api-key=

This notebook opens and transforms bulk climateData from DMI (Danish Meteorological Institute) into a pandas DataFrame and then saves it as a CSV file for further analysis.

_Please note as this is loading the bulk data from DMI when applied to the full dataset it loads >5000 files with >170 gb of data and takes ~25 minutes to run_

In [1]:
import os
import glob
import pandas as pd

import preprocessing as pp
%load_ext autoreload
%autoreload 2

In [3]:
# paths
folder_path = '/Users/johan/Documents/04 Div Uni/09 Asset Pricing Data/all-muni'  # Update with your folder path
output_folder_path = '/Users/johan/Documents/04 Div Uni/09 Asset Pricing Data/weather-muni'  # Specify your output folder path

# id columns
ids = ['municipalityId', 'municipalityName'] # for municipality data
# ids = ['cellId'] # for grid data

In [4]:
# run the function
pp.process_all_years(folder_path, output_folder_path, id_fields=ids)


Started processing 1 files for year 2010.
Processed 1/1 files for year 2010.
Saved data for year 2010 to /Users/johan/Documents/04 Div Uni/09 Asset Pricing Data/weather-muni/2010.csv.

Started processing 365 files for year 2011.
Processed 100/365 files for year 2011.
Processed 200/365 files for year 2011.
Processed 300/365 files for year 2011.
Processed 365/365 files for year 2011.
Saved data for year 2011 to /Users/johan/Documents/04 Div Uni/09 Asset Pricing Data/weather-muni/2011.csv.

Started processing 366 files for year 2012.
Processed 100/366 files for year 2012.
Processed 200/366 files for year 2012.
Processed 300/366 files for year 2012.
Processed 366/366 files for year 2012.
Saved data for year 2012 to /Users/johan/Documents/04 Div Uni/09 Asset Pricing Data/weather-muni/2012.csv.

Started processing 365 files for year 2013.
Processed 100/365 files for year 2013.
Processed 200/365 files for year 2013.
Processed 300/365 files for year 2013.
Processed 365/365 files for year 2013

In [5]:
# List all CSV files in the directory
csv_files = glob.glob(os.path.join(output_folder_path, "*.csv"))

# only read file if name is YEAR.csv
csv_files = [file for file in csv_files if file.split('/')[-1].split('.')[0].isnumeric()]

dfs = []
for file in csv_files:
    df = pd.read_csv(file)
    dfs.append(df)

In [6]:
# Combine all dataframes into one
df_combined = pd.concat(dfs, ignore_index=True)

# transform from and to to datetime
df_combined['from'] = pd.to_datetime(df_combined['from'])

# sort and rebase index
sort_by = ['from']+ids
df_combined.sort_values(by=sort_by, inplace=True, ascending=True)
df_combined.reset_index(drop=True, inplace=True)

# save to csv
df_combined.to_csv(os.path.join(output_folder_path, 'weather_muni.csv'), index=False)

In [7]:
print(df_combined.shape)
df_combined.head()

(11955314, 25)


Unnamed: 0,municipalityId,municipalityName,from,to,geometry_type,coordinates,mean_radiation,mean_wind_speed,acc_precip,temp_soil_10,...,mean_cloud_cover,leaf_moisture,mean_temp,vapour_pressure_deficit_mean,no_lightning_strikes,mean_pressure,min_temp,temp_soil_30,mean_wind_dir,bright_sunshine
0,101,København,2010-12-31 23:00:00+00:00,2011-01-01 00:00:00+00:00,Point,"[12.49390862, 55.7040906]",0.0,10.0,0.0,,...,,,3.1,,0,1002.9,2.8,,250,0.0
1,147,Frederiksberg,2010-12-31 23:00:00+00:00,2011-01-01 00:00:00+00:00,Point,"[12.52373306, 55.67936546]",0.0,9.7,0.0,,...,,,3.2,,0,1002.8,3.0,,248,0.0
2,151,Ballerup,2010-12-31 23:00:00+00:00,2011-01-01 00:00:00+00:00,Point,"[12.36840182, 55.72775072]",0.0,10.6,0.1,,...,,,2.9,,0,1002.8,2.6,,258,0.0
3,153,Brøndby,2010-12-31 23:00:00+00:00,2011-01-01 00:00:00+00:00,Point,"[12.40438199, 55.64503727]",0.0,10.4,0.0,,...,,,3.1,,0,1003.1,2.8,,253,0.0
4,155,Dragør,2010-12-31 23:00:00+00:00,2011-01-01 00:00:00+00:00,Point,"[12.65022813, 55.59380739]",0.0,11.8,0.0,,...,,,3.1,,0,1003.4,2.9,,259,0.0
