Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constructing datasets from CDS products #22

Open
Dan-C-Reed opened this issue Nov 21, 2023 · 6 comments
Open

Constructing datasets from CDS products #22

Dan-C-Reed opened this issue Nov 21, 2023 · 6 comments

Comments

@Dan-C-Reed
Copy link

My understanding is that each input data file is constructed from a couple of CDS data products for 'atmospheric' and 'single-level' variables. For example, the two queries below pull the required data (for a specific date), as I understand it:

import cdsapi

c = cdsapi.Client()

c.retrieve(
    'reanalysis-era5-pressure-levels',
    {
        'product_type': 'reanalysis',
        'format': 'netcdf',
        'variable': [
            'geopotential', 'specific_humidity', 'temperature',
            'u_component_of_wind', 'v_component_of_wind', 'vertical_velocity',
        ],
        'year': '2023',
        'month': '11',
        'day': '01',
        'time': [
            '00:00', '06:00',
        ],
        'pressure_level': [
            '1', '2', '3',
            '5', '7', '10',
            '20', '30', '50',
            '70', '100', '125',
            '150', '175', '200',
            '225', '250', '300',
            '350', '400', '450',
            '500', '550', '600',
            '650', '700', '750',
            '775', '800', '825',
            '850', '875', '900',
            '925', '950', '975',
            '1000',
        ],
    },
    'atmospheric.nc')

c.retrieve(
    'reanalysis-era5-single-levels',
    {
        'product_type': 'reanalysis',
        'variable': [
            '10m_u_component_of_wind', '10m_v_component_of_wind', '2m_temperature',
            'geopotential', 'land_sea_mask', 'mean_sea_level_pressure',
            'toa_incident_solar_radiation', 'total_precipitation',
        ],
        'year': '2023',
        'month': '11',
        'day': '01',
        'time': [
            '00:00', '06:00',
        ],
        'format': 'netcdf',
    },
    'single-level.nc')

Presumably, these files are then spliced together to form an input file. While I'm able to combine the files easily enough, there are subtleties that I'm clearly messing up, like constructing the datetime and batch coordinates, which seem to be mandatory (e.g., ValueError: 'datetime' must be in data coordinates.)

Do you plan to publish a script or instructions for constructing input files from these data products in the format expected by the model? It would be a fantastic help for applying the model!

Many thanks in advance,
Dan

@ab3llini
Copy link

ab3llini commented Nov 22, 2023

Following up - This would definitely be much appreciated ! You have done an amazing job, help us understand how we can feed some real world data if you find some spare time 🤍

Alberto

@sailgrib
Copy link

Would love as well to understand better how we can get or generate these input files.
Awesome job, thanks,

Henri

@tvonich
Copy link

tvonich commented Nov 28, 2023

In the same boat. Can't manage to get any data working other than the example data. This would really jumpstart user research!

tv

@abhinavyesss
Copy link

My understanding is that each input data file is constructed from a couple of CDS data products for 'atmospheric' and 'single-level' variables. For example, the two queries below pull the required data (for a specific date), as I understand it:

import cdsapi

c = cdsapi.Client()

c.retrieve(
    'reanalysis-era5-pressure-levels',
    {
        'product_type': 'reanalysis',
        'format': 'netcdf',
        'variable': [
            'geopotential', 'specific_humidity', 'temperature',
            'u_component_of_wind', 'v_component_of_wind', 'vertical_velocity',
        ],
        'year': '2023',
        'month': '11',
        'day': '01',
        'time': [
            '00:00', '06:00',
        ],
        'pressure_level': [
            '1', '2', '3',
            '5', '7', '10',
            '20', '30', '50',
            '70', '100', '125',
            '150', '175', '200',
            '225', '250', '300',
            '350', '400', '450',
            '500', '550', '600',
            '650', '700', '750',
            '775', '800', '825',
            '850', '875', '900',
            '925', '950', '975',
            '1000',
        ],
    },
    'atmospheric.nc')

c.retrieve(
    'reanalysis-era5-single-levels',
    {
        'product_type': 'reanalysis',
        'variable': [
            '10m_u_component_of_wind', '10m_v_component_of_wind', '2m_temperature',
            'geopotential', 'land_sea_mask', 'mean_sea_level_pressure',
            'toa_incident_solar_radiation', 'total_precipitation',
        ],
        'year': '2023',
        'month': '11',
        'day': '01',
        'time': [
            '00:00', '06:00',
        ],
        'format': 'netcdf',
    },
    'single-level.nc')

Presumably, these files are then spliced together to form an input file. While I'm able to combine the files easily enough, there are subtleties that I'm clearly messing up, like constructing the datetime and batch coordinates, which seem to be mandatory (e.g., ValueError: 'datetime' must be in data coordinates.)

Do you plan to publish a script or instructions for constructing input files from these data products in the format expected by the model? It would be a fantastic help for applying the model!

Many thanks in advance, Dan

From what I have noticed, after I retreived the data from CDS and merged both, single and multi-level, using pandas dataframe, and converted it to the same format as the graphcast example data, an xarray, every "data variable" had all 5 indices as coordinates, batch, time, lat, lon, level.

However that is not the case in the example data, "land_sea_mask" has only two coordinates, 'lat' and 'lon', similarly the single-level fields do not have level as a coordinate.

Coordinates can be removed by,

data['land_sea_mask'].isel(batch = 0, time = 0, level = 0)

Other than that, 'time' has to be in pd.Timedelta format.

@tvonich
Copy link

tvonich commented Dec 4, 2023

Recently learned about the ai-models package from ECMWF. Calls the data directly from their MARS archive, no formatting required. Works for Graphcast, Panguweather, and FourCastNet! I haven't had time to try it yet myself but it seems like a golden ticket. @sailgrib @Dan-C-Reed @ab3llini

https://github.com/ecmwf-lab/ai-models

@ChrisAGBlake
Copy link

This works for me to download and format 1 days worth of data, you can modify this for multiple days

import cdsapi
import os
import datetime
from netCDF4 import Dataset
import numpy as np
from glob import glob

def download_era5_data(date_str, levels, grid, save_dir):
    c = cdsapi.Client()

    # download the data at all levels
    c.retrieve('reanalysis-era5-complete', {
        'date'    : date_str,
        'levelist': '/'.join([str(x) for x in levels]),
        'levtype' : 'pl',
        'param'   : '/'.join([str(x) for x in era5_pram_codes_levels]),
        'stream'  : 'oper',
        'time'    : '00/to/23/by/6', 
        'type'    : 'an',
        'grid'    : grid,               
        'format'  : 'netcdf',                
    }, f'{save_dir}/levels.nc') 

    # download the data at the surface
    c.retrieve('reanalysis-era5-single-levels', {
        'date'    : date_str,
        'product_type': 'reanalysis',
        'param'   : '/'.join([str(x) for x in era5_pram_codes_surface]),
        'time'    : '00/to/23/by/6', 
        'grid'    : grid,               
        'format'  : 'netcdf',                
    }, f'{save_dir}/surface.nc') 

    # download the precipitation data for the given day
    c.retrieve('reanalysis-era5-single-levels', {
        'date'    : date_str,
        'product_type': 'reanalysis',
        'param'   : era5_precipitation_code,
        'time'    : '00/to/18/by/1', 
        'grid'    : grid,               
        'format'  : 'netcdf',                
    }, f'{save_dir}/precipitation.nc')
    
    # download the last 6 hours of the previous day
    date = datetime.datetime.strptime(date_str, '%Y-%m-%d')
    prev_date = date - datetime.timedelta(days=1)
    prev_date_str = prev_date.strftime('%Y-%m-%d')
    c.retrieve('reanalysis-era5-single-levels', {
        'date'    : prev_date_str,
        'product_type': 'reanalysis',
        'param'   : era5_precipitation_code,
        'time'    : '19/to/23/by/1', 
        'grid'    : grid,               
        'format'  : 'netcdf',                
    }, f'{save_dir}/prev_precipitation.nc')

    # calculate the total precipitation for the previous 6 hours
    ds_l = Dataset(f'{save_dir}/levels.nc')
    ds_s = Dataset(f'{save_dir}/surface.nc')
    dsp1 = Dataset(f'{save_dir}/prev_precipitation.nc')
    dsp2 = Dataset(f'{save_dir}/precipitation.nc')
    precip = np.concatenate([dsp1['tp'][:], dsp2['tp'][:]], axis=0)
    _, nlat, nlon = precip.shape
    precip_6hr = np.zeros((4, nlat, nlon))
    for i in range(4):
        precip_6hr[i, :, :] = np.sum(precip[i*6:(i+1)*6, :, :], axis=0)

    # create the new combined dataset
    ds = Dataset(f'{save_dir}/{date_str}.nc', 'w', format='NETCDF4')

    # time dimension
    t_dim = ds.createDimension('time', 4)
    t_var = ds.createVariable('time', 'f4', ('time',))
    t_var.units = 'hours since 1900-01-01 00:00:00'
    t_var[:] = ds_s['time'][:]

    # batch dimension
    b_dim = ds.createDimension('batch', 1)
    b_var = ds.createVariable('batch', 'i4', ('batch',))
    b_var.units = 'batch'
    b_var[:] = [0]

    # datetime dimension
    dt_dim = ds.createDimension('datetime', 4)
    dt_var = ds.createVariable('datetime', 'i4', ('batch', 'datetime',))
    dt_var.units = 'hours'
    dt_var[:] = [[int(x *6) for x in range(4)]]

    # level dimension
    l_dim = ds.createDimension('level', len(levels))
    l_var = ds.createVariable('level', 'f4', ('level',))
    l_var.units = 'millibars'
    l_var[:] = ds_l['level'][:]

    # latitude and longitude dimensions
    lat_dim = ds.createDimension('lat', nlat)
    lat_var = ds.createVariable('lat', 'f4', ('lat',))
    lat_var.units = 'degrees_north'
    lat_var[:] = np.flip(ds_s['latitude'][:], axis=0)
    lon_dim = ds.createDimension('lon', nlon)
    lon_var = ds.createVariable('lon', 'f4', ('lon',))
    lon_var.units = 'degrees_east'
    lon_var[:] = ds_s['longitude'][:]

    # temperature
    t_var = ds.createVariable('temperature', 'f4', ('batch', 'time', 'level', 'lat', 'lon',))
    t_var.units = 'K'
    t_var.missing_value = -32767
    t_var[:] = np.expand_dims(np.flip(ds_l['t'][:], axis=2), axis=0)

    # u component of wind
    u_var = ds.createVariable('u_component_of_wind', 'f4', ('batch', 'time', 'level', 'lat', 'lon',))
    u_var.units = 'm/s'
    u_var.missing_value = -32767
    u_var[:] = np.expand_dims(np.flip(ds_l['u'][:], axis=2), axis=0)

    # v component of wind
    v_var = ds.createVariable('v_component_of_wind', 'f4', ('batch', 'time', 'level', 'lat', 'lon',))
    v_var.units = 'm/s'
    v_var.missing_value = -32767
    v_var[:] = np.expand_dims(np.flip(ds_l['v'][:], axis=2), axis=0)

    # w vertical velocity
    w_var = ds.createVariable('vertical_velocity', 'f4', ('batch', 'time', 'level', 'lat', 'lon',))
    w_var.units = 'Pa/s'
    w_var.missing_value = -32767
    w_var[:] = np.expand_dims(np.flip(ds_l['w'][:], axis=2), axis=0)

    # specific humidity
    q_var = ds.createVariable('specific_humidity', 'f4', ('batch', 'time', 'level', 'lat', 'lon',))
    q_var.units = 'kg/kg'
    q_var.missing_value = -32767
    q_var[:] = np.expand_dims(np.flip(ds_l['q'][:], axis=2), axis=0)

    # geopotential
    z_var = ds.createVariable('geopotential', 'f4', ('batch', 'time', 'level', 'lat', 'lon',))
    z_var.units = 'm^2/s^2'
    z_var.missing_value = -32767
    z_var[:] = np.expand_dims(np.flip(ds_l['z'][:], axis=2), axis=0)

    # 10m u wind
    u10_var = ds.createVariable('10m_u_component_of_wind', 'f4', ('batch', 'time', 'lat', 'lon',))
    u10_var.units = 'm/s'
    u10_var.missing_value = -32767
    u10_var[:] = np.expand_dims(np.flip(ds_s['u10'][:], axis=1), axis=0)

    # 10m v wind
    v10_var = ds.createVariable('10m_v_component_of_wind', 'f4', ('batch', 'time', 'lat', 'lon',))
    v10_var.units = 'm/s'
    v10_var.missing_value = -32767
    v10_var[:] = np.expand_dims(np.flip(ds_s['v10'][:], axis=1), axis=0)

    # 2m temperature
    t2m_var = ds.createVariable('2m_temperature', 'f4', ('batch', 'time', 'lat', 'lon',))
    t2m_var.units = 'K'
    t2m_var.missing_value = -32767
    t2m_var[:] = np.expand_dims(np.flip(ds_s['t2m'][:], axis=1), axis=0)

    # mean sea level pressure
    msl_var = ds.createVariable('mean_sea_level_pressure', 'f4', ('batch', 'time', 'lat', 'lon',))
    msl_var.units = 'Pa'
    msl_var.missing_value = -32767
    msl_var[:] = np.expand_dims(np.flip(ds_s['msl'][:], axis=1), axis=0)

    # toa solar radiation
    tisr_var = ds.createVariable('toa_incident_solar_radiation', 'f4', ('batch', 'time', 'lat', 'lon',))
    tisr_var.units = 'J/m^2'
    tisr_var.missing_value = -32767
    tisr_var[:] = np.expand_dims(np.flip(ds_s['tisr'][:], axis=1), axis=0)

    # precipitation
    precip_var = ds.createVariable('total_precipitation_6hr', 'f4', ('batch', 'time', 'lat', 'lon',))
    precip_var.units = 'm'
    precip_var.missing_value = -32767
    precip_var[:] = np.expand_dims(np.flip(precip_6hr, axis=1), axis=0)

    # geopotential at the surface
    gh_var = ds.createVariable('geopotential_at_surface', 'f4', ('lat', 'lon',))
    gh_var.units = 'm^2/s^2'
    gh_var.missing_value = -32767
    gh_var[:] = np.flip(ds_s['z'][0, :, :], axis=0)

    # land sea mask
    lsm_var = ds.createVariable('land_sea_mask', 'f4', ('lat', 'lon',))
    lsm_var.units = '1'
    lsm_var.missing_value = -32767
    lsm_var[:] = np.flip(ds_s['lsm'][0, :, :], axis=0)

    ds.close()

    return f'{save_dir}/{date_str}.nc'

where

# parameter codes information at https://apps.ecmwf.int/codes/grib/param-db/
era5_pram_codes_levels = [
    130, # temperature
    131, # u component of wind
    132, # v component of wind
    135, # w vertical velocity
    133, # q specific humidity
    129, # z geopotential
]
era5_pram_codes_surface = [
    165, # 10u 10m component of wind
    166, # 10v 10m component of wind
    167, # 2t temperature
    151, # msl mean sea level pressure
    212, # tisr toa solar radiation
    129, # z geopotential at the surface
    172, # lsm land sea mask
]
era5_precipitation_code = 228

# prod model
prod_model_levels = [1, 2, 3, 5, 7, 10, 20, 30, 50, 70, 100, 125, 150, 175, 200, 225, 250, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 775, 800, 825, 850, 875, 900, 925, 950, 975, 1000]
prod_model_grid = '0.25/0.25'
prod_checkpoint = 'params/params_GraphCast - ERA5 1979-2017 - resolution 0.25 - pressure levels 37 - mesh 2to6 - precipitation input and output.npz'

# small models
small_model_levels = [50, 100, 150, 200, 250, 300, 400, 500, 600, 700, 850, 925, 1000]
small_model_grid = '1.0/1.0'
small_checkpoint = 'params/params_GraphCast_small - ERA5 1979-2015 - resolution 1.0 - pressure levels 13 - mesh 2to5 - precipitation input and output.npz'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants