# Pre-processing the data
This notebook will demonstrate transforming the data extracted from MASS into a tabular dataset (alomst) ready for use in machine learning. This notebook will do this for one forecast reference time and one realisation, to make the development process easier (and able to run on a smaller compute instance), and then a separate notebook will be created to do the actual "batch" processing of data.

We are using data around Storm Dennis ([Met Office](https://www.metoffice.gov.uk/weather/warnings-and-advice/uk-storm-centre/storm-dennis), [Wikipedia](https://en.wikipedia.org/wiki/Storm_Dennis))

The key steps in this process are as follows:
* Prepare radar data
  * Load in radar data files extracted from mass and agggregated into 1 file per day of data 
  * Accumuluate in 3hr accumulations to match 3 hour frequency of model data
  * Load a sample MOGREPS-UK UK cutout grid to use a regridding target
  * regrid radar data
  * transform into tabular data
* Prepare MOGREPS-G data
  * Load in data extract from MASS for the forecast reference times and leadtime of interest (IN this case around Storm Dennis 15/16 Feb 2020)
  * Extract UK Data
  * For each forecast ref time,load in all single level variables and transform to tabular (using xarray.Dataset.to_dataframe)
  * For each forecast ref time, load in variables on height levels and tranform to data frame
  * the xarray function by default puts variables on different heights on different rows, whereas we want all levels of a variable for a particular time/lat/lon/realization to be on the same row as separate features/columns. Transform by selecting different heights, renaming variables to include name and height, and merging together.
  * merge single level and height levels variables
  * concatenate different times into a single dataframe and save to disk.

In [1]:
import pathlib
import datetime
import functools
import os

In [2]:
import numpy

In [3]:
import pandas

In [4]:
import xarray
import iris
import iris.quickplot
import iris.coord_categorisation

In [5]:
import matplotlib.pyplot

# Set parameters for notebook
Set the paths and lists of things to process

In [6]:
project_name = 'precip_rediagnosis'
mogreps_g_name = 'mogreps-g'
event_dir = '202002_storm_dennis'
dataset_version_dir = 'train_20221216'
input_data_dir  = pathlib.Path('/scratch')/ os.environ['USER'] 
output_dir =  pathlib.Path('/scratch')/ os.environ['USER'] / project_name

In [7]:
root_data_dir = input_data_dir / project_name / dataset_version_dir / event_dir
mogreps_g_data_dir = root_data_dir / mogreps_g_name
radar_data_dir = root_data_dir / 'radar'
print(f'mogreps-g data: {str(mogreps_g_data_dir)}\n radar data dir {str(radar_data_dir)}')

mogreps-g data: /scratch/shaddad/precip_rediagnosis/train_20221216/202002_storm_dennis/mogreps-g
 radar data dir /scratch/shaddad/precip_rediagnosis/train_20221216/202002_storm_dennis/radar


In [8]:
output_fname_template = 'prd_{lt:03d}H_{vt.year:04d}{vt.month:02d}{vt.day:02d}T{vt.hour:02d}{vt.minute:02d}Z.csv'

We are processing several height level variables (that is variables reported at each model height level) and several single level variables (that is variables that report only one value for each column, such as precipitation).

In [9]:
precip_vars = [
    "rainfall_rate",
    "rainfall_rate_from_convection",
    "snowfall_rate",
    "snowfall_rate_from_convection",
]

variables_single_level = [
    "cloud_amount_of_total_cloud",
    "height_of_orography",
    "pressure_at_mean_sea_level",
] + precip_vars

variables_height_levels = [
    "cloud_amount_on_height_levels",
    "pressure_on_height_levels",
    "temperature_on_height_levels",
    "relative_humidity_on_height_levels",
    "wind_direction_on_height_levels",
    "wind_speed_on_height_levels",
    
]

Here we define the ranges of the precipitation intensity bands

In [10]:
rainfall_thresholds = {
    "0.0": [0.0,0.01],
     "0.25": [0.01, 0.5],
     "2.5": [0.5, 4.0],
     "7.0": [4.0, 10.0],
     "10.0": [10.0, 220.0]
  }


In [11]:
merge_coords_model_hlvs = ['latitude', 'longitude', 'time', 'realization']


In [12]:
num_periods = 10
start_ref_time = datetime.datetime(2020,2,14,12)
forecast_ref_time_range = [start_ref_time + datetime.timedelta(hours=6)*i1 for i1 in range(num_periods)]
leadtime_hours = 6
realizations_list = list(range(35))

In [13]:
target_time_range = [dt1 + datetime.timedelta(hours=leadtime_hours) for dt1 in forecast_ref_time_range]
target_time_range

[datetime.datetime(2020, 2, 14, 18, 0),
 datetime.datetime(2020, 2, 15, 0, 0),
 datetime.datetime(2020, 2, 15, 6, 0),
 datetime.datetime(2020, 2, 15, 12, 0),
 datetime.datetime(2020, 2, 15, 18, 0),
 datetime.datetime(2020, 2, 16, 0, 0),
 datetime.datetime(2020, 2, 16, 6, 0),
 datetime.datetime(2020, 2, 16, 12, 0),
 datetime.datetime(2020, 2, 16, 18, 0),
 datetime.datetime(2020, 2, 17, 0, 0)]

In [14]:
dataset = 'mogreps-g'
subset = 'lev1'
forecast_ref_template = '{frt.year:04d}{frt.month:02d}{frt.day:02d}T{frt.hour:02d}00Z.nc.file'
fname_template = '{vt.year:04d}{vt.month:02d}{vt.day:02d}T{vt.hour:02d}00Z-PT{lead_time:04d}H00M-{var_name}.nc'

In [15]:
variables_to_extract = variables_height_levels + variables_single_level

The variables in the files don't match the variables names in the metadata, so we need to create a dictionary to find a mapping between the file names and the actual variable names in the metadata.

In [16]:
path_lists_vars = {
    var_name: [f1 for f1 in mogreps_g_data_dir.iterdir() if var_name in str(f1)]
    for var_name in variables_to_extract
}


We have an example target cube, which represents the grid that we are aiming for. This is the same as the grid for the model data so we don't actually  need to regrid the model data, but it provides the cutout area that we are considering.

In [17]:
target_cube_path = '/project/informatics_lab/precip_rediagnosis/target_cube.nc'

In [18]:
target_grid_cube = iris.load_cube(
    str(target_cube_path)
)


In [19]:
uk_bounds = {
    'latitude': (min(target_grid_cube.coord('latitude').points), max(target_grid_cube.coord('latitude').points)),
    'longitude': (min(target_grid_cube.coord('longitude').points), max(target_grid_cube.coord('longitude').points))}
xarray_select_uk = {k1: slice(*v1) for k1, v1 in uk_bounds.items()}
uk_bounds_constraint = iris.Constraint(latitude=lambda c1: uk_bounds['latitude'][0] < c1.point <= uk_bounds['latitude'][1],
                              longitude=lambda c1: uk_bounds['longitude'][0] < c1.point <= uk_bounds['longitude'][1])


### Helper functions

We have some helper functions to make the some of the operations easier and the code a bit more compact and easier to read.

In [20]:
def calc_dates_list(start_datetime, end_datetime, delta_hours, tz_str='UTC'):
    dates_to_extract = list(pandas.date_range(
        start=start_datetime,
        end=end_datetime,
        freq=datetime.timedelta(hours=delta_hours),
        tz=tz_str,
    ).to_pydatetime())
    return dates_to_extract


In [21]:
def compare_time(t1, t2):
    is_match = (t1.year == t2.year) and  (t1.month == t2.month) and  (t1.day == t2.day) and  (t1.hour== t2.hour) and  (t1.minute == t2.minute)
    return is_match

## Create a dataset from MOGREPS-G data
Information on Met Office Ensmble forecasts - https://www.metoffice.gov.uk/research/weather/ensemble-forecasting#
Paper - https://www.metoffice.gov.uk/research/weather/ensemble-forecasting 

In [22]:
fcst_ref_time = forecast_ref_time_range[0]
real1 = realizations_list[10]
validity_time = fcst_ref_time + datetime.timedelta(hours=leadtime_hours)

In [23]:
validity_time

datetime.datetime(2020, 2, 14, 18, 0)

The file names do not match the variables names within the files, so we need to create a mapping to work with.

In [24]:
%%time
# load a cube for each variable in iris to get the actual variable name, and populate dictionary mapping from the var name in the file name to the variable as loaded into iris/xarray
file_to_var_mapping = {
    var_file_name: iris.load_cube(str(mogreps_g_data_dir / fname_template.format(vt=validity_time,
                                                                                 lead_time=leadtime_hours,
                                                                                 var_name=var_file_name))).name()
    for var_file_name in variables_single_level + variables_height_levels}
file_to_var_mapping

CPU times: user 473 ms, sys: 43.5 ms, total: 517 ms
Wall time: 512 ms


{'cloud_amount_of_total_cloud': 'cloud_area_fraction',
 'height_of_orography': 'surface_altitude',
 'pressure_at_mean_sea_level': 'air_pressure_at_sea_level',
 'rainfall_rate': 'rainfall_rate',
 'rainfall_rate_from_convection': 'convective_rainfall_rate',
 'snowfall_rate': 'lwe_snowfall_rate',
 'snowfall_rate_from_convection': 'lwe_convective_snowfall_rate',
 'cloud_amount_on_height_levels': 'cloud_volume_fraction_in_atmosphere_layer',
 'pressure_on_height_levels': 'air_pressure',
 'temperature_on_height_levels': 'air_temperature',
 'relative_humidity_on_height_levels': 'relative_humidity',
 'wind_direction_on_height_levels': 'wind_from_direction',
 'wind_speed_on_height_levels': 'wind_speed'}

Load an example cube to get a list of the heights for height level variables.

In [25]:
heights = iris.load_cube(
    str(mogreps_g_data_dir / fname_template.format(
        vt=target_time_range[0],
        lead_time=leadtime_hours,
        var_name=variables_height_levels[0]))).coord('height').points


In [26]:
single_level_var_mappings = {v1: file_to_var_mapping[v1] for v1 in variables_single_level}
height_level_var_mappings = {v1: file_to_var_mapping[v1] for v1 in variables_height_levels}

In [27]:
for fcst_ref_time in forecast_ref_time_range:
    print(fcst_ref_time)

2020-02-14 12:00:00
2020-02-14 18:00:00
2020-02-15 00:00:00
2020-02-15 06:00:00
2020-02-15 12:00:00
2020-02-15 18:00:00
2020-02-16 00:00:00
2020-02-16 06:00:00
2020-02-16 12:00:00
2020-02-16 18:00:00


set up the paths to files containing the single level variables

In [28]:
sl_paths = [mogreps_g_data_dir / fname_template.format(vt=validity_time,
                                             lead_time=leadtime_hours,
                                               var_name=var1) for var1 in variables_single_level]
sl_paths

[PosixPath('/scratch/shaddad/precip_rediagnosis/train_20221216/202002_storm_dennis/mogreps-g/20200214T1800Z-PT0006H00M-cloud_amount_of_total_cloud.nc'),
 PosixPath('/scratch/shaddad/precip_rediagnosis/train_20221216/202002_storm_dennis/mogreps-g/20200214T1800Z-PT0006H00M-height_of_orography.nc'),
 PosixPath('/scratch/shaddad/precip_rediagnosis/train_20221216/202002_storm_dennis/mogreps-g/20200214T1800Z-PT0006H00M-pressure_at_mean_sea_level.nc'),
 PosixPath('/scratch/shaddad/precip_rediagnosis/train_20221216/202002_storm_dennis/mogreps-g/20200214T1800Z-PT0006H00M-rainfall_rate.nc'),
 PosixPath('/scratch/shaddad/precip_rediagnosis/train_20221216/202002_storm_dennis/mogreps-g/20200214T1800Z-PT0006H00M-rainfall_rate_from_convection.nc'),
 PosixPath('/scratch/shaddad/precip_rediagnosis/train_20221216/202002_storm_dennis/mogreps-g/20200214T1800Z-PT0006H00M-snowfall_rate.nc'),
 PosixPath('/scratch/shaddad/precip_rediagnosis/train_20221216/202002_storm_dennis/mogreps-g/20200214T1800Z-PT0006H00

In [29]:
sl_cubes = [iris.load_cube(str(sl_path), uk_bounds_constraint) for sl_path in sl_paths]
sl_cubes

[<iris 'Cube' of cloud_area_fraction / (1) (realization: 18; latitude: 51; longitude: 30)>,
 <iris 'Cube' of surface_altitude / (m) (realization: 18; latitude: 51; longitude: 30)>,
 <iris 'Cube' of air_pressure_at_sea_level / (Pa) (realization: 18; latitude: 51; longitude: 30)>,
 <iris 'Cube' of rainfall_rate / (m s-1) (realization: 18; latitude: 51; longitude: 30)>,
 <iris 'Cube' of convective_rainfall_rate / (m s-1) (realization: 18; latitude: 51; longitude: 30)>,
 <iris 'Cube' of lwe_snowfall_rate / (m s-1) (realization: 18; latitude: 51; longitude: 30)>,
 <iris 'Cube' of lwe_convective_snowfall_rate / (m s-1) (realization: 18; latitude: 51; longitude: 30)>]

For our precipitation variables, we want to convert from the standard unit output from the model, to the more human comprehensible `mm` or `mm/h`.

In [30]:
for cube1 in sl_cubes:
    if 'thickness' in cube1.name() and ('rainfall' in cube1.name() or 'snowfall' in cube1.name()):
        cube1.convert_units('mm')
        print(f'converting {cube1.name()}')
    if 'rate' in cube1.name() and ('rainfall' in cube1.name() or 'snowfall' in cube1.name()):
        cube1.convert_units('mm/h')
        print(f'converting {cube1.name()}')


converting rainfall_rate
converting convective_rainfall_rate
converting lwe_snowfall_rate
converting lwe_convective_snowfall_rate


In [31]:
single_level_ds = xarray.merge( [xarray.DataArray.from_iris(slc1) for slc1 in sl_cubes] )

In [32]:
single_level_df = single_level_ds.to_dataframe().reset_index()

In [33]:
single_level_df

Unnamed: 0,realization,latitude,longitude,forecast_period,forecast_reference_time,time,cloud_area_fraction,surface_altitude,air_pressure_at_sea_level,rainfall_rate,convective_rainfall_rate,lwe_snowfall_rate,lwe_convective_snowfall_rate
0,0,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.000000,0.0,101512.0,0.137463,0.177696,0.0,0.0
1,0,49.40625,-5.203125,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.000000,0.0,101537.0,0.083819,0.000000,0.0,0.0
2,0,49.40625,-4.921875,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.000000,0.0,101566.0,0.124052,0.000000,0.0,0.0
3,0,49.40625,-4.640625,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.000000,0.0,101599.0,0.248104,0.000000,0.0,0.0
4,0,49.40625,-4.359375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.000000,0.0,101633.0,0.466034,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
27535,17,58.78125,1.546875,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,0.875000,0.0,99871.0,0.000000,0.000000,0.0,0.0
27536,17,58.78125,1.828125,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,0.984375,0.0,99910.0,0.000000,0.000000,0.0,0.0
27537,17,58.78125,2.109375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.000000,0.0,99952.0,0.000000,0.000000,0.0,0.0
27538,17,58.78125,2.390625,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.000000,0.0,99992.0,0.150874,0.000000,0.0,0.0


In [34]:
def calc_ensemble_fractions(model_df, lower_bound, upper_bound):
    return ((model_df >= lower_bound) & (model_df < upper_bound)).sum() / model_df.shape[0]


In [35]:
def load_ds(ds_path, selected_bounds):
    try:
        subset1 = dict(selected_bounds)
        subset1['bnds'] = 0
        single_level_ds = xarray.load_dataset(ds_path).sel(**subset1)
    except KeyError as e1:
        single_level_ds = None
    return single_level_ds

In [36]:
intensity_band_template = '{source}_fraction_in_band_instant_{band_centre}'

In [37]:
ensemble_fractions = [
    single_level_df.groupby(['latitude', 'longitude', 'time'])[
        ['rainfall_rate']].apply(
        lambda x: calc_ensemble_fractions(x, lower_bound,
                                         upper_bound)).rename(
        columns={'rainfall_rate': intensity_band_template.format(
            source='mogrepsg', band_centre=intensity_band)})
    for intensity_band, [lower_bound, upper_bound] in
    rainfall_thresholds.items()]

In [38]:
ensemble_fractions_df = pandas.concat(ensemble_fractions, axis=1)
ensemble_fractions_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,mogrepsg_fraction_in_band_instant_0.0,mogrepsg_fraction_in_band_instant_0.25,mogrepsg_fraction_in_band_instant_2.5,mogrepsg_fraction_in_band_instant_7.0,mogrepsg_fraction_in_band_instant_10.0
latitude,longitude,time,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
49.40625,-5.484375,2020-02-14 18:00:00,0.055556,0.777778,0.166667,0.0,0.0
49.40625,-5.203125,2020-02-14 18:00:00,0.000000,1.000000,0.000000,0.0,0.0
49.40625,-4.921875,2020-02-14 18:00:00,0.000000,0.944444,0.055556,0.0,0.0
49.40625,-4.640625,2020-02-14 18:00:00,0.055556,0.944444,0.000000,0.0,0.0
49.40625,-4.359375,2020-02-14 18:00:00,0.055556,0.944444,0.000000,0.0,0.0
...,...,...,...,...,...,...,...
58.78125,1.546875,2020-02-14 18:00:00,0.944444,0.055556,0.000000,0.0,0.0
58.78125,1.828125,2020-02-14 18:00:00,0.777778,0.222222,0.000000,0.0,0.0
58.78125,2.109375,2020-02-14 18:00:00,0.333333,0.611111,0.055556,0.0,0.0
58.78125,2.390625,2020-02-14 18:00:00,0.111111,0.777778,0.111111,0.0,0.0


In [39]:
single_level_df = pandas.merge(single_level_df,
                               ensemble_fractions_df,
                     left_on=['latitude', 'longitude', 'time'],
                     right_index=True)
single_level_df

Unnamed: 0,realization,latitude,longitude,forecast_period,forecast_reference_time,time,cloud_area_fraction,surface_altitude,air_pressure_at_sea_level,rainfall_rate,convective_rainfall_rate,lwe_snowfall_rate,lwe_convective_snowfall_rate,mogrepsg_fraction_in_band_instant_0.0,mogrepsg_fraction_in_band_instant_0.25,mogrepsg_fraction_in_band_instant_2.5,mogrepsg_fraction_in_band_instant_7.0,mogrepsg_fraction_in_band_instant_10.0
0,0,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,101512.0,0.137463,0.177696,0.0,0.0,0.055556,0.777778,0.166667,0.0,0.0
1530,1,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,101522.0,0.301749,0.248104,0.0,0.0,0.055556,0.777778,0.166667,0.0,0.0
3060,2,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,101543.0,0.301749,0.000000,0.0,0.0,0.055556,0.777778,0.166667,0.0,0.0
4590,3,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,101549.0,0.204518,0.000000,0.0,0.0,0.055556,0.777778,0.166667,0.0,0.0
6120,4,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,101533.0,0.177696,0.026822,0.0,0.0,0.055556,0.777778,0.166667,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
21419,13,58.78125,2.671875,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,100081.0,0.466034,0.000000,0.0,0.0,0.000000,0.611111,0.388889,0.0,0.0
22949,14,58.78125,2.671875,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,100036.0,0.191107,0.000000,0.0,0.0,0.000000,0.611111,0.388889,0.0,0.0
24479,15,58.78125,2.671875,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,100050.0,0.576675,0.026822,0.0,0.0,0.000000,0.611111,0.388889,0.0,0.0
26009,16,58.78125,2.671875,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,100007.0,0.013411,0.000000,0.0,0.0,0.000000,0.611111,0.388889,0.0,0.0


Now load and process the variables on height levels

In [40]:
height_levels_ds = xarray.merge([load_ds(
    ds_path=mogreps_g_data_dir / fname_template.format(vt=validity_time,
                                                       lead_time=leadtime_hours,
                                                       var_name=var1),
    selected_bounds=xarray_select_uk,
    )
                                 for var1 in variables_height_levels])

In [41]:
hl_df_multirow = height_levels_ds.to_dataframe()


When using the xarray method `to_dataframe`, different heights will be put on different rows, which in machine learning jargon is putting them in different data points for training purposes. Really we want values at different heights for the same physical phenomeon to be different features within a data point. So  really we want them in the same row in separate columns e.g. `air_temperature_5m`, `air_temperature_10m` etc. Pandas provides an `unstack` method which achieves this when we start from a dataframe with multi-index, as we get from xarray.

In [42]:
height_levels_df = hl_df_multirow[[file_to_var_mapping[v1] for v1 in variables_height_levels]].unstack('height')
height_levels_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,cloud_volume_fraction_in_atmosphere_layer,cloud_volume_fraction_in_atmosphere_layer,cloud_volume_fraction_in_atmosphere_layer,cloud_volume_fraction_in_atmosphere_layer,cloud_volume_fraction_in_atmosphere_layer,cloud_volume_fraction_in_atmosphere_layer,cloud_volume_fraction_in_atmosphere_layer,cloud_volume_fraction_in_atmosphere_layer,cloud_volume_fraction_in_atmosphere_layer,cloud_volume_fraction_in_atmosphere_layer,...,wind_speed,wind_speed,wind_speed,wind_speed,wind_speed,wind_speed,wind_speed,wind_speed,wind_speed,wind_speed
Unnamed: 0_level_1,Unnamed: 1_level_1,height,5.0,10.0,20.0,30.0,50.0,75.0,100.0,150.0,200.0,250.0,...,2750.0,3000.0,3250.0,3500.0,3750.0,4000.0,4500.0,5000.0,5500.0,6000.0
realization,latitude,longitude,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2,Unnamed: 23_level_2
0,49.21875,-5.765625,0.0,0.0,0.0,0.0,0.0,0.218750,0.468750,0.820312,0.953125,1.000000,...,22.3125,22.6875,23.5000,25.0000,27.1250,29.7500,31.6250,33.1250,32.9375,32.1250
0,49.21875,-5.484375,0.0,0.0,0.0,0.0,0.0,0.039062,0.078125,0.507812,0.742188,0.875000,...,21.7500,22.1875,23.2500,25.1875,27.6250,29.8125,31.1875,32.3125,32.2500,31.6875
0,49.21875,-5.203125,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.289062,0.476562,0.648438,...,21.0000,21.7500,23.0625,25.3125,27.8750,29.7500,30.5000,31.1875,31.4375,31.3750
0,49.21875,-4.921875,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.320312,0.656250,...,20.3125,21.4375,23.1875,25.6250,28.0625,29.5625,29.8125,29.8125,30.5625,31.1875
0,49.21875,-4.640625,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.132812,0.289062,0.453125,...,19.8125,21.4375,23.5000,25.9375,28.1250,29.3125,29.0000,28.4375,29.5625,30.9375
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17,58.78125,1.546875,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,...,22.7500,23.6875,25.1250,27.4375,30.5000,33.8750,37.1875,38.0000,39.4375,41.6250
17,58.78125,1.828125,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,...,23.1875,23.8750,24.7500,26.5000,29.3125,32.7500,36.4375,37.5625,39.3125,41.5000
17,58.78125,2.109375,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.039062,0.218750,...,24.6250,25.3750,25.6250,26.3750,28.2500,31.3125,35.5000,37.0000,39.0000,41.2500
17,58.78125,2.390625,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.109375,0.242188,...,26.3750,27.0625,27.0625,27.0625,27.8750,29.8750,34.3750,36.3125,38.3750,40.8125


In [43]:
height_levels_df.columns = [f'{c1[0]}_{c1[1]}' for c1 in height_levels_df]


In [44]:
height_levels_df = height_levels_df.reset_index()
height_levels_df['time'] = single_level_df['time'][0]
height_levels_df['forecast_reference_time'] = single_level_df['forecast_reference_time'][0]
height_levels_df['forecast_period'] = single_level_df['forecast_period'][0]
height_levels_df

Unnamed: 0,realization,latitude,longitude,cloud_volume_fraction_in_atmosphere_layer_5.0,cloud_volume_fraction_in_atmosphere_layer_10.0,cloud_volume_fraction_in_atmosphere_layer_20.0,cloud_volume_fraction_in_atmosphere_layer_30.0,cloud_volume_fraction_in_atmosphere_layer_50.0,cloud_volume_fraction_in_atmosphere_layer_75.0,cloud_volume_fraction_in_atmosphere_layer_100.0,...,wind_speed_3500.0,wind_speed_3750.0,wind_speed_4000.0,wind_speed_4500.0,wind_speed_5000.0,wind_speed_5500.0,wind_speed_6000.0,time,forecast_reference_time,forecast_period
0,0,49.21875,-5.765625,0.0,0.0,0.0,0.0,0.0,0.218750,0.468750,...,25.0000,27.1250,29.7500,31.6250,33.1250,32.9375,32.1250,2020-02-14 18:00:00,2020-02-14 12:00:00,0 days 06:00:00
1,0,49.21875,-5.484375,0.0,0.0,0.0,0.0,0.0,0.039062,0.078125,...,25.1875,27.6250,29.8125,31.1875,32.3125,32.2500,31.6875,2020-02-14 18:00:00,2020-02-14 12:00:00,0 days 06:00:00
2,0,49.21875,-5.203125,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,25.3125,27.8750,29.7500,30.5000,31.1875,31.4375,31.3750,2020-02-14 18:00:00,2020-02-14 12:00:00,0 days 06:00:00
3,0,49.21875,-4.921875,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,25.6250,28.0625,29.5625,29.8125,29.8125,30.5625,31.1875,2020-02-14 18:00:00,2020-02-14 12:00:00,0 days 06:00:00
4,0,49.21875,-4.640625,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,25.9375,28.1250,29.3125,29.0000,28.4375,29.5625,30.9375,2020-02-14 18:00:00,2020-02-14 12:00:00,0 days 06:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29011,17,58.78125,1.546875,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,27.4375,30.5000,33.8750,37.1875,38.0000,39.4375,41.6250,2020-02-14 18:00:00,2020-02-14 12:00:00,0 days 06:00:00
29012,17,58.78125,1.828125,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,26.5000,29.3125,32.7500,36.4375,37.5625,39.3125,41.5000,2020-02-14 18:00:00,2020-02-14 12:00:00,0 days 06:00:00
29013,17,58.78125,2.109375,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,26.3750,28.2500,31.3125,35.5000,37.0000,39.0000,41.2500,2020-02-14 18:00:00,2020-02-14 12:00:00,0 days 06:00:00
29014,17,58.78125,2.390625,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,27.0625,27.8750,29.8750,34.3750,36.3125,38.3750,40.8125,2020-02-14 18:00:00,2020-02-14 12:00:00,0 days 06:00:00


In [45]:
list(height_levels_df.columns)

['realization',
 'latitude',
 'longitude',
 'cloud_volume_fraction_in_atmosphere_layer_5.0',
 'cloud_volume_fraction_in_atmosphere_layer_10.0',
 'cloud_volume_fraction_in_atmosphere_layer_20.0',
 'cloud_volume_fraction_in_atmosphere_layer_30.0',
 'cloud_volume_fraction_in_atmosphere_layer_50.0',
 'cloud_volume_fraction_in_atmosphere_layer_75.0',
 'cloud_volume_fraction_in_atmosphere_layer_100.0',
 'cloud_volume_fraction_in_atmosphere_layer_150.0',
 'cloud_volume_fraction_in_atmosphere_layer_200.0',
 'cloud_volume_fraction_in_atmosphere_layer_250.0',
 'cloud_volume_fraction_in_atmosphere_layer_300.0',
 'cloud_volume_fraction_in_atmosphere_layer_400.0',
 'cloud_volume_fraction_in_atmosphere_layer_500.0',
 'cloud_volume_fraction_in_atmosphere_layer_600.0',
 'cloud_volume_fraction_in_atmosphere_layer_700.0',
 'cloud_volume_fraction_in_atmosphere_layer_800.0',
 'cloud_volume_fraction_in_atmosphere_layer_1000.0',
 'cloud_volume_fraction_in_atmosphere_layer_1250.0',
 'cloud_volume_fraction_in

Now that we have created the correct dataframe for variables on height levels, we can merge this with the dataframe for single level variables. We are merging on the following coordinates:
* location (latitude and longitude)
* time (validity time)
* realization

In [46]:
merge_coords = ['latitude', 'longitude', 'time', 'realization']

In [47]:
mogreps_g_single_ts_uk_df = single_level_df.merge(height_levels_df,
                                                  on=merge_coords)

In [48]:
mogreps_g_single_ts_uk_df

Unnamed: 0,realization,latitude,longitude,forecast_period_x,forecast_reference_time_x,time,cloud_area_fraction,surface_altitude,air_pressure_at_sea_level,rainfall_rate,...,wind_speed_3250.0,wind_speed_3500.0,wind_speed_3750.0,wind_speed_4000.0,wind_speed_4500.0,wind_speed_5000.0,wind_speed_5500.0,wind_speed_6000.0,forecast_reference_time_y,forecast_period_y
0,0,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,101512.0,0.137463,...,23.6875,25.0000,26.9375,29.2500,31.5000,33.2500,33.0000,32.1250,2020-02-14 12:00:00,0 days 06:00:00
1,1,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,101522.0,0.301749,...,22.1250,22.3750,23.5000,25.8750,29.2500,30.8125,32.3750,32.0000,2020-02-14 12:00:00,0 days 06:00:00
2,2,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,101543.0,0.301749,...,23.6875,25.2500,27.4375,29.5625,30.1875,32.7500,33.5625,33.5000,2020-02-14 12:00:00,0 days 06:00:00
3,3,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,101549.0,0.204518,...,23.3750,24.1250,25.8750,28.5625,30.8750,31.4375,31.5625,31.4375,2020-02-14 12:00:00,0 days 06:00:00
4,4,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,101533.0,0.177696,...,23.0625,24.6250,26.6250,28.3750,30.7500,32.6875,32.5000,32.0625,2020-02-14 12:00:00,0 days 06:00:00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
27535,13,58.78125,2.671875,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,100081.0,0.466034,...,30.3125,30.9375,30.8125,30.4375,33.3750,38.0000,39.0000,38.8125,2020-02-14 12:00:00,0 days 06:00:00
27536,14,58.78125,2.671875,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,100036.0,0.191107,...,24.4375,25.2500,26.4375,28.0625,37.6875,41.9375,42.2500,42.5625,2020-02-14 12:00:00,0 days 06:00:00
27537,15,58.78125,2.671875,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,100050.0,0.576675,...,27.6875,28.8750,29.9375,30.9375,33.5000,37.4375,41.6250,44.7500,2020-02-14 12:00:00,0 days 06:00:00
27538,16,58.78125,2.671875,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.0,0.0,100007.0,0.013411,...,28.3750,29.6250,30.9375,32.0625,34.6250,37.4375,38.3125,38.7500,2020-02-14 12:00:00,0 days 06:00:00


In [49]:
coords = list(set(hl_df_multirow.columns) - set(height_level_var_mappings.values()))
print(coords)

['forecast_period', 'forecast_reference_time', 'time', 'latitude_longitude', 'longitude_bnds', 'latitude_bnds']


### Load radar data

Now we want to load the radar, to add a radar rainfall measurement to each column.

### radar helper functions

In [51]:
def calc_lat_lon_coords(radar_cubes, target_grid_cube):
    """

    :param radar_cubes:
    :return:
    """
    radar_crs = radar_cubes[0].coord_system().as_cartopy_crs()

    # Create some helper arrays for converting from our radar grid to the mogreps-g grid
    X_radar,Y_radar = numpy.meshgrid(radar_cubes[0].coord('projection_x_coordinate').points, 
                                     radar_cubes[0].coord('projection_y_coordinate').points,)

    target_crs = target_grid_cube.coord_system().as_cartopy_crs()
    ret_val = target_crs.transform_points(
        radar_crs,
        X_radar,
        Y_radar,
    )        

    lat_vals = ret_val[:, :, 1]
    lon_vals = ret_val[:, :, 0]

    lon_coord = iris.coords.AuxCoord(
        lon_vals,
        standard_name='longitude',
        units='degrees',
    )
    lat_coord = iris.coords.AuxCoord(
        lat_vals,
        standard_name='latitude',
        units='degrees',
    )

    for rc1 in radar_cubes:
        rc1.add_aux_coord(lon_coord, [1, 2])
        rc1.add_aux_coord(lat_coord, [1, 2])

    return lat_vals, lon_vals


In [52]:
def calc_target_cube_indices(lat_vals, lon_vals, radar_cube, target_grid_cube):
    """
    Calculate the latitude and longitude index in the target cube
    coordinate system of each grid square in the radar cube.
    :param lat_vals: A 1D array of the target latitude values
    :param lon_vals: A 1D array of the target longitude values
    :param radar_cube: The source radar cube for the calculating the mapping
    :return: 2D numpy arrays with a mapping for each cell in the radar
    cube to the index in latitude and longitude of the target cube.
    """
    lat_target_index = -1 * numpy.ones(
        (radar_cube.shape[1], radar_cube.shape[2]),
        dtype='int32',
    )
    lon_target_index = -1 * numpy.ones(
        (radar_cube.shape[1], radar_cube.shape[2]),
        dtype='int32',
    )

    num_cells= numpy.zeros((target_grid_cube.shape[0],
                              target_grid_cube.shape[1], ))
    for i_lon, bnd_lon in enumerate(
            target_grid_cube.coord('longitude').bounds):

        for i_lat, bnd_lat in enumerate(
                target_grid_cube.coord('latitude').bounds):
            arr1, arr2 = numpy.where((lat_vals >= bnd_lat[0]) &
                                     (lat_vals < bnd_lat[1]) &
                                     (lon_vals >= bnd_lon[0]) &
                                     (lon_vals < bnd_lon[1])
                                     )
            lon_target_index[arr1, arr2] = i_lon
            lat_target_index[arr1, arr2] = i_lat
            num_cells[i_lat, i_lon] = len(arr1)

    return lat_target_index, lon_target_index, num_cells

In [53]:
radar_days = list(set([datetime.datetime(year=dt1.year, month=dt1.month, day=dt1.day, hour=0, minute=0, second=0) for dt1 in target_time_range]))
radar_days

[datetime.datetime(2020, 2, 17, 0, 0),
 datetime.datetime(2020, 2, 15, 0, 0),
 datetime.datetime(2020, 2, 14, 0, 0),
 datetime.datetime(2020, 2, 16, 0, 0)]

In [54]:
radar_fname_template = "{product}_{selected_day.year:04d}{selected_day.month:02d}{selected_day.day:02d}.nc"
product1 = 'composite_rainfall'



In [55]:
cl1 = iris.cube.CubeList([
    iris.load_cube(str(radar_data_dir / radar_fname_template.format(
        selected_day=dt1,
        product=product1)))
    for dt1 in radar_days])
iris.util.equalise_attributes(cl1)
radar_cube = cl1.concatenate_cube()
radar_cube


Rainfall Rate Composite (mm/h),time,projection_y_coordinate,projection_x_coordinate
Shape,1152,2175,1725
Dimension coordinates,,,
time,x,-,-
projection_y_coordinate,-,x,-
projection_x_coordinate,-,-,x
Auxiliary coordinates,,,
forecast_reference_time,x,-,-
Scalar coordinates,,,forecast_period 0 second
Attributes,,,Conventions CF-1.7 field_code 213 institution Met Office nimrod_version 2 probability_period_of_event 0 source Plr single site radars title Unknown


In [56]:
validity_times = target_time_range
validity_times

[datetime.datetime(2020, 2, 14, 18, 0),
 datetime.datetime(2020, 2, 15, 0, 0),
 datetime.datetime(2020, 2, 15, 6, 0),
 datetime.datetime(2020, 2, 15, 12, 0),
 datetime.datetime(2020, 2, 15, 18, 0),
 datetime.datetime(2020, 2, 16, 0, 0),
 datetime.datetime(2020, 2, 16, 6, 0),
 datetime.datetime(2020, 2, 16, 12, 0),
 datetime.datetime(2020, 2, 16, 18, 0),
 datetime.datetime(2020, 2, 17, 0, 0)]

 add some additional time coord info for subsequent processing

In [57]:
iris.coord_categorisation.add_hour(radar_cube, coord='time')
iris.coord_categorisation.add_day_of_year(radar_cube, coord='time')

In [58]:
coord_3hr = iris.coords.AuxCoord(radar_cube.coord('hour').points // 3,
                            long_name='3hr',
                             units='hour',
                            )


We now need to get our data on the same grid as our MOGREPS-G data. Iris has a regridding function, but this doesn't do what we want exactly, so we are going to calculate the values directly. The values we want are:
* fraction of grid box where a certain amount of precipitation (in a particular range) was record. This is essentially a histogram, but with amounts normalised to add up to 1.0, rather than total samples as in a normal histogram.
* max recorded rainfall in a grid box
* average recorded rainfaill in a grid box

To do this we will
* load in a sample of MOGREPS-G data as a target
* create latitude and longitude coordinates for the radar data, which doesn't have them initially, because it is not on a lat/lon grid.
* for each radar grid cell, calculate which mogreps-g cell it maps to
* for accumulation range, calculate which radar cells fall in that range
  * count those cells for each mogreps-g, then divide by total radar cells in that MOGREPS_G cell to get normalised histogram value
* for each MOGREPS-G cell, also calculate the max and average.



In [59]:
radar_cube.add_aux_coord(coord_3hr, data_dims=0)
radar_agg_3hr = radar_cube.aggregated_by(['3hr', 'day_of_year'],iris.analysis.SUM)
aux_coord1 = iris.coords.AuxCoord(
    [c1.bound[0] + datetime.timedelta(hours=3) for c1 in radar_agg_3hr.coord('time').cells()],
    long_name='model_accum_time',
    units='mm/h'
)
radar_agg_3hr.add_aux_coord(
    aux_coord1,
    data_dims=0)

Radar data is instantaeous rainfall rates, measured every 5 minutes. Model data is every three hours. TO match these together, we will calculate "pseudo-accumulations" (pseudo because we're assuming that the instaneous rate represents 5 minute accumulations if we divide by 12, but the rain rate will not be constant in a 5 minute period.) Something we could consider would be some better statistical model to interpolate and do better accumulation calculations, but this is a starting point.

In [60]:
radar_agg_3hr.data = radar_agg_3hr.data * (1.0 / 12.0)

In [61]:
lat_vals, lon_vals = calc_lat_lon_coords([radar_cube,
                                                radar_agg_3hr],
                                               target_grid_cube
                                              )

  return ccrs.TransverseMercator(


In [62]:
# remove these coordinates as they interfere with subsequent calculations
for coord_name in ['model_accum_time', 'forecast_reference_time', 'hour', 'day_of_year','3hr']:
    radar_agg_3hr.remove_coord(coord_name)


In [63]:
%%time
lat_target_index, lon_target_index, num_cells = calc_target_cube_indices(
    lat_vals=lat_vals,
    lon_vals=lon_vals,
    radar_cube=radar_cube,
    target_grid_cube=target_grid_cube,
)

CPU times: user 1min 11s, sys: 707 ms, total: 1min 12s
Wall time: 1min 12s


In [64]:
# Set up arrays to store regridded radAR precip data
out_vars_dict = {'radar_fraction_in_band_aggregate_3hr': 'VECTOR',
                 'radar_fraction_in_band_instant': 'VECTOR',
                 'bands_mask': 'MASK_VECTOR',
                 'scalar_value_mask': 'MASK_SCALAR',
                 'radar_max_rain_aggregate_3hr': 'SCALAR',
                 'radar_mean_rain_aggregate_3hr': 'SCALAR',
                 'radar_max_rain_instant': 'SCALAR',
                 'radar_mean_rain_instant': 'SCALAR',
                 'fraction_sum_agg': 'SCALAR',
                 'fraction_sum_instant': 'SCALAR',
                 }


In [65]:
out_vars_long_names = {
    'radar_fraction_in_band_aggregate_3hr': 'Fraction radar rainfall cells in specified 3hr aggregate rain band ',
    'radar_fraction_in_band_instant': 'Fraction radar rainfall cells in specified instant rain band',
    'radar_max_rain_aggregate_3hr': 'maximum rain in radar cells within mogreps-g cell',
    'radar_mean_rain_aggregate_3hr': 'average rain in radar cells within mogreps-g cell',
    'radar_max_rain_instant': 'maximum rain in radar cells within mogreps-g cell',
    'radar_mean_rain_instant': 'average rain in radar cells within mogreps-g cell',
    'fraction_sum_agg': 'Sum of fractions for each cell for aggregate 3hr data',
    'fraction_sum_instant': 'Sum of fractions for each cell for instant precip data',
}


Create some container arrays for the regridded output.

In [66]:
regridded_arrays_dict = {}
for var_name in [k1 for k1,v1 in out_vars_dict.items() if v1 == 'VECTOR']:
    regridded_arrays_dict[var_name] = numpy.zeros(
        [len(target_time_range),
         target_grid_cube.shape[0],
         target_grid_cube.shape[1],
         len(rainfall_thresholds)])

for var_name in [k1 for k1,v1 in out_vars_dict.items() if v1 == 'MASK_SCALAR']:
    regridded_arrays_dict[var_name] = numpy.ones(
        [len(target_time_range), target_grid_cube.shape[0],
         target_grid_cube.shape[1]],
        dtype='bool',
    )

for var_name in [k1 for k1,v1 in out_vars_dict.items() if v1 == 'MASK_VECTOR']:
    regridded_arrays_dict[var_name] = numpy.ones(
        [len(target_time_range),
         target_grid_cube.shape[0],
         target_grid_cube.shape[1],
         len(rainfall_thresholds)]
    )

for var_name in [k1 for k1,v1 in out_vars_dict.items() if v1 == 'SCALAR']:
    regridded_arrays_dict[var_name] = numpy.zeros(
        [len(target_time_range),
         target_grid_cube.shape[0],
         target_grid_cube.shape[1]]
    )

In [67]:
%%time
for i_time, validity_time in enumerate(validity_times):
    print(f'Processing radar data for validity time {validity_time}')
    radar_select_time = radar_agg_3hr.extract(iris.Constraint(
        time=lambda c1: compare_time(c1.bound[0], validity_time)))
    masked_radar = numpy.ma.MaskedArray(
        radar_select_time.data.data,
        radar_agg_3hr[0].data.mask)

    radar_instant_select_time = radar_cube.extract(iris.Constraint(
        time=lambda c1: compare_time(c1.point, validity_time)))
    masked_radar_instant = numpy.ma.MaskedArray(
        radar_instant_select_time.data.data,
        radar_cube[0].data.mask)
    for i_lat in range(target_grid_cube.shape[0]):
        for i_lon in range(target_grid_cube.shape[1]):
            selected_cells = (~(radar_select_time.data.mask)) & \
                             (lat_target_index == i_lat) & (
                                         lon_target_index == i_lon)
            
            masked_radar.mask = ~selected_cells
            masked_radar_instant.mask = ~selected_cells

            radar_cells_in_mg = numpy.count_nonzero(selected_cells)
            # only proceed with processing for this tagret grid cell
            # if there are some radar grid cells within this target
            # grid cell
            if radar_cells_in_mg > 0:
                # set the values for this location to be unmasker,
                # as we have valid radar values for this location
                regridded_arrays_dict['bands_mask'][i_time, i_lat, i_lon, :] = False
                regridded_arrays_dict['scalar_value_mask'][i_time, i_lat, i_lon] = False
                for imp_ix, (imp_key, imp_bounds) in enumerate(
                        rainfall_thresholds.items()):
                    # calculate fraction in band for 3 horaggregate data
                    num_in_band_agg = numpy.count_nonzero(
                        (masked_radar.compressed() >= imp_bounds[0]) &
                        (masked_radar.compressed() <= imp_bounds[1]) )
                    regridded_arrays_dict['radar_fraction_in_band_aggregate_3hr'][
                        i_time, i_lat, i_lon, imp_ix] = num_in_band_agg / (len(masked_radar.compressed()))

                    # calculate raction in band for instant radar data
                    num_in_band_instant = numpy.count_nonzero(
                        (masked_radar_instant.compressed() >= imp_bounds[0]) &
                        (masked_radar_instant.compressed() <= imp_bounds[1]) )
                    regridded_arrays_dict['radar_fraction_in_band_instant'][i_time, i_lat, i_lon, imp_ix] = num_in_band_instant / (len(masked_radar_instant.compressed()))
                regridded_arrays_dict['fraction_sum_agg'][i_time, i_lat, i_lon] = regridded_arrays_dict['radar_fraction_in_band_aggregate_3hr'][i_time, i_lat, i_lon, :].sum()
                regridded_arrays_dict['fraction_sum_instant'][i_time, i_lat, i_lon] = regridded_arrays_dict['radar_fraction_in_band_instant'][i_time, i_lat, i_lon, :].sum()

                # calculate the max and average of all radar cells within each mogreps-g cell
                regridded_arrays_dict['radar_max_rain_aggregate_3hr'][i_time, i_lat, i_lon] = masked_radar.max()
                regridded_arrays_dict['radar_mean_rain_aggregate_3hr'][i_time, i_lat, i_lon] = (masked_radar.sum()) / radar_cells_in_mg

                # create instant radar rate feature data
                regridded_arrays_dict['radar_max_rain_instant'][i_time, i_lat, i_lon] = masked_radar_instant.max()
                regridded_arrays_dict['radar_mean_rain_instant'][i_time, i_lat, i_lon] = (masked_radar_instant.sum()) / radar_cells_in_mg




Processing radar data for validity time 2020-02-14 18:00:00
Processing radar data for validity time 2020-02-15 00:00:00
Processing radar data for validity time 2020-02-15 06:00:00
Processing radar data for validity time 2020-02-15 12:00:00
Processing radar data for validity time 2020-02-15 18:00:00
Processing radar data for validity time 2020-02-16 00:00:00
Processing radar data for validity time 2020-02-16 06:00:00
Processing radar data for validity time 2020-02-16 12:00:00
Processing radar data for validity time 2020-02-16 18:00:00
Processing radar data for validity time 2020-02-17 00:00:00
CPU times: user 25min 6s, sys: 18.6 s, total: 25min 25s
Wall time: 25min 43s


In [68]:
total_num_pts = (regridded_arrays_dict['fraction_sum_instant'].shape[0] *
                 regridded_arrays_dict['fraction_sum_instant'].shape[1] *
                 regridded_arrays_dict['fraction_sum_instant'].shape[2])

In [69]:
target_lat_coord = target_grid_cube.coord('latitude')
target_lon_coord = target_grid_cube.coord('longitude')

In [70]:
band_coord = iris.coords.DimCoord(
    [float(b1) for b1 in rainfall_thresholds.keys()],
    bounds=list(rainfall_thresholds.values()),
    var_name='band',
    units='mm',
)
radar_time_coord = iris.coords.DimCoord(
    [vt.timestamp() for vt in
     validity_times],
    var_name='time',
    units=radar_cube.coord('time').units,
)

In [71]:
radar_regrided_cubes = {}

In [72]:
for var_name in [k1 for k1, v1 in out_vars_dict.items() if v1 == 'VECTOR']:
    radar_regrided_cubes[var_name] = iris.cube.Cube(
        data=numpy.ma.MaskedArray(data=regridded_arrays_dict[var_name],
                                  mask=regridded_arrays_dict['bands_mask'],
                                  ),
        dim_coords_and_dims=(
            (radar_time_coord, 0), (target_lat_coord, 1),
            (target_lon_coord, 2),
            (band_coord, 3)),
        units=None,
        var_name=var_name,
        long_name=out_vars_long_names[var_name],
    )


In [73]:
radar_regrided_cubes['num_cells_cube'] = iris.cube.Cube(
    data=num_cells,
    dim_coords_and_dims=(
     (target_lat_coord, 0), (target_lon_coord, 1),),
    var_name='num_radar_cells',
)

In [74]:
for var_name in [k1 for k1, v1 in out_vars_dict.items() if v1 == 'SCALAR']:
    radar_regrided_cubes[var_name] = iris.cube.Cube(
        data=numpy.ma.MaskedArray(data=regridded_arrays_dict[var_name],
                                  mask=regridded_arrays_dict[
                                      'scalar_value_mask'],
                                  ),
        dim_coords_and_dims=(
            (radar_time_coord, 0), (target_lat_coord, 1),
            (target_lon_coord, 2),),
        units='mm',
        var_name=var_name,
        long_name=out_vars_long_names[var_name],
    )


In [75]:
cubelist_to_save = iris.cube.CubeList(radar_regrided_cubes.values())

In [76]:
rain_bands = list(rainfall_thresholds.keys())


In [77]:
vector_var_dataframes = {
    var_name: xarray.DataArray.from_iris(
    radar_regrided_cubes[var_name]).to_dataframe().reset_index()
    for var_name, var_type in out_vars_dict.items() if var_type == 'VECTOR'
}


In [78]:
scalar_cube_list = [radar_regrided_cubes[k1] for k1,v1 in out_vars_dict.items() if v1 == 'SCALAR']


first merge the mean and max scalar fields (scalar in the sense that each grid cells has a scalar value, unlike the rain band fractions where each grid cell has a vector of outputs.

In [79]:
radar_df = functools.reduce(
    lambda x, y: pandas.merge(x, y, on=('latitude',
                                        'longitude',
                                        'time')),
    (xarray.DataArray.from_iris(arr1).to_dataframe().reset_index()
     for arr1 in scalar_cube_list))


next merge in the fraction of intensity bands one at a time

In [80]:
for var_name in [k1 for k1,v1 in out_vars_dict.items() if v1 == 'VECTOR']:
    for band1 in rain_bands:
        vector_df = vector_var_dataframes[var_name]
        df1 = vector_df[vector_df['band'] == float(band1)][
            ['time', 'latitude', 'longitude', var_name]]
        df1 = df1.rename({var_name: f'{var_name}_{band1}'},
                         axis='columns')
        radar_df = pandas.merge(radar_df, df1,
                                on=['time', 'latitude', 'longitude'])

find where the fields are NaN, and exclude those from the table. These represent the values masked out because there are no radar cells in the paticular mogreps-g cells.

In [81]:
selected_var = [k1 for k1, v1 in out_vars_dict.items() if v1 == 'SCALAR'][0]

# use any output variable to find where there are NaNs and exclude those data points
radar_df = radar_df[~(radar_df[selected_var].isna())]


In [82]:
radar_df

Unnamed: 0,time,latitude,longitude,radar_max_rain_aggregate_3hr,radar_mean_rain_aggregate_3hr,radar_max_rain_instant,radar_mean_rain_instant,fraction_sum_agg,fraction_sum_instant,radar_fraction_in_band_aggregate_3hr_0.0,radar_fraction_in_band_aggregate_3hr_0.25,radar_fraction_in_band_aggregate_3hr_2.5,radar_fraction_in_band_aggregate_3hr_7.0,radar_fraction_in_band_aggregate_3hr_10.0,radar_fraction_in_band_instant_0.0,radar_fraction_in_band_instant_0.25,radar_fraction_in_band_instant_2.5,radar_fraction_in_band_instant_7.0,radar_fraction_in_band_instant_10.0
0,2020-02-14 18:00:00,49.21875,-5.765625,2.153646,1.401717,0.18750,0.005257,1.000000,1.000000,0.000000,0.000000,1.000000,0.0,0.0,0.901869,0.098131,0.000000,0.0,0.0
1,2020-02-14 18:00:00,49.21875,-5.484375,2.083333,0.971208,3.65625,0.458528,1.004673,1.009346,0.000000,0.116822,0.887850,0.0,0.0,0.091121,0.623832,0.294393,0.0,0.0
2,2020-02-14 18:00:00,49.21875,-5.203125,1.343750,0.351861,0.56250,0.006733,1.002342,1.002342,0.000000,0.697892,0.304450,0.0,0.0,0.957845,0.039813,0.004684,0.0,0.0
3,2020-02-14 18:00:00,49.21875,-4.921875,0.638021,0.066473,0.25000,0.022311,1.000000,1.000000,0.004651,0.990698,0.004651,0.0,0.0,0.732558,0.267442,0.000000,0.0,0.0
4,2020-02-14 18:00:00,49.21875,-4.640625,0.117188,0.050964,0.37500,0.098934,1.000000,1.000000,0.147196,0.852804,0.000000,0.0,0.0,0.394860,0.605140,0.000000,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16111,2020-02-17 00:00:00,58.78125,0.421875,0.000000,0.000000,0.00000,0.000000,1.000000,1.000000,1.000000,0.000000,0.000000,0.0,0.0,1.000000,0.000000,0.000000,0.0,0.0
16112,2020-02-17 00:00:00,58.78125,0.703125,0.000000,0.000000,0.00000,0.000000,1.000000,1.000000,1.000000,0.000000,0.000000,0.0,0.0,1.000000,0.000000,0.000000,0.0,0.0
16113,2020-02-17 00:00:00,58.78125,0.984375,0.000000,0.000000,0.00000,0.000000,1.000000,1.000000,1.000000,0.000000,0.000000,0.0,0.0,1.000000,0.000000,0.000000,0.0,0.0
16114,2020-02-17 00:00:00,58.78125,1.265625,0.000000,0.000000,0.00000,0.000000,1.000000,1.000000,1.000000,0.000000,0.000000,0.0,0.0,1.000000,0.000000,0.000000,0.0,0.0


In [83]:
list(radar_df.columns)
        






['time',
 'latitude',
 'longitude',
 'radar_max_rain_aggregate_3hr',
 'radar_mean_rain_aggregate_3hr',
 'radar_max_rain_instant',
 'radar_mean_rain_instant',
 'fraction_sum_agg',
 'fraction_sum_instant',
 'radar_fraction_in_band_aggregate_3hr_0.0',
 'radar_fraction_in_band_aggregate_3hr_0.25',
 'radar_fraction_in_band_aggregate_3hr_2.5',
 'radar_fraction_in_band_aggregate_3hr_7.0',
 'radar_fraction_in_band_aggregate_3hr_10.0',
 'radar_fraction_in_band_instant_0.0',
 'radar_fraction_in_band_instant_0.25',
 'radar_fraction_in_band_instant_2.5',
 'radar_fraction_in_band_instant_7.0',
 'radar_fraction_in_band_instant_10.0']

In [84]:
merged_dataset = pandas.merge(mogreps_g_single_ts_uk_df, radar_df, on=['latitude', 'longitude', 'time'])
merged_dataset

Unnamed: 0,realization,latitude,longitude,forecast_period_x,forecast_reference_time_x,time,cloud_area_fraction,surface_altitude,air_pressure_at_sea_level,rainfall_rate,...,radar_fraction_in_band_aggregate_3hr_0.0,radar_fraction_in_band_aggregate_3hr_0.25,radar_fraction_in_band_aggregate_3hr_2.5,radar_fraction_in_band_aggregate_3hr_7.0,radar_fraction_in_band_aggregate_3hr_10.0,radar_fraction_in_band_instant_0.0,radar_fraction_in_band_instant_0.25,radar_fraction_in_band_instant_2.5,radar_fraction_in_band_instant_7.0,radar_fraction_in_band_instant_10.0
0,0,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.000000,0.0,101512.0,0.137463,...,0.0,0.0,1.0,0.0,0.0,0.094118,0.755294,0.16,0.0,0.0
1,1,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.000000,0.0,101522.0,0.301749,...,0.0,0.0,1.0,0.0,0.0,0.094118,0.755294,0.16,0.0,0.0
2,2,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.000000,0.0,101543.0,0.301749,...,0.0,0.0,1.0,0.0,0.0,0.094118,0.755294,0.16,0.0,0.0
3,3,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.000000,0.0,101549.0,0.204518,...,0.0,0.0,1.0,0.0,0.0,0.094118,0.755294,0.16,0.0,0.0
4,4,49.40625,-5.484375,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,1.000000,0.0,101533.0,0.177696,...,0.0,0.0,1.0,0.0,0.0,0.094118,0.755294,0.16,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26707,13,58.78125,1.546875,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,0.968750,0.0,99930.0,0.000000,...,1.0,0.0,0.0,0.0,0.0,1.000000,0.000000,0.00,0.0,0.0
26708,14,58.78125,1.546875,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,0.656250,0.0,99869.0,0.000000,...,1.0,0.0,0.0,0.0,0.0,1.000000,0.000000,0.00,0.0,0.0
26709,15,58.78125,1.546875,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,0.890625,0.0,99901.0,0.000000,...,1.0,0.0,0.0,0.0,0.0,1.000000,0.000000,0.00,0.0,0.0
26710,16,58.78125,1.546875,0 days 06:00:00,2020-02-14 12:00:00,2020-02-14 18:00:00,0.968750,0.0,99843.0,0.000000,...,1.0,0.0,0.0,0.0,0.0,1.000000,0.000000,0.00,0.0,0.0
