# Demo of `create_climpred_data`

This demo demonstrates how you setup your raw output from a climate model to match `climpred`'s expectations.

In [None]:
from dask.distributed import Client
import multiprocessing
ncpu = multiprocessing.cpu_count()
threads = 6
nworker = ncpu//threads
print(f'Number of CPUs: {ncpu}, number of threads: {threads}, number of workers: {nworker}')

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import xarray as xr

import climpred

In [None]:
from climpred.preprocessing.shared import load_hindcast, climpred_preprocess_internal
from climpred.preprocessing.mpi import get_path

Assuming your raw model output is stored in multiple files per member and initialization, `load_hindcast` is a nice wrapper function based on `get_path` designed for the output format of `MPI-ESM` to aggregated all hindcast output into one file as expected by `climpred`.

The basic idea is to look over the output of all members and concatinate, then loop over all initializations and concatinate. Before concatination, it is important to align the `time` dimension.

To reduce the data size, use the `preprocess` function provided to `xr.open_mfdataset` wisely in combination with `set_integer_axis`, e.g. additionally extracting only a certain region or only few variables for a multi-variable input file as in MPI-ESM standard output.

In [None]:
get_path?

In [None]:
set_integer_axis?

In [None]:
load_hindcast?

In [None]:
def preprocess_1var(ds, v="global_primary_production"):
    return ds[v].to_dataset(name=v).squeeze()

In [None]:
%time ds = load_hindcast(preprocess=preprocess_1var)

In [7]:
ds.coords

Processing init 1961 ...
Processing init 1962 ...
Processing init 1963 ...
Processing init 1964 ...


Coordinates:
    lat      float64 0.0
    lon      float64 0.0
    depth    float64 0.0
  * lead     (lead) int64 1 2 3 4 5 6 7 8 9 10
  * member   (member) int64 1 2
  * init     (init) int64 1961 1962 1963 1964

In [8]:
# calc skill lazily
climpred.prediction.compute_perfect_model(ds, ds.rename({'lead':'time'}))

In [7]:
# loading the data into memory
%time ds = ds.load()

# `intake-esm` for cmorized output

In case you have access to cmorized output of CMIP experiments, consider using `intake-esm`. With the `preprocess` function you can align the `time` dimension of the output. Finally, `climpred_preprocess_post` only renames.

In [9]:
from climpred.create_climpred_data import rename_to_climpred_dims, climpred_preprocess_internal

In [10]:
import intake

In [11]:
col_url = "/home/mpim/m300524/intake-esm-datastore/catalogs/mistral-cmip6.json"
col = intake.open_esm_datastore(col_url)

In [19]:
col.df.columns

Index(['activity_id', 'institution_id', 'source_id', 'experiment_id',
       'member_id', 'table_id', 'variable_id', 'grid_label', 'dcpp_init_year',
       'version', 'time_range', 'path'],
      dtype='object')

In [20]:
# load 2 members for 2 inits from one model
query = dict(experiment_id=[
    'dcppA-hindcast'], table_id='Amon', member_id=['r1i1p1f1', 'r2i1p1f1'], dcpp_init_year=[1970, 1971],
    variable_id='tas', source_id='MPI-ESM1-2-HR')
cat = col.search(**query)
cdf_kwargs = {'chunks': {'time': 12}, 'decode_times': False}

In [21]:
cat.df.head()

Unnamed: 0,activity_id,institution_id,source_id,experiment_id,member_id,table_id,variable_id,grid_label,dcpp_init_year,version,time_range,path
0,DCPP,MPI-M,MPI-ESM1-2-HR,dcppA-hindcast,r1i1p1f1,Amon,tas,gn,1971.0,v20190906,197111-198112,/work/ik1017/CMIP6/data/CMIP6/DCPP/MPI-M/MPI-E...
1,DCPP,MPI-M,MPI-ESM1-2-HR,dcppA-hindcast,r1i1p1f1,Amon,tas,gn,1970.0,v20190906,197011-198012,/work/ik1017/CMIP6/data/CMIP6/DCPP/MPI-M/MPI-E...
2,DCPP,MPI-M,MPI-ESM1-2-HR,dcppA-hindcast,r2i1p1f1,Amon,tas,gn,1971.0,v20190906,197111-198112,/work/ik1017/CMIP6/data/CMIP6/DCPP/MPI-M/MPI-E...
3,DCPP,MPI-M,MPI-ESM1-2-HR,dcppA-hindcast,r2i1p1f1,Amon,tas,gn,1970.0,v20190906,197011-198012,/work/ik1017/CMIP6/data/CMIP6/DCPP/MPI-M/MPI-E...


In [13]:
def preprocess(ds):
    # extract tiny spatial and temporal subset
    ds = ds.isel(lon=[50, 51, 52], lat=[50, 51, 52],
                 time=np.arange(12 * 2))
    # make time dim identical
    ds = climpred_preprocess_internal(ds)
    return ds

In [14]:
dset_dict = cat.to_dataset_dict(
    cdf_kwargs=cdf_kwargs, preprocess=preprocess)

In [14]:
# get first dict value
_, ds = dset_dict.popitem()
ds.coords

Progress: |███████████████████████████████████████████████████████████████████████████████| 100.0% 

--> The keys in the returned dictionary of datasets are constructed as follows:
	'activity_id.institution_id.source_id.experiment_id.table_id.grid_label'
             
--> There are 1 group(s)


Coordinates:
    height          float64 2.0
  * lon             (lon) float64 46.88 47.81 48.75
  * dcpp_init_year  (dcpp_init_year) float64 1.97e+03 1.971e+03
  * time            (time) int64 1 2 3 4 5 6 7 8 9 ... 17 18 19 20 21 22 23 24
  * lat             (lat) float64 -42.55 -41.61 -40.68
  * member_id       (member_id) <U8 'r1i1p1f1' 'r2i1p1f1'

In [15]:
ds = rename_to_climpred_dims(ds)
ds.coords

Coordinates:
    height   float64 2.0
  * lon      (lon) float64 46.88 47.81 48.75
  * init     (init) float64 1.97e+03 1.971e+03
  * lead     (lead) int64 1 2 3 4 5 6 7 8 9 10 ... 15 16 17 18 19 20 21 22 23 24
  * lat      (lat) float64 -42.55 -41.61 -40.68
  * member   (member) <U8 'r1i1p1f1' 'r2i1p1f1'

In [16]:
# you may actually want to use `compute_hindcast` to calculate skill from hindcast.
# for this you also need an `observation` to compare to
# here `compute_perfect_model` compares one member to the ensemble mean of the remain members in turn
climpred.prediction.compute_perfect_model(ds, ds.rename({'lead':'time'}))

  r = r_num / r_den


In [None]:
# loading the data into memory
%time ds = ds.load()