## Building a MPI-GE Collection Catalog


Building a MPI-GE collection catalog follows the same steps that are used when building a CMIP or CESM collection catalog:

- Define a collection catalog in a YAML or nested dictionary
- Pass this collection definition to ``intake_open_esm_metadatastore()`` class.
- Use the built collection catalog

For demonstration purposes, we are going to use data from MPI-GE project.

In [1]:
import intake



In [2]:
cdefinition = {
    'name': 'mpige_test',
    'collection_type': 'mpige',
    'include_cache_dir': False,
    'data_sources': {
        'pictrl': {
            'locations': [
                {
                    'name': 'SAMPLE-DATA',
                    'loc_type': 'posix',
                    'direct_access': True,
                    'urlpath': '../../../tests/sample_data/mpi-ge/pictrl',
                    'exclude_dirs': ['*/restart/*', '*/log/*'],
                }
            ],
            'component_attrs': {'mpiom': {'grid': 'POP_gx1v6'}},
            'case_members': [
                {'case': 'pictrl0001', 'sequence_order': 0, 'ensemble': 1, 'year_offset': 1448}
            ],
        },
        'hist': {
            'locations': [
                {
                    'name': 'SAMPLE-DATA',
                    'loc_type': 'posix',
                    'direct_access': True,
                    'urlpath': '../../../tests/sample_data/mpi-ge/hist',
                    'exclude_dirs': ['*/restart/*', '*/log/*'],
                }
            ],
            'component_attrs': {'mpiom': {'grid': 'POP_gx1v6'}},
            'case_members': [
                {'case': 'hist0001', 'sequence_order': 0, 'ensemble': 1},
                {'case': 'hist0002', 'sequence_order': 0, 'ensemble': 2},
                {'case': 'hist0003', 'sequence_order': 0, 'ensemble': 3},
            ],
        },
        'rcp85': {
            'locations': [
                {
                    'name': 'SAMPLE-DATA',
                    'loc_type': 'posix',
                    'direct_access': True,
                    'urlpath': '../../../tests/sample_data/mpi-ge/rcp85',
                    'exclude_dirs': ['*/restart/*', '*/log/*'],
                }
            ],
            'component_attrs': {'mpiom': {'grid': 'POP_gx1v6'}},
            'case_members': [
                {'case': 'rcp850001', 'sequence_order': 1, 'ensemble': 1},
                {'case': 'rcp850002', 'sequence_order': 1, 'ensemble': 2},
                {'case': 'rcp850003', 'sequence_order': 1, 'ensemble': 3},
            ],
        },
    },
}


### Building the Collection

The build method loops over all the experiments and each of the ensemble members therein.
It attempts to parse file name; it fails in some instances and skips these files with a warning.

In [3]:
col = intake.open_esm_metadatastore(collection_input_definition=cdefinition,
                                       overwrite_existing=True)

Working on experiment: pictrl
Getting file listing : SAMPLE-DATA:posix:../../../tests/sample_data/mpi-ge/pictrl
Building file database : SAMPLE-DATA:posix:../../../tests/sample_data/mpi-ge/pictrl
Working on experiment: hist
Getting file listing : SAMPLE-DATA:posix:../../../tests/sample_data/mpi-ge/hist
Building file database : SAMPLE-DATA:posix:../../../tests/sample_data/mpi-ge/hist
Working on experiment: rcp85
Getting file listing : SAMPLE-DATA:posix:../../../tests/sample_data/mpi-ge/rcp85
Building file database : SAMPLE-DATA:posix:../../../tests/sample_data/mpi-ge/rcp85
None


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 16 columns):
resource            84 non-null object
resource_type       84 non-null object
direct_access       84 non-null object
experiment          84 non-null object
case                84 non-null object
component           84 non-null object
stream              84 non-null object
date_range          84 non-null object
ensemble            84 non-null object
file_fullpath       84 non-null object
file_basename       84 non-null object
file_dirname        84 non-null object
ctrl_branch_year    0 non-null object
year_offset         12 non-null object
sequence_order      84 non-null object
grid                42 non-null object
dtypes: object(16)
memory usage: 10.6+ KB
Persisting mpige_test at : /Users/abanihi/.intake_esm/collections/mpige/mpige_test.mpige.csv


### Using the Built Collection

In [4]:
col.df.head()

Unnamed: 0,resource,resource_type,direct_access,experiment,case,component,stream,date_range,ensemble,file_fullpath,file_basename,file_dirname,ctrl_branch_year,year_offset,sequence_order,grid
0,SAMPLE-DATA:posix:../../../tests/sample_data/m...,posix,True,rcp85,rcp850003,mpiom,data_2d_mm,20060101-20061231,3,../../../tests/sample_data/mpi-ge/rcp85/rcp850...,rcp850003_mpiom_data_2d_mm_20060101_20061231.nc,../../../tests/sample_data/mpi-ge/rcp85/rcp850...,,,1,POP_gx1v6
1,SAMPLE-DATA:posix:../../../tests/sample_data/m...,posix,True,rcp85,rcp850003,mpiom,data_2d_mm,20080101-20081231,3,../../../tests/sample_data/mpi-ge/rcp85/rcp850...,rcp850003_mpiom_data_2d_mm_20080101_20081231.nc,../../../tests/sample_data/mpi-ge/rcp85/rcp850...,,,1,POP_gx1v6
2,SAMPLE-DATA:posix:../../../tests/sample_data/m...,posix,True,rcp85,rcp850003,mpiom,monitoring_ym,20070101-20071231,3,../../../tests/sample_data/mpi-ge/rcp85/rcp850...,rcp850003_mpiom_monitoring_ym_20070101_2007123...,../../../tests/sample_data/mpi-ge/rcp85/rcp850...,,,1,POP_gx1v6
3,SAMPLE-DATA:posix:../../../tests/sample_data/m...,posix,True,rcp85,rcp850003,mpiom,monitoring_ym,20060101-20061231,3,../../../tests/sample_data/mpi-ge/rcp85/rcp850...,rcp850003_mpiom_monitoring_ym_20060101_2006123...,../../../tests/sample_data/mpi-ge/rcp85/rcp850...,,,1,POP_gx1v6
4,SAMPLE-DATA:posix:../../../tests/sample_data/m...,posix,True,rcp85,rcp850003,mpiom,monitoring_ym,20080101-20081231,3,../../../tests/sample_data/mpi-ge/rcp85/rcp850...,rcp850003_mpiom_monitoring_ym_20080101_2008123...,../../../tests/sample_data/mpi-ge/rcp85/rcp850...,,,1,POP_gx1v6


Now you can query the collection catalog and load data sets of interests into xarray objects:

In [5]:
cat = col.search(
            component=['mpiom', 'hamocc'],
            stream='monitoring_ym',
            experiment=['hist', 'rcp85'],
        )

In [6]:
print(cat.yaml(True))

plugins:
  source:
  - module: intake_esm.mpige
sources:
  mpige_test_279e6e1b-8d04-4669-8ef0-4a69bd347f90:
    args:
      collection_name: mpige_test
      query:
        case: null
        component:
        - mpiom
        - hamocc
        ctrl_branch_year: null
        date_range: null
        direct_access: null
        ensemble: null
        experiment:
        - hist
        - rcp85
        file_basename: null
        file_dirname: null
        file_fullpath: null
        grid: null
        resource: null
        resource_type: null
        sequence_order: null
        stream: monitoring_ym
        year_offset: null
    description: Catalog entry from mpige_test collection
    driver: mpige
    metadata:
      cache: {}
      catalog_dir: ''



#### Concatenate datasets along new experiment_id dimension


In [7]:
ds = cat.to_xarray(merge_exp=False)
ds

HBox(children=(IntProgress(value=0, description='experiment', max=2, style=ProgressStyle(description_width='in…




<xarray.Dataset>
Dimensions:                                    (depth: 1, experiment_id: 2, lat: 1, lon: 1, member_id: 3, time: 6)
Coordinates:
  * time                                       (time) datetime64[ns] 1850-12-31T23:15:00 ...
  * lon                                        (lon) float64 0.0
  * lat                                        (lat) float64 0.0
  * depth                                      (depth) float64 0.0
  * member_id                                  (member_id) int64 1 2 3
  * experiment_id                              (experiment_id) <U5 'hist' ...
Data variables:
    global_primary_production                  (experiment_id, member_id, time, depth, lat, lon) float32 dask.array<shape=(2, 3, 6, 1, 1, 1), chunksize=(1, 1, 1, 1, 1, 1)>
    global_zooplankton_grazing                 (experiment_id, member_id, time, depth, lat, lon) float32 dask.array<shape=(2, 3, 6, 1, 1, 1), chunksize=(1, 1, 1, 1, 1, 1)>
    global_OM_export_at_90m                    (experime

#### Merge datasets

In [8]:
cat2 = col.search(
            component=['mpiom', 'hamocc'],
            stream='monitoring_ym',
            experiment=['hist', 'rcp85']
        )

In [9]:
ds = cat2.to_xarray(merge_exp=True)
ds

HBox(children=(IntProgress(value=0, description='experiment', max=2, style=ProgressStyle(description_width='in…




<xarray.Dataset>
Dimensions:                                    (depth: 1, lat: 1, lon: 1, member_id: 3, time: 6)
Coordinates:
  * time                                       (time) datetime64[ns] 1850-12-31T23:15:00 ...
  * lon                                        (lon) float64 0.0
  * lat                                        (lat) float64 0.0
  * depth                                      (depth) float64 0.0
  * member_id                                  (member_id) int64 1 2 3
Data variables:
    global_primary_production                  (member_id, time, depth, lat, lon) float32 dask.array<shape=(3, 6, 1, 1, 1), chunksize=(1, 1, 1, 1, 1)>
    global_zooplankton_grazing                 (member_id, time, depth, lat, lon) float32 dask.array<shape=(3, 6, 1, 1, 1), chunksize=(1, 1, 1, 1, 1)>
    global_OM_export_at_90m                    (member_id, time, depth, lat, lon) float32 dask.array<shape=(3, 6, 1, 1, 1), chunksize=(1, 1, 1, 1, 1)>
    global_calc_export_at_90m              

#### Workaround when merging or concatenating datasets does not work: return dictionary of datasets

In [10]:
cat3 = col.search(component='mpiom', stream='monitoring_ym')

In [11]:
ds = cat3.to_xarray(merge_exp=True)
ds

HBox(children=(IntProgress(value=0, description='experiment', max=3, style=ProgressStyle(description_width='in…




  warn('Could not merge datasets. Returning non-merged datasets')


OrderedDict([('hist', <xarray.Dataset>
              Dimensions:                                    (depth: 1, lat: 1, lon: 1, member_id: 3, time: 3)
              Coordinates:
                * lon                                        (lon) float64 0.0
                * lat                                        (lat) float64 0.0
                * depth                                      (depth) float64 0.0
                * time                                       (time) datetime64[ns] 1850-12-31T23:15:00 ...
                * member_id                                  (member_id) int64 1 2 3
              Data variables:
                  gmsl_st                                    (member_id, time, depth, lat, lon) float32 dask.array<shape=(3, 3, 1, 1, 1), chunksize=(1, 1, 1, 1, 1)>
                  gmsl_eu                                    (member_id, time, depth, lat, lon) float32 dask.array<shape=(3, 3, 1, 1, 1), chunksize=(1, 1, 1, 1, 1)>
                  netwatertransp

In [12]:
%load_ext watermark

In [13]:
%watermark --iversion -g  -m -v -u -d

intake 0.4.1
last updated: 2019-04-25 

CPython 3.6.7
IPython 7.1.1

compiler   : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
system     : Darwin
release    : 17.7.0
machine    : x86_64
processor  : i386
CPU cores  : 8
interpreter: 64bit
Git hash   : 90b6eed2869c557066a5aea345f59da5178f00f5
