# Building a collection

`intake-esm` support building user-defined collection. This notebook demonstrates how to do this.

In [1]:
import intake

## Collection definition file

Aspects of the catalog can be defined in `intake-esm` config in `~/.intake_esm/config.yaml`. This is a yaml file with the following contents.

`collection_columns` : list of columns to include in catalog; for example:
```yaml
collection_columns:
  - resource
  - experiment
  - case
  - component
  ...
```

`replacements` : nested dictionaries of form {column_name: {`to_replace`: `value`}}; this is passed to [`pandas.DataFrame.replace`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html); for example
```yaml
replacements:
    freq:
      monthly: month_1
      daily: day_1
      yearly: year_1
```
(*Note: the current CESM catalog definition does not include the `freq` column.*)

`component_streams` : dictionary with lists of `stream` string for each `component`; for example:
```yaml
component_streams:
    ocn:
      - pop.h.nday1
      - pop.h.nyear1
      - pop.h.ecosys.nday1
      - pop.h.ecosys.nyear1
      - pop.h
    atm:
      - cam.h0
      - cam.h1
      - cam.h2
    ...
```

## Collection input file

Collections are built from a `yaml` input file containing a nested dictionary. 

It might looks something like this.
```yaml
cesm1_le:  # name of collection
  type: cesm  # type of collection, used to determine build method
  data_sources: # dictionary of data sources to include; each key is an `experiment`
    CTRL: # experiment name followed by list of dictionaries for each ensemble member in the experiment
      - case: b.e11.B1850C5CN.f09_g16.005
        sequence_order: 0
        ensemble: 0
        has_ocean_bgc: True
        year_offset: 1448
        locations:          # ordered list of locations from which to get data
          - name: GLADE
            type: posix
            urlpath: /glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE
          - name: HPSS
            type: hsi
            urlpath: /CCSM/csm/CESM-CAM5-BGC-LE
        component_attrs:
          ocn:
            grid: POP_gx1v6
    20C:
      - case: b.e11.B20TRC5CNBDRD.f09_g16.001
        sequence_order: 0
        ensemble: 1
        has_ocean_bgc: True
        locations:
          - name: GLADE
            type: posix
            urlpath: /glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE
          - name: HPSS
            type: hsi
            urlpath: /CCSM/csm/CESM-CAM5-BGC-LE
        component_attrs:
          ocn:
            grid: POP_gx1v6

      - case: b.e11.B20TRC5CNBDRD.f09_g16.002
        sequence_order: 0
        ensemble: 2
        has_ocean_bgc: True
        locations:
          - name: GLADE
            type: posix
            urlpath: /glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE
          - name: HPSS
            type: hsi
            urlpath: /CCSM/csm/CESM-CAM5-BGC-LE
        component_attrs:
          ocn:
            grid: POP_gx1v6
```

In [9]:
collection_type_def_file = "../../example_input/collection_input_cesm_dple.yml"

## Build the collection

The build method loops over all the experiments and each of the ensemble members therein. It attempts to parse file name; it fails in some instances and skips these files with a warning. If HPSS access is not available (such as from compute nodes on Cheyenne), this resource is omitted from the catalog. 

Ultimately, the collection is saved to disk as a csv file at `collection.active_db`

In [10]:
col = intake.open_esm_metadatastore(collection_input_file=collection_type_def_file, collection_type="cesm")
col

<Intake catalog: None>

In [11]:
# Trigger the build
col.build_collections()

Filename : g.e11_LENS.GECOIAF.T62_g16.009.pop.h.SALT_on_PD=26.5.024901-031612.nc does not conform to expected pattern
Filename : g.e11_LENS.GECOIAF.T62_g16.009.pop.h.O2_on_PD=26.5.024901-031612.nc does not conform to expected pattern


<Intake catalog: None>

## Examing the collection

`intake_esm` builds a `pandas.DataFrame` to store the collection. The `DataFrame` is stored as an attribute on the collection object.

In [14]:
col = col.open_collection(collection_name="cesm_dple", collection_type="cesm")

In [16]:
col.df

Unnamed: 0,resource,resource_type,direct_access,experiment,case,component,stream,variable,date_range,ensemble,files,files_basename,files_dirname,ctrl_branch_year,year_offset,sequence_order,has_ocean_bgc,grid
0,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,O2,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.O2....,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
1,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,NO3,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.NO3...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
2,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,SALT,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.SAL...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
3,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,TEMP,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.TEM...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
4,GLADE:posix:/glade/p/cesm/community/CESM-DPLE/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h,ADVT,024901-031612,0,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.ADVT.0249...,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,,1699,0,False,POP_gx1v6
5,GLADE:posix:/glade/p/cesm/community/CESM-DPLE/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h,CaCO3_form,024901-031612,0,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.CaCO3_for...,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,,1699,0,False,POP_gx1v6
6,GLADE:posix:/glade/p/cesm/community/CESM-DPLE/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009,g.e11_LENS.GECOIAF.T62_g16.009,ice,cice.h,meltb_sh,024901-031612,0,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,g.e11_LENS.GECOIAF.T62_g16.009.cice.h.meltb_sh...,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,,1699,0,False,
7,GLADE:posix:/glade/p/cesm/community/CESM-DPLE/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h,H2CO3,024901-031612,0,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.H2CO3.024...,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,,1699,0,False,POP_gx1v6
8,GLADE:posix:/glade/p/cesm/community/CESM-DPLE/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.nday1,SST2,02490101-03161231,0,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.nday1.SST...,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,,1699,0,False,POP_gx1v6
9,GLADE:posix:/glade/p/cesm/community/CESM-DPLE/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h,diaz_Fe_lim,024901-031612,0,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.diaz_Fe_l...,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,,1699,0,False,POP_gx1v6


In [17]:
%load_ext watermark

In [18]:
%watermark --iversion -g -h -m -v -u -d

intake 0.4.1
last updated: 2019-02-17 

CPython 3.6.7
IPython 7.1.1

compiler   : GCC 7.3.0
system     : Linux
release    : 3.12.62-60.64.8-default
machine    : x86_64
processor  : x86_64
CPU cores  : 72
interpreter: 64bit
host name  : r6i6n30
Git hash   : 71d5340c3bbabf8a76f6ba780675f5b7e4a1c8e5
