# Building a collection catalog

`intake-esm` support building user-defined collection catalogs. This notebook demonstrates how to do this.

In [1]:
import intake

## Collection definition file

Aspects of the catalog can be defined in `intake-esm` config in `~/.intake_esm/config.yaml`. This is a yaml file with the following contents.

`collection_columns` : list of columns to include in catalog; for example:
```yaml
collection_columns:
  - resource
  - experiment
  - case
  - component
  ...
```

`replacements` : nested dictionaries of form {column_name: {`to_replace`: `value`}}; this is passed to [`pandas.DataFrame.replace`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html); for example
```yaml
replacements:
    freq:
      monthly: month_1
      daily: day_1
      yearly: year_1
```
(*Note: the current CESM catalog definition does not include the `freq` column.*)

`component_streams` : dictionary with lists of `stream` string for each `component`; for example:
```yaml
component_streams:
    ocn:
      - pop.h.nday1
      - pop.h.nyear1
      - pop.h.ecosys.nday1
      - pop.h.ecosys.nyear1
      - pop.h
    atm:
      - cam.h0
      - cam.h1
      - cam.h2
    ...
```

## Collection input file

Collections are built from a `yaml` input file containing a nested dictionary. 

It might looks something like this.
```yaml
name: cesm_dple
collection_type: cesm
overwrite_existing : True
include_cache_dir: False
data_sources:
    g.e11_LENS.GECOIAF.T62_g16.009:
      locations:
        - name: GLADE
          loc_type: posix
          direct_access: True
          urlpath: /glade/p/cesm/community/CESM-DPLE/CESM-DPLE_POPCICEhindcast
      component_attrs:
        ocn: {grid: POP_gx1v6}
      case_members:
        - case: g.e11_LENS.GECOIAF.T62_g16.009
          year_offset: 1699
    g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord:
      locations:
        - name: GLADE
          loc_type: posix
          direct_access: True
          urlpath: /glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/CESM-DPLE_POPCICEhindcast
      component_attrs:
        ocn: {grid: POP_gx1v6}
      case_members:
        - case: g.e11_LENS.GECOIAF.T62_g16.009
          year_offset: 1699
```

In [2]:
collection_type_def_file = "../../example_input/collection_input_cesm_dple.yml"

## Build the collection

The build method loops over all the experiments and each of the ensemble members therein. It attempts to parse file name; it fails in some instances and skips these files with a warning. If HPSS access is not available (such as from compute nodes on Cheyenne), this resource is omitted from the catalog. 

Ultimately, the collection is saved to disk as a csv file at `collection.active_db`

In [3]:
col = intake.open_esm_metadatastore(collection_input_file=collection_type_def_file)
col

Working on experiment: g.e11_LENS.GECOIAF.T62_g16.009
Getting file listing : GLADE:posix:/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_POPCICEhindcast
Building file database : GLADE:posix:/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_POPCICEhindcast
Filename : g.e11_LENS.GECOIAF.T62_g16.009.pop.h.SALT_on_PD=26.5.024901-031612.nc does not conform to expected pattern
Filename : g.e11_LENS.GECOIAF.T62_g16.009.pop.h.O2_on_PD=26.5.024901-031612.nc does not conform to expected pattern
Working on experiment: g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord
Getting file listing : GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/CESM-DPLE_POPCICEhindcast
Building file database : GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/CESM-DPLE_POPCICEhindcast
None
Persisting cesm_dple at : /glade/work/abanihi/intake-collections/cesm/cesm_dple.csv


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 583 entries, 0 to 582
Data columns (total 18 columns):
resource            583 non-null object
resource_type       583 non-null object
direct_access       583 non-null object
experiment          583 non-null object
case                583 non-null object
component           583 non-null object
stream              583 non-null object
variable            583 non-null object
date_range          583 non-null object
ensemble            583 non-null object
files               583 non-null object
files_basename      583 non-null object
files_dirname       583 non-null object
ctrl_branch_year    0 non-null object
year_offset         583 non-null object
sequence_order      583 non-null object
has_ocean_bgc       583 non-null object
grid                325 non-null object
dtypes: object(18)
memory usage: 82.1+ KB


<Intake catalog: None>

## Examing the collection

`intake_esm` builds a `pandas.DataFrame` to store the collection. The `DataFrame` is stored as an attribute on the collection object.

In [4]:
col.df.head()

Unnamed: 0,resource,resource_type,direct_access,experiment,case,component,stream,variable,date_range,ensemble,files,files_basename,files_dirname,ctrl_branch_year,year_offset,sequence_order,has_ocean_bgc,grid
0,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,O2,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.O2....,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
1,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,NO3,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.NO3...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
2,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,SALT,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.SAL...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
3,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,TEMP,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.TEM...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
4,GLADE:posix:/glade/p/cesm/community/CESM-DPLE/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h,ADVT,024901-031612,0,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.ADVT.0249...,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,,1699,0,False,POP_gx1v6


In [5]:
%load_ext watermark

In [6]:
%watermark --iversion -g -h -m -v -u -d

intake 0.4.1
last updated: 2019-02-26 

CPython 3.6.7
IPython 7.1.1

compiler   : GCC 7.3.0
system     : Linux
release    : 3.12.62-60.64.8-default
machine    : x86_64
processor  : x86_64
CPU cores  : 72
interpreter: 64bit
host name  : r6i6n30
Git hash   : c8261b8c4233784c05d290e22c0e7a8ef11951b5
