# Building a collection

`intake_cesm` support building user-defined collection. This notebook demonstrates how to do this.

In [1]:
import intake_cesm

## Collection definition file

Aspects of the catalog can be defined in the `collection_type_def_file`. This is a yaml file with the following contents.

`collection_columns` : list of columns to include in catalog; for example:
```yaml
collection_columns:
  - resource
  - experiment
  - case
  - component
  ...
```

`replacements` : nested dictionaries of form {column_name: {`to_replace`: `value`}}; this is passed to [`pandas.DataFrame.replace`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html); for example
```yaml
replacements:
    freq:
      monthly: month_1
      daily: day_1
      yearly: year_1
```
(*Note: the current CESM catalog definition does not include the `freq` column.*)

`component_streams` : dictionary with lists of `stream` string for each `component`; for example:
```yaml
component_streams:
    ocn:
      - pop.h.nday1
      - pop.h.nyear1
      - pop.h.ecosys.nday1
      - pop.h.ecosys.nyear1
      - pop.h
    atm:
      - cam.h0
      - cam.h1
      - cam.h2
    ...
```

In [2]:
collection_type_def_file = '../../intake_cesm/cesm_definitions.yml'

## Collection input file

Collections are built from a `yaml` input file containing a nested dictionary. 

It might looks something like this.
```yaml
cesm1_le:  # name of collection
  type: cesm  # type of collection, used to determine build method
  data_sources: # dictionary of data sources to include; each key is an `experiment`
    CTRL: # experiment name followed by list of dictionaries for each ensemble member in the experiment
      - case: b.e11.B1850C5CN.f09_g16.005
        sequence_order: 0
        ensemble: 0
        has_ocean_bgc: True
        year_offset: 1448
        locations:          # ordered list of locations from which to get data
          - name: GLADE
            type: posix
            urlpath: /glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE
          - name: HPSS
            type: hsi
            urlpath: /CCSM/csm/CESM-CAM5-BGC-LE
        component_attrs:
          ocn:
            grid: POP_gx1v6
    20C:
      - case: b.e11.B20TRC5CNBDRD.f09_g16.001
        sequence_order: 0
        ensemble: 1
        has_ocean_bgc: True
        locations:
          - name: GLADE
            type: posix
            urlpath: /glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE
          - name: HPSS
            type: hsi
            urlpath: /CCSM/csm/CESM-CAM5-BGC-LE
        component_attrs:
          ocn:
            grid: POP_gx1v6

      - case: b.e11.B20TRC5CNBDRD.f09_g16.002
        sequence_order: 0
        ensemble: 2
        has_ocean_bgc: True
        locations:
          - name: GLADE
            type: posix
            urlpath: /glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE
          - name: HPSS
            type: hsi
            urlpath: /CCSM/csm/CESM-CAM5-BGC-LE
        component_attrs:
          ocn:
            grid: POP_gx1v6
```

In [3]:
collection_type_def_file = '../../intake_cesm/cesm_definitions.yml'='../../intake_cesm/collection_input_cesm1_le.yml'

## Build the collection

The build method loops over all the experiments and each of the ensemble members therein. It attempts to parse file name; it fails in some instances and skips these files with a warning. If HPSS access is not available (such as from compute nodes on Cheyenne), this resource is omitted from the catalog. 

Ultimately, the collection is saved to disk as a csv file at `collection.active_db`

In [4]:
col = intake_cesm.CESMCollections(collection_input_file=collection_input_file,
                                   collection_type_def_file=collection_type_def_file)
col

INFO:root:Active collection : cesm1_le
INFO:root:Active database: ./collections/cesm1_le.csv
INFO:root:calling build
INFO:root:working on experiment: CTRL
0it [00:00, ?it/s]INFO:root:getting file listing: GLADE:posix:/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE
INFO:root:building file database: GLADE:posix:/glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE

  0%|          | 0/157253 [00:00<?, ?it/s][A
  5%|▌         | 8496/157253 [00:00<00:01, 84958.46it/s][A
 11%|█         | 16812/157253 [00:00<00:01, 84410.10it/s][A
 16%|█▌        | 25105/157253 [00:00<00:01, 83959.54it/s][A
 21%|██▏       | 33523/157253 [00:00<00:01, 84025.28it/s][A
 28%|██▊       | 43436/157253 [00:00<00:01, 88048.45it/s][A


 43%|████▎     | 68276/157253 [00:00<00:01, 74096.39it/s][A
 48%|████▊     | 75437/157253 [00:00<00:01, 72449.42it/s][A


 62%|██████▏   | 97031/157253 [00:01<00:00, 71080.49it/s][A
 66%|██████▋   | 104189/157253 [00:01<00:00, 71228.01it/s][A
 71%|███████   | 111482/157253 [0

<intake_cesm.manage_collections.CESMCollections at 0x2b9c4d7d6390>

## Examing the collection

`intake_cesm` builds a `pandas.DataFrame` to store the collection. The `DataFrame` is stored as an attribute on the collection object.

In [5]:
col.df

Unnamed: 0,resource,experiment,case,component,stream,variable,date_range,ensemble,files,files_basename,files_dirname,ctrl_branch_year,year_offset,sequence_order,has_ocean_bgc,grid
0,HPSS:hsi:/CCSM/csm/CESM-CAM5-BGC-LE,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105,atm,cam.h1,FLNS,20060101-21001231,105,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,b.e11.BRCP85C5CNBDRD.f09_g16.105.cam.h1.FLNS.2...,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,,,1,True,
1,HPSS:hsi:/CCSM/csm/CESM-CAM5-BGC-LE,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105,atm,cam.h1,FLNSC,20060101-21001231,105,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,b.e11.BRCP85C5CNBDRD.f09_g16.105.cam.h1.FLNSC....,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,,,1,True,
2,HPSS:hsi:/CCSM/csm/CESM-CAM5-BGC-LE,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105,atm,cam.h1,FLUT,20060101-21001231,105,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,b.e11.BRCP85C5CNBDRD.f09_g16.105.cam.h1.FLUT.2...,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,,,1,True,
3,HPSS:hsi:/CCSM/csm/CESM-CAM5-BGC-LE,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105,atm,cam.h1,FSNS,20060101-21001231,105,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,b.e11.BRCP85C5CNBDRD.f09_g16.105.cam.h1.FSNS.2...,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,,,1,True,
4,HPSS:hsi:/CCSM/csm/CESM-CAM5-BGC-LE,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105,atm,cam.h1,FSNSC,20060101-21001231,105,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,b.e11.BRCP85C5CNBDRD.f09_g16.105.cam.h1.FSNSC....,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,,,1,True,
5,HPSS:hsi:/CCSM/csm/CESM-CAM5-BGC-LE,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105,atm,cam.h1,FSNTOA,20060101-21001231,105,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,b.e11.BRCP85C5CNBDRD.f09_g16.105.cam.h1.FSNTOA...,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,,,1,True,
6,HPSS:hsi:/CCSM/csm/CESM-CAM5-BGC-LE,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105,atm,cam.h1,ICEFRAC,20060101-21001231,105,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,b.e11.BRCP85C5CNBDRD.f09_g16.105.cam.h1.ICEFRA...,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,,,1,True,
7,HPSS:hsi:/CCSM/csm/CESM-CAM5-BGC-LE,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105,atm,cam.h1,LHFLX,20060101-21001231,105,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,b.e11.BRCP85C5CNBDRD.f09_g16.105.cam.h1.LHFLX....,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,,,1,True,
8,HPSS:hsi:/CCSM/csm/CESM-CAM5-BGC-LE,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105,atm,cam.h1,PRECL,20060101-21001231,105,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,b.e11.BRCP85C5CNBDRD.f09_g16.105.cam.h1.PRECL....,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,,,1,True,
9,HPSS:hsi:/CCSM/csm/CESM-CAM5-BGC-LE,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105,atm,cam.h1,PRECSC,20060101-21001231,105,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,b.e11.BRCP85C5CNBDRD.f09_g16.105.cam.h1.PRECSC...,/CCSM/csm/CESM-CAM5-BGC-LE/atm/proc/tseries/da...,,,1,True,
