# Getting Started

This notebook demonstrates data catalog functionality provided by `intake-cesm`. Let's begin by importing intake:

In [1]:
import intake

## Open a Collection

So far, `intake-cesm` supports data sets from 3 CESM collections:

- `cesm_dple`: DPLE | Decadal Prediction Large Ensemble Project
- `cesm2_runs`: CESM runs
- `cesm1_le`: LE | Large Ensemble Community Project



To use `intake-cesm`, we instatiate a `cesm_metadatastore` class with the name of the collection we want to use.

Since the class is in the top-level of the package i.e `__init__.py`, and the package name starts with `intake_`, it will be scanned when Intake is imported. Now the plugin automatically appears in the set of known plugins in the Intake registry, and an associated `intake.open_cesm_metadatastore` function is created at import time.

In [2]:
intake.registry

{'yaml_file_cat': intake.catalog.local.YAMLFileCatalog,
 'yaml_files_cat': intake.catalog.local.YAMLFilesCatalog,
 'remote-xarray': intake_xarray.xarray_container.RemoteXarray,
 'xarray_image': intake_xarray.image.ImageSource,
 'netcdf': intake_xarray.netcdf.NetCDFSource,
 'opendap': intake_xarray.opendap.OpenDapSource,
 'rasterio': intake_xarray.raster.RasterIOSource,
 'zarr': intake_xarray.xzarr.ZarrSource,
 'cesm_metadatastore': intake_cesm.core.CesmMetadataStoreCatalog,
 'cesm': intake_cesm.core.CesmSource,
 'csv': intake.source.csv.CSVSource,
 'textfiles': intake.source.textfiles.TextFilesSource,
 'catalog': intake.catalog.base.Catalog,
 'intake_remote': intake.catalog.base.RemoteCatalog,
 'numpy': intake.source.npy.NPySource}

In [3]:
collection = intake.open_cesm_metadatastore('cesm_dple')

Active collection: cesm_dple


In [4]:
collection.df.head()

Unnamed: 0,case,component,date_range,ensemble,experiment,file_basename,files,grid,sequence_order,stream,variable,year_offset,ctrl_branch_year,has_ocean_bgc
0,g.e11_LENS.GECOIAF.T62_g16.009,ocn,"['024901', '031612']",0,hindcast_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.NO3...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,POP_gx1v6,0,pop.h.sigma,NO3,1699,,
1,g.e11_LENS.GECOIAF.T62_g16.009,ocn,"['024901', '031612']",0,hindcast_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.O2....,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,POP_gx1v6,0,pop.h.sigma,O2,1699,,
2,g.e11_LENS.GECOIAF.T62_g16.009,ocn,"['024901', '031612']",0,hindcast_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.SAL...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,POP_gx1v6,0,pop.h.sigma,SALT,1699,,
3,g.e11_LENS.GECOIAF.T62_g16.009,ocn,"['024901', '031612']",0,hindcast_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.TEM...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,POP_gx1v6,0,pop.h.sigma,TEMP,1699,,
4,g.e11_LENS.GECOIAF.T62_g16.009,ice,"['024901', '031612']",0,hindcast,g.e11_LENS.GECOIAF.T62_g16.009.cice.h.FYarea_n...,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,,0,cice.h,FYarea_nh,1699,,


In [5]:
collection.df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 583 entries, 0 to 582
Data columns (total 14 columns):
case                583 non-null object
component           583 non-null object
date_range          583 non-null object
ensemble            583 non-null int64
experiment          583 non-null object
file_basename       583 non-null object
files               583 non-null object
grid                325 non-null object
sequence_order      583 non-null int64
stream              583 non-null object
variable            583 non-null object
year_offset         583 non-null int64
ctrl_branch_year    0 non-null float64
has_ocean_bgc       0 non-null float64
dtypes: float64(2), int64(3), object(9)
memory usage: 68.3+ KB


## Set active collection

`Intake-cesm` allows the user to switch active collections by calling `set_collection(collection_name)`.

In [6]:
collection.set_collection('cesm1_le')

Active collection: cesm1_le


In [7]:
collection.df.head()

Unnamed: 0,case,component,date_range,ensemble,experiment,file_basename,files,freq,grid,has_ocean_bgc,sequence_order,variable,year_offset,ctrl_branch_year
0,b.e11.BRCP85C5CNBDRD.f09_g16.105,ice,"['200601', '210012']",105,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105.cice.h.hisnap...,/glade/collections/cdg/data/cesmLE/CESM-CAM5-B...,month_1,POP_gx1v6,True,1,hisnap_nh,,
1,b.e11.BRCP85C5CNBDRD.f09_g16.105,ice,"['200601', '210012']",105,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105.cice.h.hisnap...,/glade/collections/cdg/data/cesmLE/CESM-CAM5-B...,month_1,POP_gx1v6,True,1,hisnap_sh,,
2,b.e11.BRCP85C5CNBDRD.f09_g16.105,ice,"['200601', '210012']",105,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105.cice.h.strair...,/glade/collections/cdg/data/cesmLE/CESM-CAM5-B...,month_1,POP_gx1v6,True,1,strairy_nh,,
3,b.e11.BRCP85C5CNBDRD.f09_g16.105,ice,"['200601', '210012']",105,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105.cice.h.strair...,/glade/collections/cdg/data/cesmLE/CESM-CAM5-B...,month_1,POP_gx1v6,True,1,strairy_sh,,
4,b.e11.BRCP85C5CNBDRD.f09_g16.105,ice,"['200601', '210012']",105,RCP85,b.e11.BRCP85C5CNBDRD.f09_g16.105.cice.h.strcor...,/glade/collections/cdg/data/cesmLE/CESM-CAM5-B...,month_1,POP_gx1v6,True,1,strcory_nh,,


In [8]:
collection.df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 116275 entries, 0 to 116274
Data columns (total 14 columns):
case                116275 non-null object
component           116275 non-null object
date_range          116275 non-null object
ensemble            116275 non-null int64
experiment          116275 non-null object
file_basename       116275 non-null object
files               116275 non-null object
freq                116275 non-null object
grid                112347 non-null object
has_ocean_bgc       116275 non-null bool
sequence_order      116275 non-null int64
variable            116275 non-null object
year_offset         15145 non-null float64
ctrl_branch_year    0 non-null float64
dtypes: bool(1), float64(2), int64(2), object(9)
memory usage: 12.5+ MB


## Search entries matching query

One of the features supported in `intake-cesm` is querying the collection. This is achieved through the `search` method. The `search` method allows the user to specify a query by using keyword arguments. This method returns a subset of the collection with all the entries that match the query. 

In [9]:
cat = collection.search(experiment=['20C', 'RCP85'], component='ocn', ensemble=1, variable='FG_CO2')

In [10]:
print(cat.yaml(True))

plugins:
  source:
  - module: intake_cesm.core
sources:
  cesm1_le-de2ba608-e8f4-4726-b434-8238d91722ea:
    args:
      chunks:
        time: 1
      collection: cesm1_le
      concat_dim: time
      decode_coords: false
      decode_times: false
      engine: netcdf4
      query:
        case: null
        component: ocn
        ctrl_branch_year: null
        date_range: null
        ensemble: 1
        experiment:
        - 20C
        - RCP85
        has_ocean_bgc: null
        stream: null
        variable: FG_CO2
    description: Catalog from cesm1_le collection
    driver: cesm
    metadata:
      cache: {}
      catalog_dir: ''



In [11]:
cat.results.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 100755 to 64402
Data columns (total 14 columns):
case                3 non-null object
component           3 non-null object
date_range          3 non-null object
ensemble            3 non-null int64
experiment          3 non-null object
file_basename       3 non-null object
files               3 non-null object
freq                3 non-null object
grid                3 non-null object
has_ocean_bgc       3 non-null bool
sequence_order      3 non-null int64
variable            3 non-null object
year_offset         0 non-null float64
ctrl_branch_year    0 non-null float64
dtypes: bool(1), float64(2), int64(2), object(9)
memory usage: 339.0+ bytes


In [12]:
ds = cat.to_xarray()
ds

ImportError: /glade/work/abanihi/softwares/miniconda3/envs/py-dev/lib/python3.6/site-packages/netCDF4/../../../././libcom_err.so.3: symbol k5_os_mutex_destroy, version krb5support_0_MIT not defined in file libkrb5support.so.0 with link time reference

In [13]:
cat2 = collection.search(experiment='RCP85', component='ice', ensemble=1)

In [14]:
print(cat2.yaml(True))

plugins:
  source:
  - module: intake_cesm.core
sources:
  cesm1_le-9804d3da-79fc-4d8e-b2e8-9f44708698ab:
    args:
      chunks:
        time: 1
      collection: cesm1_le
      concat_dim: time
      decode_coords: false
      decode_times: false
      engine: netcdf4
      query:
        case: null
        component: ice
        ctrl_branch_year: null
        date_range: null
        ensemble: 1
        experiment: RCP85
        has_ocean_bgc: null
        stream: null
        variable: null
    description: Catalog from cesm1_le collection
    driver: cesm
    metadata:
      cache: {}
      catalog_dir: ''



In [16]:
cat2.results.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 592 entries, 63697 to 63336
Data columns (total 14 columns):
case                592 non-null object
component           592 non-null object
date_range          592 non-null object
ensemble            592 non-null int64
experiment          592 non-null object
file_basename       592 non-null object
files               592 non-null object
freq                592 non-null object
grid                592 non-null object
has_ocean_bgc       592 non-null bool
sequence_order      592 non-null int64
variable            592 non-null object
year_offset         0 non-null float64
ctrl_branch_year    0 non-null float64
dtypes: bool(1), float64(2), int64(2), object(9)
memory usage: 65.3+ KB


As the user queries the collection, `intake-cesm` builds a dictionary of catalog entries for executed searches so far:

In [19]:
collection._entries

{'cesm1_le-9134d01b-3e19-4410-9fd1-521c41150205': <Catalog Entry: cesm1_le-9134d01b-3e19-4410-9fd1-521c41150205>,
 'cesm1_le-d46ef138-8cd0-496e-9333-1cf86d7a2c5a': <Catalog Entry: cesm1_le-d46ef138-8cd0-496e-9333-1cf86d7a2c5a>}

In [18]:
for key, val in collection._entries.items():
    print(val.yaml(True))

plugins:
  source:
  - module: intake_cesm.core
sources:
  cesm1_le-de2ba608-e8f4-4726-b434-8238d91722ea:
    args:
      chunks:
        time: 1
      collection: cesm1_le
      concat_dim: time
      decode_coords: false
      decode_times: false
      engine: netcdf4
      query:
        case: null
        component: ocn
        ctrl_branch_year: null
        date_range: null
        ensemble: 1
        experiment:
        - 20C
        - RCP85
        has_ocean_bgc: null
        stream: null
        variable: FG_CO2
    description: Catalog from cesm1_le collection
    driver: cesm
    metadata:
      cache: {}
      catalog_dir: ''

plugins:
  source:
  - module: intake_cesm.core
sources:
  cesm1_le-9804d3da-79fc-4d8e-b2e8-9f44708698ab:
    args:
      chunks:
        time: 1
      collection: cesm1_le
      concat_dim: time
      decode_coords: false
      decode_times: false
      engine: netcdf4
      query:
        case: null
        component: ice
        ctrl_branch_year: nu

In [17]:
%load_ext watermark

In [18]:
%watermark --iversion -g -h -m -v -u -d

intake    0.4.1
last updated: 2019-01-30 

CPython 3.6.7
IPython 7.1.1

compiler   : GCC 7.3.0
system     : Linux
release    : 3.12.62-60.64.8-default
machine    : x86_64
processor  : x86_64
CPU cores  : 72
interpreter: 64bit
host name  : r6i6n31
Git hash   : 9f7c07a3ee429e174acdc9c78a8afdc539d53ca6
