# Getting Started

This notebook demonstrates data catalog functionality provided by `intake-esm`. Let's begin by importing intake:

In [1]:
import intake

## Open a Collection


To use `intake-esm`, we can instatiate a `esm_metadatastore` class in two ways:
- With the name of the collection and type of collection we want to use.
- With a collection input YAML file

Since the class is in the top-level of the package i.e `__init__.py`, and the package name starts with `intake_`, it will be scanned when Intake is imported. Now the plugin automatically appears in the set of known plugins in the Intake registry, and an associated `intake.open_esm_metadatastore` function is created at import time.

In [2]:
intake.registry

{'yaml_file_cat': intake.catalog.local.YAMLFileCatalog,
 'yaml_files_cat': intake.catalog.local.YAMLFilesCatalog,
 'remote-xarray': intake_xarray.xarray_container.RemoteXarray,
 'cmip5': intake_cmip.cmip5.CMIP5DataSource,
 'xarray_image': intake_xarray.image.ImageSource,
 'netcdf': intake_xarray.netcdf.NetCDFSource,
 'opendap': intake_xarray.opendap.OpenDapSource,
 'rasterio': intake_xarray.raster.RasterIOSource,
 'zarr': intake_xarray.xzarr.ZarrSource,
 'esm_metadatastore': intake_esm.core.ESMMetadataStoreCatalog,
 'csv': intake.source.csv.CSVSource,
 'textfiles': intake.source.textfiles.TextFilesSource,
 'catalog': intake.catalog.base.Catalog,
 'intake_remote': intake.catalog.base.RemoteCatalog,
 'numpy': intake.source.npy.NPySource}

In [3]:
collection = intake.open_esm_metadatastore(collection_name='cesm_dple', collection_type="cesm")

In [4]:
collection.df.head()

Unnamed: 0,resource,resource_type,direct_access,experiment,case,component,stream,variable,date_range,ensemble,files,files_basename,files_dirname,ctrl_branch_year,year_offset,sequence_order,has_ocean_bgc,grid
0,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,O2,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.O2....,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
1,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,NO3,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.NO3...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
2,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,SALT,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.SAL...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
3,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,TEMP,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.TEM...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
4,GLADE:posix:/glade/p/cesm/community/CESM-DPLE/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h,ADVT,024901-031612,0,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.ADVT.0249...,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,,1699,0,False,POP_gx1v6


In [5]:
collection.df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 583 entries, 0 to 582
Data columns (total 18 columns):
resource            583 non-null object
resource_type       583 non-null object
direct_access       583 non-null bool
experiment          583 non-null object
case                583 non-null object
component           583 non-null object
stream              583 non-null object
variable            583 non-null object
date_range          583 non-null object
ensemble            583 non-null int64
files               583 non-null object
files_basename      583 non-null object
files_dirname       583 non-null object
ctrl_branch_year    0 non-null float64
year_offset         583 non-null int64
sequence_order      583 non-null int64
has_ocean_bgc       583 non-null bool
grid                325 non-null object
dtypes: bool(2), float64(1), int64(3), object(12)
memory usage: 78.6+ KB


In [6]:
collection.df.head()

Unnamed: 0,resource,resource_type,direct_access,experiment,case,component,stream,variable,date_range,ensemble,files,files_basename,files_dirname,ctrl_branch_year,year_offset,sequence_order,has_ocean_bgc,grid
0,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,O2,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.O2....,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
1,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,NO3,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.NO3...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
2,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,SALT,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.SAL...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
3,GLADE:posix:/glade/p/cgd/oce/projects/DPLE_O2/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009_sigma_coord,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h.sigma,TEMP,024901-031612,0,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.sigma.TEM...,/glade/p/cgd/oce/projects/DPLE_O2/sigma_coord/...,,1699,0,False,POP_gx1v6
4,GLADE:posix:/glade/p/cesm/community/CESM-DPLE/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h,ADVT,024901-031612,0,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.ADVT.0249...,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,,1699,0,False,POP_gx1v6


## Search entries matching query

One of the features supported in `intake-esm` is querying the collection. This is achieved through the `search` method. The `search` method allows the user to specify a query by using keyword arguments. This method returns a subset of the collection with all the entries that match the query. 

In [7]:
cat = collection.search(case="g.e11_LENS.GECOIAF.T62_g16.009", component='ocn', variable=["O2"], stream="pop.h")

In [8]:
print(cat.yaml(True))

plugins:
  source:
  - module: intake_esm.cesm
sources:
  cesm_dple_a9c6c878-0fee-43f1-bd39-978a2d342057:
    args:
      chunks:
        time: 1
      collection_name: cesm_dple
      collection_type: cesm
      concat_dim: time
      decode_coords: false
      decode_times: false
      engine: netcdf4
      query:
        case: g.e11_LENS.GECOIAF.T62_g16.009
        component: ocn
        ctrl_branch_year: null
        date_range: null
        direct_access: null
        ensemble: null
        experiment: null
        files: null
        files_basename: null
        files_dirname: null
        grid: null
        has_ocean_bgc: null
        resource: null
        resource_type: null
        sequence_order: null
        stream: pop.h
        variable:
        - O2
        year_offset: null
    description: Catalog entry from cesm_dple collection
    driver: cesm
    metadata:
      cache: {}
      catalog_dir: ''



In [9]:
cat.results.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1 entries, 522 to 522
Data columns (total 18 columns):
resource            1 non-null object
resource_type       1 non-null object
direct_access       1 non-null bool
experiment          1 non-null object
case                1 non-null object
component           1 non-null object
stream              1 non-null object
variable            1 non-null object
date_range          1 non-null object
ensemble            1 non-null int64
files               1 non-null object
files_basename      1 non-null object
files_dirname       1 non-null object
ctrl_branch_year    0 non-null float64
year_offset         1 non-null int64
sequence_order      1 non-null int64
has_ocean_bgc       1 non-null bool
grid                1 non-null object
dtypes: bool(2), float64(1), int64(3), object(12)
memory usage: 138.0+ bytes


In [10]:
cat.results

Unnamed: 0,resource,resource_type,direct_access,experiment,case,component,stream,variable,date_range,ensemble,files,files_basename,files_dirname,ctrl_branch_year,year_offset,sequence_order,has_ocean_bgc,grid
522,GLADE:posix:/glade/p/cesm/community/CESM-DPLE/...,posix,True,g.e11_LENS.GECOIAF.T62_g16.009,g.e11_LENS.GECOIAF.T62_g16.009,ocn,pop.h,O2,024901-031612,0,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,g.e11_LENS.GECOIAF.T62_g16.009.pop.h.O2.024901...,/glade/p/cesm/community/CESM-DPLE/CESM-DPLE_PO...,,1699,0,False,POP_gx1v6


In [11]:
ds = cat.to_xarray()
ds

<xarray.Dataset>
Dimensions:               (d2: 2, ens: 1, lat_aux_grid: 395, moc_comp: 3, moc_z: 61, nlat: 384, nlon: 320, time: 816, transport_comp: 5, transport_reg: 2, z_t: 60, z_t_150m: 15, z_w: 60, z_w_bot: 60, z_w_top: 60)
Coordinates:
  * z_t                   (z_t) float32 500.0 1500.0 ... 512502.8 537500.0
  * z_w                   (z_w) float32 0.0 1000.0 2000.0 ... 500004.7 525000.94
  * moc_z                 (moc_z) float32 0.0 1000.0 ... 525000.94 549999.06
  * z_w_top               (z_w_top) float32 0.0 1000.0 ... 500004.7 525000.94
  * z_w_bot               (z_w_bot) float32 1000.0 2000.0 ... 549999.06
  * lat_aux_grid          (lat_aux_grid) float32 -79.48815 -78.952896 ... 90.0
  * z_t_150m              (z_t_150m) float32 500.0 1500.0 ... 13500.0 14500.0
  * time                  (time) float64 9.092e+04 9.094e+04 ... 1.157e+05
Dimensions without coordinates: d2, ens, moc_comp, nlat, nlon, transport_comp, transport_reg
Data variables:
    REGION_MASK           (nlat, 

In [12]:
cat2 = collection.search(component=["ocn"], variable=["TEMP", "ADVT"])

In [13]:
print(cat2.yaml(True))

plugins:
  source:
  - module: intake_esm.cesm
sources:
  cesm_dple_af8dd059-b59f-4964-ae9b-30362c33bb67:
    args:
      chunks:
        time: 1
      collection_name: cesm_dple
      collection_type: cesm
      concat_dim: time
      decode_coords: false
      decode_times: false
      engine: netcdf4
      query:
        case: null
        component:
        - ocn
        ctrl_branch_year: null
        date_range: null
        direct_access: null
        ensemble: null
        experiment: null
        files: null
        files_basename: null
        files_dirname: null
        grid: null
        has_ocean_bgc: null
        resource: null
        resource_type: null
        sequence_order: null
        stream: null
        variable:
        - TEMP
        - ADVT
        year_offset: null
    description: Catalog entry from cesm_dple collection
    driver: cesm
    metadata:
      cache: {}
      catalog_dir: ''



As the user queries the collection, `intake-esm` builds a dictionary of catalog entries for executed searches so far:

In [14]:
collection._entries

{'cesm_dple_a9c6c878-0fee-43f1-bd39-978a2d342057': <Catalog Entry: cesm_dple_a9c6c878-0fee-43f1-bd39-978a2d342057>,
 'cesm_dple_af8dd059-b59f-4964-ae9b-30362c33bb67': <Catalog Entry: cesm_dple_af8dd059-b59f-4964-ae9b-30362c33bb67>}

In [15]:
for key, val in collection._entries.items():
    print(val.yaml(True))

plugins:
  source:
  - module: intake_esm.cesm
sources:
  cesm_dple_a9c6c878-0fee-43f1-bd39-978a2d342057:
    args:
      chunks:
        time: 1
      collection_name: cesm_dple
      collection_type: cesm
      concat_dim: time
      decode_coords: false
      decode_times: false
      engine: netcdf4
      query:
        case: g.e11_LENS.GECOIAF.T62_g16.009
        component: ocn
        ctrl_branch_year: null
        date_range: null
        direct_access: null
        ensemble: null
        experiment: null
        files: null
        files_basename: null
        files_dirname: null
        grid: null
        has_ocean_bgc: null
        resource: null
        resource_type: null
        sequence_order: null
        stream: pop.h
        variable:
        - O2
        year_offset: null
    description: Catalog entry from cesm_dple collection
    driver: cesm
    metadata:
      cache: {}
      catalog_dir: ''
      coords: !!python/tuple
      - z_t
      - z_w
      - moc_z
      

In [16]:
%load_ext watermark

In [17]:
%watermark --iversion -g -h -m -v -u -d

intake 0.4.1
last updated: 2019-02-17 

CPython 3.6.7
IPython 7.1.1

compiler   : GCC 7.3.0
system     : Linux
release    : 3.12.62-60.64.8-default
machine    : x86_64
processor  : x86_64
CPU cores  : 72
interpreter: 64bit
host name  : r6i6n30
Git hash   : 71d5340c3bbabf8a76f6ba780675f5b7e4a1c8e5
