# Use a custom parser

While many of the parsers included within this libary may be useful, we do not have parsers for **every** dataset out there. If you are interested in adding your own parser (and hopefully contributing that parser to the main repo 😊 ), check out this walkthrough of how to build one!

## What is a Parser?
Basically, a parser collects information from two main sources:
* The file string
* The dataset itself

This means there are two main steps:
* Parsing out the file string, separating based on some symbol
* Opening the file, and extracting variables and their attributes, or even global attributes

The result from a "parser" is a dictionary of fields to add to the catalog, stored in a `pandas.DataFrame`

It would probably be **more helpful** to walk through a concrete example of this...

## Example of Building a Parser
Let's say we have a list of files which we wanted to parse! In this example, we are using a set of observational data on NCAR HPC resources. A full blog post detailing this dataset and comparison is [included here](https://ncar.github.io/esds/posts/2021/intake-obs-cesm2le-comparison/)

### Imports

In [16]:
import glob
import pathlib
import traceback
from datetime import datetime

import xarray as xr

from ecgtools import Builder
from ecgtools.builder import INVALID_ASSET, TRACEBACK

In [3]:
files = sorted(glob.glob('/glade/p/cesm/amwg/amwg_diagnostics/obs_data/*'))
files[::20]

['/glade/p/cesm/amwg/amwg_diagnostics/obs_data/AIRS_01_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ARM_annual_cycle_twp_c2_cmbe_sound_p_f.cdf',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES-EBAF_01_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES2_04_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CLOUDSATCOSP_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CLOUDSAT_10_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ECMWF_09_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/EP.ERAI_DJF_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERAI_04_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERBE_07_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/ERS_12_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/GPCP_JJA_climo.nc',
 '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/HadISST_CL_03_climo.nc',
 '/glade/p/cesm/am

Observational datasetsets in this directory follow the convention `source_(month/season/annual)_climo.nc.`

Let’s open up one of those datasets

In [5]:
ds = xr.open_dataset('/glade/p/cesm/amwg/amwg_diagnostics/obs_data/CERES-EBAF_01_climo.nc')
ds

We see that this dataset is gridded on a global 0.5° grid, with several variables related to solar fluxes (ex. `TOA net shortwave`)

### Parsing the Filepath
As mentioned before, the first step is parsing out information from the filepath. Here, we use [pathlib](https://docs.python.org/3/library/pathlib.html) which can be helpful when working with filepaths generically

In [7]:
path = pathlib.Path(files[0])
path.stem

'AIRS_01_climo'

This path can be split using `.split('_')`, separates the path into the following:
* Observational dataset source
* Month Number, Season, or Annual
* “climo”

In [8]:
path.stem.split('_')

['AIRS', '01', 'climo']

### Open the File for More Information
We can also gather useful insight by opening the file!

In [10]:
ds = xr.open_dataset(files[0])
ds

Let’s look at the variable “Temperature” (`T`)

In [11]:
ds.T

In this case, we want to include the list of variables available from this single file, such that each entry in our catalog represents a single file. We can search for variables in this dataset using the following:

In [13]:
variable_list = [var for var in ds if 'long_name' in ds[var].attrs]
variable_list

['gw', 'T', 'RELHUM', 'O3', 'SHUM', 'PREH2O']

### Assembling These Parts into a Function
Now that we have methods of extracting the relevant information, we can assemble this into a function which returns a dictionary. You'll notice the addition of the exception handling, which will add the unparsable file to a `pandas.DataFrame` with the unparsable file, and the associated traceback error.

In [15]:
def parse_amwg_obs(file):
    """Atmospheric observational data stored in"""
    file = pathlib.Path(file)
    info = {}

    try:
        stem = file.stem
        split = stem.split('_')
        source = split[0]
        temporal = split[-2]
        if len(temporal) == 2:
            month_number = int(temporal)
            time_period = 'monthly'
            temporal = datetime(2020, month_number, 1).strftime('%b').upper()

        elif temporal == 'ANN':
            time_period = 'annual'
        else:
            time_period = 'seasonal'

        with xr.open_dataset(file, chunks={}, decode_times=False) as ds:
            variable_list = [var for var in ds if 'long_name' in ds[var].attrs]

            info = {
                'source': source,
                'temporal': temporal,
                'time_period': time_period,
                'variable': variable_list,
                'path': str(file),
            }

        return info

    except Exception:
        return {INVALID_ASSET: file, TRACEBACK: traceback.format_exc()}

### Test this Parser on Some Files
We can try this parser on a single file, to make sure that it returns a dictionary

In [20]:
parse_amwg_obs(files[0])

{'source': 'AIRS',
 'temporal': 'JAN',
 'time_period': 'monthly',
 'variable': ['gw', 'T', 'RELHUM', 'O3', 'SHUM', 'PREH2O'],
 'path': '/glade/p/cesm/amwg/amwg_diagnostics/obs_data/AIRS_01_climo.nc'}

Now that we made sure that it works, we can implement in `ecgtools`! 

First, we setup the `Builder` object

In [17]:
b = Builder('/glade/p/cesm/amwg/amwg_diagnostics/obs_data')

Next, we build the catalog using our newly created parser!

In [18]:
b.build(parse_amwg_obs)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    0.9s remaining:    0.0s
[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:    0.9s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 40 concurrent workers.
[Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done 216 tasks      | elapsed:    3.8s
[Parallel(n_jobs=-1)]: Done 760 tasks      | elapsed:    4.1s
[Parallel(n_jobs=-1)]: Done 2333 tasks      | elapsed:    5.1s
[Parallel(n_jobs=-1)]: Done 2882 tasks      | elapsed:    5.5s
[Parallel(n_jobs=-1)]: Done 3096 out of 3096 | elapsed:    5.8s finished
  parsing_func, parsing_func_kwargs


Builder(root_path=PosixPath('/glade/p/cesm/amwg/amwg_diagnostics/obs_data'), extension='.nc', depth=0, exclude_patterns=None, njobs=-1)

Let's take a look at our resultant catalog...

In [19]:
b.df

Unnamed: 0,source,temporal,time_period,variable,path
0,ABLE-2A,c2h6,seasonal,"[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p...",/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
1,ABLE-2A,c2h6,seasonal,"[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p...",/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
2,ABLE-2A,c3h8,seasonal,"[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p...",/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3,ABLE-2A,c3h8,seasonal,"[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p...",/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
6,ABLE-2A,noday,seasonal,"[dnum, dmin, dmax, dmed, dmn, dstd, d5pt, d25p...",/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
...,...,...,...,...,...
3091,ozonesondes,polar1995,seasonal,"[levels, o3_mean, o3_med, o3_num, o3_std, o3_w...",/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3092,ozonesondes,tropics11995,seasonal,"[levels, o3_mean, o3_med, o3_num, o3_std, o3_w...",/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3093,ozonesondes,tropics21995,seasonal,"[levels, o3_mean, o3_med, o3_num, o3_std, o3_w...",/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
3094,ozonesondes,tropics31995,seasonal,"[levels, o3_mean, o3_med, o3_num, o3_std, o3_w...",/glade/p/cesm/amwg/amwg_diagnostics/obs_data/c...
