## Script to Create New `intake-esm` Collections (csv.gz) from Old Collections (nc)

This notebook assumes a netcdf collection  already exists.
If needed, create nc collection files in `~/.intake_esm/collections` from the YAML files in `intake-esm-collection-defs`.
Note that I needed to update that old version of `intake-esm` to make `cesm1-le.nc`.
I did this in a slightly modified conda environment based on my `intake-esm` sandbox:

```
$ conda activate /glade/work/mlevy/miniconda3/envs/legacy_intake/
(legacy_intake) $ intake-esm-builder -cdef cesm1-le-collection.yaml
```

I believe that the above conda environment just needs `intake-esm` v2019.8.23;
work is in progress to be able to point `intake-esm-builder` to a directory and have it generate the `csv.gz` collection.

In [1]:
import xarray as xr
import numpy as np

def prune_bad_collection_data(df):
    df = df.copy()
    bad_data = np.zeros(len(df), dtype=bool)
    bad_data = (bad_data 
                | df.date_range.isin(['185002-190001', '190002-195001', '195002-200001', '200002-201412']) 
                | df.path.str.contains('\.back/|\.back2/|\.backup\.04012019/'))
    print(f"Removing {sum(bad_data)} entries")
    return df[~bad_data]

def netcdf_to_df(file_in):
    """
       file_in: netcdf file generated from intake-esm-builder
    """
    df = xr.open_dataset(file_in).to_dataframe()
    df = df.drop(columns=['resource', 'resource_type', 'direct_access', 'file_basename', 'year_offset', 'sequence_order', 'grid'])
    df = df.rename(columns = {'file_fullpath' : 'path'})
    return(df)

### Generate the collection dataframe

There are two different `CESM2-CMIP6.nc` files.
The one in `~mclong/` points to data in `/glade/collections/cdg/timeseries-cmip6`; those runs are being moved to `/glade/campaign/collections/cmip/CMIP6/timeseries-cmip6`.

There is also a collection for `CESM1-CMIP5.nc`, which assumes data has been copied from HPSS to `/glade/p/cgd/oce/projects/cesm2-marbl/intake-esm-data`.
The `get_ocn_cmip5_files.sh` script can be used to add more data to those directories.

In [2]:
CESM1_LENS = netcdf_to_df('/glade/u/home/mlevy/.intake_esm/collections/CESM1-LE.nc')
print("Dataframe for CESM1-LENS (generated from /glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE)\n----\n{}".format(CESM1_LENS))

Dataframe for CESM1-LENS (generated from /glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE)
----
       experiment                              case component  stream  \
index                                                                   
0            CTRL       b.e11.B1850C5CN.f09_g16.005       atm  cam.h1   
1            CTRL       b.e11.B1850C5CN.f09_g16.005       atm  cam.h1   
2            CTRL       b.e11.B1850C5CN.f09_g16.005       atm  cam.h1   
3            CTRL       b.e11.B1850C5CN.f09_g16.005       atm  cam.h1   
4            CTRL       b.e11.B1850C5CN.f09_g16.005       atm  cam.h1   
...           ...                               ...       ...     ...   
161259      RCP45  b.e11.BRCP45C5CNBDRD.f09_g16.011       rof  rtm.h0   
161260      RCP45  b.e11.BRCP45C5CNBDRD.f09_g16.012       rof  rtm.h0   
161261      RCP45  b.e11.BRCP45C5CNBDRD.f09_g16.013       rof  rtm.h0   
161262      RCP45  b.e11.BRCP45C5CNBDRD.f09_g16.014       rof  rtm.h0   
161263      RCP45  b.e11.

### Write`JSON` and `csv` Files Defining Collection

`intake-esm` wants the dataframe written as a `csv` file (it is okay to compress with `gzip`).
Additionally, there is a JSON file that points to the `.csv.gz` file and also defines the different columns.

In [3]:
# df.to_csv('/glade/work/mlevy/intake-esm-collection/CESM1-CMIP5_only-NOT_CMORIZED.csv.gz', compression='gzip', index=False)
import os
import json

def write_collection(df, root_dir, collection_name, desc):
    """
       Write df as a csv file (file name is collection_name+'.csv.gz'; written to root_dir+'/csv.gz/')
       Write json file describing collection (file name is collection_name+'.json'; written to root_dir+'/json/')
    """

    # Make sure output directories exist
    dir_not_found = False
    csv_dir = os.path.join(root_dir, 'csv.gz')
    json_dir = os.path.join(root_dir, 'json')
    if not os.path.isdir(csv_dir):
        dir_not_found = True
        print('Can not find directory {}'.format(csv_dir))
    if not os.path.isdir(json_dir):
        dir_not_found = True
        print('Can not find directory {}'.format(json_dir))
    if dir_not_found:
        raise ValueError('Can not find needed subdirectories in {}'.format(root_dir))

    # Write csv file
    csv_file = os.path.join(csv_dir, collection_name+'.csv.gz')
    df.to_csv(csv_file, compression='gzip', index=False)

    # Write json_file
    json_file = os.path.join(json_dir, collection_name+'.json')
    collection = dict()
    collection["esmcat_version"] = "0.1.0"
    collection["id"] = collection_name
    collection["description"] = desc
    collection["catalog_file"] = csv_file
    collection["attributes"] = []
    for col in df.columns:
        collection["attributes"].append({"column_name" : col, "vocabulary" : ""})
    collection["assets"] = {"column_name" : "path", "format" : "netcdf"}
    collection["aggregation_control"] = dict()
    collection["aggregation_control"]["variable_column_name"] = "variable"
    collection["aggregation_control"]["groupby_attrs"] = ["component", "experiment", "stream"]
    collection["aggregation_control"]["aggregations"] = []
    collection["aggregation_control"]["aggregations"].append({"type" : "union", "attribute_name" : "variable"})
    collection["aggregation_control"]["aggregations"].append({"type" : "join_existing", "attribute_name" : "date_range", "options" : {"dim" : "time", "coords" : "minimal", "compat": "override"}})
    collection["aggregation_control"]["aggregations"].append({"type" : "join_new", "attribute_name" : "member_id", "options" : {"coords" : "minimal", "compat": "override"}})
    
    with open(json_file, "w") as f:
        f.write(json.dumps(collection, indent=2))

In [4]:
write_collection(CESM1_LENS, '/glade/work/mlevy/intake-esm-collection', 'glade-cesm1-le',
                 desc="ESM collection for the CESM1 LENS data stored on GLADE in /glade/collections/cdg/data/cesmLE/CESM-CAM5-BGC-LE")