# Convert lots of small NetCDFs to one big Zarr
The National Water Model writes a new NetCDF file for each hour, resulting in 8760 files for a year!  Here's how we are convering bunchs of little NetCDF files to Zarr.  

In theory, this would be a simple as:

```
import xarray as xr
ds = xr.open_mfdataset('*.nc')
ds.to_zarr('all_nc.zarr', consolidated=True)
```

In practice, we usually want to rechunk and xarray has issues with certain NetCDF elements, and it's a bit more complicated....

In [None]:
import numpy as np
import xarray as xr
import pandas as pd
import numcodecs
from dask.distributed import Client, progress, LocalCluster

Build a list of filenames for open_mfdataset

In [None]:
dates = pd.date_range(start='2017-01-01 00:00',end='2017-12-31 23:00', freq='1h')

files = ['./nc/{}/{}.CHRTOUT_DOMAIN1.comp'.format(date.strftime('%Y'),date.strftime('%Y%m%d%H%M')) for date in dates]

In [None]:
len(files)

In [None]:
dset = xr.open_dataset(files[0])

In [None]:
dset

A nice chunk size for object storage is on the order of 100Mb.   

In [None]:
time_chunk_size = 672   
feature_chunk_size = 30000

In [None]:
len(files)/time_chunk_size

In [None]:
nchunks = len(dset.feature_id)/feature_chunk_size
nchunks

In [None]:
nt_chunks = int(np.ceil(len(files)/time_chunk_size))
nt_chunks

In [None]:
(time_chunk_size * feature_chunk_size )*8 / 1e6

... Close enough to 100Mb

Create a function to drop stuff that messes up `open_mfdataset`

In [None]:
def drop_coords(ds):
    ds = ds.drop(['reference_time','feature_id', 'crs'])
    return ds.reset_coords(drop=True)

Create a local dask cluster

In [None]:
cluster = LocalCluster()
cluster

In [None]:
client = Client(cluster)

Tell blosc not to use threads since we are using dask to parallelize

In [None]:
numcodecs.blosc.use_threads = False

Step our way through the dataset, reading one chunk along the time dimension at a time, to avoid dask reading too many chunks before writing and blowing out memory.  First time chunk is written to zarr, then others are appended. 

In [None]:
%%time
for i in range(nt_chunks):
#for i in range(1):
    print(i)
    istart = i * time_chunk_size
    istop = int(np.min([(i+1) * time_chunk_size, len(files)]))
    
    ds = xr.open_mfdataset(files[istart:istop], parallel=True, preprocess=drop_coords, combine='by_coords', 
                       concat_dim='time')

    # add back in the 'feature_id' coordinate removed by preprocessing 
    ds.coords['feature_id'] = dset.coords['feature_id']

    ds1 = ds.chunk(chunks={'time':time_chunk_size, 'feature_id':feature_chunk_size})

    if i==0:
        ds1.to_zarr('zarr/2017f', consolidated=True, mode='w')
    else:
        ds1.to_zarr('zarr/2017f', consolidated=True, append_dim='time')