# Data Pipeline: `precipitation_flux` [.pp -> zarr]

Using Iris and Xarray to consolidate `precipitation` data stored in many .pp files to one large Zarr store.

This will follow the general pattern of:
1. Load .pp for one time period  [Iris]
2. Convert to zarr-able state  [Xarray]
3. Append to Zarr store

## Load sample dataset

In [None]:
import os
import iris

In [None]:
filepath = '/data/cssp-china/sample-data-17-01-20/cssp_china_pp'
filename = 'apepda.pa508i0.pp'
cubelist = iris.load(os.path.join(filepath, filename), 'precipitation_flux')
cubelist

In [None]:
cube, = cubelist
cube

## Turn this one cube into a Zarr store
Onto which we can start appending

In [None]:
cube.lazy_data()

Each cube is 2.5MB in size, which is a reasonable chunk size to have for our Zarr store

In [None]:
import xarray as xr

In [None]:
da = xr.DataArray.from_iris(cube)
da

In [None]:
da.chunk()

Looks like Xarray preserves the chunk size of the Iris cube

Let's convert it to an xr.Dataset.

In [None]:
ds = da.to_dataset()
ds

In [None]:
# Convert the np.ndarray to a dask.array
ds1 = ds.chunk(chunks={'time':10, 'grid_latitude':219, 'grid_longitude':286})
ds1

In [None]:
ds1.precipitation_flux.data

## How many chunks we be appending to the Zarray?

In [None]:
!ls -1q {filepath} | wc -l

This is a good number of chunks to try and append to the Zarr. At some point we might want to rechunk them (e.g. `100x100x100`) but for now let's not.

## `xr.Dataset` to Zarray

In [None]:
ds1.to_zarr('zarr_precip', consolidated=True, mode='w')

In [None]:
if i==0:
        ds1.to_zarr('zarr/2017f', consolidated=True, mode='w')
    else:
        ds1.to_zarr('zarr/2017f', consolidated=True, append_dim='time')

In [None]:
PRECIP_STASH = 'm01s05i216'

In [None]:
filepath = '/data/cssp-china/sample-data-17-01-20/cssp_china_pp'
filename = 'apepda.paj56i0.pp'
# cube2, = iris.load(os.path.join(filepath, filename), iris.AttributeConstraint(STASH=PRECIP_STASH))
cube2, = iris.load(os.path.join(filepath, filename), 'precipitation_flux')
cube2

In [None]:
cube2.data = cube2.core_data().rechunk((10,219,286))

In [None]:
cube2.core_data()

In [None]:
ds2 = xr.DataArray.from_iris(cube2).to_dataset()
ds2.precipitation_flux.data

In [None]:
ds1.to_zarr('zarr_precip', consolidated=True, append_dim='time')

In [None]:
ds_z = xr.open_zarr('zarr_precip/')
ds_z

In [None]:
ds_z.precipitation_flux

In [None]:
ds_z.time

In [None]:
ds_z.forecast_reference_time

In [None]:
from IPython.display import HTML

display([c for c in cube.coord('time').cells()])
print()
display([c for c in cube2.coord('time').cells()])

In [None]:
print(filename)
print('apepda.pa508i0.pp')

In [None]:
sorted(os.listdir(filepath))[0:4]

In [None]:
cube.coord('time')

In [None]:
for file in sorted(os.listdir(filepath))[0:4]:
    cube, = iris.load(os.path.join(filepath, file), iris.AttributeConstraint(STASH=PRECIP_STASH))
    
    display([c for c in cube.coord('time').cells()])
    print()