# Preparing the data for CNN models
**Author**: Code by Yifei Hang (UW) and edited by Eli Holmes


### Library imports

In this tutorial, we will be using the following libraries:
- `import numpy as np`: NumPy is a fundamental library for scientific computing in Python. It provides support for multidimensional arrays, matrices, and many fast mathematical operations tomanipulate these data structures. We use NumPy for array operations and data processing. 
- `import dask.array as da`: Dask is a parallel computing library that handles large datasets, and dask array allows parallel computation coordinating a collection of NumPy arrays, which greatly improves efficiency. We use Dask array for more efficient computation of larger data.
- `import xarray as xr`: Xarray is designed for working with labeled multi-dimensional arrays, commonly used in scientific data analysis, particularly in fields like oceanography and climate science. It extends the capabilities of NumPy arrays by introducing labels of dimensions, coordinates, and attributes to the data. We use Xarray to create and manipulate datasets.
- `import zarr`: Zarr is a cloud-based data format for chunked, compressed, N-dimensional arrays. The library allows efficient storage and access of large datasets. We use Zarr to store Xarray datasets in the easily accessible Zarr format.
- `from os import path`: The OS library provides functions for interacting with the operating system, and the Path submodule is specifically useful in pathname manipulations. We use path to check whether a file or directory exists and avoid recomputations.
- `import matplotlib.pyplot as plt`: Matplotlib is a powerful plotting library, and the pyplot module introduces a collection of functions that allows MATLAB-like plotting. We use pyplot to visualize our data and results.


In [1]:
import numpy as np
import dask.array as da
import xarray as xr
import zarr
from os import path
import matplotlib.pyplot as plt

### Prep the data

### Data Preprocessing
#### 1. Load the dataset
We start by loading the dataset of IO.zarr, slicing the region to the desired dimension, and removing days with no valid CHL data. This is mainly prior to 1997.

In [3]:
zarr_path = "~/shared-public/mind_the_chl_gap/IO.zarr"
zarr_ds = xr.open_zarr(store=zarr_path, consolidated=True)  # get data
 
zarr_ds = zarr_ds.sel(lat=slice(32, -11.75), lon=slice(42,101.75))  # choose long and lat

all_nan_CHL = np.isnan(zarr_ds['CHL_cmes-level3']).all(dim=["lon", "lat"]).compute()  # find sample indices where CHL is NaN

zarr_ds = zarr_ds.sel(time=(~all_nan_CHL))  # select samples with CHL not NaN

zarr_ds = zarr_ds.sortby('time')

In [4]:
zarr_ds

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,370.32 MiB,3.95 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray
"Array Chunk Bytes 370.32 MiB 3.95 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type uint8 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,370.32 MiB,3.95 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,41.25 kiB,41.25 kiB
Shape,"(176, 240)","(176, 240)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray
"Array Chunk Bytes 41.25 kiB 41.25 kiB Shape (176, 240) (176, 240) Dask graph 1 chunks in 3 graph layers Data type uint8 numpy.ndarray",240  176,

Unnamed: 0,Array,Chunk
Bytes,41.25 kiB,41.25 kiB
Shape,"(176, 240)","(176, 240)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,uint8 numpy.ndarray,uint8 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,80.41 MiB
Shape,"(9193, 176, 240)","(499, 176, 240)"
Dask graph,19 chunks in 5 graph layers,19 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 80.41 MiB Shape (9193, 176, 240) (499, 176, 240) Dask graph 19 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,80.41 MiB
Shape,"(9193, 176, 240)","(499, 176, 240)"
Dask graph,19 chunks in 5 graph layers,19 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,80.41 MiB
Shape,"(9193, 176, 240)","(499, 176, 240)"
Dask graph,19 chunks in 5 graph layers,19 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 80.41 MiB Shape (9193, 176, 240) (499, 176, 240) Dask graph 19 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,80.41 MiB
Shape,"(9193, 176, 240)","(499, 176, 240)"
Dask graph,19 chunks in 5 graph layers,19 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,330.00 kiB,330.00 kiB
Shape,"(176, 240)","(176, 240)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 330.00 kiB 330.00 kiB Shape (176, 240) (176, 240) Dask graph 1 chunks in 3 graph layers Data type float64 numpy.ndarray",240  176,

Unnamed: 0,Array,Chunk
Bytes,330.00 kiB,330.00 kiB
Shape,"(176, 240)","(176, 240)"
Dask graph,1 chunks in 3 graph layers,1 chunks in 3 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.45 GiB 15.79 MiB Shape (9193, 176, 240) (98, 176, 240) Dask graph 94 chunks in 5 graph layers Data type float32 numpy.ndarray",240  176  9193,

Unnamed: 0,Array,Chunk
Bytes,1.45 GiB,15.79 MiB
Shape,"(9193, 176, 240)","(98, 176, 240)"
Dask graph,94 chunks in 5 graph layers,94 chunks in 5 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


#### 2. Process data

#### Function: `data_preprocessing`
This function selects and standardizes feature variables, and stores them to a zarr file for easy access in future training and evaluation.

##### Parameters:
- `zarr_ds`: original zarr dataset after region slicing and NaN CHL filtering
- `features`: a list of features available directly from _zarr_ds_
- `train_year`: the first year of train data
- `train_range`: length of train data in year


##### Other Features (X):
- `sin_time`:
$
\sin({\text{day in the year} \over 366} \cdot 2 \pi)
$
for seasonal information
- `cos_time`:
$
\cos({\text{day in the year} \over 366} \cdot 2 \pi)
$
for seasonal information as well
- `masked_CHL` (logged): artifically masked CHL to simulate cloud coverage. Artificial clouds are the overlapping pixels of _current day observed CHL location_ and _10 day after cloud location_
- `prev_day_CHL`: CHL data from the previous day
- `next_day_CHL`: CHL data from the next day
- `land_flag`: flag for land, with 1 = land and 0 = not land
- `real_cloud_flag`: flag for real cloud, with 1 = real cloud and 0 = not real cloud
- `valid_CHL_flag`: flag for observed CHL after applying artifical masks, with 1 = CHL observed and 0 = CHL not observed
- `fake_cloud_flag`: flag for fake cloud, with 1 = fake cloud and 0 = not fake cloud

##### label (y):
- `CHL` (logged): observed CHL

##### Standardization:
First standardize based on train dataset, and then apply the calculated mean and standard deviation to all data. Only numerical features and the label are standardized. Mean and standard deviation of _CHL_ and _masked_CHL_ are stored in a `.npy` file for evaluation. 

#### Function: `create_zarr`
This function creates a zarr file and stores standardized features and label to the zarr file. 

__Note__: If you run the code for the first time (the `data_preprocessing` function creates the zarr file), it is recommended to restart the kernel and release the memory. Otherwise you might run out of memory during the trainning phase.

In [17]:
def data_preprocessing(zarr_ds, features, train_year, train_range, data_dir="./data"):
    numer_features = []  # numerical features
    cat_features = []  # categorical features
    zarr_label = f'{train_year}_{train_range}'  # later passed to create_zarr as zarr file name
    zarr_label = f'{zarr_label}_full_2days'

    print('label created')

    if path.exists(f'data/{zarr_label}.zarr'):
        print('Zarr file exists')
        return zarr_label
    
    # add raw data features
    for feature in features:
        feat_arr = zarr_ds[feature].data
        numer_features.append(feat_arr)
    print('raw data features added')

    # get label
    CHL_data = zarr_ds['CHL_cmes-level3']
    CHL_data = np.log(CHL_data.copy())
    print('CHL logged')
    
    # additional features
    # sin and cos of day for seasonal features
    time_data = da.array(zarr_ds.time)
    day_rad = (time_data - np.datetime64("1900-01-01")) / np.timedelta64(1, "D") / 365 * 2 * np.pi
    day_rad = day_rad.astype(np.float32)
    day_sin = np.sin(day_rad)
    day_cos = np.cos(day_rad)
    print('sin and cos time calculated')
    day_sin = np.tile(day_sin[:, np.newaxis, np.newaxis], (1,) + CHL_data[0].shape)
    day_sin = da.rechunk(day_sin, (100, *day_sin.shape[1:]))
    numer_features.append(day_sin)
    print('sin time added')
    day_cos = np.tile(day_cos[:, np.newaxis, np.newaxis], (1,) + CHL_data[0].shape)
    day_cos = da.rechunk(day_cos, (100, *day_cos.shape[1:]))
    numer_features.append(day_cos)
    print('cos time added')

    
    # artifically masked CHL (10 day shift)
    day_shift_flag = np.vstack((zarr_ds['CHL_cmes-cloud'].data[10:], zarr_ds['CHL_cmes-cloud'].data[:10]))
    assert CHL_data.shape == day_shift_flag.shape
    
    masked_CHL = da.where(day_shift_flag == 0, np.nan, CHL_data)
    numer_features.append(masked_CHL)

    print('masked CHL added')

    prev_day = np.vstack((np.zeros((1, ) + CHL_data[0].shape), CHL_data.data[:-1]))
    numer_features.append(prev_day)
    print('prev day CHL added')
    next_day = np.vstack((CHL_data.data[1:], np.zeros((1, ) + CHL_data[0].shape)))
    numer_features.append(next_day)
    print('next day CHL added')

    # land one-hot encoding
    land_flag = da.zeros(CHL_data.shape)
    land_flag = da.where(zarr_ds['CHL_cmes-cloud'][0] == 2, 1, land_flag)
    cat_features.append(land_flag)
    
    print('land flag added')

    # real cloud one-hot encoding
    real_cloud_flag = da.zeros(CHL_data.shape)
    real_cloud_flag = da.where(zarr_ds['CHL_cmes-cloud'] == 1, 1, real_cloud_flag)
    cat_features.append(real_cloud_flag)

    print('real cloud flag added')

    # valid CHL one-hot encoding
    valid_CHL_flag = da.zeros(CHL_data.shape)
    valid_CHL_flag = da.where(~da.isnan(masked_CHL), 1, valid_CHL_flag)
    cat_features.append(valid_CHL_flag)

    print('valid CHL flag added')

    
    # fake cloud one-hot encoding
    fake_cloud_flag = da.zeros(CHL_data.shape)
    fake_cloud_flag = da.where((land_flag + real_cloud_flag + valid_CHL_flag) == 0, 1, fake_cloud_flag)
    cat_features.append(fake_cloud_flag)

    print('fake cloud flag added')


    # find train data start and end indices
    train_start_ind = np.where(zarr_ds.time.values == np.datetime64(f'{train_year}-01-01'))[0][0]
    train_end_ind = np.where(zarr_ds.time.values == np.datetime64(f'{train_year + train_range}-01-01'))[0][0]
    

    # get mean and stdev for numerical features
    feat_mean = []
    feat_stdev = []

    for feature in numer_features:
        feature_train = feature[train_start_ind: train_end_ind]
        feat_mean.append(da.nanmean(feature_train).compute())
        feat_stdev.append(da.nanstd(feature_train).compute())
        print('calculating mean and stdev...')

    # calculate standardized features
    numer_features_stdized = []
    feature_shape = numer_features[0].shape
    for feature, mean, stdev in zip(numer_features, feat_mean, feat_stdev):
        numer_features_stdized.append((feature - da.full(feature_shape, mean)) / da.full(feature_shape, stdev))
        print('standardizing...')

    # get mean and stdev for CHL
    CHL_mean = da.nanmean(CHL_data).compute()
    CHL_stdev = da.nanstd(CHL_data).compute()
    np.save(f'{data_dir}/{zarr_label}.npy', {'CHL': np.array([CHL_mean, CHL_stdev]), 'masked_CHL': np.array([feat_mean[-3], feat_stdev[-3]])})

    # calculate standardized CHL
    CHL_data_stdized = (CHL_data - da.full(feature_shape, CHL_mean)) / da.full(feature_shape, CHL_stdev)

    print('all standardized')

    numer_var_names = features + ['sin_time', 'cos_time', 'masked_CHL', 'prev_day_CHL', 'next_day-CHL']
    cat_var_names = ['land_flag', 'real_cloud_flag', 'valid_CHL_flag', 'fake_cloud_flag']

    print('creating zarr')
    create_zarr(zarr_ds, numer_features_stdized, numer_var_names, cat_features, cat_var_names, CHL_data_stdized.data, zarr_label, data_dir)

    del time_data, day_rad, day_sin, day_cos
    del feature, feat_arr
    del numer_features, numer_features_stdized, numer_var_names, cat_features, cat_var_names, CHL_data, CHL_data_stdized
    del feat_mean, feat_stdev
    
    return zarr_label

In [16]:
def create_zarr(zarr_ds, numer_features, numer_var_names, cat_features, cat_var_names, CHL_data, zarr_label, data_dir):
    chunk_size = 100
    coord_names = ['time', 'lat', 'lon']
    coords = {coord_name: zarr_ds[coord_name] for coord_name in coord_names}
    
    numer_features_dict = {var_name: (coord_names, feature) for var_name, feature in zip(numer_var_names, numer_features)}
    cat_features_dict = {var_name: (coord_names, feature) for var_name, feature in zip(cat_var_names, cat_features)}
    label_dict = {'CHL': (coord_names, CHL_data)}
    print('variables dicts loaded')
    
    ds_numer = xr.Dataset(numer_features_dict, coords=coords)
    ds_cat = xr.Dataset(cat_features_dict, coords=coords)
    ds_label = xr.Dataset(label_dict, coords=coords)
    print('xarray datasets created')    

    for var in list(ds_numer.keys()):
        ds_numer[var]=ds_numer[var].chunk({"time": chunk_size}) 
    for var in list(ds_cat.keys()):
        ds_cat[var]=ds_cat[var].chunk({"time": chunk_size})
    for var in list(ds_label.keys()):
        ds_label[var]=ds_label[var].chunk({"time": chunk_size}) 
    print('chunked')

    store = zarr.DirectoryStore(f'{data_dir}/{zarr_label}.zarr')
    ds_numer.to_zarr(store, mode='w')
    ds_cat.to_zarr(store, mode='a')
    ds_label.to_zarr(store, mode='a')

In [19]:
features = ['u_wind', 'v_wind', 'sst', 'air_temp']
train_year = 2015
train_range = 3
val_range = 1
test_range = 1
sharedir = "~/shared-public/mind_the_chl_gap"
zarr_label = data_preprocessing(zarr_ds, features, train_year, train_range, data_dir=sharedir)
zarr_stdized = xr.open_zarr(zarr.DirectoryStore(f'{sharedir}/{zarr_label}.zarr'))

label created
raw data features added
CHL logged
sin and cos time calculated
sin time added
cos time added
masked CHL added
prev day CHL added
next day CHL added
land flag added
real cloud flag added
valid CHL flag added
fake cloud flag added
calculating mean and stdev...
calculating mean and stdev...
calculating mean and stdev...
calculating mean and stdev...
calculating mean and stdev...
calculating mean and stdev...
calculating mean and stdev...
calculating mean and stdev...
calculating mean and stdev...
standardizing...
standardizing...
standardizing...
standardizing...
standardizing...
standardizing...
standardizing...
standardizing...
standardizing...


FileNotFoundError: [Errno 2] No such file or directory: '~/shared-public/mind_the_chl_gap/2015_3_full_2days.npy'

#### Function: `data_split`
This function selects the train, validation, and test data from the standardized data and splits the features and label.
##### Parameters:
- `zarr_stdized`: Zarr file storing standardized features and label.
- `train_year`: the first year of train data
- `train_range`: length of train data in year
- `val_range`: length of validation data in year
- `test_range`: length of test data in year
##### Return:
- `X_train, X_val, X_test`: the predictor variables of the train/validation/test data
- `y_train, y_val, y_test`: the response variables of the train/validation/test data

In [9]:
def data_split(zarr_stdized, train_year, train_range, val_range, test_range):
    X_vars = list(zarr_stdized.keys())
    X_vars.remove('CHL')
    
    zarr_train = zarr_stdized.sel(time=slice(f'{train_year}-01-01', f'{train_year+train_range}-01-01'))
    X_train = []
    for var in X_vars:
        var = zarr_train[var].to_numpy()
        X_train.append(np.where(np.isnan(var), 0.0, var))
    y_train = zarr_train.CHL.to_numpy()
    y_train = np.where(np.isnan(y_train), 0.0, y_train)
    X_train = np.array(X_train)
    X_train = np.moveaxis(X_train, 0, -1)
    del zarr_train
    
    zarr_val = zarr_stdized.sel(time=slice(f'{train_year+train_range}-01-01', f'{train_year+train_range+val_range}-01-01'))
    X_val = []
    for var in X_vars:
        var = zarr_val[var].to_numpy()
        X_val.append(np.where(np.isnan(var), 0.0, var))
    y_val = zarr_val.CHL.to_numpy()
    y_val = np.where(np.isnan(y_val), 0.0, y_val)
    X_val = np.array(X_val)
    X_val = np.moveaxis(X_val, 0, -1)
    del zarr_val
    
    zarr_test = zarr_stdized.sel(time=slice(f'{train_year+train_range+val_range}-01-01', f'{train_year+train_range+val_range+test_range}-01-01'))
    X_test= []
    for var in X_vars:
        var = zarr_test[var].to_numpy()
        X_test.append(np.where(np.isnan(var), 0.0, var))
    y_test = zarr_test.CHL.to_numpy()
    y_test = np.where(np.isnan(y_test), 0.0, y_test)
    X_test = np.array(X_test)
    X_test = np.moveaxis(X_test, 0, -1)
    del zarr_test, var

    return (X_train, y_train, 
            X_val, y_val,
            X_test, y_test)

## Data ready for training and testing

- `X_train, X_val, X_test`: the predictor variables of the train/validation/test data
- `y_train, y_val, y_test`: the response variables of the train/validation/test data
 

In [10]:
X_train, y_train, X_val, y_val, X_test, y_test = data_split(zarr_stdized, train_year, train_range, val_range, test_range)