# Training data explainer
## Max Thomas

Data for training the emulator were generated using GOSI9 (Global Ocean and Sea Ice Configuration 9), which is the Met Office's global ocean (NEMO) and sea ice (SI3) model configuration, documented [here](https://gmd.copernicus.org/articles/18/377/2025/). The model was run for 1 year (1976), and timestep level sea ice data were saved.

Raw data exist at timestep frequency for:
- EVP rheology with 120 iterations. This is standard for the Met Office, but we expect convergence to be incomplete.
- aEVP rheology with 100 iterations. This is a different numerical formulation of EVP, and should be better converged.
- EVP rheology with 1200 iterations. This should be better converged, but would be too expensive in practise.

Initial testing is with 1200 iteration EVP, as good emulation of this would improve the cost *and* performance of the existing rheology solver.

Raw model output data were processed into a more useful format for machine learning using ```code/src/make_pairs_2.py``` and config files stored in ```configs/data_gathering/```. The script loads the data, separates pairs of data at time *t* and time *t+1*, flattens it (so 2D lat/lon/time coordinates become 1D), and removes all data points where there is no sea ice (variable *siconc* of 0).

The largest dataset processed so far uses one day of timestep frequency output from each month in 1976:
```python make_pairs.py evp_120itr_12day```

The resulting file is ```data/raw/evp_1200itr_fmt2.zarr```. 'evp' here refers to the elasto-viscous-plastic rheology (see [here](https://www.annualreviews.org/content/journals/10.1146/annurev.fluid.40.111406.102151) for a discussion of various rheologies). '1200itr' refers to the number of iteration allowed for the rheology solver. 'fmt2' is an identifier that distinguishes data made by ```code/src/make_pairs_2.py``` (fmt2) from ```code/src/make_pairs.py``` (which is obsolete).

Taking a look at the file...

In [3]:
import xarray as xr 

data = xr.open_zarr('../data/raw/evp_1200itr_fmt2.zarr')
data

Unnamed: 0,Array,Chunk
Bytes,483.23 MiB,1.89 MiB
Shape,"(126676851,)","(494832,)"
Dask graph,256 chunks in 2 graph layers,256 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 483.23 MiB 1.89 MiB Shape (126676851,) (494832,) Dask graph 256 chunks in 2 graph layers Data type float32 numpy.ndarray",126676851  1,

Unnamed: 0,Array,Chunk
Bytes,483.23 MiB,1.89 MiB
Shape,"(126676851,)","(494832,)"
Dask graph,256 chunks in 2 graph layers,256 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,483.23 MiB,1.89 MiB
Shape,"(126676851,)","(494832,)"
Dask graph,256 chunks in 2 graph layers,256 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 483.23 MiB 1.89 MiB Shape (126676851,) (494832,) Dask graph 256 chunks in 2 graph layers Data type float32 numpy.ndarray",126676851  1,

Unnamed: 0,Array,Chunk
Bytes,483.23 MiB,1.89 MiB
Shape,"(126676851,)","(494832,)"
Dask graph,256 chunks in 2 graph layers,256 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 GiB,1.89 MiB
Shape,"(126676851,)","(247416,)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,int64 numpy.ndarray,int64 numpy.ndarray
"Array Chunk Bytes 0.94 GiB 1.89 MiB Shape (126676851,) (247416,) Dask graph 512 chunks in 2 graph layers Data type int64 numpy.ndarray",126676851  1,

Unnamed: 0,Array,Chunk
Bytes,0.94 GiB,1.89 MiB
Shape,"(126676851,)","(247416,)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,int64 numpy.ndarray,int64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 GiB,1.89 MiB
Shape,"(126676851,)","(247416,)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 0.94 GiB 1.89 MiB Shape (126676851,) (247416,) Dask graph 512 chunks in 2 graph layers Data type object numpy.ndarray",126676851  1,

Unnamed: 0,Array,Chunk
Bytes,0.94 GiB,1.89 MiB
Shape,"(126676851,)","(247416,)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 GiB,1.89 MiB
Shape,"(126676851,)","(247416,)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray
"Array Chunk Bytes 0.94 GiB 1.89 MiB Shape (126676851,) (247416,) Dask graph 512 chunks in 2 graph layers Data type object numpy.ndarray",126676851  1,

Unnamed: 0,Array,Chunk
Bytes,0.94 GiB,1.89 MiB
Shape,"(126676851,)","(247416,)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,object numpy.ndarray,object numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 GiB,1.89 MiB
Shape,"(126676851,)","(247416,)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,int64 numpy.ndarray,int64 numpy.ndarray
"Array Chunk Bytes 0.94 GiB 1.89 MiB Shape (126676851,) (247416,) Dask graph 512 chunks in 2 graph layers Data type int64 numpy.ndarray",126676851  1,

Unnamed: 0,Array,Chunk
Bytes,0.94 GiB,1.89 MiB
Shape,"(126676851,)","(247416,)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,int64 numpy.ndarray,int64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 GiB,1.89 MiB
Shape,"(126676851,)","(247416,)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,int64 numpy.ndarray,int64 numpy.ndarray
"Array Chunk Bytes 0.94 GiB 1.89 MiB Shape (126676851,) (247416,) Dask graph 512 chunks in 2 graph layers Data type int64 numpy.ndarray",126676851  1,

Unnamed: 0,Array,Chunk
Bytes,0.94 GiB,1.89 MiB
Shape,"(126676851,)","(247416,)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,int64 numpy.ndarray,int64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,4.72 GiB,6.62 MiB
Shape,"(10, 126676851)","(1, 1736640)"
Dask graph,730 chunks in 2 graph layers,730 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 4.72 GiB 6.62 MiB Shape (10, 126676851) (1, 1736640) Dask graph 730 chunks in 2 graph layers Data type float32 numpy.ndarray",126676851  10,

Unnamed: 0,Array,Chunk
Bytes,4.72 GiB,6.62 MiB
Shape,"(10, 126676851)","(1, 1736640)"
Dask graph,730 chunks in 2 graph layers,730 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,0.94 GiB,6.62 MiB
Shape,"(2, 126676851)","(1, 1736640)"
Dask graph,146 chunks in 2 graph layers,146 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 0.94 GiB 6.62 MiB Shape (2, 126676851) (1, 1736640) Dask graph 146 chunks in 2 graph layers Data type float32 numpy.ndarray",126676851  2,

Unnamed: 0,Array,Chunk
Bytes,0.94 GiB,6.62 MiB
Shape,"(2, 126676851)","(1, 1736640)"
Dask graph,146 chunks in 2 graph layers,146 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


We see there are about 127 million instances of the solver behavior (length of *z* dimension). 

There are two potential labels, corresonding to the *u* and *v* velocities in the model (these are orthogonal to eachother and are horizontal).

There are ten potential features.

At any *z*, the features are at time *t* and the labels are at time *t+1*.

In [6]:
print(data.label)
print(data.feature)


<xarray.DataArray 'label' (label: 2)> Size: 48B
array(['sivelv', 'sivelu'], dtype='<U6')
Coordinates:
  * label    (label) <U6 48B 'sivelv' 'sivelu'
<xarray.DataArray 'feature' (feature: 10)> Size: 280B
array(['siconc', 'sithic', 'sivelv', 'sivelu', 'utau_ai', 'utau_oi', 'vtau_ai',
       'vtau_oi', 'sidive', 'sishea'], dtype='<U7')
Coordinates:
  * feature  (feature) <U7 280B 'siconc' 'sithic' 'sivelv' ... 'sidive' 'sishea'
