# IceNet: Pipeline usage

## Context

### Purpose
The first notebook demonstrated the use of high level command-line interfaces (CLI) of the IceNet library to download, process, train and predict from end to end.

Now that you have gone through the basic steps of running the IceNet model via the CLI, you may wish to establish a framework to run the model automatically for end-to-end runs. This is often called a Pipeline. A Pipeline can schedule ongoing model runs or run multiple model variations simultaneously.

This notebook illustrates the use of helper scripts from the IceNet pipeline repository for testing and producing operational forecasts.

Please do go through the first notebook before proceeding with this, as the data download exists outside of the pipeline, and this is covered in detail in the first notebook. However, even so, this notebook has been designed to be run independent of other notebooks in this repository.

This demonstrator notebook has been run on the British Antarctic Survey in-house HPC, however, the pipeline is by no means limited to running solely on HPCs.

### Highlights
The key features of an end to end run are:

 * [1. Introduction](#1-Introduction)
 * [2. Setup](#2-Setup)
 * [3. Process](#3-Process)
 * [4. Train](#4-Train)
 * [5. Predict](#5-Predict)
 * [6. Visualisation](#6-Visualisation)

**Note:** Steps 3, 4 and 5 are within the IceNet pipeline.

### Contributions
#### Notebook

James Byrne (author)

Matthew Gascoyne

Bryn Noel Ubald

__Please raise issues [in this repository](https://github.com/icenet-ai/icenet-notebooks/issues) to suggest updates to this notebook!__ 

Contact me at _jambyr \<at\> bas.ac.uk_ for anything else...

#### Modelling codebase
James Byrne (code author), Tom Andersson (science author)

#### Modelling publications
Andersson, T.R., Hosking, J.S., Pérez-Ortiz, M. et al. Seasonal Arctic sea ice forecasting with probabilistic deep learning. Nat Commun 12, 5124 (2021). https://doi.org/10.1038/s41467-021-25257-4

#### Involved organisations
The Alan Turing Institute and British Antarctic Survey

___
# 1. Introduction

## CLI vs Library vs Pipeline usage

The IceNet package is designed to support automated runs from end to end by exposing the CLI operations demonstrated in the first notebook. These are simple wrappers around the library itself, and __any__ step of this can be undertaken manually or programmatically by inspecting the relevant endpoints. 

IceNet can be run in a number of ways: from the command line, the python interface, or as a pipeline.

The rule of thumb to follow: 

* Use the [pipeline repository](https://github.com/icenet-ai/icenet-pipeline) if you want to run the end to end IceNet processing out of the box.
* Adapt or customise this process using `icenet_*` commands described in this notebook and in the scripts contained in [the pipeline repo](https://github.com/icenet-ai/icenet-pipeline).
* For ultimate customisation, you can interact with the IceNet repository programmatically (which is how the CLI commands operate.) For more information look at the [IceNet CLI implementations](https://github.com/JimCircadian/icenet2/blob/main/setup.py#L32) and the [library notebook](03.library_usage.ipynb), along with the [library documentation](#TODO). 

## Using the Pipeline

Now that you have gone through the basic steps of running the IceNet model via the high-level CLI commands, you may wish to establish a framework to run the model automatically for end-to-end runs. This is often called a Pipeline. A Pipeline can schedule ongoing model runs or run multiple model variations simultaneously. The pipeline is driven by a series of bash scripts, and an environmental `ENVS` configuration file.

![Diagram of Icenet and it's pipeline](./pipeline_diagram3.png "Icenet pipeline diagram displaying process blocks and data being processed from input on the left to output on the right, through the pipeline")

To automatically produce daily IceNet forecasts we train multiple variations of the model, each with different starting conditions. We call this ensemble training. Then we run predictions for each model variation, producing a mean and error across the whole model ensemble. This captures some of the model uncertainty.

### Data

This assumes that you have a data store in a `data/` folder (This can be the same as the `data/` directory generated when running through the first notebook). Since the data is common across pipelines, you do not need to redownload data that you have previously downloaded. It is recommended to symbolically link to a data store such that data is only downloaded when has not been downloaded previously.

### Ensemble Running

To do this, an [icenet-pipeline](https://www.github.com/icenet-ai/icenet-pipeline) repository is available. The icenet-pipeline offers the `run_train_ensemble.sh` and `run_predict_ensemble.sh` script which operates similarly to the `icenet_train` and `icenet_predict` CLI commands demonstrated in the first notebook from the IceNet library.

___
# 2. Setup

## Get the IceNet Pipeline

Before progressing you will need to clone the icenet-pipeline repository. Assuming you have followed the directory structure from the first notebook:


```bash
git clone https://www.github.com/icenet-ai/icenet-pipeline.git green
ln -s green notebook-pipeline
cd icenet-notebooks
```

We clone a 'fresh' pipeline repository into a directory called 'green' (as an arbitrary way of identifying the fresh pipeline) and then symbolically link to it. This allows us to symbolically swap to another pipeline later if we want to.

```bash
my-icenet-project/       <--- we're in here!
├── data/
├── icenet-notebooks/
├── green/               <--- Clone of icenet-pipeline
└── notebook-pipeline@   <--- Symlink to the green/ `icenet-pipeline` repo we've just cloned into
```

In [1]:
# Viewing symbolically linked files.
!find .. -maxdepth 1 -type l -ls

317836352179    0 lrwxrwxrwx   1 bryald   ailab           5 Apr 12 13:07 ../notebook-pipeline -> green


## Configure the Pipeline

Move into the `notebook-pipeline` directory.

In [2]:
import os
os.chdir("../notebook-pipeline")
!pwd

/data/hpcdata/users/bryald/git/icenet/green


The pipeline is driven by environmental variables that are defined within an `ENVS` file.

There is an example ENVS file (`ENVS.example`) in the `../notebook-pipeline` directory which is what ENVS is symbolically linked to by default.

You can copy the `ENVS.example` file and create many variations to cover your usage scenario. Then, update the `ENVS` file symbolic link to the run you would like to go through.

As a demonstrator, we will change the existing `my-icenet-project/notebook-pipeline/ENVS` link that points to `my-icenet-project/notebook-pipeline/ENVS.example`.

We will instead point it to the example in this notebook repository after copying it over `my-icenet-project/icenet-notebooks/ENVS.notebook_tutorial`.

The ENVS files are typically collated within the `notebook-pipeline` repo, hence why we link the `ENVS.notebook_tutorial` in this repository to `ENVS` in the `notebook-pipeline` repository.

In [3]:
# Unlink the existing symoblic link (under `my-icenet-project/notebook-pipeline/ENVS`)
!unlink ENVS

# Point to the ENVS file from the icenet-notebooks repository (where this notebook is)
!ln -s ../icenet-notebooks/ENVS.notebook_tutorial ENVS

Before running through this notebook, please update the following variables in the ENVS file to point to your icenet conda environment (if different to the default):

<pre>
export ICENET_HOME=${ICENET_HOME:-${HOME}/icenet/${ICENET_ENVIRONMENT}}
export ICENET_CONDA=${ICENET_CONDA:-${HOME}/conda-envs/icenet}
</pre>

In [4]:
# Looking at the symlinked files in the `notebook-pipeline` directory
!find . -maxdepth 1 -type l -ls

225529190527    0 lrwxrwxrwx   1 bryald   ailab          45 Apr 12 12:50 ./data -> /data/hpcdata/users/jambyr/icenet/yellow/data
225493182496    0 lrwxrwxrwx   1 bryald   ailab          42 Apr 12 13:56 ./ENVS -> ../icenet-notebooks/ENVS.notebook_tutorial


## Download data before initiating pipeline

As shown in the pipeline image at the top, the source data download is external to the pipeline since it is common across pipelines.

Hence, the same commands from the first notebook can be used to download the required data into a data store (if not previously downloaded) and symbolically linked into in the working directory before using the pipeline. Please check the first notebook for details regarding the usage of these commands.

**Please note that you do not need to redownload data you have already downloaded previously** (i.e., for date ranges you have previously downloaded into your data store).

In [5]:
!icenet_data_masks south

[12-04-24 13:56:11 :INFO    ] - Skipping ./data/masks/south/masks/active_grid_cell_mask_01.npy, already exists
[12-04-24 13:56:11 :INFO    ] - Skipping ./data/masks/south/masks/active_grid_cell_mask_02.npy, already exists
[12-04-24 13:56:11 :INFO    ] - Skipping ./data/masks/south/masks/active_grid_cell_mask_03.npy, already exists
[12-04-24 13:56:11 :INFO    ] - Skipping ./data/masks/south/masks/active_grid_cell_mask_04.npy, already exists
[12-04-24 13:56:11 :INFO    ] - Skipping ./data/masks/south/masks/active_grid_cell_mask_05.npy, already exists
[12-04-24 13:56:11 :INFO    ] - Skipping ./data/masks/south/masks/active_grid_cell_mask_06.npy, already exists
[12-04-24 13:56:11 :INFO    ] - Skipping ./data/masks/south/masks/active_grid_cell_mask_07.npy, already exists
[12-04-24 13:56:11 :INFO    ] - Skipping ./data/masks/south/masks/active_grid_cell_mask_08.npy, already exists
[12-04-24 13:56:11 :INFO    ] - Skipping ./data/masks/south/masks/active_grid_cell_mask_09.npy, already exists
[

In [6]:
!icenet_data_era5 south --vars uas,vas,tas,zg --levels ',,,500|250' 2020-1-1 2020-4-30

[12-04-24 13:56:15 :INFO    ] - ERA5 Data Downloading
[12-04-24 13:56:15 :INFO    ] - Building request(s), downloading and daily averaging from ERA5 API
[12-04-24 13:56:15 :INFO    ] - Processing single download for uas @ None with 121 dates
[12-04-24 13:56:15 :INFO    ] - Processing single download for vas @ None with 121 dates
[12-04-24 13:56:15 :INFO    ] - Processing single download for tas @ None with 121 dates
[12-04-24 13:56:15 :INFO    ] - Processing single download for zg @ 500 with 121 dates
[12-04-24 13:56:15 :INFO    ] - Processing single download for zg @ 250 with 121 dates
[12-04-24 13:56:16 :INFO    ] - No requested dates remain, likely already present
[12-04-24 13:56:16 :INFO    ] - No requested dates remain, likely already present
[12-04-24 13:56:16 :INFO    ] - No requested dates remain, likely already present
[12-04-24 13:56:16 :INFO    ] - No requested dates remain, likely already present
[12-04-24 13:56:16 :INFO    ] - No requested dates remain, likely already pres

**Note:** We also make sure to also download sea-ice concentration data for the time period we're predicting for (in addition to the training range).

In this case, the ENVS file defines the latest train date as being `2020-3-31`, and the latest test date being `2020-4-2`. Since we would like to forecast for `7` days (Also defined within the `ENVS` file under `export FORECAST_DAYS=7`), we should download up to 7 days after the end dates of train/validation/test.

This will also be of use when comparing the prediction data.

These do not have to be downloaded in separate date ranges, you can cover the entire period in one go (`2019-12-29 2020-4-30`), or use (e.g. `2019-12-29,2020-4-3 2020-3-31,2020-4-23`) syntax. The download is split into multiple sections to also demonstrate that previously downloaded data will be skipped over. This is the same for the above ERA5 download.

In [7]:
# Date range for training (Adding 7 days forecast period to end date)
!icenet_data_sic south -d 2020-1-1 2020-4-7

# Date range for validation (Adding 7 days forecast period to end date)
!icenet_data_sic south -d 2020-4-3 2020-4-30

# Date range for test  (Adding 7 days forecast period to end date)
# Note: Above date range already covers this, so this data will not be re-downloaded.
!icenet_data_sic south -d 2020-4-1 2020-4-9

[12-04-24 13:56:19 :INFO    ] - OSASIF-SIC Data Downloading
[12-04-24 13:56:19 :INFO    ] - Downloading SIC datafiles to .temp intermediates...
[12-04-24 13:56:19 :INFO    ] - Excluding 366 dates already existing from 98 dates requested.
[12-04-24 13:56:19 :INFO    ] - Opening for interpolation: ['./data/osisaf/south/siconca/2020.nc']
[12-04-24 13:56:19 :INFO    ] - Processing 0 missing dates
[12-04-24 13:56:20 :INFO    ] - OSASIF-SIC Data Downloading
[12-04-24 13:56:20 :INFO    ] - Downloading SIC datafiles to .temp intermediates...
[12-04-24 13:56:20 :INFO    ] - Excluding 366 dates already existing from 28 dates requested.
[12-04-24 13:56:20 :INFO    ] - Opening for interpolation: ['./data/osisaf/south/siconca/2020.nc']
[12-04-24 13:56:20 :INFO    ] - Processing 0 missing dates
[12-04-24 13:56:21 :INFO    ] - OSASIF-SIC Data Downloading
[12-04-24 13:56:21 :INFO    ] - Downloading SIC datafiles to .temp intermediates...
[12-04-24 13:56:21 :INFO    ] - Excluding 366 dates already exis

___
# 3. Process

The following command processes the downloaded data for the dates defined in the ENVS file.

This is equivalent to running `icenet_process_era5`, `icenet_process_ora5`, `icenet_process_sic`, `icenet_process_metadata` commands from the IceNet library (as demonstrated in the first notebook).

The arguments passed to these commands are obtained from the `PROC_ARGS_*` variables in the ENVS file.

And, the dates that are processed are defined by the following variables in the ENVS file:
* `TRAIN_START_*`
* `TRAIN_END_*`
* `VAL_START_*`
* `VAL_END_*`
* `TEST_START_*`
* `TEST_END_*`

This only needs to be run once unless the above variables need to be changed. Hence, it can be run as a precursor to the pipeline if the processed data does not need to change.

In [8]:
!./run_data.sh south


CondaError: Run 'conda init' before 'conda activate'

[12-04-24 13:56:37 :INFO    ] - Got 91 dates for train
[12-04-24 13:56:37 :INFO    ] - Got 21 dates for val
[12-04-24 13:56:37 :INFO    ] - Got 2 dates for test
[12-04-24 13:56:37 :INFO    ] - Creating path: ./processed/tutorial_pipeline_south/era5
[12-04-24 13:56:37 :DEBUG   ] - Setting range for linear trend steps based on 7
[12-04-24 13:56:37 :INFO    ] - Processing 91 dates for train category
[12-04-24 13:56:37 :INFO    ] - Including lag of 1 days
[12-04-24 13:56:37 :INFO    ] - Including lead of 93 days
[12-04-24 13:56:37 :DEBUG   ] - Globbing train from ./data/era5/south/**/[12]*.nc
[12-04-24 13:56:37 :DEBUG   ] - Globbed 494 files
[12-04-24 13:56:37 :DEBUG   ] - Create structure of 494 files
[12-04-24 13:56:37 :INFO    ] - Processing 21 dates for val category
[12-04-24 13:56:37 :INFO    ] - Including lag of 1 days
[12-04-24 13:56:37 :INFO    ] - Including lead of 93 days
[12-04-24 13:56:37 :DEBUG   ] - Globbing val from ./da

___
# 4. Train

For producing forecasts in the described pipeline we actually run a set of models using the [model-ensembler](https://github.com/JimCircadian/model-ensembler) tool and as such there are convenience scripts for doing this as part of the end to end run.

This requires the [model-ensembler](https://pypi.org/project/model-ensembler/) (`pip install model-ensembler`) module to be installed.

Note that the model-ensembler will submit jobs and to configure the job scripts, you can access the templates that are used to generate them in the `.yaml` (in particular [`train.tmpl.yaml`](https://github.com/icenet-ai/icenet-pipeline/blob/main/ensemble/train.tmpl.yaml) for the training ensemble jobs) files in the `ensemble/` folder of the clone of the `icenet-pipeline` repository.

Many of the arguments for the following command are equivalent to the `icenet_train` command. However, the `-n` filters factor is actually `-f` in this example and we have additional arguments `-n` for the node to run on, `-p` for the pre_run script to use and `-j` for the number of simultaneous runs to execute on the SLURM cluster we use at BAS. However, these arguments are not necessarily required for other clusters, nor is the model-ensembler limited to running on SLURM (it can, at present, also run locally.)

The pipeline repository shell scripts that provide this functionality are easily adaptable, as well as the ensemble itself which is stored in the pipeline repository under `/ensemble/`.

_Please review the `-h` help option for the script to gain further insight the options available._

In [9]:
!./run_train_ensemble.sh --help

Usage ./run_train_ensemble.sh LOADER DATASET NAME



The optional arguments (Some are not defined in this example):
| argument | description                                                                                                  | value |
|     ---: |:---                                                                                                          | :---  |
|*-b*      | Batch size                                                                                                   | -     |
|*-d*      | Run locally instead of submitting SLURM jobs                                                                 | -     |
|*-e*      | Number of epochs to train for                                                                                | 10    |
|*-f*      | Scale the neural network channel sizes by this factor (reduces network size, priority over ENVS definition)  | 0.6   |
|*-m*      | Memory required                                                                                              | 64gb  |
|*-n*      | Node to run on                                                                                               | -     |
|*-p*      | pre_run script to use                                                                                        | -     |
|*-q*      | Maximum queue size                                                                                           | 4     |
|*-r*      | Seed values for ensemble members (determines no. of ensemble members, overrides values in ENVS if specified) | -     |
|*-j*      | No. of simultaneous runs to execute on the SLURM cluster                                                     | 5     |


The positional arguments:
| argument | description                                   | value                   |
|     ---: |:---                                           | :---                    |
|*LOADER*  | Name of loader: loader.{LOADER}.json          | tutorial_pipeline_south |
|*DATASET* | Name of dataset: dataset_config.{LOADER}.json | tutorial_pipeline_south |
|*NAME*    | Neural network output name                    | tutorial_south_ensemble |

The loader and dataset names are defined by the prefix in the `ENVS` file. The hemisphere is appended to the defined string, so the following in the `ENVS.notebook_tutorial` file becomes "tutorial_pipeline_south".

```bash
PREFIX="TUTORIAL_PIPELINE"
```

In [10]:
# Positional Arguments
# argument 1: The loader json file:          loader.tutorial_pipeline_south.json
# argument 2: The dataset json file:         dataset_config.tutorial_pipeline_south.json
# argument 3: The trained network name:      tutorial_south_ensemble
!./run_train_ensemble.sh -e 10 -f 0.6 -m 64gb -q 4 -j 5 tutorial_pipeline_south tutorial_pipeline_south tutorial_south_ensemble

ARGS: -e 10 -f 0.6 -m 64gb -q 4 -j 5 tutorial_pipeline_south tutorial_pipeline_south tutorial_south_ensemble
ARGS = -x arg_epochs=10 arg_filter_factor=0.6 mem=64gb arg_queue=4 , Leftovers: tutorial_pipeline_south tutorial_pipeline_south tutorial_south_ensemble
No. of ensemble members:  2
Ensemble members:  42,46
Running model_ensemble ./tmp.pr8RNWCTp7.train slurm -x arg_epochs=10 arg_filter_factor=0.6 mem=64gb arg_queue=4 
[12-04-24 13:59:45    :INFO    ] - Model Ensemble Runner
[12-04-24 13:59:45    :INFO    ] - Validated configuration file ./tmp.pr8RNWCTp7.train successfully
[12-04-24 13:59:45    :INFO    ] - Importing model_ensembler.cluster.slurm
[12-04-24 13:59:45    :INFO    ] - Running batcher
[12-04-24 13:59:45    :INFO    ] - Running command: mkdir -p ./results/networks
[12-04-24 13:59:45    :INFO    ] - Start batch: 2024-04-12 12:59:45.388853
[12-04-24 13:59:45    :INFO    ] - Running cycle 1
[12-04-24 13:59:45    :INFO    ] - Start run tutorial_south_ensemble-0 at 2024-04-12

This trains based on the processed data, and creates a sub-directory under `ensemble/` with the network name that contains each of the ensemble runs. This includes log files for debugging in case of any errors/issues in the training process.

```bash
ensemble/
└── tutorial_south_ensemble/
    ├── tutorial_south_ensemble-0/
    │   ├── *.err    <-- Error file
    │   └── *.out    <-- Log file
    └── tutorial_south_ensemble-1/
        └── ...
```

The output from the trained network can be found in `results/networks`. The specifics of what is contained in here is out of scope of this notebook (please see [03.data_and_forecasts.ipynb](03.data_and_forecasts.ipynb) after running through this notebook), but in general it stores the trained model, and a history of the losses and other metrics.

```bash
results/
└── networks/
    └── tutorial_south_ensemble/
        ├── *.h5
        ├── *.json
        └── ...
```

___
# 5. Predict

In a similar manner to the training script, the `run_predict_ensemble` script will submit jobs to the HPC. The template corresponding to the prediction run is [`predict.tmpl.yaml`](https://github.com/icenet-ai/icenet-pipeline/blob/main/ensemble/predict.tmpl.yaml) found in the `icenet-pipeline` repo.

For the ensemble prediction, we define the dates we want to predict for in a csv file. This can be automatically generated from the dataset as follows.

In [11]:
!./loader_test_dates.sh tutorial_pipeline_south | tee testdates.csv

2020-04-01
2020-04-02


First look at the required input arguments for running the prediction ensemble.

In [12]:
!./run_predict_ensemble.sh --help

Usage ./run_predict_ensemble.sh NETWORK DATASET NAME DATEFILE [LOADER]


Many of the command line arguments are the same as with `run_train_ensemble` listed above.

So to to predict from an ensemble training run, we use:  

| argument  | description                                          | value                            |
|     ---:  |:---                                                  | :---                             |
|*NETWORK*  | Name of trained neural network to use for prediction | tutorial_south_ensemble          |
|*DATASET*  | Name of dataset: dataset_config.{LOADER}.json        | tutorial_pipeline_south          |
|*NAME*     | Name of output prediction                            | tutorial_south_ensemble_forecast |
|*DATEFILE* | Dates to predict for                                 | testdates.csv                    |
|*LOADER*   | Name of loader: loader.{LOADER}.json (optional)      | -                                |

In [13]:
# -f: n_filters_factor (matching the value used for training)
# -p: prep bash script (A bash script to run before running the prediction)
!./run_predict_ensemble.sh -f 0.6 -p bashpc.sh tutorial_south_ensemble tutorial_pipeline_south tutorial_south_ensemble_forecast testdates.csv

ARGS: -f 0.6 -p bashpc.sh tutorial_south_ensemble tutorial_pipeline_south tutorial_south_ensemble_forecast testdates.csv
ARGS = -x arg_filter_factor=0.6 arg_prep=bashpc.sh , Leftovers: tutorial_south_ensemble tutorial_pipeline_south tutorial_south_ensemble_forecast testdates.csv
No. of ensemble members:  2
Ensemble members:  42,46
Running model_ensemble ./tmp.jpYObID30M.predict slurm -x arg_filter_factor=0.6 arg_prep=bashpc.sh 
[12-04-24 14:09:18    :INFO    ] - Model Ensemble Runner
[12-04-24 14:09:18    :INFO    ] - Validated configuration file ./tmp.jpYObID30M.predict successfully
[12-04-24 14:09:18    :INFO    ] - Importing model_ensembler.cluster.slurm
[12-04-24 14:09:18    :INFO    ] - Running batcher
[12-04-24 14:09:18    :INFO    ] - Start batch: 2024-04-12 13:09:18.576347
[12-04-24 14:09:18    :INFO    ] - Running cycle 1
[12-04-24 14:09:18    :INFO    ] - Running command: /usr/bin/ln -s ../../data
[12-04-24 14:09:18    :INFO    ] - Start run tutorial_south_ensemble_forecast-0

As with the previous example, the individual numpy outputs, samples and sample weights are deposited into `/results/predict` for each ensemble member. However, the ensemble also runs `icenet_output` to generate __a CF-compliant NetCDF containing the forecasts requested__ which can then be post-processed or [deposited to an external location](#Uploading-to-Azure) (which is the platform for the [wider IceNet forecasting infrastructure](https://github.com/alan-turing-institute/IceNet-Project)). 

In [14]:
# Numpy files location (under each ensemble directory listed in the output of this cell)
!ls ./results/predict/tutorial_south_ensemble_forecast

tutorial_south_ensemble.42  tutorial_south_ensemble.46


In [15]:
# Combined netCDF file location
!ls ./results/predict/tutorial_south_ensemble_forecast.nc

./results/predict/tutorial_south_ensemble_forecast.nc


___
# 6. Visualisation

## View the forecast output from the pipeline

Now that we have a prediction, we can visualise the binary sea ice concentration using some of the built-in tools in IceNet that utilise `cartopy` and `matplotlib`.

(Note: There are also some scripts in the [icenet-pipeline](https://github.com/icenet-ai/icenet-pipeline) repository that enable plotting common results such as `produce_op_assets.sh`)

Here, we are loading the prediction netCDF file we've just created in the previous step.

We are also using the `Masks` class from IceNet to create a land mask region that will mask out the land regions in the forecast plot.

In [16]:
from icenet.plotting.video import xarray_to_video as xvid
from icenet.data.sic.mask import Masks
from IPython.display import HTML
import xarray as xr, pandas as pd, datetime as dt

# Load our output prediction file
ds = xr.open_dataset("results/predict/tutorial_south_ensemble_forecast.nc")
land_mask = Masks(south=True, north=False).get_land_mask()
ds.info()

PyTorch not found - not required if not using PyTorch
xarray.Dataset {
dimensions:
	time = 2 ;
	yc = 432 ;
	xc = 432 ;
	leadtime = 7 ;

variables:
	int32 Lambert_Azimuthal_Grid() ;
		Lambert_Azimuthal_Grid:grid_mapping_name = lambert_azimuthal_equal_area ;
		Lambert_Azimuthal_Grid:longitude_of_projection_origin = 0.0 ;
		Lambert_Azimuthal_Grid:latitude_of_projection_origin = -90.0 ;
		Lambert_Azimuthal_Grid:false_easting = 0.0 ;
		Lambert_Azimuthal_Grid:false_northing = 0.0 ;
		Lambert_Azimuthal_Grid:semi_major_axis = 6378137.0 ;
		Lambert_Azimuthal_Grid:inverse_flattening = 298.257223563 ;
		Lambert_Azimuthal_Grid:proj4_string = +proj=laea +lon_0=0 +datum=WGS84 +ellps=WGS84 +lat_0=-90.0 ;
	float32 sic_mean(time, yc, xc, leadtime) ;
		sic_mean:long_name = mean sea ice area fraction across ensemble runs of icenet model ;
		sic_mean:standard_name = sea_ice_area_fraction ;
		sic_mean:short_name = sic ;
		sic_mean:valid_min = 0 ;
		sic_mean:valid_max = 1 ;
		sic_mean:ancillary_variables = 

The next cell obtains the start date of the forecast

In [17]:
# Get the forecast start date
forecast_date = ds.time.values[0]
print(forecast_date)

2020-04-01T00:00:00.000000000


And, here, we plot the forecast across the range of days we've defined within the `ENVS` file (7 days in this case).

Since this is a demonstrator notebook, we have not trained our network for a prolonged period of time or for a large date range, but the plot below shows indicative results of what the output would look like.

In [18]:
fc = ds.sic_mean.isel(time=0).drop_vars("time").rename(dict(leadtime="time"))
fc['time'] = [pd.to_datetime(forecast_date) \
              + dt.timedelta(days=int(e)) for e in fc.time.values]

anim = xvid(fc, 15, figsize=4, mask=land_mask)
HTML(anim.to_jshtml())

## Other Pipeline Considerations

### A bit more information on ensemble runs

#### Cleaning up runs

Ensemble runs take place under `/ensemble/` in the pipeline folder and ARE NOT deleted after they've happened, to allow for debugging. Commonly, the ensemble configurations will contain a delete task to remove the extraneous run folders. __In the meantime this should be done manually__ after running `run_train_ensemble` or `run_predict_ensemble`.

The only exception to this is the use of `run_daily.sh` (see below) which does clean up prior to rerunning. 

### Daily execution

Daily execution is facilitated in the pipeline by using [`run_daily.sh`](https://github.com/antarctica/IceNet-Pipeline/blob/main/run_daily.sh). This wraps all the necessary steps to perform the following sequence for producing forecasts from yesterday for the next 93 days, for both northern and southern hemispheres. 

* Removes any old ensemble runs
* Downloads [HRES forecast data from the ECMWF MARS API](https://www.ecmwf.int/en/forecasts/datasets/catalogue-ecmwf-real-time-products)
* Processes the HRES and necessary training metadata to produce a data loader
* Creates a dataset configuration for it
* Runs a [prediction ensemble](#4-Predict) to produce a NetCDF
* Uploads to the necessary endpoint

#### Automation

With the above shell script it's trivial to automate using cron. Of course this is simply for demonstration, with more complex workflow managers offering far great flexibility especially when considering analysis of the produced forecasts.

```bash
# We assume your environment is configured appropriately to run conda from cron files, for example by adding...
#
# SHELL=/bin/bash
# BASH_ENV=~/.bashrc_env
#
# With conda initialisation in bashrc_env at the top of your crontab
25 9 * * * conda activate icenet; cd $HOME/hpc/icenet/pipeline && bash run_daily.sh >$HOME/daily.log 2>&1; conda deactivate
```

TODO: more information on the usage of this command.

## Summary

Within this notebook we've attempted to give a full crash course on the IceNet pipeline and how to utilise it for a generalised run using the __pipeline helper scripts__. This is the second of six (currently) notebooks contained within the pipeline repository, covering further information: 

* [Data structure and analysis](03.data_and_forecasts.ipynb): understand the structure of the data stores and products created by these workflows and what tools currently exist in IceNet to looks over them.
* [Library usage](04.library_usage.ipynb): understand how to programmatically perform an end to end run.
* [Library extension](05.library_extension.ipynb): understand why and how to extend the IceNet library.

## Version
- IceNet Codebase: v0.2.8