In [None]:
# Quick hack to put us in the icenet-pipeline folder, assuming it was created as per 01.cli_demonstration.ipynb
import os
if os.path.exists("03.library_usage.ipynb"):
    os.chdir("../notebook-pipeline")
print("Running in {}".format(os.getcwd()))

%matplotlib inline

# IceNet Library Usage

## Context

### Purpose
The IceNet library provides the ability to download, process, train and predict from end to end via a set of command-line interfaces.

Using this notebook we can understand how to programmatically undertake various activities using the IceNet library, allowing for significant customisation of the end to end deep learning pipeline for research and operational use.

### Modelling approach
This modelling approach allows users to immediately utilise the library for producing sea ice concentraion forecasts.

### Highlights
The key features of an end to end run are: 
* Setup: _this was concerned with setting up the conda environment, which remains the same as in 01.cli_demonstration_
* [Download](#Download) 
* [Process](#Process)
* [Train](#Train)
* [Predict](#Predict)

_This follows the same structure as the CLI demonstration notebook so that it's easy to follow step-by-step..._

### Contributions
#### Notebook
James Byrne (author)

__Please raise issues [in this repository](https://github.com/antarctica/IceNet-Pipeline) to suggest updates to this notebook!__ 

Contact me at _jambyr \<at\> bas.ac.uk_ for anything else...

#### Modelling codebase
James Byrne (code author), Tom Andersson (science author)

#### Modelling publications
Andersson, T.R., Hosking, J.S., PÃ©rez-Ortiz, M. et al. Seasonal Arctic sea ice forecasting with probabilistic deep learning. Nat Commun 12, 5124 (2021). https://doi.org/10.1038/s41467-021-25257-4

#### Involved organisations
The Alan Turing Institute and British Antarctic Survey

## Introduction

Once installed the API can be utilised as easily as the CLI commands from a shell, via any Python interpreter. As usual ensure that you're operating within the conda environment you installed the library into.

### A tip on CLI - API usage

All of the `icenet_*` CLI commands behind the scenes implement API activities. By inspecting the [`setup.py` ](https://github.com/JimCircadian/icenet2/blob/main/setup.py#L32) entry points you can locate the module and thus the code used by these. 

In most cases the CLI imposes various assumptions about what to do without exposing, necessarily, all available options to change the behaviour of the library. This is primarily as the CLI entry points are still under development to open up the options, so these CLI operations are for introductory use and API usage is recommended for advanced use cases and pipeline integrations.

### What we'll cover

For the sake of illustration this notebook will display and execute the equivalent API code, equivalent to [the first notebook of this collection](01.cli_demonstration.ipynb) as well as some updates that incorporate the visualisations from [the second notebook describing the data](02.data_and_forecasts.ipynb). However, for the sake of extending our dataset, we'll work towards extending our original downloads from covering *2019-12-28 through 2020-4-30* to cover 2020 in totality, as well as creating a more complex selection of dates for our dataset, training and predicting with new networks.

We'll start with some frequently useful imports...

In [None]:
import glob, json, os, random, sys
import datetime as dt
import numpy as np, pandas as pd, xarray as xr, matplotlib.pyplot as plt
from IPython.display import HTML

# We also set the logging level so that we get some feedback from the API
import logging
logging.basicConfig(level=logging.INFO)

## Download

The following is preparation of the downloaders, whose instantiation describes the interactions with the upstream APIs/data interfaces used to source various types of data. 

In [None]:
from icenet.data.sic.mask import Masks
from icenet.data.interfaces.cds import ERA5Downloader
from icenet.data.sic.osisaf import SICDownloader

masks = Masks(north=False, south=True)
era5 = ERA5Downloader(
    var_names=["tas", "zg", "uas", "vas"],
    pressure_levels=[None, [250, 500], None, None],
    dates=[pd.to_datetime(date).date() for date in
           pd.date_range("2020-1-1", "2020-4-30", freq="D")],
    delete_tempfiles=False,
    max_threads=64,
    north=False,
    south=True,
    # NOTE: there appears to be a bug with the toolbox API at present (icenet#54)
    use_toolbox=False
)
sic = SICDownloader(
    dates=[pd.to_datetime(date).date() for date in
           pd.date_range("2020-1-1", "2020-4-30", freq="D")],
    delete_tempfiles=False,
    north=False,
    south=True,
)

Next we download all required data with our extended date range. All downloaders inherit a `download` method from the `Downloader` class in [`icenet2.data.producers`](https://github.com/JimCircadian/icenet2/blob/main/icenet2/data/producers.py), which also contains two other data producing classes `Generator` (which Masks inherits from) and `Processor` (used in the next section), each providing abstract implementations that multiple classes derive from.

In [None]:
# The original downloading takes a while, hence I left it over a weekend to preserve 
# the logging. When needing to pick up the processing, you can rerun these items
# to continue to regridding/rotating of fields

masks.generate(save_polarhole_masks=False)
era5.download()
sic.download()

The `ERA5Downloader` inherits from `ClimateDownloader`, from which several implementations derive their functionality. Two particularly useful methods shown below allow the downloaded data to be converted to the same grid and orientation as the OSISAF SIC data.

In [None]:
era5.regrid()
era5.rotate_wind_data()

It is hopefully obvious now that the CLI operations wrap several activities within the API up for convenience and initial ease of use, but that for experimentation, research and advancing the pipeline the API offers greater flexibility to manipulate processing chains as required for these purposes.

## Process

Similarly to the downloaders, each data producer (be it a `Downloader` or `Generator`) has a respective `Processor` that converts the `/data/` products into a normalised, preprocessed dataset under `/processed/` as per the `icenet_process_*` commands.

Firstly, to make life a bit easier, we set up some variables that are normally handled from the CLI arguments. In this case we're splitting the validation and test sets out of the 2020 data in a fairly naive manner.

In [None]:
processing_dates = dict(
    train=[pd.to_datetime(el) for el in pd.date_range("2020-1-1", "2020-3-31")],
    val=[pd.to_datetime(el) for el in pd.date_range("2020-4-3", "2020-4-23")],
    test=[pd.to_datetime(el) for el in pd.date_range("2020-4-1", "2020-4-2")],
)
processed_name = "notebook_api_data"

Next, we create the data producer and configure them for the dataset we want to create. 

In [None]:
from icenet.data.processors.era5 import IceNetERA5PreProcessor
from icenet.data.processors.meta import IceNetMetaPreProcessor
from icenet.data.processors.osi import IceNetOSIPreProcessor

pp = IceNetERA5PreProcessor(
    ["uas", "vas"],
    ["tas", "zg500", "zg250"],
    processed_name,
    processing_dates["train"],
    processing_dates["val"],
    processing_dates["test"],
    linear_trends=tuple(),
    north=False,
    south=True
)
osi = IceNetOSIPreProcessor(
    ["siconca"],
    [],
    processed_name,
    processing_dates["train"],
    processing_dates["val"],
    processing_dates["test"],
    linear_trends=None,
    north=False,
    south=True
)
meta = IceNetMetaPreProcessor(
    processed_name,
    north=False,
    south=True
)

Next, we initialise the data processors using `init_source_data` which scans the data source directories to understand what data is available for processing based on the parameters. 

In [None]:
pp.init_source_data(
    lag_days=1,
)
pp.process()
osi.init_source_data(
    lag_days=1,
    lead_days=7,
)
osi.process()
meta.process()

At this point the preprocessed data is ready to convert or create a configuration for the network dataset.

### Dataset creation

As with the `icenet_dataset_create` command we can create a dataset configuration for training the network. As before this can include cached data for the network in the format of a TFRecordDataset compatible set of tfrecords. To achieve this we create the `IceNetDataLoader`, which can both generate `IceNetDataSet` configurations (which easily provide the necessary functionality for training and prediction) as well as individual data samples for direct usage.

In [None]:
from icenet.data.loaders import IceNetDataLoaderFactory

dl = IceNetDataLoaderFactory().create_data_loader(
    "loader.notebook_api_data.json",
    "api_dataset",
    1,
    n_forecast_days=93,
    north=False,
    south=True,
    output_batch_size=4,
    generate_workers=8)

At this point we can either use `generate` or `write_dataset_config_only` to produce a ready-to-go `IceNetDataSet` configuration. 

In [None]:
dl.generate()

## Train

For single runs we programmatically can call the same method used by the CLI. `train_model` defines the training process from start to finish. The [`model-ensembler`](https://github.com/JimCircadian/model-ensembler) works outside the API, controlling multiple CLI submissions. Customising an ensemble can be achieved through looking at the configuration in [the pipeline repository](https://github.com/antarctica/IceNet-Pipeline). That said, if workflow system integration (e.g. Airflow) is desired, integrating via this method is the way to go.

In [None]:
from icenet.model.train import train_model
from icenet.data.dataset import IceNetDataSet
import tensorflow as tf

dataset = IceNetDataSet("dataset_config.api_dataset.json",
                        batch_size=4)
strategy = tf.distribute.get_strategy()

trained_path, history = \
    train_model("api_test_run",
                dataset,
                batch_size=4,
                epochs=10,
                n_filters_factor=0.6,
                seed=42,
                strategy=strategy,
                # == 2 for notebook usage
                training_verbosity=2,
                # Various other parameters can be set here as with the CLI
                # learning_rate=args.lr,
                # lr_10e_decay_fac=args.lr_10e_decay_fac,
                # lr_decay_start=args.lr_decay_start,
                # lr_decay_end=args.lr_decay_end,
                # pre_load_network=args.preload is not None,
                # pre_load_path=args.preload,
                # use_multiprocessing=args.multiprocessing,
                # use_wandb=not args.no_wandb,
                # wandb_offline=args.wandb_offline,
                # workers=args.workers, 
    )

Breaking `train_model` apart, one can look at customising the training process itself programmatically. Here, we've reduced `train_model` to its component parts with some notes about missing items (e.g. callbacks and wandb integration), to give some insight into how the training workflow is architected.

In [None]:
from icenet.data.dataset import IceNetDataSet
from icenet.model.models import unet_batchnorm
import icenet.model.losses as losses
import icenet.model.metrics as metrics

# train_model sets up wandb and attempts seeding here (see icenet#8 for issues around multi-GPU determinism)
os.environ['PYTHONHASHSEED'] = str(42)
np.random.seed(42)
random.seed(42)
tf.random.set_seed(42)
tf.keras.utils.set_random_seed(42)

ds = IceNetDataSet(dataset_config, batch_size=4)

input_shape = (*ds.shape, ds.num_channels)
train_ds, val_ds, test_ds = ds.get_split_datasets()

# train_model handles pickup runs/trained networks
run_name = "custom_run"
network_folder = os.path.join(".", "results", "networks", run_name)

if not os.path.exists(network_folder):
    logging.info("Creating network folder: {}".format(network_folder))
    os.makedirs(network_folder)

network_path = os.path.join(network_folder,
                            "{}.network_{}.{}.h5".format(run_name,
                                                         ds.identifier,
                                                         seed))

callbacks_list = list()
# train_model sets up various callbacks: early stopping, lr scheduler, 
# checkpointing, wandb and tensorboard

with strategy.scope():
    loss = losses.WeightedMSE()
    metrics_list = [
        metrics.WeightedMAE(),
        metrics.WeightedRMSE(),
        losses.WeightedMSE()
    ]

    network = unet_batchnorm(
        input_shape=input_shape,
        loss=loss,
        metrics=metrics_list,
        learning_rate=1e-4,
        filter_size=3,
        n_filters_factor=0.6,
        n_forecast_days=ds.n_forecast_days,
    )

# train_model loads weights
network.summary()

model_history = network.fit(
    train_ds,
    epochs=5,
    verbose=2,
    callbacks=callbacks_list,
    validation_data=val_ds,
    max_queue_size=10,
)

logging.info("Saving network to: {}".format(network_path))
network.save_weights(network_path)


As can be seen the training workflow is very standard for deep learning networks, with `train_model` and CLI wrapping up the training process with a lot of customisation of extraneous functionality. 

## Predict

In much the same manner as with `train_model`, the `predict_forecast` method acts as a convenient entry point workflow system integration, CLI entry as well as an overridable method upon which to base custom implementations. Using the method directly relies on loading from a prepared (but perhaps not cached) dataset.

Some parameters are fed to `predict_forecast` that ideally shouldn't need to be specified (like `seed` and `n_filters_factor`) and might seem contextually odd. They're used to locate the appropriate saved network. *This will be cleaned up in a future version*.  

In [None]:
from icenet.model.predict import predict_forecast

# Same as our training set, we'll use the test dates defined when we created this
# dataset
dataset_config = "dataset_config.api_dataset.json"

# Follows the naming convention used by the CLI version
output_dir = os.path.join(".", "results", "predict",
                          "custom_run_forecast",
                          "{}.{}".format("custom_run", "42"))

forecasts, gen_outputs, sample_weights = \
    predict_forecast(dataset_config,
                     "custom_run",
                     n_filters_factor=0.6,
                     seed=42,
                     # Range previously defined as processing_dates["test"]
                     start_dates=[pd.to_datetime(el).date()
                                  for el in pd.date_range("2020-4-1", "2020-4-2")],
                     testset=True)

The persistence and respective use of these results is then up to the user, with the threefold outputs correlating to that which is normally saved to disk as individual files containing the numpy arrays by the CLI command.

The [internals of the `predict_forecast` method](https://github.com/JimCircadian/icenet2/blob/main/icenet2/model/predict.py#L17) are still undergoing some development, but it should be noted that this method can be easily overridden or called as part of a larger workflow. In particular, within this method it's worth noting the importance of the `testset` parameter. 

Should `testset` be true, then cached data generated in `network_datasets` is never used, and instead the preprocessed data in `processed` is used directly. This actually makes the implementation of `predict_forecast` extremely simple compared with the alternative, due to some outstanding work to derive dates from the cached batched files. 

As before this is revised implementation in order to illustrate the "non-testset" use case, so several modifications are in situ for notebook execution:

In [None]:
from icenet.data.dataset import IceNetDataSet
from icenet.model.models import unet_batchnorm
import tensorflow as tf

# Usually passed to predict forecast
start_dates = [pd.to_datetime(el).date()
               for el in pd.date_range("2020-4-1", "2020-4-2")]
testset = False
# End predict_forecast args

ds = IceNetDataSet(dataset_config)
dl = ds.get_data_loader()

if not testset:
    logging.info("Generating forecast inputs from processed/ files")

    forecast_inputs, gen_outputs, sample_weights = \
        list(zip(*[dl.generate_sample(date) for date in start_dates]))
else:
    # Use the network_dataset cached data, which is a much messier implementation
    # but worthwhile using if running massive datasets incl.datasets
    # ...
    pass

network_folder = os.path.join(".", "results", "networks", "custom_run")

dataset_name = ds.identifier
network_path = os.path.join(network_folder,
                            "{}.network_{}.{}.h5".format("custom_run",
                                                         "api_dataset",
                                                         42))

logging.info("Loading model from {}...".format(network_path))

network = unet_batchnorm(
    (*ds.shape, dl.num_channels),
    [],
    [],
    n_filters_factor=0.6,
    n_forecast_days=ds.n_forecast_days
)
network.load_weights(network_path)

predictions = []

for i, net_input in enumerate(forecast_inputs):
    logging.info("Running prediction {} - {}".format(i, start_dates[i]))
    pred = network(tf.convert_to_tensor([net_input]), training=False)
    predictions.append(pred)
print("Predictions: {} shape {}".format(len(predictions), 
                                        predictions[0].shape))
print("Generated outputs: {} shape {}".format(len(gen_outputs), 
                                              gen_outputs[0].shape))
print("Sample weights: {} shape {}".format(len(sample_weights), 
                                           sample_weights[0].shape))

## Summary

This notebook has attempted to illustrate the workflow implementations of the CLI as well as highlight the flexibility of direct integration using it. Ultimately, library usage is the only way to achieve truly novel and flexible usage, the CLI is for convenience of running existing pipelines without having to manually implement complex scripts. 

The key to leveraging the benefits of both of these interfaces being provided is to consider using the following workflow:

* Get your environment(s) set up, be they research, development or production
* Use the existing CLI implementations to seed the data stores and get baseline networks operational
* Start to customise the existing operations via custom calls to the API, for example by downloading new variables or adding extra analysis to training/prediction runs
* If researching, consider [extending the functionality of the API to include revised or completely new implementations, such as additional data sources](04.library_extension.ipynb)

This last point brings us to the topic of the last of the introductory notebooks. 

## Version
- Codebase: drafted for v0.2.0, as yet untested