In [1]:
# Quick hack to put us in the icenet-pipeline folder, assuming it was created as per 01.cli_demonstration.ipynb
import os
if os.path.exists("05.library_extension.ipynb"):
    os.chdir("../notebook-pipeline")
print("Running in {}".format(os.getcwd()))

%matplotlib inline

Running in /data/hpcdata/users/jambyr/icenet/notebook-pipeline


# IceNet Library Extension

## Context

### Purpose
The IceNet library provides the ability to download, process, train and predict from end to end via a set of API/CLI tools.

Using this notebook we will demonostrate how one can extend the functionality of the API provided by the library.

### Highlights
The key features of this notebook are:

* [Reviewing the integration of MARS HRES data as an additional data source.](#Data:-Reviewing-MARS-HRES)
* [Demonstrating the incorporation of a new datasource through notebook based development.](#Data:-Extending-with-another-implementation)
* [Considerations when having extended the library.](#Considerations-when-extending)

### Contributions
#### Notebook
James Byrne (author)

__Please raise issues [in this repository](https://github.com/antarctica/IceNet-Pipeline) to suggest updates to this notebook!__ 

Contact me at _jambyr \<at\> bas.ac.uk_ for anything else...

#### Modelling codebase
James Byrne (code author), Tom Andersson (science author)

#### Modelling publications
Andersson, T.R., Hosking, J.S., Pérez-Ortiz, M. et al. Seasonal Arctic sea ice forecasting with probabilistic deep learning. Nat Commun 12, 5124 (2021). https://doi.org/10.1038/s41467-021-25257-4

#### Involved organisations
The Alan Turing Institute and British Antarctic Survey

## Data: Reviewing MARS HRES

The first extension of the original codebase of IceNet, after having refactored from the original research, was in integrating the ECMWF HRES atmospheric data. This is [not readily available without agreeing to the license conditions](https://www.ecmwf.int/en/forecasts/datasets/catalogue-ecmwf-real-time-products), but for the sake of explanation it is possible to describe how to implement a new data stream.

As mentioned in the [previous notebook](03.library_usage.ipynb), there are three types of data producers defined in the IceNet class hierarchy:

* `Downloader` type producers; from which all `icenet.data.interfaces` implementations derive their functionality, as well as `SICDownloader` in `icenet.data.sic.osisaf`, which is our reference dataset for sea-ice concentration. Currently an additional class `ClimateDownloader` implements a load of extra atmospheric related functionality for the implementations in `icenet.data.interfaces` as well.
* `Generator` type producers; which differ from `Downloader` in semantics only at present. Currently `IceNetDataLoader` and `Masks` use this as a base class. 
* `Processor` type producers; at present all concrete implementations descend from its child `IceNetPreProcessor` which defines common handling for variable naming, configuration generation, normalisation and linear trend forecasting. `Processor` itself also contains functionality that differs with `Downloader` and `Generator`, in that it implements source data store pickup functionality via `init_source_data`. 

The distinctions here are useful for the future and refactoring that is still ongoing to make common certain functionality that is duplicated. The overall class diagram looks as follows.

<img src="classhierarchy.png" alt="IceNet high level class diagram" />

The important thing to note is that all data producers derive from DataProducer which implements the key file handling routines that keep the data flow (*FIXME: at time of writing, relatively*) consistent. When implementing new data processing for use in an end to end pipeline, all that is needed will be: 

1. A concrete implementation of `Downloader` or `ClimateDownloader` that takes care of interfacing with the external system;
1. A concrete implementation of `Processor` via `IceNetPreProcessor`, that defines any specific

### Data Downloaders

The first step to defining the new `ClimateDownloader` is to inherit from it and, in the case of MARS HRES data, define our request template and parameter map: this is not strongly defined but allows us to link the parameters to the equivalent channels used for training, as this data is used for predictions primarily at present.

The most important thing to note is that data downloaded and used to train creates one or more channels in the input dataset. As such, if you say train with a *geopotential height from 500hpa* variable, you need to ensure that any other composite data source provides a channel whose named can be similarly derives. In this case, we had this channel called **zg500** in the ERA5 training data, so we supply this via **zg** (appended with 500 through initialisation of the class for this variable with a `[250, 500]` entry in `ClimateDownloader`s `pressure_levels` argument)...

```python
class HRESDownloader(ClimateDownloader):
    PARAM_TABLE = 128
    HRES_PARAMS = {
        "siconca":      (31, "siconc"),     # sea_ice_area_fraction
        "tos":          (34, "sst"),    # sea surface temperature 
        "zg":           (129, "z"),     # geopotential
        "ta":           (130, "t"),     # air_temperature (t)
        "hus":          (133, "q"),     # specific_humidity
        "psl":          (134, "sp"),    # surface_pressure
        "uas":          (165, "u10"),   # 10m_u_component_of_wind
        "vas":          (166, "v10"),   # 10m_v_component_of_wind
        "tas":          (167, "t2m"),   # 2m_temperature (t2m)
        "rlds":         (175, "strd"),
        "rsds":         (169, "ssrd"),
    }

    MARS_TEMPLATE = """
retrieve,
  class=od,
  date={date},
  expver=1,
  levtype={levtype},
  {levlist}param={params},
  step=0,
  stream=oper,
  time=00:00:00,
  type=fc,
  area={area},
  grid=0.25/0.25,
  target="{target}",
  format=netcdf
    """
```

Next, we define our constructor. Here the importance is in supplying an identifier which is used to identify the source data under the `/data/` directory. Notice it's also an opportunity to start up the API client for this particular service.

The `HRESDownloader` is in itself interesting as down the line we actually instantiate **more than a single processor** for it, meaning that a single source data store under `/data/mars.hres/` is used to seed multiple preprocessing steps, as it provides `siconca` (our sea ice concentration channels) and atmospheric channels, which do and do not require linear trends generating and thus different configurations. 

```python
    def __init__(self,
                 *args,
                 identifier="mars.hres",
                 **kwargs):
        super().__init__(*args,
                         identifier=identifier,
                         **kwargs)

        self._server = ecmwfapi.ECMWFService("mars")
```

**The first of two abstract methods requiring implementation is `_get_dates_for_request`**, which is defined in `ClimateDownloader` at time of writing. THis returns a list of batches of dates, appropriate to the best practice for interacting with the external system, so that the request in `_single_download` can be called with each batch as the argument `req_dates`.

```python
    def _get_dates_for_request(self):
        return batch_requested_dates(self._dates, attribute="month")
```

**The second of the two abstract methods from `ClimateDownloader` requiring implementation is `_single_download`**. This takes every batch, the supplied requested variable names and associated pressures and prepares and issues the request, being responsible for downloading a file for the batch *and then splitting it into daily files in the source directory* ([see the data_and_forecasts notebook if that doesn't make sense to you!](02.data_and_forecasts.ipynb))

[Please note this is subject to a future refactor as the data interfaces implementation has a bit of technical debt still to address to move common functionality back up the hierarchy](https://github.com/JimCircadian/icenet2/issues/9).

Note in particular the importance of populating `self._files_downloaded` with newly downloaded files. Maintaining this list thus feeds the `regrid` and `rotate_wind_data` methods that are inherited from `ClimateDownloader` for use with each instantiation, if required.

```python
    def _single_download(self, var_names, pressures, req_dates):
        levtype = "plev" if pressures else "sfc"

        for dt in req_dates:
            assert dt.year == req_dates[0].year
            assert dt.month == req_dates[0].month

        request_month = req_dates[0].strftime("%Y%m")
        request_target = "{}.{}.{}.nc".format(
            self.hemisphere_str[0], levtype, request_month)

        download_dates = []

        for var_name, pressure in product(var_names, pressures.split('/')
                                          if pressures else [None]):
            var = var_name if not pressure else \
                "{}{}".format(var_name, pressure)
            var_folder = os.path.join(self.get_data_var_folder(var),
                                      str(req_dates[0].year))

            for destination_date in req_dates:
                daily_path, regridded_name = get_daily_filenames(
                    var_folder, var, destination_date.strftime("%Y_%m_%d"))

                if not os.path.exists(daily_path) \
                        and not os.path.exists(regridded_name):
                    if destination_date not in download_dates:
                        download_dates.append(destination_date)
                elif not os.path.exists(regridded_name):
                    self._files_downloaded.append(daily_path)

        download_dates = sorted(list(set(download_dates)))

        if not len(download_dates):
            logging.info("We have all the files we need from MARS API")
            return

        request = HRESDownloader.MARS_TEMPLATE.format(
            area="/".join([str(s) for s in self.hemisphere_loc]),
            date="/".join([el.strftime("%Y%m%d") for el in download_dates]),
            levtype=levtype,
            levlist="levelist={},\n  ".format(pressures) if pressures else "",
            params="/".join(
                ["{}.{}".format(
                    HRESDownloader.HRES_PARAMS[v][0],
                    HRESDownloader.PARAM_TABLE)
                 for v in var_names]),
            target=request_target,
        )

        logging.debug("MARS REQUEST: \n{}\n".format(request))

        if not os.path.exists(request_target):
            self._server.execute(request, request_target)

        ds = xr.open_dataset(request_target)

        for day in ds.time.values:
            date_str = pd.to_datetime(day).strftime("%Y_%m_%d")

            for var_name, pressure in product(var_names, pressures.split('/')
                                              if pressures else [None]):
                var = var_name if not pressure else \
                    "{}{}".format(var_name, pressure)
                var_folder = os.path.join(self.get_data_var_folder(var),
                                          str(pd.to_datetime(day).year))

                # For the year component - 365 * 50 is a lot of files ;)
                os.makedirs(var_folder, exist_ok=True)

                daily_path, _ = get_daily_filenames(var_folder, var, date_str)

                da = getattr(ds,
                             HRESDownloader.HRES_PARAMS[var_name][1])

                if pressure:
                    da = da.sel(level=int(pressure))

                # Just to make sure
                da_daily = da.sel(time=slice(
                    pd.to_datetime(day), pd.to_datetime(day)))

                logging.info("Saving new daily file: {}".format(daily_path))
                da_daily.to_netcdf(daily_path)

                if daily_path not in self._files_downloaded:
                    self._files_downloaded.append(daily_path)

        logging.info("Removing {}".format(request_target))
        ds.close()
        os.unlink(request_target)
```

***Finally, because the MARS API benefits from single level and pressure level variables being grouped together*** we define a **custom** implementation for the `ClimateDownloader.download` method. The original actually issues a `_single_download` request for each variable-level-req_date, which is not favoured for the ECMWF API. 

```python
    def download(self):
        logging.info("Building request(s), downloading and daily averaging "
                     "from {} API".format(self.identifier.upper()))

        sfc_vars = [var for idx, var in enumerate(self.var_names)
                    if not self.pressure_levels[idx]]
        plev_vars = [var for idx, var in enumerate(self.var_names)
                     if self.pressure_levels[idx]]
        pressures = "/".join([str(s) for s in sorted(set(
            [p for ps in self.pressure_levels if ps for p in ps]))])

        dates_per_request = self._get_dates_for_request()

        for req_batch in dates_per_request:
            self._single_download(sfc_vars, None, req_batch)
            self._single_download(plev_vars, pressures, req_batch)

        logging.info("{} daily files downloaded".
                     format(len(self._files_downloaded)))
```

### Data Processors


```python
class IceNetHRESPreProcessor(IceNetPreProcessor):
    def __init__(self, *args, **kwargs):
        super().__init__(*args,
                         file_filters=["latlon_"],
                         identifier="mars.hres",
                         **kwargs)
```

## Data: Extending with another implementation

TODO

## Other extensions

TODO

## Considerations when extending

### Open a PR!

The IceNet library is (will be, at present time) open sourced so that people can contribute back to it. Therefore, if you've implemented a useful downloader, processor or other item of functionality **the community will definitely benefit from it!** Please do contribute via a pull request, even if it's a quick and dirty implementation. 

### Documentation

If possible, preparing even a small amount of documentation will go a long way, especially if it points people at external sites describing data or illustrates the reasoning/usage of the new functionality.

## Summary

TODO

## Version
- Codebase: drafting for v0.2.0