# How-to Finetune

This tutorial shows how to adapt a pretrained model to a different, eventually much smaller dataset, a concept called finetuning. Finetuning is well-established in machine learning and thus nothing new. Generally speaking, the idea is to use a (very) large and diverse dataset to learn a general understanding of the underlying problem first and then, in a second step, adapt this general model to the target data. Usually, especially if the available target data is limited, pretraining plus finetuning yields (much) better results than only considering the final target data. 

The connection to hydrology is the following: Often, researchers or operators are only interested in a single basin. However, considering that a Deep Learning (DL) model has to learn all (physical) process understanding from the available training data, it might be understandable that the data records of a single basin might not be enough (see e.g. the presentation linked at [this](https://meetingorganizer.copernicus.org/EGU2020/EGU2020-8855.html) EGU'20 abstract)

This is were we apply the concept of pretraining and finetuning: First, we train a DL model (e.g. an LSTM) with a large and diverse, multi-basin dataset (e.g. CAMELS) and then finetune this model to our basin of interest. Everything you need is available in the `neuralHydrology` package and in this notebook we will give you an overview of how to actually do it.

**Note**: Finetuning can be a tedious task and is usually very sensitive to the learning rate as well as the number of epochs used for finetuning. One reason is that the pretrained models are usually quite large. In fact, most often they are much larger than what would be possible to train for just a single basin. So during finetuning, we have to make sure that this large capacity is not negatively impacting our model results. Common approaches are to a) only allow parts of the model to be adapted during finetuning and/or b) to train with a much lower learning rate. So far, no publication was published that presents a universally working approach for finetuning in hydrology. So be aware that the results may vary and you might need to invest some time before finding a good strategy. However, in our experience it was always possible to get better results _with_ finetuning than without.

**To summarize**: If you are interested in getting the best-performing Deep Learning model for a single basin, pretraining on a large and diverse dataset, followed by finetuning the pretrained model on your target basin is the way to go.

In [13]:
# Imports
from pathlib import Path

import numpy as np
import pandas as pd

from neuralhydrology.nh_run import start_run, eval_run, finetune

## Pretraining

In the first step, we need to pretrain our model on a large and possibly diverse dataset. Our target basin does not necessarily have to be a part of this dataset, but usually it should be better to include it.

For the sake of the demonstration, we will train an LSTM on the CAMELS US dataset and then finetune this model to a random basin. Note that it is possible to use other inputs during pretraining and finetuning, if additional embedding layers (before the LSTM) are used, which we will ignore for now. Furthermore, we will concentrate only on demonstrating the "how-to" rather than striving for best-possible performance. To save time and energy, we will only pretrain the model for a small number of epochs. When striving for the best possible performance, you should make sure that you pretrain the model as best as possible, before starting to finetune.

We will stick closely to the model and experimental setup from [Kratzert et al. (2019)](https://hess.copernicus.org/articles/23/5089/2019/hess-23-5089-2019.html). To summarize:
- A single LSTM layer with a hidden size of 128.
- Input sequences are 365 days and the prediction is made at the last timestep.
- For the sake of this demonstration, we will only consider the 5 meteorological variables from the [extended Maurer](https://doi.org/10.4211/hs.17c896843cf940339c3c3496d0c1c077) forcing data. Either download these forcings and place the `maurer_extended` folder into the `basin_mean_forcing` folder within the CAMELS US root directory or change the forcing product and dynamic inputs in the config file.
- We will use the same CAMELS attributes, as in the publication mentioned above, as additional inputs at every time step so that the model can learn different hydrological behaviors depending on the catchment properties.

For more details, take a look at the config print-out below.

In [2]:
config_file = Path("531_basins.yml")
start_run(config_file=config_file)

2021-09-28 14:11:48,584: Logging to /home/frederik/Projects/neuralhydrology/examples/06-Finetuning/runs/cudalstm_maurer_531_basins_2809_141148/output.log initialized.
2021-09-28 14:11:48,586: ### Folder structure created at /home/frederik/Projects/neuralhydrology/examples/06-Finetuning/runs/cudalstm_maurer_531_basins_2809_141148
2021-09-28 14:11:48,587: ### Run configurations for cudalstm_maurer_531_basins
2021-09-28 14:11:48,588: experiment_name: cudalstm_maurer_531_basins
2021-09-28 14:11:48,590: run_dir: /home/frederik/Projects/neuralhydrology/examples/06-Finetuning/runs/cudalstm_maurer_531_basins_2809_141148
2021-09-28 14:11:48,592: train_basin_file: 531_basin_list.txt
2021-09-28 14:11:48,594: validation_basin_file: 531_basin_list.txt
2021-09-28 14:11:48,595: test_basin_file: 531_basin_list.txt
2021-09-28 14:11:48,597: train_start_date: 1999-10-01 00:00:00
2021-09-28 14:11:48,599: train_end_date: 2008-09-30 00:00:00
2021-09-28 14:11:48,600: validation_start_date: 1980-10-01 00:00:0

We end with an okay'ish model that should be enough for the purpose of this demonstration. Remember we only train for a limited number of epochs here.

Next, let's look in the `runs/` folder, where the folder of this model is stored to lookup the exact name.

In [3]:
!ls runs/

cudalstm_maurer_531_basins_2809_141148


Next, we'll load the validation results into memory so we can select a basin to demonstrate how to finetune based on the model performance. Here, we will select a random basin from the lower 50% of the NSE distribution, i.e. a basin where the NSE is below the median NSE. Usually, you'll see better performance gains for basins with lower model performance than for those where the base model is already really good.

In [4]:
# Load validation results from the last epoch
run_dir = Path("runs/cudalstm_maurer_531_basins_2809_141148/")
df = pd.read_csv(run_dir / "validation" / "model_epoch003" / "metrics_freq1D.csv", dtype={'basin': str})
df = df.set_index('basin')

# Compute the median NSE from all basins, where discharge observations are available for that period
print(f"Median NSE of the validation period {df['NSE'].median():.3f}")

# Select a random basins from the lower 50% of the NSE distribution
basin = df.loc[df["NSE"] < df["NSE"].median()].sample(n=1).index[0]

print(f"Selected basin: {basin} with an NSE of {df.loc[df.index == basin, 'NSE'].values[0]:.3f}")

Median NSE of the validation period 0.701
Selected basin: 02055100 with an NSE of 0.296


## Finetuning

Next, we will show how to perform finetuning for the basin selected above, based on the model we just trained. The function to use is `finetune` from `neuralhydrology.nh_run` if you want to train from within a script or notebook. If you want to start finetuning from the command line, you can also call the `nh-run` utility with the `finetune` argument, instead of e.g. `train` or `evaluate`.

The only thing required, similar to the model training itself, is a config file. This config however has slightly different requirements to a normal model config and works slightly different:
- The config has to contain the following two arguments:
    - `base_run_dir`: The path to the directory of the pre-trained model.
    - `finetune_modules`: Which parts of the pre-trained model you want to finetune. Check the documentation of each model class for a list of all possible parts. Often only parts, e.g. the output layer, are trained during finetuning and the rest is kept fixed. There is no general rule of thumb and most likely you will have to try both.
- Any additional argument contained in this config will overwrite the config argument of the pre-trained model. Everything _not_ specified will be taken from the pre-trained model. That is, you can e.g. specify a new basin file in the finetuning config (by `train_basin_file`) to finetune the pre-trained model on a different set of basins, or even just a single basin as we will do in this notebook. You can also change the learning rate, loss function, evaluation metrics and so on. The only thing you can not change are arguments that change the model architecture (e.g. `model`, `hidden_size` etc.), because this leads to errors when you try to load the pre-trained weights into the initialized model.

Let's have a look at the `finetune.yml` config that we prepared for this tutorial.

In [5]:
!cat finetune.yml

# --- Experiment configurations --------------------------------------------------------------------

# experiment name, used as folder name
experiment_name: cudalstm_maurer_531_basins_finetuned

# files to specify training, validation and test basins (relative to code root or absolute path)
train_basin_file: finetune_basin.txt
validation_basin_file: finetune_basin.txt
test_basin_file: finetune_basin.txt

# --- Training configuration -----------------------------------------------------------------------

# specify learning rates to use starting at specific epochs (0 is the initial learning rate)
learning_rate:
    0: 5e-4
    2: 5e-5	

# Number of training epochs
epochs: 10

finetune_modules:
- head
- lstm
base_run_dir: /home/frederik/Projects/neuralhydrology/examples/06-Finetuning/runs/cudalstm_maurer_531_basins_2809_133043

So out of the two arguments that are required, `base_run_dir` is still missing. We will add the argument from here and point at the directory of the model we just trained. Furthermore, we point to a new file for training, validation and testing, called `finetune_basin.txt`, which does not yet exist. We will create this file and add the basin we selected above as the only basin we want to use here. The rest are some changes to the learning rate and the number of training epochs as well as a new name. Also note that here, we train the full model, by selecting all model parts available for the `CudaLSTM` under `finetune_modules`.

In [6]:
# Add the path to the pre-trained model to the finetune config
with open("finetune.yml", "a") as fp:
    fp.write(f"base_run_dir: {run_dir.absolute()}")
    
# Create a basin file with the basin we selected above
with open("finetune_basin.txt", "w") as fp:
    fp.write(basin)

With that, we are ready to start the finetuning. As mentioned above, we have two options to start finetuning:
1. Call the `finetune()` function from a different Python script or a Jupyter Notebook with the path to the config.
2. Start the finetuning from the command line by calling

```bash
nh-run finetune --config-file /path/to/config.yml
```

Here, we will use the first option.

In [8]:
finetune(Path("finetune.yml"))

2021-09-28 15:13:36,650: Logging to /home/frederik/Projects/neuralhydrology/examples/06-Finetuning/runs/cudalstm_maurer_531_basins_finetuned_2809_151336/output.log initialized.
2021-09-28 15:13:36,651: ### Folder structure created at /home/frederik/Projects/neuralhydrology/examples/06-Finetuning/runs/cudalstm_maurer_531_basins_finetuned_2809_151336
2021-09-28 15:13:36,652: ### Start finetuning with pretrained model stored in /home/frederik/Projects/neuralhydrology/examples/06-Finetuning/runs/cudalstm_maurer_531_basins_2809_141148
2021-09-28 15:13:36,652: ### Run configurations for cudalstm_maurer_531_basins_finetuned
2021-09-28 15:13:36,653: batch_size: 256
2021-09-28 15:13:36,654: clip_gradient_norm: 1
2021-09-28 15:13:36,655: commit_hash: 7c75e42
2021-09-28 15:13:36,655: data_dir: /data/Hydrology/CAMELS_US
2021-09-28 15:13:36,656: dataset: camels_us
2021-09-28 15:13:36,657: device: cuda:0
2021-09-28 15:13:36,658: dynamic_inputs: ['prcp(mm/day)', 'srad(W/m2)', 'tmax(C)', 'tmin(C)', 'v

Looking at the validation result, we can see an increase of roughly 0.05 NSE.

Last but not least, we will compare the pre-trained and the finetuned model on the test period. For this, we will make use of the `eval_run` function from `neuralhydrolgy.nh_run`. Alternatively, you could evaluate both runs from the command line by calling

```bash
nh-run evaluate --run-dir /path/to/run_directory/
```

In [9]:
eval_run(run_dir, period="test")

2021-09-28 15:13:56,872: Using the model weights from runs/cudalstm_maurer_531_basins_2809_141148/model_epoch003.pt
# Evaluation: 100%|██████████| 531/531 [03:05<00:00,  2.87it/s]
2021-09-28 15:17:02,286: Stored results at runs/cudalstm_maurer_531_basins_2809_141148/test/model_epoch003/test_results.p


Next we check for the full name of the finetuning run (which we could also extract from the log output above)

In [10]:
!ls runs/

cudalstm_maurer_531_basins_2809_141148
cudalstm_maurer_531_basins_finetuned_2809_151336


Now we can call the `eval_run()` function as above, but pointing to the directory of the finetuned run. By default, this function evaluates the last checkpoint, which can be changed with the `epoch` argument. Here however, we use the default.

In [11]:
finetune_dir = Path("runs/cudalstm_maurer_531_basins_finetuned_2809_151336")
eval_run(finetune_dir, period="test")

2021-09-28 15:17:11,469: Using the model weights from runs/cudalstm_maurer_531_basins_finetuned_2809_151336/model_epoch010.pt
# Evaluation: 100%|██████████| 1/1 [00:00<00:00,  1.91it/s]
2021-09-28 15:17:12,007: Stored results at runs/cudalstm_maurer_531_basins_finetuned_2809_151336/test/model_epoch010/test_results.p


Now let's look at the test period results of the pre-trained base model and the finetuned model for the basin that we chose above.

In [15]:
# load test results of the base run
df_pretrained = pd.read_csv(run_dir / "test/model_epoch003/metrics_freq1D.csv", dtype={'basin': str})
df_pretrained = df_pretrained.set_index("basin")
    
# load test results of the finetuned model
df_finetuned = pd.read_csv(finetune_dir / "test/model_epoch010/metrics_freq1D.csv", dtype={'basin': str})
df_finetuned = df_finetuned.set_index("basin")
    
# extract basin performance
base_model_nse = df_pretrained.loc[df_pretrained.index == basin, "NSE"].values[0]
finetune_nse = df_finetuned.loc[df_finetuned.index == basin, "NSE"].values[0]
print(f"Basin {basin} base model performance: {base_model_nse:.3f}")
print(f"Performance after finetuning: {finetune_nse:.3f}")

Basin 02055100 base model performance: 0.568
Performance after finetuning: 0.646


So we see roughly the same performance increase in the test period (slightly higher), which is great. However, note that a) our base model was not optimally trained (we stopped quite early) but also b) the finetuning settings were chosen rather randomly. From our experience so far, you can almost always get performance increases for individual basins with finetuning, but it is difficult to find settings that are universally applicable. However, this tutorial was just a showcase of how easy it actually is to finetune models with the `neuralHydrology` library. Now it is up to you to experiment with it.