# Backward model training

In this notebook we take the prepared experimental (NIST) and synthetic (NEIMS-gen, RASSP-gen) datasets and train our final backward model. 

The model is an adapted encoder-decoder transformer architecture that takes m/z values and intensities as an input and predicts the SMILES string of a molecule as an output. The architecture is based on Huggingface implementation of the [BART model](https://huggingface.co/docs/transformers/model_doc/bart) and can be found in `{PROJECT_ROOT}/bart_spektro/modeling_bart_spektro.py`.

The training phase is divided into two parts:
1. Pretraining on the synthetic dataset (NEIMS-gen, RASSP-gen)
2. Fine-tuning on the experimental dataset (NIST)

In order to train the best model possible, we implemented several experiments seeking for the best architecture and hyperparameters. First we show how to train the final model, and then we show how to reproduce the experiments. 

All the various configurations of the runs are stored in the `{PROJECT_ROOT}/configs` folder.

## Final model training

TODO: 
- edit data paths in configs!!

The final model's configuration can be found in `{PROJECT_ROOT}/configs/pretrain_final.yaml` and `{PROJECT_ROOT}/configs/finetune_final.yaml`. The run commands are as follows:

### Pretraining

```bash
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/pretrain_final.yaml \
                     --additional-info _final \
                     --additional-tags scratch:rassp1:neims1:final \
                     --wandb-group pretrain
```

### Finetuning
```bash
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_final.yaml \
                     --checkpoint checkpoints/pretrain/{CHECKPOINT_NAME}/checkpoint-224000 \
                     --additional-info final \
                     --additional-tags final \
                     --wandb-group finetune 
```

NOTES:

The training process is logged to [Weights & Biases](https://wandb.ai/home). We encourage you to create a profile to track the training. When using `wandb` you can also resume a failed run easily.  

The final model is saved in the `{PROJECT_ROOT}/checkpoints/{wandb-group}/{wandb-run-id}` folder. More information on the `train_bart.py` script arguments can be found by running `python train_bart.py --help`.

If you have more GPUs available, specify the list of the chosen IDs in CUDA_VISIBLE_DEVICES env variable right before the `python` call at the beginning of the line. 

If your GPU memory is smaller than 40GB, you might need to adjust the batch size in the config files: set the auto_bs to False and specify the batch size manually (setting values for `per_device_train_batch_size`, `per_device_eval_batch_size`, `gradient_accumulation_steps`, and `eval_accumulation_steps` in `hf_training_args`). Beware that the batch size is an important hyperparameter and can significantly affect the moodel's performance.

Training on CPU (debug mode) is currently not supported due to dependency issues.

Since the training process is quite long (60h + 24h on H100 GPU), we recommend running it in a `tmux` session.

## Experiments

### 1) Positional embeddings
In this experiment we tested whether we can effectively input two channels of information (m/z and intensities) in the form of two sets of trained embeddings that are added together. A model given only m/z values and static positional encoding is compared to a model where trained binned intensity embeddings replace positional encoding.

To save time and resources, for this experiment we conducted only the finetuning step.

#### Training (Finetuning)
```bash
# m/z + static positional encoding
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp1_pos_emb.yaml \
                     --additional-info _exp1_pos_emb \
                     --additional-tags exp1:pos_emb:from_scratch \
                     --wandb-group finetune 

# m/z + intensity embeddings
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp1_int_emb.yaml \
                     --additional-info exp1_int_emb \
                     --additional-tags exp1:int_emb:from_scratch \
                     --wandb-group finetune 
```

In this experiment we showed that the model can effectively learn to use the intensity embeddings as a positional encoding. 

### 2) Intensity binning
This experiment  aims to validate our intuition regarding logarithmic intensity binning and try to find an optimal combination of the number of bins *s + 1* and the logarithm base *b*.

We investigated two variants of linear binning, rounding the intensities to two and three decimal places, resulting in 100 and 1000 trainable bins, respectively. Additionally, we experiment with four different variants of logarithmic binning parameters (s, b), specifically (9, 2.2), (20, 1.43), (29, 1.28), and (39, 1.2). The arbitrarily looking pairs of numbers are hand-crafted to create a uniform-like distribution without unnecessary empty bins, on the NIST train dataset.

Again, we only conducted the finetuning step for this experiment.

#### Training (Finetuning)
```bash
# linear binning 100 bins
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp2_lin_100.yaml \
                     --additional-info _exp2_lin_100 \
                     --additional-tags exp2:lin_100:from_scratch \
                     --wandb-group finetune

# linear binning 1000 bins
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp2_lin_1000.yaml \
                     --additional-info _exp2_lin_1000 \
                     --additional-tags exp2:lin_1000:from_scratch \
                     --wandb-group finetune

# log binning 10 bins, base 2.2
# !!! This is the same run as [exp1: m/z + intensity embeddings]
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp2_log_9_2.2.yaml \
                     --additional-info _exp2_log_9_2.2 \
                     --additional-tags exp2:log_9_2.2:from_scratch \
                     --wandb-group finetune

# log binning 21 bins, base 1.43
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp2_log_20_1.43.yaml \
                     --additional-info _exp2_log_20_1.43 \
                     --additional-tags exp2:log_20_1.43:from_scratch \
                     --wandb-group finetune

# log binning 30 bins, base 1.28
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp2_log_29_1.28.yaml \
                     --additional-info _exp2_log_29_1.28 \
                     --additional-tags exp2:log_29_1.28:from_scratch \
                     --wandb-group finetune

# log binning 40 bins, base 1.2
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp2_log_39_1.2.yaml \
                     --additional-info _exp2_log_39_1.2 \
                     --additional-tags exp2:log_39_1.2:from_scratch \
                     --wandb-group finetune
```

In this experiment we showed that the logarithmic binning can be used as an effective intensity encoding. There was an significant improvement for models using more than 10 bins, but the differences for 30 vs. 40 bins were negligible in our runs. Linear binning wit 1000 bins performed reasonably well, lin_100 did not. The best results were achieved with 30 bins and log base 1.28.

### 3) Molecular representation & Tokenization
Another decision to make was choosing the optimal encoding for the model’s decoder subcomponent. We  initially questioned whether the BBPE tokenization of SMILES was a suitable approach and, if so, what
properties the tokenizer should have. Additionally, we questioned the choice of SMILES as a molecular representation and compared the results with the recently proposed SELFIES, a representation designed
specifically for generative neural networks.

We decided to test four values of `min_frequency` parameter – 10, 100, 10 000, and 10 million, resulting in vocabulary sizes of 1286, 780, 367 and 267 tokens, respectively. The last tokenizer (10M) splits strings simply on the level of characters, the others also use longer tokens. Besides, we included a SELFIES tokenizer that we created by wrapping the Hugging face PreTrainedTokenizer class around the [SELFIES tokenization algorithm](https://pypi.org/project/selfies/).

Also here, we only conducted the finetuning step.

#### Training (Finetuning)
```bash
# BBPE with minimal token frequency of 10
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp3_mf10.yaml \
                     --additional-info _exp3_mf10 \
                     --additional-tags exp3:mf10:from_scratch \
                     --wandb-group finetune

# BBPE with minimal token frequency of 100
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp3_mf100.yaml \
                     --additional-info _exp3_mf100 \
                     --additional-tags exp3:mf100:from_scratch \
                     --wandb-group finetune

# BBPE with minimal token frequency of 10K
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp3_mf10K.yaml \
                     --additional-info _exp3_mf10K \
                     --additional-tags exp3:mf10K:from_scratch \
                     --wandb-group finetune

# BBPE with minimal token frequency of 10M
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp3_mf10M.yaml \
                     --additional-info _exp3_mf10M \
                     --additional-tags exp3:mf10M:from_scratch \
                     --wandb-group finetune

# SELFIES tokenizer
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp3_selfies.yaml \
                     --additional-info _exp3_selfies \
                     --additional-tags exp3:selfies:from_scratch \
                     --wandb-group finetune
```

In this experiment we showed that the BBPE tokenization with a minimal token frequency of 10M (character-level tokenization) is the best choice for the model.

### 4) Pretraining datasets mixing
In this experiment, we tested whether pretraining on synthetic data positively affects the model’s results. We also tried to find a good mixture of pretraining datasets to optimize the performance.

With the use of specific source tokens at the beinning of every SMILES string determining the origin dataset of each spectrum (<rassp>, <neims>, <rassp>) we give our model a chance to adapt to the differences between the datasets. We tested the following pretraining combinations of datasets:

- (no pretraining - baseline trained in the previous experiment - exp3_mf10M) 
- RASSP-gen only
- NEIMS-gen only
- RASSP-gen + NEIMS-gen (mixing 1:1)
- RASSP-gen + NEIMS-gen + NIST (mixing 1:1:0.1)

All the pretrained models were finetuned on the NIST dataset and the performance was measured on the NIST validation set.

#### Pretraining
```bash
# RASSP-gen only
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/pretrain_exp4_rassp.yaml \
                     --additional-info _exp4_rassp \
                     --additional-tags exp4:rassp:from_scratch \
                     --wandb-group pretrain

# NEIMS-gen only
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/pretrain_exp4_neims.yaml \
                     --additional-info _exp4_neims \
                     --additional-tags exp4:neims:from_scratch \
                     --wandb-group pretrain

# RASSP-gen + NEIMS-gen
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/pretrain_exp4_rassp_neims.yaml \
                     --additional-info _exp4_rassp_neims \
                     --additional-tags exp4:rassp:neims:from_scratch \
                     --wandb-group pretrain

# RASSP-gen + NEIMS-gen + NIST
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/pretrain_exp4_rassp_neims_nist.yaml \
                     --additional-info _exp4_rassp_neims_nist \
                     --additional-tags exp4:rassp:neims:nist:from_scratch \
                     --wandb-group pretrain
```

#### Finetuning
The finetuning step is the same as in the previous experiment - just replace the `--checkpoint` argument with the path to the pretrained model.

```bash
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp3_mf10M.yaml \
                     --checkpoint checkpoints/pretrain/{CHECKPOINT_NAME}/checkpoint-112000 \ # TODO: add correct path
                     --additional-info _exp4_{MIXTURE_NAME} \   # TODO: add correct name
                     --additional-tags exp4:from_pretrained \   # TODO: add correct tags
                     --wandb-group finetune
```

In this experiment we showed that the model benefits from pretraining on synthetic data. The best results were achieved with the mixture of RASSP-gen and NEIMS-gen datasets.

### 5) Source token vindication
In this experiment, we tested the effect of the source tokens on the model's performance. We compared the model trained with three different source tokens to the model trained with the same source token for all datasets.

#### Pretraining
```bash
# one source token for all datasets (we chose <nist> for convenience)
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/pretrain_exp5_one_src_token.yaml \
                     --additional-info _exp5_one_src_token \
                     --additional-tags exp5:one_src_token:from_scratch \
                     --wandb-group pretrain
```

#### Finetuning
The finetuning step is the same as in the previous experiment - just replace the `--checkpoint` argument with the path to the pretrained model.

```bash
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp3_mf10M.yaml \
                     --checkpoint checkpoints/pretrain/{CHECKPOINT_NAME}/checkpoint-112000 \ # TODO: add correct path
                     --additional-info _exp5_one_src_token \ 
                     --additional-tags exp5:from_pretrained:one_src_token \
                     --wandb-group finetune
```

In this experiment we showed that the model does not benefit from the source tokens. The results were comparable to the model trained with the same source token for all datasets, thus we suggest that the model learned to ignore the source tokens.

### 6) Layer freezing
In this next experiment, we investigated whether we could speed up the finetuning process or even enhance the model’s performance using layer freezing. Layer freezing is commonly used as a regularization method that, given a considerable level of overfitting seen on model's loss values, might be benefitial. We tested the following layer freezing strategies:

- no freezing (here we used the best model from the previous experiments - unfrozen RASSP:NIST model finetuned on the NIST dataset)
- train *fc1* and *decoder embedding* layers (72% frozen)
- train *fc1*, *cross-attention*, *decoder self-attention* and *decoder embedding* layers (45% frozen)

#### Finetuning
```bash
# train fc1 and decoder embedding layers
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp6_72perc_forzen.yaml \
                     --checkpoint checkpoints/pretrain/{NEIMS_RASSP_CHECKPOINT_NAME}/checkpoint-112000 \ # TODO: add correct path
                     --additional-info _exp6_72perc_forzen \
                     --additional-tags exp6:72perc_forzen:from_pretrained \
                     --wandb-group finetune

# train fc1, cross-attention, decoder self-attention and decoder embedding layers
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python train_bart.py --config-file configs/finetune_exp6_45perc_forzen.yaml \
                     --checkpoint checkpoints/pretrain/{NEIMS_RASSP_CHECKPOINT_NAME}/checkpoint-112000 \ # TODO: add correct path
                     --additional-info _exp6_45perc_forzen \
                     --additional-tags exp6:45perc_forzen:from_pretrained \
                     --wandb-group finetune
```

In this experiment we showed that the model does not benefit from layer freezing. The results were significantly worse than the model trained without freezing. The more layers were frozen, the worse the results were.

### 7) Training length
We identified the best combination of hyperparameters from those we tested. Since the training didn’t seem to fully converge yet, we doubled the number of epochs for both the pretraining and finetuning phases.

This experiment results in the final models, the training of which is described at the beginning of this notebook.