# Backward model training

In this notebook we take the prepared experimental (NIST) and synthetic (NEIMS-gen, RASSP-gen) datasets and train our final backward model. 

The model is an adapted encoder-decoder transformer architecture that takes m/z values and intensities as an input and predicts the SMILES string of a molecule as an output. The architecture is based on Huggingface implementation of the [BART model](https://huggingface.co/docs/transformers/model_doc/bart) and can be found in `{PROJECT_ROOT}/spectus/model/modeling_spectus.py`.

The training phase is divided into two parts:
1. Pretraining on the synthetic dataset (NEIMS-gen, RASSP-gen)
2. Fine-tuning on the experimental dataset (NIST)

In order to train the best model possible, we implemented several experiments seeking for the best architecture and hyperparameters. First we show how to train the final model, and then we show how to reproduce the experiments. 

All the various configurations of the runs are stored in the `{PROJECT_ROOT}/configs` folder.

## Final model training

The final model's configuration can be found in `{PROJECT_ROOT}/configs/pretrain_final.yaml` and `{PROJECT_ROOT}/configs/finetune_final.yaml`. The run commands are as follows:

### Pretraining

```bash
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/pretrain_final.yaml \
                     --additional-info _final \
                     --additional-tags scratch:rassp1:neims1:final \
                     --wandb-group pretrain
```

### Finetuning
```bash
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_final.yaml \
                     --checkpoint checkpoints/pretrain/{CHECKPOINT_NAME}/checkpoint-448000 \
                     --additional-info final \
                     --additional-tags final \
                     --wandb-group finetune 
```

NOTES:

The training process is logged to [Weights & Biases](https://wandb.ai/home). We encourage you to create a profile to track the training (set your login and project name in the [train script](../spectus/train_spectus.py)). When using `wandb` you can also resume a failed run easily.  

The models are saved in the `{PROJECT_ROOT}/checkpoints/{wandb-group}/{wandb-run-name}` folder. More information on the `spectus/train_spectus.py` script arguments can be found by running `python spectus/train_spectus.py --help`.

If you have more GPUs available, specify the list of the chosen IDs in CUDA_VISIBLE_DEVICES env variable right before the `python` call at the beginning of the line. 

If your GPU memory is smaller than 40GB, you might need to adjust the batch size in the config files: set the auto_bs to False and specify the batch size manually (setting values for `per_device_train_batch_size`, `per_device_eval_batch_size`, `gradient_accumulation_steps`, and `eval_accumulation_steps` in `hf_training_args`). Beware that the batch size is an important hyperparameter and can significantly affect the moodel's performance.

Training on CPU (debug mode) is currently not supported due to dependency issues.

Since the training process is quite long (58h + 28h on H100 GPU), we recommend running it in a `tmux` session.

## Experiments

### 1) Intensity binning
This experiment  aims to validate our intuition regarding logarithmic intensity binning and try to find an optimal combination of the number of bins *s + 1* and the logarithm base *b*.

We investigated three variants of linear binning, rounding the intensities to two, three and four decimal places, resulting in 100, 1000 and 10000 trainable bins, respectively. Additionally, we experiment with four different variants of logarithmic binning parameters (s, b), specifically (9, 2.2), (20, 1.43), (29, 1.28), and (39, 1.2). The arbitrarily looking pairs of numbers are hand-crafted to create a uniform-like distribution without unnecessary empty bins, on the NIST train dataset.

Again, we only conducted the finetuning step for this experiment.

#### Training (Finetuning)
```bash
# linear binning 100 bins
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp1_lin_100.yaml \
                     --additional-info _exp1_lin_100 \
                     --additional-tags exp1:lin_100:from_scratch \
                     --wandb-group finetune

# linear binning 1000 bins
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp1_lin_1000.yaml \
                     --additional-info _exp1_lin_1000 \
                     --additional-tags exp1:lin_1000:from_scratch \
                     --wandb-group finetune

# linear binning 10000 bins
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp1_lin_10000.yaml \
                     --additional-info _exp1_lin_10000 \
                     --additional-tags exp1:lin_10000:from_scratch \
                     --wandb-group finetune

# log binning 10 bins, base 2.2
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp1_log_9_2.2.yaml \
                     --additional-info _exp1_log_9_2.2 \
                     --additional-tags exp1:log_9_2.2:from_scratch \
                     --wandb-group finetune

# log binning 21 bins, base 1.43
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp1_log_20_1.43.yaml \
                     --additional-info _exp1_log_20_1.43 \
                     --additional-tags exp1:log_20_1.43:from_scratch \
                     --wandb-group finetune

# log binning 30 bins, base 1.28
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp1_log_29_1.28.yaml \
                     --additional-info _exp1_log_29_1.28 \
                     --additional-tags exp1:log_29_1.28:from_scratch \
                     --wandb-group finetune

# log binning 40 bins, base 1.2
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp1_log_39_1.2.yaml \
                     --additional-info _exp1_log_39_1.2 \
                     --additional-tags exp1:log_39_1.2:from_scratch \
                     --wandb-group finetune
```

In this experiment we showed that the logarithmic binning can be used as an effective intensity encoding. There was an significant improvement for models using more than 10 bins, but the differences for 30 vs. 40 bins were negligible in our runs. Linear binning with 1000 and 10000 bins performed reasonably well, lin_100 did not. The best results were achieved with 30 bins and log base 1.28.

### 2) Molecular representation & Tokenization
Another decision to make was choosing the optimal encoding for the model’s decoder subcomponent. We  initially questioned whether the BPE tokenization of SMILES was a suitable approach and, if so, what
properties the tokenizer should have. Additionally, we questioned the choice of SMILES as a molecular representation and compared the results with the recently proposed SELFIES, a representation designed
specifically for generative neural networks.

We decided to test four values of `min_frequency` parameter – 10, 100, 10 000, and 10 million, resulting in vocabulary sizes of 1286, 780, 367 and 267 tokens, respectively. The last tokenizer (10M) splits strings simply on the character level, the others also use longer tokens. Besides, we included a SELFIES tokenizer that we created by wrapping the Hugging face PreTrainedTokenizer class around the [SELFIES tokenization algorithm](https://pypi.org/project/selfies/).

Also here, we only conducted the finetuning step.

#### Training (Finetuning)
```bash
# BBPE with minimal token frequency of 10
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp2_mf10.yaml \
                     --additional-info _exp2_mf10 \
                     --additional-tags exp2:mf10:from_scratch \
                     --wandb-group finetune

# BBPE with minimal token frequency of 100
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp2_mf100.yaml \
                     --additional-info _exp2_mf100 \
                     --additional-tags exp2:mf100:from_scratch \
                     --wandb-group finetune

# BBPE with minimal token frequency of 10K
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp2_mf10K.yaml \
                     --additional-info _exp2_mf10K \
                     --additional-tags exp2:mf10K:from_scratch \
                     --wandb-group finetune

# BBPE with minimal token frequency of 10M
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp2_mf10M.yaml \
                     --additional-info _exp2_mf10M \
                     --additional-tags exp2:mf10M:from_scratch \
                     --wandb-group finetune

# SELFIES tokenizer
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp2_selfies.yaml \
                     --additional-info _exp2_selfies \
                     --additional-tags exp2:selfies:from_scratch \
                     --wandb-group finetune
```

In this experiment we showed that the BBPE tokenization with a minimal token frequency of 10M (character-level tokenization) is the best choice for the model.

### 3) Pretraining datasets mixing
In this experiment, we tested whether pretraining on synthetic data positively affects the model’s results. We also tried to find a good mixture of pretraining datasets to optimize the performance.

With the use of specific source tokens at the beinning of every SMILES string determining the origin dataset of each spectrum (<rassp>, <neims>, <nist>) we give our model a chance to adapt to the differences between the datasets. We tested the following pretraining combinations of datasets:

- (no pretraining - baseline trained in the previous experiment - exp2_mf10M) 
- RASSP-gen only
- NEIMS-gen only
- RASSP-gen + NEIMS-gen (mixing 1:1)
- RASSP-gen + NEIMS-gen + NIST (mixing 1:1:0.1)

All the pretrained models were finetuned on the NIST dataset and the performance was measured on the NIST validation set.

#### Pretraining
```bash
# RASSP-gen only
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/pretrain_exp3_rassp.yaml \
                     --additional-info _exp3_rassp \
                     --additional-tags exp3:rassp:from_scratch \
                     --wandb-group pretrain

# NEIMS-gen only
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/pretrain_exp3_neims.yaml \
                     --additional-info _exp3_neims \
                     --additional-tags exp3:neims:from_scratch \
                     --wandb-group pretrain

# RASSP-gen + NEIMS-gen
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/pretrain_exp3_rassp_neims.yaml \
                     --additional-info _exp3_rassp_neims \
                     --additional-tags exp3:rassp:neims:from_scratch \
                     --wandb-group pretrain

# RASSP-gen + NEIMS-gen + NIST
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/pretrain_exp3_rassp_neims_nist.yaml \
                     --additional-info _exp3_rassp_neims_nist \
                     --additional-tags exp3:rassp:neims:nist:from_scratch \
                     --wandb-group pretrain
```

#### Finetuning
The finetuning step is the same as in the previous experiment - just replace the `--checkpoint` argument with the path to the pretrained model and specify the correct name for each run.

```bash
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp2_mf10M.yaml \
                     --checkpoint checkpoints/pretrain/{CHECKPOINT_NAME}/checkpoint-112000 \ # TODO: add correct path
                     --additional-info _exp3_{MIXTURE_NAME} \   # TODO: add correct name
                     --additional-tags exp3:from_pretrained \   # TODO: add correct tags
                     --wandb-group finetune
```

In this experiment we showed that the model benefits from pretraining on synthetic data. The best results were achieved with the mixture of RASSP-gen and NEIMS-gen datasets.

### 4) Source token indication
In this experiment, we tested the effect of the source tokens on the model's performance. We compared the model trained with three different source tokens to the model trained with the same source token for all datasets.

#### Pretraining
```bash
# one source token for all datasets (we chose <nist> for convenience)
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/pretrain_exp4_one_src_token.yaml \
                     --additional-info _exp4_one_src_token \
                     --additional-tags exp4:one_src_token:from_scratch \
                     --wandb-group pretrain
```

#### Finetuning
The finetuning step is the same as in the previous experiment - just replace the `--checkpoint` argument with the path to the pretrained model.

```bash
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp2_mf10M.yaml \
                     --checkpoint checkpoints/pretrain/{CHECKPOINT_NAME}/checkpoint-112000 \ # TODO: add correct path
                     --additional-info _exp4_one_src_token \ 
                     --additional-tags exp4:from_pretrained:one_src_token \
                     --wandb-group finetune
```

In this experiment we showed that the model does not benefit from the source tokens. The results were comparable to the model trained with the same source token for all datasets, thus we suggest that the model learned to ignore the source tokens.

### 5) Training length and dataset size
In the last experiment we tested the impact of pretraining dataset size, finetuning length and pretraining length on the model's performance. We identified the best combination of hyperparameters from those we tested. Since the training didn’t seem to fully converge yet, we doubled the number of epochs for both the pretraining and finetuning phases.

We compared the following combinations:
- 4.2M pretraining compounds, 112k pretraining steps, 74k finetuning steps (the best model from experiment 3)
  
- 4.2M pretraining compounds, 224k pretraining steps, 74k finetuning steps (doubling only the pretraining length)
  
- 4.2M pretraining compounds, 224k pretraining steps, 148k finetuning steps (doubling both the pretraining and finetuning length)
  
- 8.6M pretraining compounds, 224k pretraining steps, 148k finetuning steps (doubling the pretraining dataset size)
  
- 8.6M pretraining compounds, 448k pretraining steps, 296k finetuning steps (the last model pushed to the limit)

---------------------------------------------------


```bash
# 4.2M_224k_74k
# we use the pretrained model from exp3 and extend it to 224k 
cd {PROJECT_ROOT}
python spectus/train_spectus.py --config-file configs/pretrain_exp5_4.2_224.yaml \
                     --additional-info "_exp5_4.2M_224" \
                     --wandb-group pretrain \
                     --resume-id j99ipkke \
                     --checkpoint ../checkpoints/pretrain_clean/{CHECKPOINT_NAME_1}_exp4_custom_rassp_neims/checkpoint-112000 \

# finetuning as in the exp3 and exp2
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp2_mf10M.yaml \
                     --checkpoint checkpoints/pretrain/{CHECKPOINT_NAME_2}/checkpoint-224000 \ # TODO: add correct path
                     --additional-info _exp5_4.2M_224_74 \ 
                     --additional-tags exp5:4.2M:from_pretrained:224:74 \
                     --wandb-group finetune

######################
# 4.2M_224k_148k
# using the pretrained model from 4.2M_224k_74k
# finetuning for 148k steps
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp5_224_148.yaml \
                     --checkpoint checkpoints/pretrain/{CHECKPOINT_NAME_2}/checkpoint-224000 \ # TODO: add correct path
                     --additional-info _exp5_4.2M_224_148 \ 
                     --additional-tags exp5:4.2M:from_pretrained:224:148 \
                     --wandb-group finetune

######################
# 8.6M_224k_148k
# to match the previous two experiments, we need do split the pretraining into two parts
# first 112k steps
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/pretrain_exp5_112.yaml \
                     --additional-info _exp5_8.6M_112_112 \
                     --additional-tags exp5:8.6M:112_112:from_scratch \
                     --wandb-group pretrain

# second 112k steps
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/pretrain_exp5_224.yaml \
                     --checkpoint checkpoints/pretrain/{CHECKPOINT_NAME_3}/checkpoint-112000 \ # TODO: add correct path
                     --additional-info _exp5_8.6M_112_112 \
                     --additional-tags exp5:8.6M:112_112:from_scratch \
                     --wandb-group pretrain

# finetuning as 4.2M_224k_148k
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp5_224_148.yaml \
                     --checkpoint checkpoints/pretrain/{CHECKPOINT_NAME_3}/checkpoint-224000 \ # TODO: add correct path
                     --additional-info _exp5_8.6M_224_148 \ 
                     --additional-tags exp5:8.6M:from_pretrained:224:74 \
                     --wandb-group finetune

######################
# 8.6M_448k_296k
# to push the model to the limit, we train the whole 8.6M dataset for 448k steps
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/pretrain_exp5_448.yaml \
                     --additional-info _exp5_8.6M_448 \
                     --additional-tags exp5:8.6M:448_296:from_scratch \
                     --wandb-group pretrain

# finetuning for full 296k steps
cd {PROJECT_ROOT}
CUDA_VISIBLE_DEVICES=0 python spectus/train_spectus.py --config-file configs/finetune_exp5_448_296.yaml \
                     --checkpoint checkpoints/pretrain/{CHECKPOINT_NAME_4}/checkpoint-448000 \ # TODO: add correct path
                     --additional-info _exp5_8.6M_448_296 \ 
                     --additional-tags exp5:from_pretrained:448:296 \
                     --wandb-group finetune

```

This experiment results in the final models (8.6M_448k_296k), the training of which is described at the beginning of this notebook.