# Training

Neural network architectures are implemented in the files `MolecularStructureTeFT.py`, `MolecularStructureTeFTWithBias.py`, `MolecularStructureTeFTWithIntensity.py` (under test), `MolecularStructureTeFTWithMLP.py` (multi-task learning under revision).

The tokenizer is implemented in `MolecularStructureDictionary.py`.

The Training integrator is implemented in `TrainWorkFlowV2.py` (branch `v2`).

The loss functions are defined as classes in `LossFunctions.py` (and for the multi-task learning also under the `LossWithMassPrediction.py`).

The script used to train the neural network is named `train_teft_original_data_datadriven_tokenizer_v2.py` (the one in the `v2` branch).

To check the arguments accepted by the script you can execute :

```bash
python train_teft_original_data_datadriven_tokenizer_v2.py --help
```

and after doing so, the list of arguments will be shown:

- -h, --help            show this help message and exit
-  *--max_sentence_size* : Maximum length of MS/MS to be considered
-  *--tokenizer*: Tokenizer class name (Available options: DataDrivenTokenizer and SimpleTokenizer)
-  *--dir_vocab*: path to the folder containing the tokenizer vocabulary (i.e. SPE_ChEMBL.txt)
-  *--class_balancing*:     Enable class balancing
-  *--balance_by*: Balance dataset by m/z values or molecular weights ('mz'or 'molecular_weight')
-  *--balance_bins*: Number of bins to use for histogram when balancing the dataset
-  *--batch_size*: Batch size for training
-  *--device*: Device to use for training ('cpu' or 'cuda')
-  *--loss_function*: Loss function to use for training ('CrossEntropy' or 'RegByMassCrossEntropy')
-  *--architecture*: Available transformer-based architectures to use for training ('MolecularStructureTeFT' 'MolecularStructureTeFTWithBias', 'MolecularStructureTeFTWithIntensity')
-  *--model_name*: Model name for saving
- *--max_samples*: Maximum number of samples to use from the dataset. If None, use all samples.
- *--epochs*: Number of training epochs
- *--d_model*: Transformer architecture parameter emmbedding hidden size
- *--d_ff*: Transformer architecture parameter feedForward dimension
- *--d_k*: Transformer architecture parameter dimension of K( and Q)
- *--d_v*: Transformer architecture parameter dimension of V
- *--n_layers*: Number of Encoder/Decoder Layers
- *--n_heads*: Number of heads in Multi-Head Attention

In the command line execute:

```bash
python train_teft_original_data_datadriven_tokenizer_v2.py --max_sentence_size 100 --tokenizer DataDrivenTokenizer --dir_vocab /users/jdvillegas/ptfi-frijol-pujc-v2/data --batch_size 246 --device cuda --loss_function CrossEntropy --architecture MolecularStructureTeFTWithIntensity --model_name dd_su_arch3_lf1_code_pl
```

Since training is very computing demanding, it is rather better to run it on the GPUs available in the cluster. 

To do so here is an example:

```bash
#!/bin/bash
#SBATCH --job-name=dd_su_arch3_lf1_code_pl  # Job name
#SBATCH -o dd_su_arch3_lf1_code_pl.out  # File to which STDOUT will be written, %j inserts jobid
#SBATCH -e dd_su_arch3_lf1_code_pl.err  # File to which STDERR will be written, %j inserts jobid
#SBATCH --partition=GPU # FULL if running CPU, GPU if running GPU, 3 GPUs with 24 GB each
#SBATCH --nodes=1
##SBATCH --tasks-per-node=12 # 40 CPUs per node
##SBATCH --ntasks-per-socket=6 # 19 CPUs per socket
##SBATCH --cpus-per-task=1 # 1 CPU per task
#SBATCH --mem=200G # each node has around 380 GB, do not reserve all memory if you don´t need all of it
##SBATCH --nodelist=node22  # exactly which node to use, node 22 -> test node
##SBATCH --exclude=node21   # which node to exclude
#SBATCH --gres=gpu:3

#module load lang/python/3.12.3
#module load cuda/11.8

# Source the Conda script to enable the `conda activate` command
source ~/miniconda3/etc/profile.d/conda.sh  # Update this to your actual conda.sh path

# Activate the specific Conda environment
conda activate metabolomicscuda
#conda activate wearmepylow

scratch_dir=/scratch/jdvillegas/ptfi_$SLURM_JOB_ID

mkdir -p $scratch_dir;
cp -r /users/jdvillegas/ptfi-frijol-pujc-v2/classes/InputManagement.py $scratch_dir
cp -r /users/jdvillegas/ptfi-frijol-pujc-v2/classes/MolecularStructureDictionary.py $scratch_dir
cp -r /users/jdvillegas/ptfi-frijol-pujc-v2/classes/MolecularStructureTeFT.py $scratch_dir
cp -r /users/jdvillegas/ptfi-frijol-pujc-v2/classes/MolecularStructureTeFTWithBias.py $scratch_dir
cp -r /users/jdvillegas/ptfi-frijol-pujc-v2/classes/MolecularStructureTeFTWithIntensity.py $scratch_dir
cp -r /users/jdvillegas/ptfi-frijol-pujc-v2/classes/TrainWorkFlowWithIntensity.py $scratch_dir
cp -r /users/jdvillegas/ptfi-frijol-pujc-v2/classes/LossFunctions.py $scratch_dir
cp -r /users/jdvillegas/ptfi-frijol-pujc-v2/classes/Plotter.py $scratch_dir
cp -r /users/jdvillegas/ptfi-frijol-pujc-v2/classes/SignalMath.py $scratch_dir
cp -r /users/jdvillegas/ptfi-frijol-pujc-v2/classes/TrainWorkFlowV2.py $scratch_dir
cp -r /users/jdvillegas/ptfi-frijol-pujc-v2/scripts/train_teft_original_data_datadriven_tokenizer_v2.py $scratch_dir
cd $scratch_dir

# Run your Python script
#time python train_teft_original_data_datadriven_tokenizer_v2.py
time python train_teft_original_data_datadriven_tokenizer_v2.py --max_sentence_size 100 --tokenizer DataDrivenTokenizer --dir_vocab /users/jdvillegas/ptfi-frijol-pujc-v2/data --batch_size 246 --device cuda --loss_function CrossEntropy --architecture MolecularStructureTeFTWithIntensity --model_name dd_su_arch3_lf1_code_pl

#mkdir $SLURM_SUBMIT_DIR/TeFT_DD_wo_reg_v2_nopad_nobal_dir_attn_$SLURM_JOB_ID
#cp -r * $SLURM_SUBMIT_DIR/TeFT_DD_wo_reg_v2_nopad_nobal_dir_attn_$SLURM_JOB_ID

cp -r $scratch_dir /users/jdvillegas/slurm_results/ptfi/$SLURM_JOB_ID

echo "TrainWorkflowV2 LF1 Arch3 (Intensity gating + Bias in Attention)" > /users/jdvillegas/slurm_results/ptfi/$SLURM_JOB_ID/DescriptionBatchJob.txt
#rm -r $scratch_dir

# Deactivate the Conda environment
conda deactivate
```


# Predict

The SMILES prediction is done by the `PredictWorkflow()` class of the `Workflow.py` file.

Prediction can be done by executing the following snippet of code:

```python
from ptfifrijolpujc import PredictWorkflow

pw = PredictWorkflow()
pw.predict( List[List], # List of list of intensities
            List[List], # List of list of m/z
            smiles=List, # List of smiles
            full_path_to_mdl=str, # full path to trained model
            device=str, # either 'cpu' or 'cuda'
            output_dir=str, # path to directory where to save results
            num_pred=int, # number of SMILES predictions per MS/MS
            predict_approach=str, # either "beam_search" or "greedy",
            predict_approach_args=dict|None) # Only used when `predict_approach` is "greedy. A dict with the only key `beam_size` (i.e. {'beam_size': 10})
```


Working prediction examples `predict_known_metabolites_dec_2024.py` and `predict_CASMI_2022_modular.py` are avilable to use.

- `predict_known_metabolites_dec_2024.py`: does predictions for a list of 314 known metabolites available in the file `path/to/the/repo/ptfi-frijol-pujc/data/Espectros_conocidos 1.xlsx`.

To check the input parameters of `predict_known_metabolites_dec_2024.py`, set the current working directory to the `scripts` folder and execute:

```bash
python predict_known_metabolites_dec_2024.py --help
```

- *-h*, *--help*: show this help message and exit
- *--dir_path*: Directory path containing the Excel file
- *--filename*: Excel filename
- *path_to_mdl*: Full path to the model directory
- *--trained_model_name*: Name of the trained model file
- *--output_dir*: Output directory for results
- *--num_pred*: Number of predictions
- *--predict_approach*: Prediction approach is either 'greedy' or 'beam_search'
- *--device*: 'cpu' or 'cuda'
- *--beam_size*: [OPTIONAL] Only if the `predict_approach` is 'beam_search', the length of the beam should be supplied as a dict (i.e. {`beam_size`: 5}) 

Specific example of use:

```bash 
python predict_known_metabolites_dec_2024.py --dir_path /home/julian/Documents/repos/ptfi-frijol-pujc/data --filename Espectros_conocidos 1.xlsx --path_to_mdl /home/julian/Documents/FTP/ptfi/models --trained_model_name dd_arch1_lf1_data_1.pth --output_dir /home/julian/Documents/repos/ptfi-frijol-pujc/results/kn-dd-su-arch1-lf1-test --num_pred 10 --device cpu --predict_approach beam_search --beam_size 5
```