# Run MS2DeepScore Training Flow Locally

This notebook shows how MS2DeepScore training flow can be run locally. Output is trained model path.

## Imports

In [1]:
from omigami.spectra_matching.ms2deepscore.factory import MS2DeepScoreFlowFactory
import mlflow

## Build Training Flow

You can adjust following parameters of your choice:
- `flow-name`: Flow name
- `fingerprint_n_bits`: MS2DeepScore model parameter
- `scores_decimals`: MS2DeepScore model parameter
- `spectrum_binner_n_bins`: MS2DeepScore model parameter
- `epochs`: MS2DeepScore model parameter
- `dataset_id`: dataset to train Spec2Vec model on, will download the data from a particular URL. Available options are:
  - `small`: data from https://raw.githubusercontent.com/MLOps-architecture/share/main/test_data/SMALL_GNPS.json. This data is not up-to-date with GNPS.
  - `small_500`: data from https://raw.githubusercontent.com/MLOps-architecture/share/main/test_data/SMALL_GNPS_500_spectra.json. This data is not up-to-date with GNPS.
  - `10k`: This dataset has no url, but it uses first 10k spectra from GNPS. This data is not up-to-date with GNPS.
  - `complete`: data from https://gnps-external.ucsd.edu/gnpslibrary/ALL_GNPS.json. This will always be up-to-datw with GNPS.
- `ion_mode`: `"positive"` or `"negative"`
- `train_ratio`: percentage of dataset to use in training the model
- `validation_ratio`: percentage of dataset to use in validating the model
- `test_ratio`: percentage of dataset to use in testing the model

Rest of the parameters can be stay as it is, as they are related to tools used in developing the flow.

In [2]:
from pathlib import Path
storage_root = Path.cwd() / "results"

In [3]:
factory = MS2DeepScoreFlowFactory(
    storage_root=storage_root,
    model_registry_uri="sqlite:///mlflow.sqlite",
    mlflow_output_directory=storage_root / "ms2deepscore/models"
)
flow = factory.build_training_flow(
    flow_name="ms2deepscore",
    fingerprint_n_bits=2048,
    scores_decimals=5,
    spectrum_binner_n_bins=10000,
    epochs=5,
    dataset_id="small_500",
    ion_mode="positive",
    train_ratio=0.8,
    validation_ratio=0.2,
    test_ratio=0.2,
    image="image",
    project_name="ms2deepscore-positive",
    spectrum_ids_chunk_size=100,
)
flow

  next(self.gen)


<Flow: name="ms2deepscore">

## Run Training Flow

In [4]:
flow_run = flow.run()

[2022-01-13 12:08:15-0300] INFO - prefect.FlowRunner | Beginning Flow run for 'ms2deepscore'
[2022-01-13 12:08:15-0300] INFO - prefect.TaskRunner | Task 'DownloadData': Starting task run...
[2022-01-13 12:08:15-0300] INFO - prefect.TaskRunner | Task 'DownloadData': Finished task run for task with final state: 'Cached'
[2022-01-13 12:08:15-0300] INFO - prefect.TaskRunner | Task 'CreateChunks': Starting task run...
[2022-01-13 12:08:15-0300] INFO - prefect.TaskRunner | Task 'CreateChunks': Finished task run for task with final state: 'Cached'
[2022-01-13 12:08:15-0300] INFO - prefect.TaskRunner | Task 'CleanRawSpectra': Starting task run...
[2022-01-13 12:08:15-0300] INFO - prefect.TaskRunner | Task 'CleanRawSpectra': Finished task run for task with final state: 'Mapped'
[2022-01-13 12:08:15-0300] INFO - prefect.TaskRunner | Task 'CleanRawSpectra[0]': Starting task run...
[2022-01-13 12:08:15-0300] INFO - prefect.CleanRawSpectra | Loading spectra from /Users/czanella/dev/datarevenue/omig

Spectrum binning: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 336/336 [00:00<00:00, 5058.48it/s]
Create BinnedSpectrum instances: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 336/336 [00:00<00:00, 60225.90it/s]

[2022-01-13 12:08:15-0300] INFO - prefect.ProcessSpectrum | Finished processing 336 binned spectra. Saving into spectrum database.
[2022-01-13 12:08:15-0300] INFO - prefect.TaskRunner | Task 'ProcessSpectrum': Finished task run for task with final state: 'Success'
[2022-01-13 12:08:15-0300] INFO - prefect.TaskRunner | Task 'CalculateTanimotoScore': Starting task run...
[2022-01-13 12:08:15-0300] INFO - prefect.CalculateTanimotoScore | Calculating the Tanimoto Scores





[2022-01-13 12:08:15-0300] INFO - prefect.CalculateTanimotoScore | Calculating Tanimoto scores for 147 unique InChIkeys
[2022-01-13 12:08:16-0300] INFO - prefect.TaskRunner | Task 'CalculateTanimotoScore': Finished task run for task with final state: 'Success'
[2022-01-13 12:08:16-0300] INFO - prefect.TaskRunner | Task 'TrainModel': Starting task run...
The value for batch_size is set from 32 (default) to 32
117 out of 147 InChIKeys found in selected spectrums.
The value for batch_size is set from 32 (default) to 32
29 out of 147 InChIKeys found in selected spectrums.
The value for batch_size is set from 32 (default) to 32
1 out of 147 InChIKeys found in selected spectrums.
[2022-01-13 12:08:16-0300] INFO - prefect.TrainModel | 268 spectra in training data 
[2022-01-13 12:08:16-0300] INFO - prefect.TrainModel | 66 spectra in validation data 
[2022-01-13 12:08:16-0300] INFO - prefect.TrainModel | 2 spectra in test data 
Epoch 1/5


2022-01-13 12:08:16.906276: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-01-13 12:08:16.930468: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:258] None of the MLIR optimization passes are enabled (registered 0 passes)


Epoch 2/5
1/8 [==>...........................] - ETA: 0s - batch: 0.0000e+00 - size: 32.0000 - loss: 0.0955



Epoch 3/5
Epoch 4/5
Epoch 5/5
[2022-01-13 12:08:20-0300] INFO - prefect.TrainModel | Saving trained model to /Users/czanella/dev/datarevenue/omigami-core/notebooks/training/results/ms2deepscore/tmp/387cab63-c30a-4f8f-9989-bf39f6db841e/ms2deep_score.hdf5.
[2022-01-13 12:08:20-0300] INFO - prefect.TaskRunner | Task 'TrainModel': Finished task run for task with final state: 'Success'
[2022-01-13 12:08:20-0300] INFO - prefect.TaskRunner | Task 'RegisterModel': Starting task run...
[2022-01-13 12:08:20-0300] INFO - prefect.RegisterModel | Registering model on URI sqlite:///mlflow.sqlite on path: /Users/czanella/dev/datarevenue/omigami-core/notebooks/training/results/ms2deepscore/models.


Registered model 'MS2DeepScore-positive' already exists. Creating a new version of this model...
2022/01/13 12:08:21 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: MS2DeepScore-positive, version 2


[2022-01-13 12:08:21-0300] INFO - prefect.RegisterModel | Created model run_id: c204b0952fc24d7c99faf3220707f9da.
[2022-01-13 12:08:21-0300] INFO - prefect.TaskRunner | Task 'RegisterModel': Finished task run for task with final state: 'Success'
[2022-01-13 12:08:21-0300] INFO - prefect.FlowRunner | Flow run SUCCESS: all reference tasks succeeded


Created version '2' of model 'MS2DeepScore-positive'.


## Output Model Path

In [5]:
register_task = flow.get_tasks("RegisterModel")[0]
run_id = flow_run.result[register_task].result

artifact_uri = mlflow.get_run(run_id).info.artifact_uri
model_uri = f"{artifact_uri}/model/python_model.pkl"

In [6]:
print(f"MS2DeepScore model is available at: {model_uri}")

MS2DeepScore model is available at: /Users/czanella/dev/datarevenue/omigami-core/notebooks/training/results/ms2deepscore/models/c204b0952fc24d7c99faf3220707f9da/artifacts/model/python_model.pkl
