# Run Spec2Vec Training Flow Locally

This notebook shows how Spec2Vec training flow can be run locally. Output is trained model path.
![spec2vec_training_flow](spec2vec_training_flow.png)

## Imports

In [1]:
from omigami.spectra_matching.spec2vec.factory import Spec2VecFlowFactory
import mlflow

## Build Training Flow

You can adjust following parameters of your choice:
- `flow-name`: Flow name
- `iterations`: Spec2Vec model parameter
- `window`: Spec2Vec model parameter
- `intensity_weighting_power`: Spec2Vec model parameter
- `allowed_missing_percentage`: Spec2Vec model parameter
- `n_decimals`: Spec2Vec model parameter
- `dataset_id`: dataset to train Spec2Vec model on, will download the data from a particular URL. Available options are:
  - `small`: data from https://raw.githubusercontent.com/MLOps-architecture/share/main/test_data/SMALL_GNPS.json. This data is not up-to-date with GNPS.
  - `small_500`: data from https://raw.githubusercontent.com/MLOps-architecture/share/main/test_data/SMALL_GNPS_500_spectra.json. This data is not up-to-date with GNPS.
  - `10k`: This dataset has no url, but it uses first 10k spectra from GNPS. This data is not up-to-date with GNPS.
  - `complete`: data from https://gnps-external.ucsd.edu/gnpslibrary/ALL_GNPS.json. This will always be up-to-datw with GNPS.
- `ion_mode`: `"positive"` or `"negative"`

Rest of the parameters can be stay as it is, as they are related to tools used in developing the flow.

In [2]:
factory = Spec2VecFlowFactory()
flow = factory.build_training_flow(
    flow_name="spec2vec",
    iterations=5,
    window=10,
    intensity_weighting_power=0.5,
    allowed_missing_percentage=5.0,
    n_decimals=2,
    dataset_id="small",
    ion_mode="positive",
    project_name="spec2vec-positive",
    image="image",
)
flow

  next(self.gen)


<Flow: name="spec2vec">

## Run Training Flow

In [3]:
flow_run = flow.run()

[2022-01-12 16:18:04+0300] INFO - prefect.FlowRunner | Beginning Flow run for 'spec2vec'
[2022-01-12 16:18:04+0300] INFO - prefect.TaskRunner | Task 'DownloadData': Starting task run...
[2022-01-12 16:18:04+0300] INFO - prefect.TaskRunner | Task 'DownloadData': Finished task run for task with final state: 'Cached'
[2022-01-12 16:18:04+0300] INFO - prefect.TaskRunner | Task 'CreateChunks': Starting task run...
[2022-01-12 16:18:04+0300] INFO - prefect.TaskRunner | Task 'CreateChunks': Finished task run for task with final state: 'Cached'
[2022-01-12 16:18:04+0300] INFO - prefect.TaskRunner | Task 'CleanRawSpectra': Starting task run...
[2022-01-12 16:18:04+0300] INFO - prefect.TaskRunner | Task 'CleanRawSpectra': Finished task run for task with final state: 'Mapped'
[2022-01-12 16:18:04+0300] INFO - prefect.TaskRunner | Task 'CacheCleanedSpectra': Starting task run...
[2022-01-12 16:18:04+0300] INFO - prefect.TaskRunner | Task 'CacheCleanedSpectra': Finished task run for task with final

Registered model 'spec2vec-model' already exists. Creating a new version of this model...
2022/01/12 16:18:12 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: spec2vec-model, version 3


[2022-01-12 16:18:12+0300] INFO - prefect.RegisterModel | Created model run_id: a04ba1d61b1c44f2aa8d3f0e23866113.
[2022-01-12 16:18:12+0300] INFO - prefect.TaskRunner | Task 'RegisterModel': Finished task run for task with final state: 'Success'
[2022-01-12 16:18:12+0300] INFO - prefect.TaskRunner | Task 'CacheCleanedSpectra[0]': Starting task run...
[2022-01-12 16:18:12+0300] INFO - prefect.CacheCleanedSpectra | Loading spectra from /Users/cereniyim/GitHub/omigami-core/local-deployment/results/datasets/small/cleaned/positive/chunk_0.pickle
[2022-01-12 16:18:12+0300] INFO - prefect.CacheCleanedSpectra | Finished loading file. File contains 100 spectra.
[2022-01-12 16:18:12+0300] INFO - prefect.CacheCleanedSpectra | There is no new spectra to save.
[2022-01-12 16:18:12+0300] INFO - prefect.TaskRunner | Task 'CacheCleanedSpectra[0]': Finished task run for task with final state: 'Success'


Created version '3' of model 'spec2vec-model'.


[2022-01-12 16:18:12+0300] INFO - prefect.FlowRunner | Flow run SUCCESS: all reference tasks succeeded


## Output Model Path

In [4]:
register_task = flow.get_tasks("RegisterModel")[0]
run_id = flow_run.result[register_task].result

artifact_uri = mlflow.get_run(run_id).info.artifact_uri
model_uri = f"{artifact_uri}/model/python_model.pkl"

In [5]:
print(f"Spec2Vec model is available at: {model_uri}")

Spec2Vec model is available at: /Users/cereniyim/GitHub/omigami-core/local-deployment/results/mlflow/a04ba1d61b1c44f2aa8d3f0e23866113/artifacts/model/python_model.pkl
