# Introduction
The following modules operate within the Conda environment `MapleEmbedder-cu11.7`, which is configured for MAPLE’s deep learning-based inference workflows.  These modules embed mass spectrometry features (MS¹ peaks and MS² fragmentation patterns) into vector representations using pretrained Graphormer models. The provided pipelines process mzXML files after peak picking and prepare the data for downstream metabolomic inference and knowledge graph integration.

# Embedding MS<sup>1</sup> Signals
MAPLE generates embeddings that capture contextual relationships between co-eluting metabolites, enabling the construction of MS<sup>1</sup>-level similarity networks. These networks can be used to assess metabolomic uniqueness across taxa and prioritize lineage-specific chemical signatures.

Currently, we deployed a cloud-based virtual machine hosting Qdrant vector databases. However, performance may vary depending on concurrent usage by other users, network latency, and API call volume. Because this cloud instance is shared and subject to platform resource limits, response times may be slower during peak usage periods. For faster and more reliable performance, we recommend deploying a local instance of the vector database. This allows you to allocate dedicated CPU resources and avoid the bottlenecks associated with shared infrastructure. Please refer to the README for setup instructions.

All MS<sup>1</sup>-level  peaks included in the vector database were generated using the same chromatographic method and instrumentation. To ensure accurate retrieval of related peaks, query peaks must be acquired under similar experimental conditions. Related peaks are matched based on retention time (+/- 30 seconds), monoisotopic mass (10 ppm tolerance), and embedding similarity (top 100 most contextually similar metabolites). The following code returns the current number of peaks stored in the database.

In [1]:
from qdrant_client import QdrantClient

# by default we connect to the google cloud VM
client = QdrantClient(host="34.66.123.176", port=6333)
info = client.get_collection("ms1_full_collection")
num_points = info.points_count
print(f"Number of MS1 Peaks: {num_points}")

Number of MS1 Peaks: 29740226


Using the JSON output from the peak picking module, MS<sup>1</sup>-level contextual embeddings can be computed. These embeddings aim to capture the biological context of each MS feature, providing an additional dimension of comparison, alongside mass and retention time, for identifying related signals.

In [2]:
from Maple.Embedder import run_MS1Former_on_mzXML

run_MS1Former_on_mzXML(
    peaks_fp="sample_output/20109_peaks.json", # input data
    output_fp="sample_output/20109_MS1Former_embeddings.pkl",
    gpu_id=0
)

This is an example spectral embedding, which can be used to compare metabolic profiles and visualize sample clustering by taxonomic relatedness using UMAP.

In [3]:
import pickle

data = pickle.load(open("sample_output/20109_MS1Former_embeddings.pkl", "rb"))
print(data['spectra_embedding'])


[-3.8981495e+00 -1.3277055e+00  1.6500206e+00  2.9224319e+00
  1.5815086e+00 -3.9603739e+00  6.6490784e+00 -4.0497327e+00
  6.8188375e-01 -2.1867735e+00 -5.5364203e+00  6.8557801e+00
 -6.7879987e+00  1.8217114e+00 -1.8177940e+00 -4.3249135e+00
 -1.2523000e-01 -1.6304187e-01  8.0084810e+00 -3.2079360e+00
  5.5583405e+00  3.8594007e+00 -5.6871185e+00 -5.9769100e-01
  6.0343027e-01  3.7484665e+00  4.4908389e-01 -1.9937955e+00
 -1.3957467e+00  1.7780576e+00 -3.9347951e+00  5.6917858e-01
 -3.6425056e+00  1.1756643e+00  9.1851330e-01  1.8546214e+00
 -5.3653449e-01 -3.6172569e+00  6.1181359e+00 -6.2345564e-01
 -9.8220104e-01  3.3371737e+00 -5.5778213e+00  4.2966428e+00
 -2.4021327e+00 -3.3779266e-01  3.3927186e+00 -6.6053414e+00
 -8.2158737e+00 -5.2275443e+00  1.9655819e+00  2.3178278e-01
  6.3059753e-01 -5.4222226e+00 -4.7024903e+00  3.3142669e+00
 -4.8132739e+00  3.8984404e+00  3.8266685e+00 -2.3295348e+00
  2.2599087e+00 -2.1927767e-01 -7.2922057e-01 -4.0921397e+00
 -2.1875234e+00  3.62838

The peak embeddings are stored as a dictionary and can be referenced using the `peak_id` from the original input file. While this is an intermediate file, it’s useful to understand its structure so you can access the embedded vectors for any downstream analysis.

In [4]:
import pickle

data = pickle.load(open("sample_output/20109_MS1Former_embeddings.pkl", "rb"))

print("Embedding for MS1 Peak ID 1310723")
print(data['peak_embeddings'][1310723])


Embedding for MS1 Peak ID 1310723
[ -2.9566813   -0.61362964   2.7477226    3.4256096    2.1508465
  -5.2667255    8.466292    -3.9359527    0.6548786   -2.135022
  -4.593832     7.0485954   -8.525942     2.7478826   -2.408204
  -3.5771382   -0.632742     0.96321845   7.6369667   -4.463799
   6.4599214    4.8708143   -6.548872    -1.8393247    1.8864954
   4.4927397    1.2182733   -4.112611    -1.5134687   -0.08773378
  -5.456712     1.0186031   -3.7417157    0.99092036   1.6092199
   2.4876695   -0.41741157  -2.894666     5.1541023   -0.82748556
  -0.79933286   3.6033702   -6.0620866    4.395275    -2.0441446
  -1.2979503    3.5587027   -6.5615954   -9.203628    -5.8445168
   2.8178449   -0.14990334   0.6845444   -5.67352     -5.6418915
   5.2566533   -4.1525936    3.7429066    5.209329    -2.9008915
   1.0177721    0.50938565  -2.6087213   -6.738433    -3.2829165
   2.8265395   -0.9556246   -0.20054305   0.7810776    1.4643962
  -1.4790617    0.2756552   -3.5523725   -1.7743621    1.

After generating MS<sup>1</sup> embeddings, run the following command to compute taxonomy consistency scores. Query taxonomic labels must follow the naming conventions provided in the taxonomy tables located at `Maple/Embedder/dat/taxonomy_tables`. A score of 1 indicates that a signal was found exclusively within the specified taxonomic level. If a score returns `None`, it means no overlapping signals were detected in the current LC–MS/MS database, and a reliable score could not be computed. 

Running this module on 1,000 peaks takes approximately 1 hour on the cloud VM, as each peak is compared against ~30 million MS<sup>1</sup> entries in the database. To reduce runtime, a parameter is provided to allow selection of specific peaks of interest from your sample.

In [5]:
from Maple.Embedder import annotate_mzXML_with_tax_scores

annotate_mzXML_with_tax_scores(
    peaks_fp="sample_output/20109_peaks.json", # input data
    ms1_emb_fp="sample_output/20109_MS1Former_embeddings.pkl", # input data
    peak_ids=[1310723, 616210, 616233],
    output_fp="sample_output/20109_MS1Former_taxscores.csv",
    query_phylum="bacteroidetes",
    query_class="sphingobacteriia",
    query_order="sphingobacteriales",
    query_family="chitinophagaceae",
    query_genus="chitinophaga",
    use_cloud_service=True
)

Loading existing collection ms1_full_collection.


100%|██████████| 3/3 [00:00<00:00, 11949.58it/s]


This is an example output from the module.

In [6]:
import pandas as pd

df = pd.read_csv("sample_output/20109_MS1Former_taxscores.csv")
display(df)

Unnamed: 0,peak_id,phylum_score,class_score,order_score,family_score,genus_score
0,1310723,1.0,1.0,1.0,0.0,0.0
1,616210,0.62,0.5,0.5,0.12,0.12
2,616233,0.6,0.4,0.4,0.1,0.1


# Embedding MS<sup>2</sup> Signals

MAPLE generates two distinct embeddings from MS fragmentation data, each fine-tuned for a specific downstream task:

1. Chemotype embedding – optimized for approximate nearest neighbor (ANN)–based prediction of biosynthetic classes. This embedding captures broader chemical features relevant to class-level annotations.
2. Analog embedding – designed to cluster structurally related derivatives within the same biosynthetic class, enabling finer-grained comparison of molecular variants.

The following modules generate these embeddings based on the specified embedding type.

In [7]:
# generating chemotype embedding

from Maple.Embedder import run_MS2Former_on_mzXML

run_MS2Former_on_mzXML(
    peaks_fp="sample_output/20109_peaks.json", # input data
    output_fp="sample_output/20109_MS2Former_chemotype_embeddings.pkl",
    embedding_type="chemotype",
    gpu_id=0,
    min_ms2=5,
)

100%|██████████| 150/150 [00:14<00:00, 10.21it/s]


In [8]:
# generating analog embedding 

from Maple.Embedder import run_MS2Former_on_mzXML

run_MS2Former_on_mzXML(
    peaks_fp="sample_output/20109_peaks.json", # input data
    output_fp="sample_output/20109_MS2Former_analog_embeddings.pkl",
    embedding_type="analog",
    gpu_id=0,
    min_ms2=5,
)

100%|██████████| 150/150 [00:14<00:00, 10.32it/s]


The following shows the first entry of the output generated by the embedding modules.

In [9]:
import pickle

data = pickle.load(open("sample_output/20109_MS2Former_analog_embeddings.pkl", "rb"))
print(data[0])

{'peak_id': 868422, 'embedding': array([-0.09877048, -1.5212228 ,  1.3384156 ,  0.4698092 ,  1.1919297 ,
        1.701223  ,  0.9596497 ,  2.117453  , -0.927798  ,  0.6383174 ,
        0.29976436, -0.783218  ,  3.620066  , -0.85373783,  1.6131049 ,
        1.0558908 , -1.0299243 , -0.5408578 ,  1.5006096 ,  0.6871171 ,
        1.6958839 ,  0.9642316 ,  0.84937024,  1.1635346 ,  1.4147906 ,
       -1.6264155 ,  0.5953009 , -1.2708269 ,  1.8673004 ,  2.1998034 ,
       -0.063664  ,  0.4811926 ,  2.9784348 ,  1.7199275 ,  1.3015515 ,
       -0.5835962 ,  0.8174096 ,  1.49722   , -1.9747276 ,  1.8467396 ,
       -0.01789707, -0.19416761,  2.5114574 ,  1.0124359 ,  0.58799523,
        0.30747232, -0.847512  ,  0.77019155, -1.2247756 , -0.8647843 ,
        2.7590735 ,  0.8098427 , -0.85064805,  1.4891274 ,  1.473404  ,
        0.9046491 ,  1.8194584 ,  0.71513844, -0.86943173,  1.7123741 ,
        0.8720225 , -2.1895921 , -2.0222898 , -1.1792221 , -1.2615494 ,
        0.68223554,  0.00691056

Run the following code to perform ANN-based biosynthetic class prediction. It can be executed directly on the peak-picker results from the mzXML file (includes embedding generation).

In [10]:
from Maple.Embedder import annotate_mzXML_with_chemotypes

annotate_mzXML_with_chemotypes(
    peaks_fp="sample_output/20109_peaks.json", # input data
    output_fp="sample_output/20109_MS2Former_chemotype_predictions.csv",
)

100%|██████████| 150/150 [00:14<00:00, 10.35it/s]


Loading existing collection ms2_chemotype_reference.


                                             

Alternatively, you can run it directly on the embedding output.

In [11]:
from Maple.Embedder import annotate_mzXML_with_chemotypes

annotate_mzXML_with_chemotypes(
    ms2_emb_fp="sample_output/20109_MS2Former_chemotype_embeddings.pkl", # input data
    output_fp="sample_output/20109_MS2Former_chemotype_predictions.csv",
)

Loading existing collection ms2_chemotype_reference.


                                             

This is an example output from the module. Distance refers to the Euclidean distance to the nearest annotated data point in the vector database. The homology score indicates how frequently the predicted chemotype appears among the top 10 nearest neighbors.

In [12]:
import pandas as pd

df = pd.read_csv('sample_output/20109_MS2Former_chemotype_predictions.csv')
display(df)

Unnamed: 0,peak_id,label,homology,distance
0,868422,TypeIPolyketide,0.8,11.268
1,1466445,TypeIPolyketide,0.6,11.985
2,688318,NonRibosomalPeptide,0.9,9.350
3,1270043,TypeIPolyketide,0.5,5.059
4,1220895,TypeIPolyketide,0.9,8.979
...,...,...,...,...
145,802553,TypeIPolyketide,0.7,12.743
146,679713,TypeIPolyketide,0.7,11.137
147,1335137,TypeIPolyketide,0.5,13.450
148,1220478,TypeIIPolyketide,1.0,8.915


Run the following code to perform density-based MS<sup>2</sup> clustering across multiple processed mzXML files. The method supports comparison of millions of peaks simultaneously. For optimal performance, we recommend tuning the clustering parameters (`min_cluster_size` and `n_neighbors`). Default parameters used in the study are provided.

In [13]:
from Maple.Embedder import compute_ms2_networks_from_mzXMLs

out = compute_ms2_networks_from_mzXMLs(
    ms2_emb_fps=[
        'sample_output/20109_MS2Former_analog_embeddings.pkl'
    ],
    output_fp="sample_output/example_MS2Former_analog_predictions.csv",
    n_neighbors=15,
    min_cluster_size=5,
)

Running Dimension Reduction ...
Took 3.92 seconds
Running Soft Clustering ...
Took 0.03 seconds


This is an example output from the module. Here, each peak is assigned to a family id.

In [14]:
import pandas as pd

df = pd.read_csv("sample_output/example_MS2Former_analog_predictions.csv")
display(df)

Unnamed: 0,source,peak_id,family_id
0,20109_MS2Former_analog_embeddings.pkl,868422,0
1,20109_MS2Former_analog_embeddings.pkl,1466445,0
2,20109_MS2Former_analog_embeddings.pkl,688318,2
3,20109_MS2Former_analog_embeddings.pkl,1270043,0
4,20109_MS2Former_analog_embeddings.pkl,1220895,0
...,...,...,...
145,20109_MS2Former_analog_embeddings.pkl,802553,0
146,20109_MS2Former_analog_embeddings.pkl,679713,2
147,20109_MS2Former_analog_embeddings.pkl,1335137,2
148,20109_MS2Former_analog_embeddings.pkl,1220478,0
