# Introduction
The following modules operate within the Conda environment `MaplePeakPicker`. This environment is designed to run peak picking software (e.g. includes adduct detection and molecular formula prediction) to process raw mzXML data for downstream use in the embedding modules. A separate Conda environment is used here because the dependencies required for peak picking conflict with those used in the deep learning–based embedding modules.

# Preprocessing Raw mzXML Files

MAPLE provides high-level inference functions for streamlined LC–MS/MS data preprocessing. Raw WIFF files should be converted to the mzXML format using [ProteoWizard](https://hub.docker.com/r/proteowizard/pwiz-skyline-i-agree-to-the-vendor-licenses).

The preprocessing pipeline consists of the following steps:
1. Data Loading – mzXML files are parsed using [pyOpenMS](https://pyopenms.readthedocs.io/en/latest/)
2. EIC Decomposition – Spectral data is decomposed into extracted ion chromatograms (EICs)
3. Isotopic Pattern Analysis – MS<sup>1</sup> signals are grouped into isotopic distributions, and charge states are assigned.
4. Adduct Annotation - [Common adducts](https://github.com/magarveylab/maple-publication/blob/main/Maple/PeakPicker/database/adducts.csv) are identified to compute accrurate neutral monoisotopic mass.
5. Molecular Formula Prediction - Candidate molecular formulas within 5 ppm of the observed mass are generated using [ChemCalc](https://www.chemcalc.org/), limited to common natural product elements (C, H, O, N, S, Cl, Br). [Brainpy](https://github.com/mobiusklein/brainpy) is used to compute theoretical isotopic distributions, which are compared to experimental data to prioritize high-confidence candidates.

In [1]:
from Maple.PeakPicker import run_peak_picker

run_peak_picker(
    mzXML_fp="sample_data/20109_Chitinophaga__C408L_Czapek-Dox-1perstarch_HP20-XAD7bags_C12_1.mzXML", # input data
    output_fp="sample_output/20109_peaks.json"
)

2025-07-21 01:04:44,045 Read mzXML data


  from numpy.core.umath_tests import inner1d


2025-07-21 01:04:44,679 Parse mzXML data


100%|██████████| 6481/6481 [00:02<00:00, 2176.03it/s]

2025-07-21 01:04:47,664 Decompose spectral data into EICs
2025-07-21 01:04:47,665 Overlapping signals and organizing into network



100%|██████████| 1704/1704 [00:12<00:00, 131.70it/s]

2025-07-21 01:05:00,964 Find repersentative signals



100%|██████████| 230670/230670 [00:08<00:00, 26754.78it/s]

2025-07-21 01:05:12,716 Pre: 1528317 signals
2025-07-21 01:05:12,717 Post: 258102 signals





2025-07-21 01:05:13,424 Determine isotopic distributions
2025-07-21 01:05:13,426 Find all possible isotopic distributions


100%|██████████| 1666/1666 [00:10<00:00, 158.22it/s]

2025-07-21 01:05:24,175 Priortization of isotopic distributions



100%|██████████| 151067/151067 [00:05<00:00, 26374.61it/s]


2025-07-21 01:05:30,490 Overlapping signals and organizing into network


100%|██████████| 1073/1073 [00:00<00:00, 1618.78it/s]

2025-07-21 01:05:31,532 Find repersentative signals



100%|██████████| 3727/3727 [00:00<00:00, 14712.34it/s]

2025-07-21 01:05:31,813 Pre: 17710 signals
2025-07-21 01:05:31,814 Post: 3731 signals





2025-07-21 01:05:32,373 Assign ms2 data


100%|██████████| 3731/3731 [00:01<00:00, 2645.64it/s]

2025-07-21 01:05:33,789 Detect Adducts
2025-07-21 01:05:33,790 Pre-calculate Adduct Masses





2025-07-21 01:05:35,006 Compute all possible adduct clusters


100%|██████████| 3731/3731 [00:11<00:00, 334.04it/s]

2025-07-21 01:05:46,801 Overlapping all adduct clusters



100%|██████████| 790/790 [00:01<00:00, 401.31it/s]

2025-07-21 01:05:49,770 Total clusters: 790



100%|██████████| 340/340 [00:08<00:00, 40.67it/s]


2025-07-21 01:05:58,851 Permutation: 1
2025-07-21 01:05:58,852 Remaining Clusters: 34


100%|██████████| 13/13 [00:00<00:00, 23.70it/s]

2025-07-21 01:05:59,972 Permutation: 2
2025-07-21 01:05:59,974 Remaining Clusters: 3



100%|██████████| 2/2 [00:00<00:00,  5.52it/s]

2025-07-21 01:06:00,829 Permutation: 3
2025-07-21 01:06:00,830 Remaining Clusters: 0
2025-07-21 01:06:00,843 Found 355 adduct clusters
2025-07-21 01:06:00,891 Cleaning MASTERbook values for JSON output
2025-07-21 01:06:00,933 Finished





This is the first entry of the example output. It includes corresponding m/z, charge, intensity, and retention time. It also reports the number of scans in which the peak appears and any associated fragmentation data. The predicted adduct type is provided, and if the peak is part of an adduct network (i.e., other related adducts are detected), an adduct_cluster_id is assigned. Based on the adduct, the monoisotopic mass is calculated. Additionally, the isotopic distribution is reported, with each element representing an isotope: the first value indicates relative intensity, and the second indicates mass.

In [2]:
import json

data = json.load(open("sample_output/20109_peaks.json"))

print(data[0])

{'scan_id': 1741, 'mz': 506.289, 'intensity': 4170, 'rt': 422.3, 'scan_count': 13, 'peak_id': 647173.0, 'charge': 1, 'ms2': [], 'adduct_type': 'MpIPpH', 'adduct_cluster_id': 159, 'monoisotopic_mass': 445.224, 'base_adduct': False, 'isotopic_distribution': [[1.0, 506.289], [0.608, 507.231]]}


Run the following code to predict molecular formulas independently, with faster inference achieved by prioritizing high-intensity signals.

In [4]:
from Maple.PeakPicker import run_formula_predictor
import json

query_peaks = [
    {"peak_id": 31365361,
     "mass": 283.084,
     "charge": 1,
     "adduct_type": "2MpH",
     "iso_mz": [567.1758, 568.1783, 569.1757, 570.1852],
     "iso_intens": [1, 0.324, 0.112, 0.039]}
]

run_formula_predictor(
    peaks=query_peaks,
    output_fp="sample_output/example_formula_predictions.json",
    cpu=10 # the number of cpu cores to use
)

2025-07-21 01:06:12,524 Predict Elemental Ratios


100%|██████████| 1/1 [00:00<00:00, 383.81it/s]

2025-07-21 01:06:12,532 Predict Formulas



100%|██████████| 1/1 [00:01<00:00,  1.82s/it]


The following shows a sample output:

In [5]:
import json

data = json.load(open("sample_output/example_formula_predictions.json"))
print(data)

[{'peak_id': 31365361, 'formulas': [{'formula': 'C16H13NO4', 'score': 89.12}]}]


# Analyzing Isotope Feeding Studies

MAPLE includes inference functions for analyzing isotope feeding experiments by comparing control and labeled samples. It detects changes in isotopic distributions (measured using skewness) and identifies mass shifts in MS<sup>2</sup> fragmentation data. <sup>12</sup>C mzXML files must first be processed using the peak picking module.

In [6]:
from Maple.FeedingAnalysis import run_feeding_analysis

run_feeding_analysis(
    c13_mzXML_fp="sample_data/20111_Chitinophaga__C408L_Czapek-Dox-1perstarch_HP20-XAD7bags_C13_1.mzXML", # input data
    c12_peaks_fp="sample_output/20109_peaks.json", # input data
    output_fp="sample_output/20111_feeding_results.json"
)

100%|██████████| 6251/6251 [00:01<00:00, 3254.12it/s]
100%|██████████| 1523/1523 [00:00<00:00, 379037.86it/s]
100%|██████████| 1523/1523 [00:01<00:00, 882.26it/s]


This is the first entry of the example output. It includes MS information for a detected peak in the C13 file, its corresponding overlapping peak in the C12 file, and the calculated skewness.

In [7]:
import json

data = json.load(open("sample_output/20111_feeding_results.json"))
print(data[0])

{'scan_id': 4630, 'mz': 964.644, 'charge': 1, 'rt': 836.898, 'isotopic_distribution': [[1.0, 964.644], [0.9073, 965.643], [0.4908, 966.649], [0.1814, 967.643]], 'intensity_raw': 183090, 'ms2': [], 'skew': -0.1833, 'shifted_ms2': [], 'overlap_peaks': [{'peak_id': 1327140.0, 'skew': 0.1087}]}


The following function can be used to calculate skewness from the isotopic distribution. This calculation is already integrated into the `run_feeding_analysis` function.

In [8]:
from Maple.FeedingAnalysis import get_isot_dist_skewness

iso_mz = [937.684, 938.687, 939.689, 940.694, 941.694]
iso_intens = [1, 0.558, 0.188, 0.046, 0.009]
skew = get_isot_dist_skewness(iso_mz=iso_mz, iso_intens=iso_intens)

print(f"Calculated Skewness {skew}")

Calculated Skewness 0.273


The Dixon Q test is used to determine whether the <sup>13</sup>C skewness represents a statistical outlier.

In [9]:
from Maple.FeedingAnalysis import does_peak_shift

c12_skew_values = [0.3098, 0.2913, 0.273, 0.2825, 0.2695, 0.2822, 0.2668]
c13_skew_value = 0.1537
result = does_peak_shift(c12_skew_values, c13_skew_value, alpha=0.01)
print(result)

{'shift': True, 'stats': {'normality': True, 'Q_min': 0.725, 'Q_crit': 0.634, 'n': 8, 'outlier': True}}
