# SpecXplore Demo Notebook

This notebook takes the user through the data importing steps of specxplore. 
This also includes a run of ms2query on a demo dataset. 
The used dataset is very small to guarantee quick speed of all processing. 
The data used in the demo consists of the first 30 ion entries of the wheat data .mgf file used in the illustrative examples of the specXplore publication.

Note that, to run ms2query as well as ms2deepscore and spec2vec models used in specXplore one needs to provide pre-trained model files. 
There are assumed to be in the "models" directory with a path that needs to be specified by the user. 
To download the latter, please use the latest model files linked to on the github repo of [ms2query](https://github.com/iomega/ms2query), i.e. [https://zenodo.org/record/6124552](https://zenodo.org/record/6124552) for positive mode and [https://zenodo.org/record/7104184](https://zenodo.org/record/7104184) for negative mode (Sept. 5, 2023). 
All model files for the ion mode of the data should be put into one folder, the path of which is needed for specXplore intialization and ms2query running.

This notebook assumes the following folder structure:

```
|--[parent folder]
     |-- demo.ipynb
     |-- models
          |-- {model & library files from ms2query for positive mode}
     |-- data
          |-- demo_data.mgf
     |-- output
          |-- demo_data.csv (ms2query results created using demo.ipynb)
          |-- demo.pickle (created using demo.ipynb)
```

# Step 1: Initialize Jupyter Notebook

This step starts with selecting the right kernel to run the jupyter notebook in. If the set-up instructions were followed, the kernel with specXplore installed is the conda environment with name specxplore_environment. Select this environment as the kernel to run jupyter notebook. If successfully created, this kernel contains all packages required for specXplore to run, including matchms, ms2query, spec2vec, and ms2deepscore.

The following code block is used to make all required packages availabile in this notebook session:

In [None]:
import specxplore.importing
import matchms
import matchms.filtering
import ms2query
import os
import pandas as pd

The next code block gives relative paths to all inputs and outputs used in this notebook. 
Having all paths specified in one place helps with keeping an overview of file paths. 
The os model is used to ensure that filepaths comprised of folder names and filenames are constructued to conform with the standards of the operating system. 
This ensures that relative filepaths generalize across systems.

In [None]:
################ File paths for specXplore ################
# Folder path for the pre-trained models and local ms2query library
models_and_library_folder_path = os.path.join("models")

# File path for spectral data .mgf file to be explored using specXplore
input_mgf_filepath = os.path.join("data", "demo_data.mgf")

################ File paths for ms2query ################
# Folder path that contains the .mgf file with spectral data
input_data_folder  = os.path.join("data")

# The filename of the .mgf file
mgf_filename = os.path.join("demo_data.mgf")

# Folder to which ms2query puts the results .csv file
output_ms2query_directory = os.path.join("output")

# ms2query csv file name derived from input spectrum filename
output_ms2query_filepath = os.path.join("output", "demo_data.csv") 

# Output paths
output_filepath = os.path.join("output", "demo.pickle")

# Step 2: Load and pre-process input data files

In [None]:
spectra_matchms = list(
    matchms.importing.load_from_mgf(input_mgf_filepath)
)
spectra_matchms = specxplore.importing.apply_basic_matchms_filters_to_spectra(spectra_matchms)

The following code block does a quick pre-emptive check for unqiqueness fo feature identifiers. Should the latter be non-unique, problems may occur in follow-up processing. There is a final check for uniquenss in the specXplore constructor for this as well.

In [None]:
# Check for uniqueness of feature_ids 
feature_ids = [spec.get("feature_id") for spec in spectra_matchms]
print(feature_ids[0:4])
assert len(feature_ids) == len(set(feature_ids)), "Feature_ids must be unique!"
print("Uniqueness assertion passed")

# Step 3: Run ms2query 

ms2query is used in specXplore to provide additional chemical context to unknown spectra. 
This includes the putative analog matches, as well as the chemical classifications of the latter which may serve as guidelines for which part of the t-SNE embedding can be considered of interest. 
In general, it is advised to make use of a suitable match threshold for ms2query to avoid excessibve numbers of false positive hits (non-analogs). 
In this example no threshold is set and any putative analog match is kept regardless of score. 
Individual results are thus to be  read carefully and compared with the match score. 
Running ms2query is the step in this list with the longest runtime. This is because every spectrum is compared against a large offline database at each step.

In [None]:
ms2library = ms2query.create_library_object_from_one_dir(
    models_and_library_folder_path
)

ms2query.run_ms2query_single_file(
    ms2library = ms2library, 
    folder_with_spectra = input_data_folder,
    spectrum_file_name = mgf_filename, 
    results_folder = output_ms2query_directory,
    settings = ms2query.utils.SettingsRunMS2Query()
)

For the ms2query analog annotation table to be used inside the specXplore dashboard it needs to be post-processed. 
By default, ms2query does not run for spectra that not fulfill its quality criteria with the corresponding entries missing from the .csv table. 
specXplore on the other hand makes use of all spectra inside the spectra_matchms object. 
The succesful ms2query runs thus have to be aligned with the matching spectra used in specXplore. 

In addition, analog classifications can be converted into a table suitable for class based coloring in specXplore.

Both tables can be joined into specxplore via their feature_id key column in later steps.

In [None]:
# Get list of raw spectra without specXplore pre-processing as used in ms2query
raw_mgf_spectra = list(
    matchms.importing.load_from_mgf(
        input_mgf_filepath
    )
)

# Get ascending order number for each spectrum (query number)
raw_data_spectrum_number = [
    iloc 
    for iloc in range(1, len(raw_mgf_spectra)+1)
]

# Get feature_id entries for all spectra
raw_data_feature_ids = [
    spec.get('feature_id') 
    for spec in raw_mgf_spectra
]

# Create a mapping of feature_id to query_number
raw_iloc_to_feature_id_mapping = pd.DataFrame(
    {
        "feature_id": raw_data_feature_ids, 
        "query_spectrum_nr" : raw_data_spectrum_number
    }
)

# Load ms2query results table using pandas
ms2query_annotation_table = pd.read_csv(
    output_ms2query_filepath
)

# Join the ms2query results table with the feature mapping such that for each available query, a feature_id is present
ms2query_annotation_table = ms2query_annotation_table.merge(
    raw_iloc_to_feature_id_mapping, 
    how = "left", 
    on="query_spectrum_nr"
)

# Rename ms2query feature identifier column and recast it as string type if not already
ms2query_annotation_table["feature_id"] = ms2query_annotation_table["feature_id"].astype("string")


In [None]:

# Extract ms2query analog classification table for heuristic highlighting
ms2query_analog_classification = ms2query_annotation_table.loc[
    :, 
    [
        'cf_superclass', 'cf_class', 'cf_subclass', 'cf_direct_parent', 'npc_class_results', 'npc_superclass_results',
        'npc_pathway_results', 'feature_id'
    ]
]
ms2query_analog_classification

# Step 4: Initialize specXplore session

This is the first specXplore specific step. Here, the spectral data is supplied to the SessionData() constructor which separates spectral information, feature_ids, and runs pairwise similarity computations using matchms (unless provided). The models and library folder path is used the by SessionData constructor function for accessing the trained ms2deepscore and spec2vec models.

In [None]:
specxplore_demo_session = specxplore.importing.SessionData(
    spectra_matchms, 
    models_and_library_folder_path
)

# Step 5: Run Grids and Select t-SNE and k-medoid parameters

In [None]:
specxplore_demo_session.attach_kmedoid_grid(
    k_values=[3, 6, 8]
)

In [None]:
specxplore_demo_session.attach_run_tsne_grid(
    perplexity_values=[3, 5, 10]
)

In [None]:
# select a particular iloc of the tsne grid with good distance preservation
specxplore_demo_session.select_tsne_coordinates(2) 

# select particular iloc(s) for kmedoid cluster assignments to add to class table
specxplore_demo_session.select_kmedoid_cluster_assignments([0,1,2]) 

# Step 6: Attach classification data to SessionData

In [None]:
specxplore_demo_session.attach_addon_data_to_class_table(ms2query_analog_classification)

# Step 7: Attach metadata to the SessionData

In [None]:
specxplore_demo_session.attach_addon_data_to_metadata(ms2query_annotation_table)

# Step 8: Designate Highighlted Spectra

In [None]:
specxplore_demo_session.construct_highlight_table(['1961', '76', '198', '301'])

# Step 9: Initialize SessionData derived variables

In [None]:
specxplore_demo_session.initialize_specxplore_session()

# Step 10: Save the file

In [None]:
specxplore_demo_session.check_and_save_to_file(output_filepath)

# Step 11: Start up dashboard and explore data

Follow the readme guidelines on how to open a specxplore dashboard using the terminal and upload your saved demo_data.pickle using the full filepath. 