Template code structure for the pre-processing steps to be done by the end user.

In [None]:
filename_data = "your-spectral-dataset-filepath.mgf"
filename_standards = "your-reference-standards-filepath.mgf"
filename_spec2vec1 = "spec2vec-model-filepath-1" # trained model
filename_spec2vec1 = "spec2vec-model-filepath-2" # ???
filename_spec2vec1 = "spec2vec-model-filepath-3" # ???
filename_ms2deepsc = "ms2deepscore-model-filepath"
path_ms2query = "ms2query-downloaded-files-folder-path"

In [None]:
from specxplore import processing

### Process and Annotate spectrum data

- An mgf file is loaded and matchms spectrum cleaning and harmonization functions are applied as a pipeline. 
- ms2query is run to generate classification tables and analog predictions. 
- The spectrum metadata is extracted as a pandas df.

Returns:
cleaned spectrum list, classification table, analog table, metadata table. All linked together through sample_idx.

In [None]:
spectra = matchms.load_from_mgf # standard matchms loader
spectra = clean_spectra() # custom pipeline; metadata cleaning, etc.
# --> get idx
classification_table, analog_table = run_ms2query + fetch_results # add print for results.csv location
metadata_table = fetch_metadata() # any custom addons should be introduced here.

### Process and Annotate reference standards

- An mgf file is loaded and matchms spectrum cleaning and harmonization functions are applied as a pipeline. 
- GNPS API interfacing code is used to run classyfire and npclassifier on inchi/smiles.
- The spectrum metadata is extracted as a pandas df.
- Any additonal spectrum metadata is added if available.
- All spectra are indexed by their standards_idx


In [None]:
standards = matchms.load_from_mgf
standards = clean_spectra()
classification_table = run_classification() # try_catch based
metadata_table = fetch_metadata() # any custom addons should be introduced here.
# e.g. is_standard = TRUE column 
metadata_table["is_standard"] = True # automatically repeats in df

### Merge spectral data

Sample data and reference standard data are combined together. Reference standards can be identified easily in post via their is_standard == True entry in the joint metadata table or via their inchi/smiles.

A new idx is generated to uniquely identify each spectrum in the merged data. Any otherwise useful spectrum ids will be within the metadata.

In [None]:
merge_tables() # possibly little overlap in columns, lots of NA information for metadata
merge_spectra()
get_idx() # for merged data

### Get Pairwise Similarities using matchms

Requires: model files and their paths, spectrum list
Returns: idx ordered pairwise similarities in np matrix format

Note:
- spec2vec and ms2deepscore come with their own tutorials on how to do this. 
- installations of both tools may be tricky depdending on the operating system
- matchms has a nice interface for this already; all we can do is wrap it away and limit it
- WARNING: all three similarity matrices are currently necessary for the dashboard; they cannot be missing.

Proposed Solution:

--> leave these steps in the original functions style and mainly provide output glue.
--> a wrapper function will add additional baggage and will only be handy if we can guarantee it'll run.

In [None]:
sm_ms2deepscore = get_pairwise_similarities(
    merged_spectra, "ms2deepscore")
sm_modified_cosine = get_pairwise_similarities(
    merged_spectra, "modified_cosine")
sm_spec2vec = get_pairwise_similarities(
    merged_spectra, "spec2vec")

### Run K-Medoid Clustering Grid

Here, K-Medoid clustering is run for many levels of K to achieve a good Silhouette score.

Idea: this particular code can be run and rerun easily; the grid can be modified until a suitable K is found.

Return: A classification table with suitable K clustering coefficients. Small K for broad trends, large K for granularity in the t-SNE embedding.

In [None]:
k_grid = [5,10,15...]
run_k_medoid_grid()
plot(scores)
construct_clustering_table # with desired clustering levels 

### Run t-SNE Grid

Here, a t-SNE tuning round is done to assess what levels of perplexity would lead to good distance preservation properties of the embedding. Learning rate and number of iterations may also be investigated, but this tuning will be slower.

Speed depends on data size and settings. A single run may take a couple of minutes for large datasets and certain settings.

In [None]:
perplexities = [...]
learning_rates = [...]
iterations = [...]
run_tsne_grid()
plot(scores)

construct_tsne_xy()# for selected settings

### Construct specXplore data structure

Construct a specXplore data structure for use within the dashboard. Essentially a class with named data entries to use. This avoids passing around many parameters at each step of the dashboard, and provides a single place to look at the data structure used throughout specXplore.