# Setting-up environment (google colab)

The following script sets up the jupyter notebook to work in google colab. It should be downloaded from [*github*](), uploded to you *gdrive*, and opened using *Google Colaboratory*. The course module package is cloned from github and installed. Please beware that the code is made to not run automatically; the user need to confirm cloning and installation explicitly with a yes input to the console. The code block has a number of checks to make sure that a run_all command can be run without re-triggering installation or input requests (*resetting the kernel will lead to a re-evaluation and quick re-install since all dependencies already available*). Please note that the git clone command will fail when run a second time since the module folder is already present. This will not impact the runtime in any way. To remove the folder in google colab, make use of the following command in a jupyter notebook cell: 

    !rm -rf CompMet_Tutorials

**--> Important:** Only run the following two code cells when using google colab.

In [1]:
selected_environment = "runtime_colab" # or "runtime_local" if using a local development environment, see below

In [2]:
# SETUP COLAB ENVIRONMENT
import importlib
if (not 'confirm' in locals()): # this effectively caches the response so run all can be used
  confirm = input("WARNING: The git clone and pip install commands can have unintended side effects when used outside of google colab. Please enter >yes< to continue (without arrow brackets).")
if (selected_environment == "runtime_colab") and (not importlib.util.find_spec("compMetabolomics") and (confirm == "yes")):
    !rm -rf CompMet_Tutorials
    !git clone https://github.com/vdhooftcompmet/CompMet_Tutorials
    !pip install CompMet_Tutorials/.
if confirm != "yes":
    print(f"Clone and installation skipped since confirm = >{confirm}<")

Clone and installation skipped since confirm = >no<


# Setting-up environment (local installation)

To run the module locally on your computer you need to have conda installed ([install guide](https://conda.io/projects/conda/en/latest/user-guide/install/index.html)) and git installed. To set up the module run the following commands via the terminal:

    conda create --name compMetEnv python=3.10
    conda activate compMetEnv
    git clonehttps://github.com/vdhooftcompmet/CompMet_Tutorials
    cd CompMet_Tutorials
    pip install .
    jupyter-notebook

These commands create 1) a new isolated python environment, 2) activate this environmeent, 3) clone the github repository, 4) move the active directory into the downloaded repository folder using the terminal, 5) install the compMetabolomics course modules, and 6) run a jupyter-notebook from within the conda environment. Set-up may differ when using google-colab or hosted jupyterlab environments. 

**--> Important:** Only uncomment and run the following cell when making use of local set-up.

In [8]:
# Remove the # comment mark below to set the environment to local
#selected_environment = "runtime_local" # or "runtime_colab" if using google colab, see above

# Practical Assignments 
In this practical we will explore the GNPS natural products library from https://gnps.ucsd.edu/ProteoSAFe/gnpslibrary.jsp?library=GNPS-NIH-NATURALPRODUCTSLIBRARY. This dataset is a collection of natural product reference standards with associated ms/ms spectrum. The data have been processed using matchms in the natural_product_library_preprocessing.ipynb notebook.

# Interactive Code Examples

 The processing code below mimicks that of the main practical very closely. For details on the different steps, please refer to the main practical in practical_may2024.

In [9]:
# LOADING PYTHON PACKAGES & COMP-METABOLOMICS MODULES
import numpy as np
import pandas as pd
import os
import copy
import json
from compMetabolomics.map_to_size import transform_log2_fold_change_to_node_size
from compMetabolomics.spectrum import Spectrum, load_json_spectra, parse_json_spectrum, get_min_max, get_spectrum_ids
from compMetabolomics.tsne_embedding import run_tsne_grid, plot_tsne_grid, print_tsne_grid, extract_coordinates_from_entry, plot_embedding
from compMetabolomics.kmedoid_clustering import run_kmedoid_grid, print_kmedoid_grid, plot_kmedoid_grid, get_kmedoid_grid_entry_cluster_assignments
from compMetabolomics.utils import convert_similarity_to_distance, get_spectrum_by_id
from compMetabolomics.integrate import align_with_spectral_feature_id
from compMetabolomics.construct_cytoscape_elements import generate_edge_list, generate_node_list_no_stats

The data path will differ depending on the environment within which the jupyter notebook is run. The code below works for both a local and google_colab setup, provided the right "selected_environment" is available.

In [10]:
# DEFINE DATA DIRECTORIES AND FILEPATHS
if selected_environment == "runtime_local":
  data_directory = os.path.join("data_natural_products_library")
if selected_environment == "runtime_colab":
  data_directory = os.path.join("CompMet_Tutorials", "data_natural_products_library")
filepath_spectra = os.path.join(data_directory, "spectra.json")
filepath_similarity_matrix_modcos = os.path.join(data_directory, "similarities_modcos.npy")
filepath_similarity_matrix_ms2deepscore = os.path.join(data_directory, "similarities_ms2ds.npy")

In [11]:
# LOAD & INSPECT INPUT DATA SOURCES - SPECTRA
spectra = load_json_spectra(filepath_spectra)
# Load the raw json spectra as well for metadata access
with open(filepath_spectra) as f:
  json_spectra = json.load(f)
print("Json Spectra is a list of generic spectrum entries: \n", json_spectra[0])
print("Spectra is a list of Spectrum tuples: \n", spectra[0])

Json Spectra is a list of generic spectrum entries: 
 {'charge': 1, 'ionmode': 'positive', 'smiles': 'OC(=O)[C@H](NC(=O)CCN1C(=O)[C@@H]2Cc3ccccc3CN2C1=O)c4ccccc4', 'scans': '1865', 'ms_level': '2', 'instrument_type': 'LC-ESI-qTof', 'file_name': 'p1-A05_GA5_01_17878.mzXML', 'peptide_sequence': '*..*', 'organism_name': 'GNPS-NIH-NATURALPRODUCTSLIBRARY', 'compound_name': '(2R)-2-[3-[(10aS)-1,3-dioxo-10,10a-dihydro-5H-imidazo[1,5-b]isoquinolin-2-yl]propanoylamino]-2-phenylacetic acid"', 'principal_investigator': 'Dorrestein', 'data_collector': 'VVP/LMS', 'submit_user': 'vphelan', 'confidence': '1', 'spectrum_id': 'CCMSLIB00000079350', 'precursor_mz': 408.156, 'adduct': '[M+H]+', 'feature_id': 'CCMSLIB00000079350', 'retention_time': 'not-available', 'peaks_json': [[95.718742, 0.007350539039529565], [97.039551, 0.007840574975498203], [97.131714, 0.009637373407383208], [97.412415, 0.009474028095393662], [97.58535, 0.007350539039529565], [99.071404, 0.007350539039529565], [100.596634, 0.007187

In [12]:
# LOAD & INSPECT INPUT DATA SOURCES - SPECTRAL SIMILARITY MATRICES
similarity_matrix_modcos = np.loadtxt(filepath_similarity_matrix_modcos)
similarity_matrix_ms2ds = np.loadtxt(filepath_similarity_matrix_ms2deepscore)
print(
  "The similarity matrices are square matrices displaying similarity score output",
  "for each feature pair (via index):\n",
  "Modified Cosine Score entries (first 4)\n", 
  similarity_matrix_modcos[0:4, 0:4],
  "\nMs2DeepScore entries (first 4) \n", 
  similarity_matrix_ms2ds[0:4, 0:4]
)

The similarity matrices are square matrices displaying similarity score output for each feature pair (via index):
 Modified Cosine Score entries (first 4)
 [[1.         0.02073893 0.00303784 0.05186433]
 [0.02073893 1.         0.04486132 0.94989006]
 [0.00303784 0.04486132 1.         0.03535681]
 [0.05186433 0.94989006 0.03535681 1.        ]] 
Ms2DeepScore entries (first 4) 
 [[1.         0.36936951 0.41749796 0.71601178]
 [0.36936951 1.         0.40013656 0.36121405]
 [0.41749796 0.40013656 1.         0.34816449]
 [0.71601178 0.36121405 0.34816449 1.        ]]


In [13]:
import compMetabolomics.heatmap
threshold = 0.7
feature_ilocs = [iloc for iloc in range(400, 500)] # selected range of features from whole matrix, try different ranges.
feature_ids = [spectrum.feature_id for spectrum in spectra]
compMetabolomics.heatmap.generate_augmap_graph(feature_ilocs, similarity_matrix_modcos, feature_ids, threshold).show()
compMetabolomics.heatmap.generate_augmap_graph(feature_ilocs, similarity_matrix_ms2ds, feature_ids, threshold).show()

# Embedding the spectral similarity data using t-SNE

As a first step of our exploration of the data we will perform a t-SNE embedding of the spectral similarity matrix. The code below mimicks that of the main practical.

In [14]:
# SELECT PAIRWISE SIMILARITY MATRIX TO BE USED THROUGHOUT THE REMAINDER OF THE CODE
# IMPORTANT: NOTE THAT THE ANSWERS IN THE VANILLA DOCUMENT ASSUME THE MS2DEEPSCORE SIMILARITY MATRIX TO HAVE BEEN USED!
similarity_matrix = similarity_matrix_ms2ds # feel free to rerun this and subsequent cells with similarity_matrix_modcos for networking using this score

In [15]:
tsne_grid = run_tsne_grid(
  convert_similarity_to_distance(similarity_matrix), 
  perplexity_values = [10, 20, 30, 40, 1200] # more values are possible here (time consuming), e.g. [50, 200, 800]
)
plot_tsne_grid(tsne_grid)
print_tsne_grid(tsne_grid)

T-sne grid results. Use to inform t-sne embedding selection.
   iloc  perplexity  pearson_score  spearman_score  random_seed_used
0     0          10       0.449521        0.434647                 0
1     1          20       0.504859        0.487354                 0
2     2          30       0.486665        0.471076                 0
3     3          40       0.489098        0.474994                 0
4     4        1200       0.820232        0.828682                 0


In [16]:
tsne_iloc = 3
coordinates_table = extract_coordinates_from_entry(tsne_grid[tsne_iloc])
print(coordinates_table[0:4])

   x_coordinate  y_coordinate
0     -6.354599     -5.403307
1     15.997694    -10.932066
2     26.008554     12.648302
3      2.610598      9.902471


In [17]:
plot_embedding(coordinates_table, feature_ids = [spec.feature_id for spec in spectra]).update_layout(title = "t-SNE with moderate perplexity")

In [18]:
# VISUALIZE THE HIGH PERPLEXITY TSNE RESULTS
tsne_iloc_high = 4 # ASSUMED TO BE ILOC = 4 AS SET IN THE VANILLA DOCUMENT
plot_embedding(
  embedding_coordinates_table = extract_coordinates_from_entry(tsne_grid[tsne_iloc_high]), 
  feature_ids = [spec.feature_id for spec in spectra]
).update_layout(title = "t-SNE with very high perplexity run")

# Clustering the spectral similarity data using k-medoid clustering

The code below is used to run k-medoid clustering on the spectral similarity data. This produces data sub-divisions that are useful for determining which groups of features can be considered related. The code below mimicks that of the main practical.

In [19]:
# set the number of clusters (similarity matrix same as above assumed)
n_groups_list = [2, 3, 5, 10, 20, 30, 50, 100, 150, 200, 250, 500, 750, 1000, 1250, len(spectra)-1]

In [20]:
kmedoid_grid = run_kmedoid_grid(convert_similarity_to_distance(similarity_matrix_ms2ds), n_groups_list) 
print_kmedoid_grid(kmedoid_grid)
plot_kmedoid_grid(kmedoid_grid)

Kmedoid grid results. Use to inform kmedoid classification selection ilocs.
    iloc     k  silhouette_score  random_seed_used
0      0     2          0.170511                 0
1      1     3          0.103227                 0
2      2     5          0.120938                 0
3      3    10          0.140374                 0
4      4    20          0.154879                 0
5      5    30          0.179703                 0
6      6    50          0.195309                 0
7      7   100          0.215762                 0
8      8   150          0.214980                 0
9      9   200          0.225087                 0
10    10   250          0.226147                 0
11    11   500          0.217264                 0
12    12   750          0.183405                 0
13    13  1000          0.119485                 0
14    14  1250          0.016010                 0
15    15  1266          0.001090                 0


In [21]:
# SELECT KMEDOID GRID ENTRY & extract assignment list
k_medoid_iloc = 6
kmedoid_assignments = get_kmedoid_grid_entry_cluster_assignments(kmedoid_grid,  k_medoid_iloc)
print(kmedoid_assignments[0:5])
print(np.unique(kmedoid_assignments))

['km_33', 'km_30', 'km_6', 'km_17', 'km_36']
['km_0' 'km_1' 'km_10' 'km_11' 'km_12' 'km_13' 'km_14' 'km_15' 'km_16'
 'km_17' 'km_18' 'km_19' 'km_2' 'km_20' 'km_21' 'km_22' 'km_23' 'km_24'
 'km_25' 'km_26' 'km_27' 'km_28' 'km_29' 'km_3' 'km_30' 'km_31' 'km_32'
 'km_33' 'km_34' 'km_35' 'km_36' 'km_37' 'km_38' 'km_39' 'km_4' 'km_40'
 'km_41' 'km_42' 'km_43' 'km_44' 'km_45' 'km_46' 'km_47' 'km_48' 'km_49'
 'km_5' 'km_6' 'km_7' 'km_8' 'km_9']


# Interactive Network Visualization

In this section we generate the interactive network visualization to be used for exploring the data interactively. The code below mimicks that of the main practical.

In [22]:
edge_list = generate_edge_list(similarity_matrix, [s.feature_id for s in spectra], top_k = 50)
# sorting edge list:
sorted_indices = np.array([elem["data"]["weight"] for elem in edge_list]).argsort()
edge_list = np.array(edge_list)[sorted_indices[::-1]].tolist()
edge_list[0:3]

[{'data': {'source': 'CCMSLIB00000080005',
   'target': 'CCMSLIB00000080372',
   'weight': 0.9950322181963641,
   'label': '1.0',
   'id': 'CCMSLIB00000080005-to-CCMSLIB00000080372'}},
 {'data': {'source': 'CCMSLIB00000079921',
   'target': 'CCMSLIB00000080320',
   'weight': 0.9928834953127419,
   'label': '0.99',
   'id': 'CCMSLIB00000079921-to-CCMSLIB00000080320'}},
 {'data': {'source': 'CCMSLIB00000079905',
   'target': 'CCMSLIB00000079996',
   'weight': 0.9922914289708642,
   'label': '0.99',
   'id': 'CCMSLIB00000079905-to-CCMSLIB00000079996'}}]

In [23]:
class_assignments = kmedoid_assignments
node_list = generate_node_list_no_stats(spectra=spectra, coordinates_table=coordinates_table, group_ids=class_assignments)
node_list[0:3]

[{'data': {'id': 'CCMSLIB00000079350',
   'precursor_mz': 408.156,
   'label': 'CCMSLIB00000079350; 408.156',
   'size': 25,
   'log2ratio': 'none',
   'effect_direction': 'none',
   'group': 'group_km_33'},
  'position': {'x': -635.4598999023438, 'y': -540.330696105957},
  'classes': 'group_km_33'},
 {'data': {'id': 'CCMSLIB00000079351',
   'precursor_mz': 277.098,
   'label': 'CCMSLIB00000079351; 277.098',
   'size': 25,
   'log2ratio': 'none',
   'effect_direction': 'none',
   'group': 'group_km_30'},
  'position': {'x': 1599.769401550293, 'y': -1093.2065963745117},
  'classes': 'group_km_30'},
 {'data': {'id': 'CCMSLIB00000079352',
   'precursor_mz': 269.081,
   'label': 'CCMSLIB00000079352; 269.081',
   'size': 25,
   'log2ratio': 'none',
   'effect_direction': 'none',
   'group': 'group_km_6'},
  'position': {'x': 2600.8554458618164, 'y': 1264.830207824707},
  'classes': 'group_km_6'}]

The code below runs an interactive network visualization for the data we've processed. Some remarks:

* The dashboard works best when used in a full browser window, which unfortunately does not work in colab. You may see an additional side scroll wheel appear that you need to use to move from the top of the dashboard to the bottom.
* Be careful with (mouse wheel) zoom in and out as you can quickly loose the network within it's canvas. You can left click and drag to move around the canvas.
* Clicking a node that is already selected will fail to trigger a callback. To change top-K for a selected node, change the top-K value, deselect the node (clicking on empty area), and re-select the node (clicking on node).
* When having lost the network, it is easiest to regenerate the dashboard by rerunning the jupyter cell
* The top of the network contains some useful commands that can be expanded on click.
* Below the network there are a number of text output containers and a top-K adjustment slider.
* Sliding the top-K slider changes the number of edges shown for each node. Beware of large node selections and top-K combinations as they may crash the session!
* Hovering over a node will automatically highlight its corresponding cluster until another node is hovered over.
* You can select individual nodes, or multiple nodes, prompting detailed information to be overlaid or added below the network. To perform multi-selection, hold ctrl, shift, or cmd when clicking nodes, or hold shift and use the rectangular drag selection tool.
* Selection of one or more nodes results in: hover group highlighting, node information display below the network, and top-K edge overlay.
* Spectral plots are availabe: for single node selections, a single spectrum plot is shown. For two node selections, a mirror plot is shown. For three and up to 10 nodes selected, aligned spectral plots are shown. If more nodes are selected, spectral plots are no longer generated. Spectral plots are interactive allowing zoom and hover tooltip triggering. 


In [24]:
# RUN INTERACTIVE NETWORK VISUALIZATION
%load_ext autoreload
%autoreload 2
from compMetabolomics.dash_group_highlight_topknet import run_network_visualization
app = run_network_visualization(node_list, edge_list, 50, spectra)
if selected_environment == "runtime_colab":
  jupyter_mode = "inline" # to render output as jupyter cell
else:
  jupyter_mode = "external" # to render output in separate browser cell (does not work in colab)
app.run_server(port = "8051", jupyter_mode = jupyter_mode)

Dash app running on http://127.0.0.1:8051/


Use the following code cell to show the extended data available for a specific spectrum provided an identifier:

In [25]:
feature_to_show = "CCMSLIB00000079350" # -- modify the identifier string
get_spectrum_by_id(feature_to_show, json_spectra)

[{'charge': 1,
  'ionmode': 'positive',
  'smiles': 'OC(=O)[C@H](NC(=O)CCN1C(=O)[C@@H]2Cc3ccccc3CN2C1=O)c4ccccc4',
  'scans': '1865',
  'ms_level': '2',
  'instrument_type': 'LC-ESI-qTof',
  'file_name': 'p1-A05_GA5_01_17878.mzXML',
  'peptide_sequence': '*..*',
  'organism_name': 'GNPS-NIH-NATURALPRODUCTSLIBRARY',
  'compound_name': '(2R)-2-[3-[(10aS)-1,3-dioxo-10,10a-dihydro-5H-imidazo[1,5-b]isoquinolin-2-yl]propanoylamino]-2-phenylacetic acid"',
  'principal_investigator': 'Dorrestein',
  'data_collector': 'VVP/LMS',
  'submit_user': 'vphelan',
  'confidence': '1',
  'spectrum_id': 'CCMSLIB00000079350',
  'precursor_mz': 408.156,
  'adduct': '[M+H]+',
  'feature_id': 'CCMSLIB00000079350',
  'retention_time': 'not-available',
  'peaks_json': [[95.718742, 0.007350539039529565],
   [97.039551, 0.007840574975498203],
   [97.131714, 0.009637373407383208],
   [97.412415, 0.009474028095393662],
   [97.58535, 0.007350539039529565],
   [99.071404, 0.007350539039529565],
   [100.596634, 0.007