# Setting-up environment (google colab)

The following script sets up the jupyter notebook to work in google colab. It should be downloaded from [*github*](), uploded to you *gdrive*, and opened using *Google Colaboratory*. The course module package is cloned from github and installed. Please beware that the code is made to not run automatically; the user need to confirm cloning and installation explicitly with a yes input to the console. The code block has a number of checks to make sure that a run_all command can be run without re-triggering installation or input requests (*resetting the kernel will lead to a re-evaluation and quick re-install since all dependencies already available*). Please note that the git clone command will fail when run a second time since the module folder is already present. This will not impact the runtime in any way. To remove the folder in google colab, make use of the following command in a jupyter notebook cell: 

    !rm -rf moduleCompMet2024May

In [None]:
selected_environment = "runtime_colab" # or "runtime_local"
import importlib
if not 'confirm' in locals(): # this effectively caches the response so run all can be used
  confirm = input("WARNING: The git clone and pip install commands can have unintended side effects when used outside of google colab. Please enter >yes< to continue (without arrow brackets).")
if (selected_environment == "runtime_colab") and (not importlib.util.find_spec("compMetabolomics") and (confirm == "yes")):
    !rm -rf moduleCompMet2024May
    !git clone https://github.com/kevinmildau/moduleCompMet2024May
    !pip install moduleCompMet2024May/.
if confirm != "yes":
    print(f"Clone and installation skipped since confirm = >{confirm}<")

# Setting-up environment (local installation)

To run the module locally on your computer you need to have conda installed ([install guide](https://conda.io/projects/conda/en/latest/user-guide/install/index.html)).
The best way to make use of the course modules is to clone the github repository, open the repository folder using the command line, and run the following lines of code:

    conda create -name compMetEnv python=3.10
    conda activate compMetEnv
    pip install .
    jupyter-notebook

These commands create 1) a new isolated python environment, 2) activate this environmeent, 3) install the compMetabolomics course modules, and 4) run a jupyter-notebook from within the conda environment. Set-up may differ when using google-colab or hosted jupyterlab environments.

# Background on Mass Spectral Networking

## Spectral Networking - A means of organizing and visualizing abstract spectral data

In this practical we will delve deeper into computational metabolomics toolkit in the form of mass spectral molecular networking. Mass spectral networking, also known as molecular networking, is a staple method used by untargeted metabolomics researchers to help in exploratory data analysis and presentation of their results. Networks serve to organize the data, find connections between different spectra, and assist in propagating structural information to and from neighboring nodes. In addition, spectral feature groups, also known as molecular families, often serve as a canvas for overlaying annotation information to be presented in scientific papers and presentations.

## Spectral Networking - Spectral and Structural Similarity Link

On a conceptual level, molecular networking inverts the observation that similar structures tend to fragment similarly into the reverse hypothesis that similar spectra imply similar structures. This inversion opens up a way to organize unknown spectra into groups with implied structural overlaps. While we may not know the chemical identity of the grouped spectra, we do know that their spectral similarity implies a certain structural similarity. It is important to take into account that we may have library matches or high confidence structural annotations for at least some of the features, rendering the the molecular families a promising stepping stone into comparative spectral analysis with the purpose of structural elucidation.

## Spectral Networking - Data Processing and Visualization

On a technical level molecular networking can be perceived as a two-step data processing and data visualization workflow. In the data processing step, spectra data are turned into a pairwise similarity matrix based on the modified cosine score, which is turned into a collection of subnetworks using various topological settings among which :

+ **spectral similarity thresholds**: a minimum pairwise similarity cutoff value required before a connection (edge, link) between spectra is made. This limits connectivity to promising pairwise relationships and prevents visual overload from excess connections. 
+ **minimum fragment overlaps**: a minimum number for the number of shared fragments between a pair of spectra before it is considered for connection. This limits connectivity to promising pairwise relationships and prevents visual overload from excess connections. 
+ **maximum node degree**: a maximum number of connections to a given feature, used primarily to prevent visual clutter from excessive numbers of edges for certain nodes.
+ **maximum sub-network size**: a limit on the number of members within a molecular family; essentially a limitation to cluster size.

The molecular networking workflow makes use of these setting to generate a collection of disjoint sub-networks from the full data. These sub-networks, of which some will inevitably be singletons (single features disconnected from everything else), are visualized all-together (side-by-side) or group-wise as network diagrams (also known as Node-Link or Vertex-Edge diagrams) in Cytoscape. Somewhat misleadingly, molecular networks in Cytoscape are organized by sub-network size, giving the misleading impression that larger clusters are more important.

## Spectral Networking - Data Subdivision & Structural Hypothesis Generation

Spectral similarity groupings and their network visualization are useful for two primary reasons. First, they subdivide large heterogeneous datasets into smaller, more homogeneous and thus more manageable subsets. Dealing with small subnetworks and gaining an overview of the features within them is much more straightforward than dealing with a whole dataset at once.Second, the nodes within the groupings contain an implicit structural relationship with one another that may be useful for structural hypothesis generation. The latter part is especially useful in conjunction with library matches or high confidence annotations for some spectra.


# Practical Assignments 

In this practical we will make use of the publically available natural product discovery dataset of Soliman Kathib. In his study, Kathib and colleagues explored the effect of increased fractions of olive mill solid waste in the growth substrate of Hericium and Pleurothus mushrooms. For this practical we will look at the Pleurothus data only, and make use of a comparison of a zero percentage of OMSW against eighty percent OMSW. More information on the dataset can be found in the [preprint](https://www.biorxiv.org/content/10.1101/2024.02.09.579616v1.full.pdf).

***Task: Open the mushroom data of Soliman Kathib using the gnps network viewer [(link)](https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=60727fe5228643e6a482bd797d83df38). Assume you are interested in how the chemistry of Pleurotus mushrooms adjusts to the increased amount of Olive Mill Solid Waste (OMSW) in the growth substrate. Do the network views provide a means of identifying important node clusters or nodes?***

<details>
    <summary>Hint</summary>
    The web browser version (Network Visualizations --> View Spectral Families (In Browser Network Visualizer)) does not provide enough information integration with statistical data elements to find out which nodes and node clusters show differential intensity trends. To inspect this aspect of the data, the cytoscape desktop app needs to be used to visualize the whole dataset and some custom styling needs to be applied. Follow the instructions below to generate a suitable network view:  <br> 
    --> open cytoscape, <br>
    --> load the downloaded *.graphml* (Advanced Views - External Visualization [ Direct Cytoscape Preview/Download)]) file from the gnps repository, <br> 
    --> open Cytoscape, navigate to import network from file (top bar), <br> 
    --> import the network data, <br> 
    --> move to style, and make sure to activate node styling (on the left side), <br> 
    --> within the opened panel, find the image/chart entry, and click on the left-most selection box, <br> 
    --> this opens a window within which you can select betweem images, charts, and gradient overlays, <br> 
    --> move to Charts, <br> 
    --> select the pie chart, and make sure that the selected columns are GNPSGROUP:0 and GNPSGROUP:80, which represent the cumulative intensities for features in samples with these respective OMSW percentages. <br> 
    <br> 
    This will show cumulative intensities as percentage pie charts on the network nodes. An alternative represenation is the bar chart, used stacked and deactivate the "same value range for all charts" option for a better two group comparison on a node by node basis.
</details>
<details>
    <summary>Answer</summary>
    The web browser network visualization does not offer enough visual integration to inspect statistical aspects of the data. 
    The gnps cytoscape network export does provide this information provided you apply appropriate styling. With some inspection, searching, and custom styling, one can find information on each node, as well as sample group specific abundance patterns. Molecular families of interest can become apparent via differential intensity coloring. It should be noted that these analyses are exploratory, and may be heavily distorted by sampling artefacts such as batch effects. Example views from Cytoscape where OMSW0 sample intensities are portrayed in blue and OMSW80 sample intensities are portrayed in green: <br>
    Molecular Families with differences and overlaps between conditions: <br>
    <img src="https://raw.githubusercontent.com/kevinmildau/moduleCompMet2024May/main/images/omsw%200%20vs%2080%20(80%20is%20green)%20overlaps.png">
    Molecular Families with almost no overlap between conditions: <br>
    <img src="https://github.com/kevinmildau/moduleCompMet2024May/blob/main/images/omsw%200%20vs%2080%20(80%20is%20green)%20large%20difference.png?raw=true">
</details>


*Inspect the edge list and node list data within Cytoscape. What information did the GNPS workflow add to the spectral data?*
<details>
    <summary>Answer</summary>
    The edge list does not provide any useful information beyond the connectivity shown in the network view itself. The node view contains many data columns with processed metadata and any annotations from library matching.  <br>
    Partial view of tabular data available in node table: <br>
    <img src="https://github.com/kevinmildau/moduleCompMet2024May/blob/main/images/annotations%20data.png?raw=true">
</details>

##  Limitations of "molecular networking"

While clearly advantageous and an important first step in generating organization in datasets, molecular networking and its subdivision into subnetworks does come with tradeoffs. Here, subdividing networks into strict subnetworks may obscure relationships between such subnetworks or across the spectra they contain. While strict edge cutoffs may be needed for organizing spectral data into disjoint groups using this method, they do represent a loss of topological neighborhood information. The feature grouping itself is done using a plethora of topological settings, and correspondingly difficult to tune. This is especially true when changing between different scoring approaches with different scoring behavior than the modified cosine score such as MS2DeepScore. Modified cosine scoring tends to produce sparse matrices suitable for disjoint subdivision, while machine learning embedding-based scores such as MS2DeepScore tend to create much higher interconnectivity between features, easily leading to dense hairball networks that are difficult to read.

In addition to difficulties in finding the right settings to use, the molecular networking workflow comes in a blank canvas form; nodes and edges are shown, some metadata and annotation information is integrated, yet visual mappings are left to the user. Hence, a good understanding of the cytoscape user interface and its settings is required to achieve good visual integration of the type of information sought. As such, molecular networking is better viewed as a starting point for additional customization and data explorations rather than an exploratory end-point.

In the following tasks we will make use of different approaches to generating molecular network-like visualizations interactively within this jupyter notebook. Specifically, we will explore a workflow similar to the one used in [specXplore](https://doi.org/10.1021/acs.analchem.3c04444) and [msFeaST](https://doi.org/10.26434/chemrxiv-2024-h7sm8), where two-dimensional embedding projections and interactive network overlays are used together.

# Interactive Code Examples

In [None]:
# LOADING PYTHON PACKAGES & COMP-METABOLOMICS MODULES
import numpy as np
import pandas as pd
import os
import copy
import json
from compMetabolomics.map_to_size import transform_log2_fold_change_to_node_size
from compMetabolomics.spectrum import Spectrum, load_json_spectra, parse_json_spectrum, get_min_max, get_spectrum_ids
from compMetabolomics.tsne_embedding import run_tsne_grid, plot_tsne_grid, print_tsne_grid, extract_coordinates_from_entry, plot_embedding
from compMetabolomics.kmedoid_clustering import run_kmedoid_grid, print_kmedoid_grid, plot_kmedoid_grid, get_kmedoid_grid_entry_cluster_assignments
from compMetabolomics.utils import convert_similarity_to_distance
from compMetabolomics.integrate import align_with_spectral_feature_id
from compMetabolomics.construct_cytoscape_elements import generate_edge_list, generate_node_list

The data path will differ depending on the environment within which the jupyter notebook is run. The code below works for both a local and google_colab setup, provided the right "selected_environment" is available.

In [None]:
# DEFINE DATA DIRECTORIES AND FILEPATHS
if selected_environment == "runtime_local":
  data_directory = os.path.join("data")
if selected_environment == "runtime_colab":
  data_directory = os.path.join("moduleCompMet2024May", "data")
filepath_spectra = os.path.join(data_directory, "spectra.json")
filepath_similarity_matrix_modcos = os.path.join(data_directory, "similarities_modcos.npy")
filepath_similarity_matrix_ms2deepscore = os.path.join(data_directory, "similarities_ms2ds.npy")
filepath_quant_table = os.path.join(data_directory, "quant_table.csv")
filepath_treat_table = os.path.join(data_directory, "treat_table.csv")

***TASK: Load and inspect the example data using the code below. Give a short description of each data component.***
<details>
    <summary>Hint</summary>
    For each loaded data item, inspect the console output. To describe the data give a short description of the data
    utility in the context of molecular networking. The data shown here is a subset and tidied up version of the 
    information you can find in the tabular and spectral exports provided by gnps. Importantly, the feature_id's here
    correspond to the scans (in spectral .mgf file), name / shared_name entries (cytoscape node table), and row ID 
    (.csv file) seen in the gnps output.
</details>
<details>
    <summary>Answer</summary>

1. Spectra contains spectral information for each ms/ms feature, feature_id, including percursor_mz, retention_time, and fragment mass to charge ratio and intensity pairs.

2. The similarity matrices contain pairwise spectral similarity using the modified cosine score or the ms2deepscore 2.0 score. The give an indication of how similar the spectra are.

3. The qaunt_table and treat_table contain tabular data on the ms1 information (precursor intensity per sample) and treatments (omsw0 vs omsw80). Notice that the intensities within the quantification table have been normalized to sum to unit intensity within sample.

</details>

In [None]:
# LOAD & INSPECT INPUT DATA SOURCES - SPECTRA
spectra = load_json_spectra(filepath_spectra)
print("Spectra is a list of Spectrum tuples: \n", spectra[0])

In [None]:
# LOAD & INSPECT INPUT DATA SOURCES - SPECTRAL SIMILARITY MATRICES
similarity_matrix_modcos = np.loadtxt(filepath_similarity_matrix_modcos)
similarity_matrix_ms2ds = np.loadtxt(filepath_similarity_matrix_ms2deepscore)
print(
  "The similarity matrices are square matrices displaying similarity score output",
  "for each feature pair (via index):\n",
  "Modified Cosine Score entries (first 4)\n", 
  similarity_matrix_modcos[0:4, 0:4],
  "\nMs2DeepScore entries (first 4) \n", 
  similarity_matrix_ms2ds[0:4, 0:4]
)

In [None]:
# LOAD & INSPECT INPUT DATA SOURCES - QUANTIFICATION TABLE & STATISTICAL METADATA
quant_table = pd.read_csv(filepath_quant_table)
treat_table = pd.read_csv(filepath_treat_table)
print(
  "The quantification table with sample_id vs feature specific intensity. Normalized to unit sum per sample.\n",
  quant_table.iloc[:, 0:7], 
  "\nThe treatment table with sample_id vs treatment.\n",
  treat_table
)

# Embedding the spectral similarity data using t-SNE

As a first step of our exploration of the data we will perform a t-SNE embedding of the spectral similarity matrix.

***TASK: Select a pairwise similarity matrix to work with: similarity_matrix_ms2ds or similarity_matrix_modcos. Run the t-SNE grid computations and select an embedding. Give a brief explanation of your choice.***
<details>
    <summary>Hint/Answer</summary>
    Perplexity as a parameter roughly corresponds to the number of nearest neighbors considered in the t-SNE embedding. The pearson and spearman correlations should ideally be high, yet at the same time, the embedding should lead to good point clusters. Notice that high perplexity values improve the distance preservation. However, while this is the case, local grouping becomes poorer; while distance preservation overall gets a bonus from better global distance preservation, local distances preservation in the form of nearest neighbors can suffer. This is difficult to see in static represenations but becomes apparent when working with the network overlays below. In this practical, we stick to lower perplexity values as we would like to inspect local connectivity using network approaches.
</details>

In [None]:
similarity_matrix = similarity_matrix_ms2ds # feel free to rerun this and subsequent cells with similarity_matrix_modcos for networking using this score

In [None]:
tsne_grid = run_tsne_grid(
  convert_similarity_to_distance(similarity_matrix), 
  perplexity_values = [10, 20, 30, 40] # more values are possible here (time consuming), e.g. [50, 200, 800]
)
plot_tsne_grid(tsne_grid)
print_tsne_grid(tsne_grid)

In [None]:
tsne_iloc = 3
coordinates_table = extract_coordinates_from_entry(tsne_grid[tsne_iloc])
print(coordinates_table[0:4])

In [None]:
plot_embedding(coordinates_table, feature_ids = [spec.feature_id for spec in spectra])

# Clustering the spectral similarity data using k-medoid clustering

***TASK: Run k-medoid clustering for on the pairwise similarity data to obtain data subdivisions akin to molecular families. What do you notice about the silhouette scores?***
<details>
    <summary>Hint/Answer</summary>
    According to the Silhouette score, larger numbers of clusters would be appropriate, with a good choice beeing around 500 clusters. Notice however that silhouette scores are always rather poor for this data. This is because there are often no clear cut differences between clusters. For visual ease within this practical we choose iloc = 3 which corresponds to 50 clusters.
</details>

In [None]:
# set the number of clusters (similarity matrix same as above assumed)
n_groups_list = [10, 20, 30, 50, 100, 150, 200, 250, 500, 1000, 1500]

In [None]:
kmedoid_grid = run_kmedoid_grid(convert_similarity_to_distance(similarity_matrix_ms2ds), n_groups_list) 
print_kmedoid_grid(kmedoid_grid)
plot_kmedoid_grid(kmedoid_grid)

In [None]:
# SELECT KMEDOID GRID ENTRY & extract assignment list
k_medoid_iloc = 1
kmedoid_assignments = get_kmedoid_grid_entry_cluster_assignments(kmedoid_grid,  k_medoid_iloc)
print(kmedoid_assignments[0:5])
print(np.unique(kmedoid_assignments))

# Computing Log2FoldChanges for the features

***TASK: Run the log2 fold-change computations below to create feature specific trend estimates across treatment groups. Explain why fold-change is an appropriate measure here.***
<details>
    <summary>Hint/Answer</summary>
    Fold change gives a measure of the relative increase of the intensity of a feature in one group compared to the other. If a feature is lowly present in OMSW0, and highly present in OMSW80, we will see large fold changes and be alerted to the respective treatment responsive chemicals for further inspection. This information is precursor based and thus stands independent of the ms/ms fragmentation infromation we use to organize the spectra into groups of related compounds. Notice that fold-change has one issue: as a relative measure it deals poorly with 0 values. In fact, a number of features in the dataset are completely absent in one group but present in another, immediately leading to infinite fold changes. In addition, lowly abundant features in one group can quickly display very high fold changes as well. To deal with potentially out of bounds fold changes, the node_size column computed below limits the node size to 50 for high or infinite fold changes.
</details>

The code below causes division by zero problems that lead to -inf and +inf outcomes which causes some warnings. The warnings can be ignored as they are handled by the node size conversion functions.

In [None]:
# Compute log2fold Changes & align data (complex code, focus on the output data frame & collapse the processing script function)
def script_create_log2foldchange_summary(quantification_table, treatment_table, spectrum_list):
  # disconnect inputs from the out of function scope variables
  quant_table = copy.deepcopy(quantification_table)
  treat_table = copy.deepcopy(treatment_table)
  spectra = copy.deepcopy(spectrum_list)
  # Marge quantification and treatment tables
  joined_data = pd.merge(quant_table, treat_table, on='sample_id', how="inner")
  # drop sample_id column which is no longer needed in joined data
  joined_data = joined_data.drop('sample_id', axis=1)
  # drop treatment information from joined data to create feature quantificaiton exclusive data
  numeric_feature_data = joined_data.drop('treatment', axis=1)
  # Compute mean by feature and group combination (feature_id will be a pandas.df index)
  mean_condition_0 = numeric_feature_data[joined_data['treatment'] == 'PleurotusOMSW0'].mean()
  mean_condition_80 = numeric_feature_data[joined_data['treatment'] == 'PleurotusOMSW80'].mean()
  # Merge pandas dataframes (using identical index)
  meansdf = pd.concat([mean_condition_0, mean_condition_80], axis=1, join="inner")
  # Rename columns
  meansdf.columns = ["omsw0", "omsw80"]
  # Compute ratio and log2 ratio
  meansdf['ratio'] = meansdf['omsw80'] / meansdf['omsw0']
  meansdf['log2ratio'] = np.log2(meansdf['ratio'])
  # add feature id column and reset index
  meansdf = meansdf.reset_index().rename(columns={'index': 'feature_id'})
  # compute node size from log2fold change (linear mapping with bounds)
  meansdf['node_size'] = [transform_log2_fold_change_to_node_size(val) for val in meansdf['log2ratio']]
  # give columns more meaningful names
  meansdf = meansdf.reset_index(drop=True).rename(columns={'omsw0': 'mean_omsw0', 'omsw80': 'mean_omsw80', 'ratio': 'ratio (omsw80 / omsw0)'})
  # add increasing vs decreasing qualifier
  meansdf["effect_direction"] = np.where((meansdf['mean_omsw80']-meansdf['mean_omsw0'])>=0, "positive", "negative")
  # Align summary data data frame with spectrum id ordering usied elswhere
  aligned_meansdf = align_with_spectral_feature_id(meansdf, spectra)
  return aligned_meansdf
summary_statistics_df = script_create_log2foldchange_summary(quant_table, treat_table, spectra)

In [None]:
summary_statistics_df

# Interactive Network Visualization

***TASK: The following code creates node and edge lists for interactive visualization of the data within the jupyter notebook. Run the code and inspect the node and edge lists.***
<details>
  <summary>Hint/Answer</summary>
</details>

In [None]:
edge_list = generate_edge_list(similarity_matrix, [s.feature_id for s in spectra], top_k = 50)
# sorting edge list:
sorted_indices = np.array([elem["data"]["weight"] for elem in edge_list]).argsort()
edge_list = np.array(edge_list)[sorted_indices[::-1]].tolist()
edge_list[0:3]

In [None]:
node_list = generate_node_list(spectra=spectra, coordinates_table=coordinates_table, group_ids=kmedoid_assignments, summary_statistics_df=summary_statistics_df)
node_list[0:3]

In [None]:
%load_ext autoreload
%autoreload 2
# BEWARE OF SELECTING LARGE NUMBERS OF NODES WITH LARGE TOPK SELECTED --> CAN CRASH THE SESSION
from compMetabolomics.dash_group_highlight_topknet import run_network_visualization as run_ns1
app = run_ns1(node_list, edge_list, 50, spectra)
app.run_server(port = "8050", jupyter_mode = "external")