# Background on Mass Spectral Networking

## Spectral Networking - A means of organizing and visualizing abstract spectral data
In this practical we will delve deeper into computational metabolomics toolkit in the form of mass spectral molecular networking. Mass spectral networking, also known as molecular networking, is a staple method used by untargeted metabolomics researchers to help in exploratory data analysis and presentation of their results. Networks serve to organize the data, find connections between different spectra, and assist in propagating structural information to and from neighboring nodes. In addition, spectral feature groups, also known as molecular families, often serve as a canvas for overlaying annotation information to be presented in scientific papers and presentations.

## Spectral Networking - Spectral and Structural Similarity Link
On a conceptual level, molecular networking inverts the observation that similar structures tend to fragment similarly into the reverse hypothesis that similar spectra imply similar structures. This inversion opens up a way to organize unknown spectra into groups with implied structural overlaps. While we may not know the chemical identity of the grouped spectra, we do know that their spectral similarity implies a certain structural similarity. It is important to take into account that we may have library matches or high confidence structural annotations for at least some of the features, rendering the the molecular families a promising stepping stone into comparative spectral analysis with the purpose of structural elucidation.

## Spectral Networking - Data Processing and Visualization
On a technical level molecular networking can be perceived as a two-step data processing and data visualization workflow. In the data processing step, spectra data are turned into a pairwise similarity matrix based on the modified cosine score, which is turned into a collection of subnetworks using various topological settings among which :

+ **spectral similarity thresholds**: a minimum pairwise similarity cutoff value required before a connection (edge, link) between spectra is made. This limits connectivity to promising pairwise relationships and prevents visual overload from excess connections. 
+ **minimum fragment overlaps**: a minimum number for the number of shared fragments between a pair of spectra before it is considered for connection. This limits connectivity to promising pairwise relationships and prevents visual overload from excess connections. 
+ **maximum node degree**: a maximum number of connections to a given feature, used primarily to prevent visual clutter from excessive numbers of edges for certain nodes.
+ **maximum sub-network size**: a limit on the number of members within a molecular family; essentially a limitation to cluster size.

The molecular networking workflow makes use of these setting to generate a collection of disjoint sub-networks from the full data. These sub-networks, of which some will inevitably be singletons (single features disconnected from everything else), are visualized all-together (side-by-side) or group-wise as network diagrams (also known as Node-Link or Vertex-Edge diagrams) in Cytoscape. Somewhat misleadingly, molecular networks in Cytoscape are organized by sub-network size, giving the misleading impression that larger clusters are more important.

## Spectral Networking - Data Subdivision & Structural Hypothesis Generation
Spectral similarity groupings and their network visualization are useful for two primary reasons. First, they subdivide large heterogeneous datasets into smaller, more homogeneous and thus more manageable subsets. Dealing with small subnetworks and gaining an overview of the features within them is much more straightforward than dealing with a whole dataset at once.Second, the nodes within the groupings contain an implicit structural relationship with one another that may be useful for structural hypothesis generation. The latter part is especially useful in conjunction with library matches or high confidence annotations for some spectra.

# Practical Assignments 

In this practical we will make use of the publically available natural product discovery dataset of Soliman Kathib. In his study, Kathib and colleagues explored the effect of increased fractions of olive mill solid waste in the growth substrate of Hericium and Pleurothus mushrooms. For this practical we will look at the Pleurothus data only, and make use of a comparison of a zero percentage of OMSW against eighty percent OMSW.

### Task 1

***Task: Open the mushroom data of Soliman Kathib using the gnps network viewer [data](https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=60727fe5228643e6a482bd797d83df38). Assume you are interested in how the chemistry of Pleurotus mushrooms adjusts to the increased amount of Olive Mill Solid Waste (OMSW) in the growth substrate. Does the network provide a means of identifying important node clusters or nodes?***

<details>
    <summary>Hint</summary>
    The web browser version does not provide enough information integration with statistical data elements to find out which nodes and node clusters show differential intensity trends. To inspect this aspect of the data, the cytoscape desktop app needs to be used to visualize the whole dataset and some custom styling needs to be applied. Follow the instructions below to generate a suitable network view:  
    In cytoscape, load the downloaded graphml file from the gnps repository, move to style, and make sure to activate node styling. Within the panel, find the image/chart entry, and click on the left-most selection box. This opens a window within which you can select betweem images, charts, and gradient overlays. Move to Charts, select the pie chart, and make sure that the selected columns are GNPSGROUP:0 and GNPSGROUP:80, which represent the cumulative intensities for features in samples with these respective OMSW percentages. An alternative represenation is the bar chart, used stacked and deactivate the "same value range for all charts" option for a better two group comparison on a node by node basis.
</details>
<details>
    <summary>Answer</summary>
    The gnps and cytoscape network provide an overwhelming amount of information at first glance. With some inspection, searching, and custom styling, one can find information on each node, as well as sample group specific abundance patterns. Molecular families of interest can become apparent via differential intensity coloring. It should be noted that these analyses are exploratory, and may be heavily distorted by sampling artefacts such as batch effects. Example views from Cytoscape: <br>
    Molecular Families with differences and overlaps between conditions: <br>
    <img src="images/omsw 0 vs 80 (80 is green) overlaps.png">
    Molecular Families with almost no overlap between conditions: <br>
    <img src="images/omsw 0 vs 80 (80 is green) large difference.png">
</details>

*Inspect the edge list and node list data within Cytoscape. What information did the GNPS workflow add to the spectral data?*
<details>
    <summary>Answer</summary>
    The edge list does not provide any useful information beyond the connectivity shown in the network view itself. The node view contains many data columns with processed metadata and any annotations from library matching.  <br>
    Partial view of abular data available in node table: <br>
    <img src="images/annotations data.png">
</details>


##  Limitations of "molecular networking"

While clearly advantageous and an important first step in generating organization in datasets, molecular networking and its subdivision into subnetworks does come with tradeoffs. Here, subdividing networks into strict subnetworks may obscure relationships between such subnetworks or across the spectra they contain. While strict edge cutoffs may be needed for organizing spectral data into disjoint groups using this method, they do represent a loss of topological neighborhood information. The feature grouping itself is done using a plethora of topological settings, and correspondingly difficult to tune. This is especially true when changing between different scoring approaches with different scoring behavior than the modified cosine score such as MS2DeepScore. Modified cosine scoring tends to produce sparse matrices suitable for disjoint subdivision, while machine learning embedding-based scores such as MS2DeepScore tend to create much higher interconnectivity between features, easily leading to dense hairball networks that are difficult to read.

In addition to difficulties in finding the right settings to use, the molecular networking workflow comes in a style-free form; nodes and edges are shown, some metadata and annotation information is integrated, yet visual mappings are left to the user. Hence, a good understanding of the cytoscape user interface and its settings is required to achieve good visual integration of the type of information sought. 

As such, molecular networking is thus better viewed as a starting point for additional customization and data explorations rather than an exploratory end-point.

### Task 2 


In [None]:
import os
data_directory = os.path.join("data")
filepath_test_spectra = os.path.join(data_directory, "spectra.mgf")
filepath_test_quant_table = os.path.join(data_directory, "quant_table.csv")
filepath_test_treat_table = os.path.join(data_directory, "treat_table.csv")
model_path = os.path.join("..", "models", "ms2deepscore_model.pt")




***Task: Generate the t-SNE overview of the data using plotly. Is the t-SNE overview useful in its own right? What does it show? Does the second version with cluster highlighting improve the utility of the plot?***

***Task: Generate the interactive t-SNE overview of the data using plotly. How does this representation of the data compare to the Cytoscape network visualization?***

***Task: Generate the interactive t-SNE overview of the data using plotly. This representation provides a color gradient for fold node specific fold change. What other ways of visualizing these data do you know?***

--> heatmaps, possibly per-cluster, possibly using optimal leaf ordering based on spectral clustering rather than statistical data

***Task: Generate the interactive t-SNE representation with edge overlays. What does the network connectivity data add to the t-SNE overview.***


**Main Task Series based on soliman's data**

***Task: Find differentially abundant features in the data.***

***Task: Find differentially abundant features in the data. Do they cluster in some way?***

***Task: Find differentially abundant features in the data. Do they have high connectivity with one another?***


***Task: Find differentially abundant features in the data. Do they share spectral similarities?***
--> fragmap like plot would be useful

***Task: Find differentially abundant features in the data. Do they share similar structural annotations?***
--> more extensive spectrum info print



***TASK: ...***
<details>
    <summary>Hint</summary>
    ...
</details>
<details>
    <summary>Answer</summary>
    ...
</details>