This notebook explores the molnetenhancer output from the gnps runtime on the mushroom dataset used in the practical. This script extracts and exports the data table as used in the practical.

Important: this notebook is designed for local runtimes only. The data produced are available in the data folder.

In [11]:
import pandas as pd
import numpy as np
import os

# Inspecting MolNetEnhancer Output

In this section we explore the node table produced by MolNetEnhancer.

In [12]:
# LOAD MOLNETENHANCER IDS & DATA
df_molnetenhancer = pd.read_csv(os.path.join("data", "node_table_from_molnetenhancer_all_pos_network.csv"))
feature_ids_molnetenhancer = [str(element) for element in df_molnetenhancer['shared name'].to_list()]
print("Number of features in the full GNPS network for both mushroom types:", len(feature_ids_molnetenhancer))

Number of features in the full GNPS network for both mushroom types: 2984


In [13]:
# LOAD QUANTIFICATION TABLE IDS AND DATA
df_quant_table = pd.read_csv(os.path.join("data", "quant_table.csv"))
column_names = df_quant_table.columns.to_list()
feature_ids_quant_table = column_names[1:len(column_names)]
print("Number of features in Pleurotus data subset quantification table: ", len(feature_ids_quant_table))

Number of features in Pleurotus data subset quantification table:  1975


Note that there are 2984 features in the node table, with the column with ids referred to as "shared name". The quantification table we use has less, since it is subset to features with at least one occurrence in the Pleurotus specific data and incldues some spectral data pre-processing. Let's do a sanity check to make sure that the features in our two tables actually match:

In [14]:
print("Set of ids in quant table is subset of larger table? -->", set(feature_ids_quant_table).issubset(feature_ids_molnetenhancer))
print("Is the size of the intersection equal to 1975, the size of the smaller table? -->", len(set(feature_ids_quant_table).intersection(feature_ids_molnetenhancer)) == 1975)

Set of ids in quant table is subset of larger table? --> True
Is the size of the intersection equal to 1975, the size of the smaller table? --> True


They do indeed. The node datable is a large dataset with many columns in its own right. These are the columns included:

In [15]:
df_molnetenhancer.columns

Index(['Adduct', 'AllGroups', 'Analog:Adduct', 'Analog:Compound_Name',
       'Analog:Compound_Source', 'Analog:Data_Collector',
       'Analog:GNPSLibraryURL', 'Analog:IIN Best Ion=Library Adduct',
       'Analog:INCHI', 'Analog:Instrument', 'Analog:Ion_Source',
       'Analog:IonMode', 'Analog:Library_Class', 'Analog:MassDiff',
       'Analog:MQScore', 'Analog:MZErrorPPM', 'Analog:PI',
       'Analog:SharedPeaks', 'Analog:Smiles', 'Analog:SpectrumID',
       'Analog:tags', 'ATTRIBUTE_ Percent of OMSW  in MS',
       'ATTRIBUTE_ Taxonomy', 'CF_class', 'CF_class_score',
       'CF_componentindex', 'CF_Dparent', 'CF_Dparent_score', 'CF_kingdom',
       'CF_kingdom_score', 'CF_MFramework', 'CF_MFramework_score',
       'CF_NrNodes', 'CF_subclass', 'CF_subclass_score', 'CF_superclass',
       'CF_superclass_score', 'charge', 'cluster index', 'componentindex',
       'Compound_Name', 'Compound_Source', 'Data_Collector', 'DefaultGroups',
       'G1', 'G2', 'G3', 'G4', 'G5', 'G6', 'GNPSGROUP

Molecular families are indicated using the componentindex column. The code below assesses the number of families, and their individual size.

Note that we will be operating operating on a slice of the full feature_data, meaning that some families may be larger in the full two mushroom types dataset than in the pleurotus subset.

In [16]:
# Inspect number of molecular families, and their sizes (-1 is singleton)
unique, counts = np.unique(np.array(df_molnetenhancer["componentindex"], dtype = str), return_counts=True)
count_table = pd.DataFrame( {"molfam_id": unique, "member_count": counts})
count_table

Unnamed: 0,molfam_id,member_count
0,-1,992
1,1,6
2,10,2
3,101,9
4,102,3
...,...,...
316,95,3
317,96,6
318,97,10
319,98,3


# Parsing MolnetEnhancer Outputs & Aligning

In this section we will extract the data we wish to incporate into our network visualization.

In [17]:
print("Number of classes:      ", np.unique(np.array(df_molnetenhancer["CF_class"], dtype = str)).size)
print("Number of superclasses: ", np.unique(np.array(df_molnetenhancer["CF_superclass"], dtype = str)).size)
print("Number of subclasses:   ", np.unique(np.array(df_molnetenhancer["CF_subclass"], dtype = str)).size)
print("Number of mol.families: ", np.unique(np.array(df_molnetenhancer["componentindex"], dtype = str)).size)
df_molnetenhancer = df_molnetenhancer[["shared name", "CF_class", "CF_superclass", "CF_subclass", "componentindex"]]
print("Extracted Data:")
df_molnetenhancer

Number of classes:       74
Number of superclasses:  14
Number of subclasses:    130
Number of mol.families:  321
Extracted Data:


Unnamed: 0,shared name,CF_class,CF_superclass,CF_subclass,componentindex
0,9559,Fatty Acyls,Lipids and lipid-like molecules,Lineolic acids and derivatives,370
1,11451,Prenol lipids,Lipids and lipid-like molecules,Diterpenoids,149
2,10150,Prenol lipids,Lipids and lipid-like molecules,Sesquiterpenoids,54
3,15149,Prenol lipids,Lipids and lipid-like molecules,Glycerophosphates,5
4,3113,2-arylbenzofuran flavonoids,Organoheterocyclic compounds,,735
...,...,...,...,...,...
2979,10015,no matches,no matches,no matches,-1
2980,8178,no matches,no matches,no matches,-1
2981,10953,Prenol lipids,Lipids and lipid-like molecules,Triterpenoids,-1
2982,16747,Glycerolipids,Lipids and lipid-like molecules,Triradylcglycerols,116


String Processing of the classificaiton table for cytoscape compatibility:

In [18]:
df_molnetenhancer = df_molnetenhancer.astype(str)
df_molnetenhancer = df_molnetenhancer.replace(r'^\s*$', "_", regex=True)
df_molnetenhancer = df_molnetenhancer.replace(r' ', "_", regex=True)
df_molnetenhancer = df_molnetenhancer.replace(r',', "-", regex=True)
df_molnetenhancer = df_molnetenhancer.replace(r';', "-", regex=True)
df_molnetenhancer = df_molnetenhancer.replace(r'nan', "not_available", regex=True)
df_molnetenhancer["componentindex"] = df_molnetenhancer["componentindex"].replace(r'-1', "singleton", regex=True)
df_molnetenhancer["componentindex"] = str("mf_") + df_molnetenhancer["componentindex"]
df_molnetenhancer.columns = ["feature_id", "cf_class", "cf_superclass", "cf_subclass", "molecular_family"]
print("Data frame entries are now all of type string / object, and entries no longer contain white spaces:")
df_molnetenhancer

Data frame entries are now all of type string / object, and entries no longer contain white spaces:


Unnamed: 0,feature_id,cf_class,cf_superclass,cf_subclass,molecular_family
0,9559,Fatty_Acyls,Lipids_and_lipid-like_molecules,Lineolic_acids_and_derivatives,mf_370
1,11451,Prenol_lipids,Lipids_and_lipid-like_molecules,Diterpenoids,mf_149
2,10150,Prenol_lipids,Lipids_and_lipid-like_molecules,Sesquiterpenoids,mf_54
3,15149,Prenol_lipids,Lipids_and_lipid-like_molecules,Glycerophosphates,mf_5
4,3113,2-arylbenzofuran_flavonoids,Organoheterocyclic_compounds,not_available,mf_735
...,...,...,...,...,...
2979,10015,no_matches,no_matches,no_matches,mf_singleton
2980,8178,no_matches,no_matches,no_matches,mf_singleton
2981,10953,Prenol_lipids,Lipids_and_lipid-like_molecules,Triterpenoids,mf_singleton
2982,16747,Glycerolipids,Lipids_and_lipid-like_molecules,Triradylcglycerols,mf_116


Subsetting to the required feature set:

In [19]:
df_molnetenhancer = df_molnetenhancer[df_molnetenhancer["feature_id"].isin(feature_ids_quant_table)]
df_molnetenhancer

Unnamed: 0,feature_id,cf_class,cf_superclass,cf_subclass,molecular_family
0,9559,Fatty_Acyls,Lipids_and_lipid-like_molecules,Lineolic_acids_and_derivatives,mf_370
2,10150,Prenol_lipids,Lipids_and_lipid-like_molecules,Sesquiterpenoids,mf_54
5,7059,Unsaturated_hydrocarbons,Benzenoids,not_available,mf_35
6,2557,Organonitrogen_compounds,Organic_nitrogen_compounds,Quaternary_ammonium_salts,mf_532
7,4729,Pyrans,Organoheterocyclic_compounds,Pyranones_and_derivatives,mf_singleton
...,...,...,...,...,...
2976,15193,Prenol_lipids,Lipids_and_lipid-like_molecules,Triterpenoids,mf_67
2977,9933,Benzopyrans,Organoheterocyclic_compounds,1-benzopyrans,mf_singleton
2978,8759,Unsaturated_hydrocarbons,Benzenoids,not_available,mf_35
2981,10953,Prenol_lipids,Lipids_and_lipid-like_molecules,Triterpenoids,mf_singleton


Saving to data folder for use in practical:

In [20]:
output_filepath = os.path.join("data", "molnetenhancer_processed.csv")
df_molnetenhancer.to_csv(output_filepath, index=False)