# Case Study Data Preparation

The msFeaST workflow runs with three linked data structures in the form of:

1. A quantification table with sample specific feature intensities. One column is assumed to be called "sample_id" with sampl identifiers, while all remaining columns are assumed named after their feature, e.g., "feature_1". 
2. A metadata table with sample identifier to treatment mapping. Column names are "sample_id" and "treatment", where the sample identifiers are matching wiht the ones in the quantification table.
3. A mgf file with spectral data for features including a feature_id entry that matches the feature names used in the quantification table.

For the msFeaST workflow to work reliably, feature identifiers, sample identifiers, and treatment identifiers are assumed matching across files. 
This jupyter notebook showaces the pre-processing required on GNPS-FBMN file export to achieve the required input data for msFeaST. 
While largely automatic, the user must make some choices regarding statistical metadata, and if missing from the gnps export, provide appropriately styled metadata.

# 1 - Import Python Dependencies & Provide File Paths
This code chunk loads python package dependencies for data processing. All packages are installed by default when following the msFeaST installation guide. To make sure they are available in this notebook, start the notebook via activating the conda environment from the console "```conda activate msfeast_environment```" and follow up this command with "```upyter-notebook```". Using the browser interface of jupyter-notebook, navigate towards the .ipynb notebook file an open it. The default Python environment will correspond to the one set-up with the conda environment manager.

In [1]:
%load_ext autoreload
%autoreload 2
from matchms.importing import load_from_mgf
from matchms.exporting import save_as_mgf
import pandas as pd
import os
import copy
from msfeast.preprocessing import apply_default_spectral_processing
from msfeast.preprocessing import extract_treatment_table
from msfeast.preprocessing import restructure_quantification_table
from msfeast.preprocessing import normalize_via_total_ion_current
from msfeast.preprocessing import subset_quantification_table_to_samples
from msfeast.preprocessing import align_feature_subsets
from msfeast.preprocessing import get_sample_ids_from_treatment_table
from msfeast.preprocessing import align_treatment_and_quantification_table

In addition to the package dependencies, this notebook relies on the mushroom case study data from msfeast. This data is assumed to be situated in a folder called "data" with subfolder "mushroom_data_gnps_export". The commands below specify the relavtive paths to the required metadata.tsv, quantification_table.csv, and spectra.mgf files. The os package is used to allow relative filepaths to work across operating systems. When working on macos or linux, specifying e.g., ```"data/mushroom_data_gnps_export/metadata.tsv"```, would also work.

*<span style="color:magenta">Required user input: Relative file paths to data</span>*

In [2]:
gnps_metadata_filepath = os.path.join("data", "mushroom_data_gnps_export", "metadata.tsv")
gnps_quant_table_filepath = os.path.join("data", "mushroom_data_gnps_export", "quantification_table.csv")
gnps_spectra_filepath = os.path.join("data", "mushroom_data_gnps_export", "spectra.mgf")

# 2 - Load gnps data
GNPS-FBMN network data contains numerous entries not requires by the msFeaST workflow. In the following processing steps, the input data is processed to contain only relevant data as expected by the msFeaST pipeline. We delineate between general steps and mushroom data specific steps to allow users to customize these steps to their own data. Unfortunately, given the metadata specific and thus unique setting of each dataset, complete automatization of this process is not possible. Users will have to ensure that they have right data available to get to the expected pipeline input.

**Loading gnps-fbmn metadata**

Note that the raw data file is placed inside the data/mushroom_data_gnps_export folder and named metadata.tsv, in tab separated format (.tsv). The mushroom dataset contains numerous samples not of direct relevance to the statistical analyses we're performing in msFeaST. The relevant data subsets must hence be extracted for the automatic analysis pipeline of msFeaST to make use of the correct data in following steps.

In [3]:
gnps_statistical_metadata = pd.read_table(gnps_metadata_filepath)
print("Data dimensions (number of rows, number of columns):", gnps_statistical_metadata.shape)
gnps_statistical_metadata.head()

Data dimensions (number of rows, number of columns): (54, 12)


Unnamed: 0,filename,SampleType,SampleType1,ATTRIBUTE_ Percent of OMSW in MS,Species,ATTRIBUTE_ Taxonomy,NCBITaxonomy,Sample Collection,Sample Extract,MassSpectrometer,IonizationSourceAndPolarity,ChromatographyAndPhase
0,MS0_NEW_POS.mzXML,BLANK_MS,Mushroom Substrate,0,Fungi,MS,91752Hericium erinaceus,Dry Solid Material,Methanol100%,Q Exactive Plus|MS:1002661,electrospray ionization (Positive),reverse phase (C18)
1,MS0_OLD_POS.mzXML,BLANK_MS,Mushroom Substrate,0,Fungi,MS,91752Hericium erinaceus,Dry Solid Material,Methanol100%,Q Exactive Plus|MS:1002662,electrospray ionization (Positive),reverse phase (C18)
2,MS33_NEW_POS.mzXML,BLANK_MS,Mushroom Substrate,33,Fungi,MS,91752Hericium erinaceus,Dry Solid Material,Methanol100%,Q Exactive Plus|MS:1002663,electrospray ionization (Positive),reverse phase (C18)
3,MS33_OLD_POS.mzXML,BLANK_MS,Mushroom Substrate,33,Fungi,MS,91752Hericium erinaceus,Dry Solid Material,Methanol100%,Q Exactive Plus|MS:1002664,electrospray ionization (Positive),reverse phase (C18)
4,MS60_NEW_POS.mzXML,BLANK_MS,Mushroom Substrate,60,Fungi,MS,91752Hericium erinaceus,Dry Solid Material,Methanol100%,Q Exactive Plus|MS:1002665,electrospray ionization (Positive),reverse phase (C18)


**Loading gnps-fbmn Quantification Table**

Note that the raw data file is placed inside the data/mushroom_data_gnps_export folder and named quantification_table.csv, in comma separated format (.csv). Each row contains a ```'row ID'``` column identifying features. Each feature has associated precursor m/z value within the ```'row m/z'``` column, as well as retention time in ```'row retention time'``` (time in minutes). The data required by msFeaST are feature specific intensity profiles across samples indicated by columns with the following name construct >>```sample identifier/name``` + ```' Peak area'```<<, e.g., ```'E37_pos.mzXML Peak area'```.

Note that the raw data import into pandas leads to many columns with NaN (not available number) entries and somewhat complex column naming conventions that prevent direct matching to sample identifiers because of the ```'Peak area'``` suffix. In addition, a trailing column is added to the end of the data frame with name ```"Unnamed: 67"```. These data features are dealt with in the processing code below after all raw data are loaded.

In [4]:
gnps_quantification_table = pd.read_csv(gnps_quant_table_filepath)
print("Data dimensions (number of rows, number of columns):", gnps_quantification_table.shape)
gnps_quantification_table.head()

Data dimensions (number of rows, number of columns): (2984, 68)


Unnamed: 0,row ID,row m/z,row retention time,row ion mobility,row ion mobility unit,row CCS,correlation group ID,annotation network number,best ion,auto MS2 verify,...,E37_pos.mzXML Peak area,E38_pos.mzXML Peak area,E39_pos.mzXML Peak area,E36_pos.mzXML Peak area,E43_pos.mzXML Peak area,E40_pos.mzXML Peak area,E41_pos.mzXML Peak area,E44_pos.mzXML Peak area,E42_pos.mzXML Peak area,Unnamed: 67
0,555,69.03428,1.209123,,,,,,,,...,7182873.0,3206877.0,6456761.5,5544007.5,6295892.0,10589970.0,7853519.0,7683877.0,9745292.0,
1,994,70.06589,1.216007,,,,,,,,...,536219.1,1073616.5,370348.06,682284.7,290696.38,369843.0,387468.44,333006.6,294155.34,
2,15743,71.086306,17.530378,,,,,,,,...,37769.434,22959.324,22191.406,41042.824,17818.375,16550.38,25448.713,33429.113,57842.637,
3,2563,79.05493,5.326331,,,,,,,,...,499695.28,514116.4,707964.06,479117.06,552020.9,916963.6,717532.2,790312.94,951237.25,
4,8783,83.049808,13.057878,,,,,,,,...,88554.445,58410.418,28739.414,25085.014,23766.312,73575.2,27015.748,47035.324,85310.11,


**Loading gnps-fbmn spectral data**

Similar to the other raw data, the raw spectral data from the gnps export may contain compatibility artefacts. For instance, some features may be ms1 only, or ms/ms features may have empty or very low amounts of spectral data.

In [5]:
gnps_spectra = list(load_from_mgf(gnps_spectra_filepath))
print("Number of spectra loaded from file: ", len(gnps_spectra))

Number of spectra loaded from file:  18562


# 3 - Extracting, transforming, and loading the data for msFeaST compatibility

The quantification table, metadata table, and spectral data loaded form the basis of msFeaST. However, they contain redundant information pieces still. Not all columns in the metadata table are relevant, nor are all rows. Not all samples in the quantification table are used. Depending on processing and subsetting, we may end with spectra which do not contain intensity information in any of the samples intended for analysis. There is hence a need for loading the data and processing it to remove irrelevant or incompatible information pieces.


**Step 1 - Extract statistical contrast data**

The statistical metadata processing is done here. This is the most important pre-processing step and requires user input for sensible results.

Unlike the mushroom type comparison example, the metadata for the OMSW comparison is not in a suitable or readable format. The gnps metadata object is first modified below to include a more straightforward omsw column entry which allows extracting the relevant contrast more easily. The example below is highly specific; such processing can be done using excel.

*<span style="color:magenta">Required user input: Treatment column, relevant treatment entries, and reference treatment selection. Depending on formatting, more input may be required to achieve expected format.</span>*

In [6]:
# make a copy of the metadata to avoid modifying the original input variable
modified_metadata = copy.deepcopy(gnps_statistical_metadata) # Making a copy of the gnps data for modification
# turn the numeric entry to string type
modified_metadata["omsw"] = modified_metadata["ATTRIBUTE_ Percent of OMSW  in MS"].apply(str)
# extend the numeric type to include taxonomy, value and a constant text, e.g., <'FB_Pleurotus_omsw_percentage_0'> and <'FB_Pleurotus_omsw_percentage_80'>
modified_metadata["omsw"] = modified_metadata["ATTRIBUTE_ Taxonomy"] + "_omsw_percentage_" + modified_metadata["omsw"]
modified_metadata["omsw"] = modified_metadata["omsw"].replace(to_replace="FB_Pleurotus_omsw_percentage_0", value="PleurotusOMSW0")
modified_metadata["omsw"] = modified_metadata["omsw"].replace(to_replace="FB_Pleurotus_omsw_percentage_80", value="PleurotusOMSW80")
modified_metadata.head()

Unnamed: 0,filename,SampleType,SampleType1,ATTRIBUTE_ Percent of OMSW in MS,Species,ATTRIBUTE_ Taxonomy,NCBITaxonomy,Sample Collection,Sample Extract,MassSpectrometer,IonizationSourceAndPolarity,ChromatographyAndPhase,omsw
0,MS0_NEW_POS.mzXML,BLANK_MS,Mushroom Substrate,0,Fungi,MS,91752Hericium erinaceus,Dry Solid Material,Methanol100%,Q Exactive Plus|MS:1002661,electrospray ionization (Positive),reverse phase (C18),MS_omsw_percentage_0
1,MS0_OLD_POS.mzXML,BLANK_MS,Mushroom Substrate,0,Fungi,MS,91752Hericium erinaceus,Dry Solid Material,Methanol100%,Q Exactive Plus|MS:1002662,electrospray ionization (Positive),reverse phase (C18),MS_omsw_percentage_0
2,MS33_NEW_POS.mzXML,BLANK_MS,Mushroom Substrate,33,Fungi,MS,91752Hericium erinaceus,Dry Solid Material,Methanol100%,Q Exactive Plus|MS:1002663,electrospray ionization (Positive),reverse phase (C18),MS_omsw_percentage_33
3,MS33_OLD_POS.mzXML,BLANK_MS,Mushroom Substrate,33,Fungi,MS,91752Hericium erinaceus,Dry Solid Material,Methanol100%,Q Exactive Plus|MS:1002664,electrospray ionization (Positive),reverse phase (C18),MS_omsw_percentage_33
4,MS60_NEW_POS.mzXML,BLANK_MS,Mushroom Substrate,60,Fungi,MS,91752Hericium erinaceus,Dry Solid Material,Methanol100%,Q Exactive Plus|MS:1002665,electrospray ionization (Positive),reverse phase (C18),MS_omsw_percentage_60


In [7]:
treatment_table = extract_treatment_table(
  metadata_table = modified_metadata,
  treatment_column_name = "omsw", 
  treatment_identifiers = ['PleurotusOMSW0', 'PleurotusOMSW80'],
  reference_category = "PleurotusOMSW0"
)
print("Print first 5 entries of treatment table: \n", treatment_table)

Print first 5 entries of treatment table: 
        sample_id        treatment
0   E1_pos.mzXML   PleurotusOMSW0
1   E2_pos.mzXML   PleurotusOMSW0
2   E6_pos.mzXML   PleurotusOMSW0
3  E10_pos.mzXML  PleurotusOMSW80
4  E11_pos.mzXML  PleurotusOMSW80
5  E12_pos.mzXML  PleurotusOMSW80


**Step 2 - Clean Spectral Data**

Spectral data are loaded and processed using matchms within the msFeaST workflow. While the initial number of features in the spectral data file is large, post-processing drastically reduces this number, especially via the minimum fragment number required. Setting the minimum number of fragment to some lower-bound is adviseable since a lack of spectral data information will prevent meaningful spectral similarity scoring and thus only introduce complexity and noise into the workflow.

*<span style="color:magenta">Required User Input: Double check that scans is the correct feature identifying entry, modify if not.</span>*

In [8]:
spectra = apply_default_spectral_processing(
  gnps_spectra, 
  feature_identifier="scans", 
  minimum_number_of_fragments=5, 
  maximum_number_of_fragments=200
)

Number of spectral features provided:  18562
Number of spectral features which passed pre-processing:  2910


**Step 3 - Process quantification table**

Reformat quantification table to expected format of msFeaST using a sample_id column and a column for each feature, where the feature column names are the respective feature identifiers without trailing text.

*<span style="color:magenta">Required User Input: Double check that the feature id column is called "row ID", and that the sample column suffix is "Peak area", modify if not.</span>*

In [9]:
quant_table = restructure_quantification_table(
  gnps_quantification_table, 
  feature_id_column_name="row ID", 
  sample_id_suffix="Peak area"
)
print("Quantification table: \n", quant_table.head())

Quantification table: 
        sample_id      10001  10010       10012      10013  10015      10023  \
0  E10_pos.mzXML  24834.273    0.0   10558.262     0.0000    0.0    0.00000   
1  E11_pos.mzXML  36303.040    0.0   17791.215     0.0000    0.0    0.00000   
2  E12_pos.mzXML  29205.367    0.0   13557.281     0.0000    0.0    0.00000   
3   E1_pos.mzXML  92415.710    0.0  131143.000  5604.7725    0.0  699.12036   
4   E2_pos.mzXML  15658.006    0.0  105531.950  5802.3750    0.0    0.00000   

        10026      10041  10043  ...        996  9960       9963       9965  \
0  132596.050     0.0000    0.0  ...  16122.999   0.0  300892.80     0.0000   
1  195410.200     0.0000    0.0  ...  25487.537   0.0  311379.75     0.0000   
2  172219.330     0.0000    0.0  ...  18186.266   0.0  175308.56     0.0000   
3  538345.100     0.0000    0.0  ...      0.000   0.0  314285.34  2021.3296   
4   84646.195  1530.7404    0.0  ...      0.000   0.0  215472.31  1582.7941   

   9972        9976       

**Step 4 - Align, Filter, & Normalize**

These steps do no require any further user input. They produce the output data structures.

In [10]:
sample_ids = get_sample_ids_from_treatment_table(treatment_table)
quant_table = subset_quantification_table_to_samples(quant_table, sample_ids)
quant_table = normalize_via_total_ion_current(quant_table) # <-- comment out if normalization done elsewhere
quant_table, spectra = align_feature_subsets(quant_table, spectra)
treatment_table, quant_table = align_treatment_and_quantification_table(treatment_table, quant_table)

Number of columns with only zero entries: 964


In [11]:
print(
  "Printing first 5 rows of treatment and quantification table (not all columns shown): \n", 
  treatment_table.head(), quant_table.head()
)

Printing first 5 rows of treatment and quantification table (not all columns shown): 
        sample_id        treatment
0   E1_pos.mzXML   PleurotusOMSW0
1   E2_pos.mzXML   PleurotusOMSW0
2   E6_pos.mzXML   PleurotusOMSW0
3  E10_pos.mzXML  PleurotusOMSW80
4  E11_pos.mzXML  PleurotusOMSW80        sample_id          6868         11428      9554     18097  14836  \
0   E1_pos.mzXML  1.489321e-06  1.317067e-05  0.000003  0.000004    0.0   
1   E2_pos.mzXML  4.371383e-07  4.495211e-06  0.000002  0.000004    0.0   
2   E6_pos.mzXML  0.000000e+00  7.931358e-06  0.000001  0.000028    0.0   
3  E10_pos.mzXML  5.121265e-07  0.000000e+00  0.000000  0.000011    0.0   
4  E11_pos.mzXML  5.270876e-07  5.276695e-07  0.000000  0.000047    0.0   

          16638      3798     13602          9664  ...         12010  \
0  4.947459e-06  0.000032  0.000377  4.559361e-07  ...  7.892191e-07   
1  4.344093e-07  0.000052  0.000355  0.000000e+00  ...  4.489333e-07   
2  4.977471e-07  0.000058  0.000251  0.000

**Step 5 - Exporting data for use in msFeaST**

*<span style="color:magenta">Required user input: Relative file paths to output data (requires existing folders) </span>*

In [12]:
output_quant_table_filepath = os.path.join("data", "omsw_pleurotus_comparison", "quant_table.csv")
output_treatment_table_filepath = os.path.join("data", "omsw_pleurotus_comparison", "treat_table.csv")
output_spectra_filepath = os.path.join("data", "omsw_pleurotus_comparison", "spectra.mgf")

In [13]:
quant_table.to_csv(output_quant_table_filepath, index = False)
treatment_table.to_csv(output_treatment_table_filepath, index = False)
if os.path.exists(output_spectra_filepath):
  os.remove(output_spectra_filepath)
save_as_mgf(spectra, filename = output_spectra_filepath)