# msFeaST example OMSW comparison between Pleurotus 0 and 80 percent omsw samples

This Jupyter notebook collects all the steps needed to run the msFeaST pipeline the illustrative example and produce the dashboard json file.

In the first block, all dependencies are loaded. All specified packages should be available after installing msFeaST as instructed.

In [1]:
%load_ext autoreload
%autoreload 2
from matchms.importing import load_from_mgf
from matchms.exporting import save_as_mgf
import pandas as pd
import os
import msfeast.pipeline
import os
import pandas as pd

In this second block all relative filepaths are specified. This includes the input and output paths. When using ms2deepscore, a model path may need to be specified in addition (see corresponding notebook).

*<span style="color:magenta">Required user input: Make sure that all relative file paths are specified correctly.</span>*

In [2]:
print("Define Filepaths...")
test_data_directory = os.path.join("data", "omsw_pleurotus_comparison")
filepath_test_spectra = os.path.join(test_data_directory, "spectra.mgf")
filepath_test_quant_table = os.path.join(test_data_directory, "quant_table.csv")
filepath_test_treat_table = os.path.join(test_data_directory, "treat_table.csv")

output_directory = os.path.join(test_data_directory)
r_output_filename = os.path.join("r_output.json")
r_filepath = os.path.join(output_directory, r_output_filename)
dashboard_output_filepath = os.path.join(output_directory, "dashboard_data.json")

Define Filepaths...


Loading data and initializing pipeline.

In [3]:
print("Initializing pipeline...")
pipeline = msfeast.pipeline.Msfeast()

print("Attaching data...")
treat_table = pd.read_csv(filepath_test_treat_table)
quant_table = pd.read_csv(filepath_test_quant_table)
pipeline.attach_spectra_from_file(filepath_test_spectra, identifier_key="scans")
pipeline.attach_quantification_table(quant_table)
pipeline.attach_treatment_table(treat_table)


Initializing pipeline...
Attaching data...


The spectral similarity computations are run using the provided score.

In [4]:
print("Running spectral similarity computations...")
pipeline.run_and_attach_spectral_similarity_computations("ModifiedCosine")

Running spectral similarity computations...


*<span style="color:magenta">Required user input: Specify the values of k to trial run using k-medoid clustering.</span>*

In [5]:
print("Run kmedoid grid...")
pipeline.run_and_attach_kmedoid_grid([50,100,150,200,250])

Run kmedoid grid...
Kmedoid grid results. Use to inform kmedoid classification selection ilocs.
   iloc    k  silhouette_score  random_seed_used
0     0   50          0.209813                 0
1     1  100          0.226936                 0
2     2  150          0.242555                 0
3     3  200          0.254547                 0
4     4  250          0.269968                 0


*<span style="color:magenta">Required user input: Select an appropriate value of k using its iloc, balancing silhouette score and desired number of clusters.</span>*

In [6]:
pipeline.select_kmedoid_settings(iloc = 4)

*<span style="color:magenta">Required user input: Select perplexity values to trial run.</span>*

In [7]:
print("Run t-sne grid...")
pipeline.run_and_attach_tsne_grid([20, 30, 40, 50, 100])

Run t-sne grid...
T-sne grid results. Use to inform t-sne embedding selection.
   iloc  perplexity  pearson_score  spearman_score  random_seed_used
0     0          20       0.444379        0.397913                 0
1     1          30       0.492475        0.466076                 0
2     2          40       0.507838        0.480615                 0
3     3          50       0.519132        0.499464                 0
4     4         100       0.515202        0.494382                 0


*<span style="color:magenta">Required user input: Select appropriate pexplexity score using its iloc.</span>*
High pearson and spearman scores indicate a good correspondence between high dimensional and low dimensional distance. However, distance preservation should be balanced against good grouping qualities in the embedding.

In [8]:
pipeline.select_tsne_settings(iloc = 3)

Run the statistics routine and integrate all pieces of information into output json file.

In [9]:
print("Initializing R runtime...")
if os.path.isfile(r_filepath):
  os.remove(r_filepath)
pipeline.run_and_attach_statistical_comparisons(output_directory, r_output_filename)

print("Integrating pipeline results...")
pipeline.integrate_and_attach_dashboard_data(top_k_max=50, alpha=0.01)

print("Exporting json file...")
pipeline.export_dashboard_json(filepath=dashboard_output_filepath)

print("Processing complete.")

Initializing R runtime...
[1] "Starting Routine log at "   "2024-06-06 15:31:05.616792"
[1] "R Routine: run integration test..."
[1] "R Routine: Validating input file paths..."
[1] "R Routine: Loading required packages..."
[1] "R Routine: Reading input files..."
[1] "R Routine: running global test and fold change computations..."
[1] "R Routine: exporting globaltest and log fold change computations..."
[1] "R Routine: complete, file saved, exiting R session."
Integrating pipeline results...
Exporting json file...
Processing complete.
