## Application 2: Analysis of candidate reference standard materials

This is the code to generate the results and figures for the application 2 published in **(ADD DOI)**

In this example we perform data curation as a first step for a suite of plasma candidate reference materials. PCA score plots are generated before and after curation to evaluate the data, in particular temporal drift and dispersion of the QC samples.

The curation strategy used consists in:

1. Discard features with retention time values lower than 90 seconds, as the system dead time was approximately 0.8 minutes.
2. Discard features that were not present in all pooled QC samples as a condition for performing the LOESS-batch correction.
3. if a feature had a peak area in a plasma sample that was 10-fold or less than the maximum peak area in the solvent, zero volume injection, and sample preparation blanks of the same feature, then its peak area was set to 0. Otherwise, the mean peak area in those blanks was subtracted from the feature peak areas in the plasma samples. In this way, potential contaminants and signals from the solvent would be removed for further analysis.
4. Filter based on the relative standard deviation of the feature in pooled QC samples was applied. That is, all features with a RSD>20% in pooled QC sample injections were eliminated. Median absolute deviation was used as an unbiased estimate of the standard deviation by multiplying by a scaling factor, as a robust estimate of RSD.
5. A 100% intraclass prevalence filter was subsequently applied with a threshold area value of 5; i.e., all area values below this threshold were considered as zero.
6. The prevalence filter was followed by the D-ratio filter, estimated as the median absolute deviation considering pooled QC samples over the MAD considering all samples. This filter was used to remove features with zero or low biological information, setting an acceptance criterion to D-ratio values lower than 10%, and leading to a final data matrix composed of 665 metabolic features.

At the end of the data curation pipeline the 665-feature matrix was normalized by using the total area of each 


**The batch correction is the most computing intensive step, with running times of ~ 10 min using a Personal Computer with an 8th generation Intel i5 processor and 8 GiB memory.**

In [None]:
import tidyms as ms
import bokeh.plotting
bokeh.plotting.output_notebook()
import pandas as pd
import matplotlib.pyplot as plt
import os
import numpy as np
import seaborn as sns
sns.set_context("paper", font_scale=1.5)
from adjustText import adjust_text
from download_from_metabolights import get_application2_data

In [None]:
# check if the data is available.
get_application2_data("data")

In [None]:
# data loading
data_matrix_path = "data/NIST001.csv"
data = ms.fileio.read_progenesis(data_matrix_path)

# add order and batch information
data.add_order_from_csv("data/run-order-data.csv")

# rename classes from 1, 2,... to SRM1, SRM2, ...
class_mapping = {"1": "RM1", "2": "RM2", "3": "RM3", "4": "SRM4",
                 "QC": "QC", "Z": "Z", "SCQC": "SCQC",
                 "B": "B", "SSS": "SSS", "SV": "SV"}
data.classes = data.sample_metadata["class"].map(class_mapping)

# set sample mapping
sample_mapping = {"qc": ["QC"],
                  "blank": ["SV","B", "Z"],
                  "sample": ["RM1", "RM2", "RM3", "SRM4"],
                  "suitability": ["SSS"]}
data.mapping = sample_mapping

# set plot mode to seaborn
data.set_plot_mode("seaborn")

In [None]:
# creating a custom filter to remove features based on retention time
class RTFilter(ms.filter.Processor):
    
    def __init__(self, min_rt=None, max_rt=None, verbose=False):
        super(RTFilter, self).__init__("filter", "features", verbose=verbose)
        if min_rt is None:
            min_rt = 0
        if max_rt is None:
            max_rt = np.inf
        self.params = {"min_rt": min_rt, "max_rt": max_rt}
        
    def func(self, dc):
        rt = dc.feature_metadata["rt"]
        min_rt = self.params["min_rt"]
        max_rt = self.params["max_rt"]
        invalid_rt = (rt < min_rt) | (rt > max_rt)
        rm_features = rt[invalid_rt].index
        return rm_features

## Figure 4
PCA score plots on Study and QC samples before data curation. Data is mean centered and scaled to unitary variance

In [None]:
ignore_classes = ["Z", "SV", "B", "SCQC", "SSS"]
palette = ["#8bcde8ff", "#e54042ff", "#efbf33ff", "#455d7aff", "#a7b3bdff"]
g = data.plot.pca_scores(ignore_classes=ignore_classes,
                         relplot_params={"hue_order": ["QC", "RM1", "RM2", "RM3", "SRM4"], "s": 50,
                                         "palette": palette},
                         scaling="autoscaling")

# use adjust_text to label each point
ax = g.axes[0, 0]
scores, _, _, _ = data.metrics.pca(scaling="autoscaling", ignore_classes=ignore_classes)
qc_samples_index = data.classes.isin(["QC"])
qc_samples_index = qc_samples_index[qc_samples_index].index
pc1 = scores.loc[qc_samples_index, "PC1"].values
pc2 = scores.loc[qc_samples_index,"PC2"].values
order = data.order[qc_samples_index]
text = [ax.text(pc1[i], pc2[i], str(order[i]), ha='center', va='center') for i in range(pc1.size)]
adjust_text(text);
# g.savefig("pca-before.png", dpi=300)

## Figure S4 (a)

PCA score plots on QC samples before data curation. Data is normalizad to total area, mean centered and scaled to unitary variance

In [None]:
ignore_classes = ["Z", "SV", "B", "SCQC", "SSS",
                  "RM1", "RM2", "RM3", "SRM4"]
g = data.plot.pca_scores(ignore_classes=ignore_classes,
                         scaling="autoscaling", normalization="sum")

# use adjust_text to label each point
ax = g.axes[0, 0]
scores, _, _, _ = data.metrics.pca(scaling="autoscaling", 
                                   normalization="sum",
                                   ignore_classes=ignore_classes)
pc1 = scores["PC1"].values
pc2 = scores["PC2"].values
order = data.order[scores.index].values
text = [ax.text(pc1[i], pc2[i], str(order[i]), ha='center', va='center') for i in range(pc1.size)]
adjust_text(text)
# g.savefig("pca-qc-before.png", dpi=300)

In [None]:
%%time

# remove the last three blank samples
rmsamples = ["NZ_20200227_095", "NZ_20200227_097", "NZ_20200227_099"]
data.remove(rmsamples, axis="samples")

# building a data curation pipeline

# remove features with rt lower than 90 seconds
rt_filter = RTFilter(min_rt=90)
# correct instrumental signal drift
batch_corrector = ms.filter.BatchCorrector(min_qc_dr=0.8, first_n_qc=3,
                                           method="additive")
# blank correction
blank_corrector = ms.filter.BlankCorrector(mode="max", factor=10)
# remove blank and conditioning QC samples
class_filter = ms.filter.ClassRemover(["B","SV","SSS","SCQC","Z"])
# remove features high a %RSD higher than 20 % in the QC samples
vf = ms.filter.VariationFilter(ub=0.2, robust=True)
# remove features that are not detected in all study samples
pf = ms.filter.PrevalenceFilter(lb=1, threshold=5)
# remove features with low biological variation
drf= ms.filter.DRatioFilter(ub=0.1, robust=True)

# Build and apply the data curation pipeline
processors = [rt_filter, batch_corrector, blank_corrector, class_filter,
              vf, pf, drf]
pipeline = ms.filter.Pipeline(processors, verbose=True)
pipeline.process(data)

# normalize to total signal intensity
data.preprocess.normalize(method="sum", inplace=True)

In [None]:
data.data_matrix.shape

In [None]:
data.save("data/nist001-curated.pickle")

In [None]:
# load curated data to bypass curation step
data = ms.DataContainer.from_pickle("data/nist001-curated.pickle")

In [None]:
data.metrics.cv(robust=True, fill_value=0).loc["QC"].plot.hist()

## Figure 4
PCA score plots on Study and QC samples after data curation. Data is normalized, mean centered and scaled to unitary variance

In [None]:
palette = ["#8bcde8ff", "#e54042ff", "#efbf33ff", "#455d7aff", "#a7b3bdff"]
g = data.plot.pca_scores(scaling="autoscaling", normalization="sum",
                         relplot_params={"hue_order": ["QC", "RM1", "RM2", "RM3", "SRM4"],
                                         "palette": palette, "s": 50})
# g.savefig("pca-after.png", dpi=300)

## Figure S4 (b)

PCA score plots on QC samples after data curation. Data is normalizad to total area, mean centered and scaled to unitary variance

In [None]:
ignore_classes = ["SRM1", "SRM2", "SRM3", "SRM4"]
g = data.plot.pca_scores(ignore_classes=ignore_classes,
                         scaling="autoscaling")

# use adjust_text to label each point
ax = g.axes[0, 0]
scores, _, _, _ = data.metrics.pca(scaling="autoscaling", 
                                   normalization="sum",
                                   ignore_classes=ignore_classes)
pc1 = scores["PC1"].values
pc2 = scores["PC2"].values
order = data.order[scores.index].values
text = [ax.text(pc1[i], pc2[i], str(order[i]), ha='center', va='center') for i in range(pc1.size)]
adjust_text(text);
# g.savefig("pca-qc-after.png", dpi=300)