In [None]:
import numpy as np
import pandas as pd
import ms_feature_validation as mfv
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

## Reading Metabolomics data

Metabolomics data is stored in a DataContainer Object. Data container can be built using pandas DataFrame or read directly from output files from common tools.

In [None]:
# Reading data from a Progenesis csv file
fname = "SuerosRCC_ESi_neg_default_SepOct2017.csv"
data = mfv.filter.read_progenesis(fname)

The data container stores infromation in three different DataFrames:

1. Data Matrix: contains feature values for each sample. Each sample is a row and each feature is a column.
2. Sample Metadata: contains sample information, such as class, run order, batch, sample id, etc... Each sample.
3. Feature Metadata: contains feature information. In the case of LC-MS data it contains retention time, exact mass, etc... Each row is a feature.

Index are shared between data matrix rows and sample metadata rows, and between data matrix columns and feature metadata rows.

In [None]:
data.data_matrix.head()

In [None]:
data.sample_metadata.head()

In [None]:
data.feature_metadata.head()

Some common fields such as class, run order, batch number are accessible as DataContainer attributes. Run order and batch will raise an Exception if they are not defined

In [None]:
data.classes.head()

### Setting run order information and batch information

order and batch can be set as attributes of the DataContainer

In [None]:
# In this example, the sample name contains batch and order information.
# This code extracts this info from the sample name and set up the bath and order attributes

# index have the following format: name_date_project_run_order
# date is obtained and converted to a batch number
# extracting batch data
batch = pd.Series(data=data.sample_metadata.index.str.split("_"), index=data.data_matrix.index)
batch = batch.apply(lambda x: x[1])
days = np.sort(batch.unique())
batch_map = dict(zip(days, np.arange(1, days.size + 1)))
batch = batch.map(batch_map)
# extracting order data
order = pd.Series(data=data.sample_metadata.index.str.split("_"), index=data.data_matrix.index)
order = order.apply(lambda x: x[-1]).astype(int)

data.order = order
data.batch = batch

## Data curation

Data curation is implementated through a series of Process objects that perform transformations on the Data matrix or remove features/samples according to a criteria. Data curation is strongly based on concepts defined on [this paper](https://doi.org/10.1007).

Even if the filters are highly customizable, the easiest way to perform data curation is first to define a mapping.
A mapping is a dictionary that maps sample types to sample classes. Using the information provided by a mapping, a Processor knows which sample to use to correct a data set and which classes are to be corrected.

Once created, a filter is used with the method process.

In [None]:
# in this example we define the Quality control samples as samples of the class QC,
# blank samples as samples of the class "SV" and sample types as samples of the class EI, EII, EIII and EIV
data.mapping
mapping = {"blank": ["SV"],
           "qc": ["QC"],
           "sample": ["EI", "EII", "EIII", "EIV", "CS"]}
data.mapping = mapping

## Getting common metrics from DataContainer objects

Some common metrics associated with metabolomics data can be obtained using the metrics object:

1. CV for each feature
2. D-Ratio for each feature
3. Detection rate for each feature
4. PCA loadings, scores and cumulative variance

In [None]:
# cv for each class
cv = data.metrics.cv()
cv.head()

In [None]:
score, loading, variance = data.metrics.pca(n_components=2)

In [None]:
fig, axes = plt.subplots(figsize=(12, 8))
sns.scatterplot(data=score, x="PC1", y="PC2", hue=data.classes, ax=axes)

In [None]:
# blank correction
br = mfv.filter.BlankCorrector(mode="lod")
br.process(data)

# prevalence filter
pf = mfv.filter.PrevalenceFilter()
pf.process(data)

# variation filter
vf = mfv.filter.VariationFilter()
vf.process(data)

Several filters can be applied using the Pipeline object

In [None]:
# revert filter effects
data.reset()

# process data with several filters using a pipeline
pipe = mfv.filter.Pipeline([br, pf, vf])
pipe.process(data)

## Analizing raw LC-MS data

Raw MS data in the mzML format can be read using the pyopenms module. Several functions are incorporated in the MSData object to read and process MS data

In [None]:
import ms_feature_validation as mfv
lcms_data = mfv.fileio.MSData("20190918_039.mzML")

In [None]:
# making EIC for a list of mz
mz_list = [203.0821, 508.3403, 285.2066]
rt, eic = lcms_data.get_eic(mz_list)

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(rt, eic[1, :])

In [None]:
mfv.peaks.pick_cwt(rt, eic[1, :], min_width=1)

In [None]:
np.diff(rt).min()