# Preprocessing LC-MS data with the nPYc-toolbox

#### This document provides a pipeline for the import of NMR data and any associated sample metadata, followed by summaries and quality control reports of the data, implementation of quality control analysis and output of a final dataset ready for sharing with collaborators and data modeling.

#### By default all summary reports (with the exception of the final report) will be output only to this notebook. The notebook (including outputs) can be saved using >File>Save and Checkpoint. However, if html copies of any reports are required these can be automatically saved to the save directory by adding the optional input argument destinationPath=saveDir.

# 1. Initial Setup

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
    Define the path to the raw data and corresponding metadata. 
    </font>
</div>

### Define file paths

In [None]:
peakPickedDataPath = '/path to peak picked data file/PipelineTest U RPOS xcms.csv'
basicCSVPath = '/path to sample metadata file/PipelineTest U RPOS Basic CSV.csv'

# rawDataPath = 'path to raw data files'

saveDir = '/path to save outputs'

### Import code

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
    Import the required Python libraries and the nPYc toolbox.
    </font>
</div>

In [None]:
import os
import matplotlib.pyplot as plt
import scipy
import pandas
import numpy
import pickle
import seaborn as sns
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
%matplotlib inline
import sys
import nPYc
import pyChemometrics
import copy
from nPYc.enumerations import VariableType, DatasetLevel, AssayRole, SampleType
from nPYc.utilities.normalisation import NullNormaliser, TotalAreaNormaliser, ProbabilisticQuotientNormaliser

# 2. Import Data and Sample Metadata

### Import acquired data and associated acqusition parameters

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
    Import dataset (peak-picked in XCMS). Set the name of the 'dataset' object.
    </font>
</div>

In [None]:
dataset = nPYc.MSDataset(peakPickedDataPath, fileType='XCMS')

In [None]:
dataset.name = 'nPYc example PipelineTest U RPOS'

### Match Acquired Samples to Sample IDs and subject information (if available)

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
    Add corresponding sample metadata (including Sample IDs and any other sample specific information) and match to the acquired data, here sample metadata is in the 'Basic CSV' format.
    <br/><br/>
    Note, acquisition related parameters (for example, acquired time etc) can also be added by reading the raw data files (if available) using the 'addSampleInfo' function (line commented out below).
    </font>
</div>

In [None]:
dataset.addSampleInfo(descriptionFormat='Basic CSV', filePath=basicCSVPath)

In [None]:
# dataset.addSampleInfo(descriptionFormat='Raw Data', filePath=rawDataPath)

### Exclude features outside of the useful retention time range of the assay

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        Use the 'excludeFeatures' function to mark features outside of the useful RT range for exclusion. For the RPOS assay this corresponds to features with a retention time outside 0.6-10.5. 
        <br/><br/>
        Subsequently, use the 'applyMasks' function to permanently remove these features from the dataset.
    </font>
</div>

In [None]:
print('Number of original features: ' + str(sum(dataset.featureMask)))
dataset.excludeFeatures(dataset.featureMetadata[dataset.featureMetadata['Retention Time'] > 10.5]['Feature Name'], on='Feature Name', message='Outside RT limits')
dataset.excludeFeatures(dataset.featureMetadata[dataset.featureMetadata['Retention Time'] < 0.6]['Feature Name'], on='Feature Name', message='Outside RT limits')
print('Number of features within RT range: ' + str(sum(dataset.featureMask)))

In [None]:
dataset.applyMasks()

# 3. Sample & Feature Filtering

### Generate sample summary report

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
       This summary can be used to check the expected samples against those acquired (for example, sample numbers, sample type, samples missing from acquisition or lacking metadata information).
    </font>
</div>

In [None]:
nPYc.reports.generateReport(dataset, 'sample summary')

### Generate feature summary report

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        The feature summary report provides visualisations summarising the quality of the dataset and highlighting any problematic areas, including:
        <ul>
        <li>The distribution of feature intensities in each sample class.</li>
        <li>The TIC in each sample against sample acqusition order, coloured by both sample class, and instrument gain parameter (detector voltage). This provides insight into potential run-order and batch effects.</li>
        <li>The correlation of feature intensity to dilution and TIC in the dilution series. This provides insight into potential issues in correlation to dilution.</li>
        <li>A histogram of feature RSDs, and a plot comparing the RSD measured in the different sample classes (study reference sample, study sanples etc). This provides insight into variance structures in the dataset, with the expectation that biologcal variance should exceeed analytical variance.</li>
        <li>An ion map visualises the location of the detected features in the m/z and retention time space of the assay.</li>
        </ul>
    </font>
</div>

In [None]:
nPYc.reports.generateReport(dataset,'feature summary')

### Assess batch and run-order effects and apply correction if necessary

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        Visualise feature intensity vs analysis order for a small subset of features in order to assess the need to correct batch and run-order effects. 
    </font>
</div>

In [None]:
nPYc.reports.generateReport(dataset, 'batch correction assessment')

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
       Apply correction if necessary.
    </font>
</div>

In [None]:
datasetCorrected = nPYc.batchAndROCorrection.correctMSdataset(dataset, window=11)

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        Compare the dataset before and after correction.
    </font>
</div>

In [None]:
nPYc.reports.generateReport(dataset, 'batch correction summary', msDataCorrected=datasetCorrected)

### Filter features

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        Generate the feature selection report to assess the number of features passing the quality control thresholds described in: <br/>
        &emsp; Lewis et al, Development and Application of Ultra-Performance Liquid Chromatography-TOF MS for Precision Large Scale <br/> &emsp; Urinary Metabolic Phenotyping, Anal. Chem., 2016, 88 (18), pp 9004–9013
        <br/><br/>
        Filter the features based on these parameters using the 'updateMasks' function. To keep all samples, set the 'filterSamples' argument to False.
    </font>
</div>

In [None]:
nPYc.reports.generateReport(datasetCorrected,'feature selection')

In [None]:
datasetCorrected.updateMasks(filterFeatures=True, filterSamples=False)

### Filter samples

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        Remove samples which fail based on any of the above analytical criteria by applying the sample masks.
        <br/><br/>
        At this point we can also exclude any other samples which are not required by setting preferences with the 'sampleTypes' argument. In this example, we limit our dataset to study samples and quality control samples only.
        <br/><br/>
        To keep all features, set the 'filterFeatures' argument to False.
    </font>
</div>

In [None]:
datasetCorrected.updateMasks(sampleTypes=[SampleType.StudySample, SampleType.StudyPool], filterSamples=True, filterFeatures=False)

### Permanently exclude masked samples/features

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
If happy with the samples and features masked for exclusion, apply these exclusions (permanently remove samples/features from the dataset) using the 'applyMasks' function.
    </font>
</div>

In [None]:
datasetCorrected.applyMasks()

# 4. Analytical Multivariate Quality Control

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        The analytical multivariate report provides visualisations summarising the largest sources of variance in the dataset (by PCA) with particular emphasis on any potential analytical sources. Plots include:
        <ul>
        <li>Model statistics.</li>
        <li>Scores plots. This provides insight into the relationship between sample, for example, consistency of the QC samples, sample outliers etc. </li>
        <li>Loadings plots. This provides insight into the features with the largest variance in the dataset.</li>
        <li>Potential associations with analytical parameters. Correlation (for continuous metadata) or Kruskal-Wallis test (for categorical metadata) between each metadata field and each set of PCA scores generated, any significant associations are flagged.</li>
        <li>The default scaling is unit variance ('scaling=1'), but other scaling options are available (0 for means centering; 0.5 for Pareto scaling)</li>
        </ul>
    </font>
</div>

In [None]:
PCAmodelAnalytical = nPYc.multivariate.exploratoryAnalysisPCA(datasetCorrected, scaling=1)

In [None]:
nPYc.reports.multivariateQCreport(datasetCorrected, PCAmodelAnalytical)

### OPTIONAL: generate interactive scores and loadings plots

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        Interactive scores plot:
        <br/><br/>
        For example, plot the scores for PCA components 1 vs. 2 and colour by Class.
    </font>
</div>

In [None]:
data = nPYc.plotting.plotScoresInteractive(datasetCorrected, PCAmodelAnalytical, 'Class', components=[1, 2])
iplot(data)

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        Interactive loadings plot:
        <br/><br/>
        For example, plot the loadings for PCA component 2.
    </font>
</div>

In [None]:
data = nPYc.plotting.plotLoadingsInteractive(datasetCorrected, PCAmodelAnalytical, component=2)
iplot(data)

# 5. Finalise & Export Dataset

### Check final dataset output:

In [None]:
nPYc.reports.generateReport(datasetCorrected, 'final report', pcaModel=PCAmodelAnalytical)

### Export data:

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
    Export a unified csv file, which contains the intensity data (one row per sample, one column per variable), alongside all sample and feature metadata (left columns and top rows respectively).
    <br/><br/>
    Output the final report to provide a summary of the dataset.
    </font>
</div>

In [None]:
if not os.path.exists(saveDir):
    os.makedirs(saveDir)

In [None]:
datasetCorrected.exportDataset(saveFormat='UnifiedCSV', destinationPath=saveDir)

In [None]:
nPYc.reports.generateReport(datasetCorrected, 'final report', pcaModel=PCAmodelAnalytical, destinationPath=saveDir)