# Preprocessing and Quality Control of targeted NMR Data (Bruker IVDr) with the nPYc-Toolbox

This tutorial demonstrates how to use the NMR targeted data processing modules of the nPYc-Toolbox, to 
import and perform some basic quality control of outputs from the Bruker IVDr targeted quantification methods and generate a final high
quality dataset ready for data modeling.

This tutorial should be performed after **Preprocessing and quality control of NMR data with nPYc-Toolbox**, as we will exclude quantification outputs from spectra which failed basic quality control checks. 

Details of how to install all of the required dependencies and to set up your computing environment and full documentation for the nPYc-Toolbox can be found on [read the docs](https://npyc-toolbox.readthedocs.io/en/latest/index.html)

The dataset used in this example (DEVSET U 1D NMR) is comprised of six samples of pooled human urine, aliquoted, and independently prepared and measured by 1H NMR spectroscopy. Each source sample was separately prepared and assayed thirteen times. A pooled QC sample (study reference, SR) and independent external reference (long-term reference, LTR) of a comparable matrix was also acquired to assist in assessing analytical precision. See the Metabolights Study [MTBLS694](https://www.ebi.ac.uk/metabolights/MTBLS694) for details.

# 1. Import the nPYc-Toolbox and Configure the Jupyter Notebook

In [None]:
import sys
import pyChemometrics
toolboxPath = '/home/gd2212/PycharmProjects/nPYc-Toolbox/'
sys.path.append(toolboxPath)

In [None]:
# Import the nPYc-Toolbox
import nPYc

# Import enumerations for sample type
from nPYc.enumerations import VariableType, DatasetLevel, AssayRole, SampleType

# Import normalisation objects for data normalisation
from nPYc.utilities.normalisation import NullNormaliser, TotalAreaNormaliser, ProbabilisticQuotientNormaliser

# Import matplotlib plotting, configure the Jupyter notebook to plot inline
import matplotlib.pyplot as plt
%matplotlib inline

# Set up plotly to work in offline mode with the notebook
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

# Set up the pandas numerical display precision
import pandas
pandas.set_option('display.float_format', '{:.2f}'.format)

# 2. Import and Preprocess targeted NMR Data (Bruker IVDr)

The NMR DevSet Dataset is located in 'DEVSET U 1D NMR raw data files'. This folder contains 94 directories. 
Each directory contains a .xml file which was obtained by extracting metabolite concentrations from the ‘noesygppr1d’ pulse sequence spectrum with the Bruker IVDr "BI-QUANT-UR" software. 
We will start by loading these .xml files into an nPYc-Toolbox [TargetedDataset object](https://npyc-toolbox.readthedocs.io/en/latest/objects.html).

The "dataPath" parameter sets the location of the NMR raw data.

In [None]:
dataPath = './DEVSET U 1D NMR raw data files'

The “sop” parameter points to a configuration file (encoded in JSON format) which contains a set of pre-set parameters to use during data import see [Configuration Files](https://npyc-toolbox.readthedocs.io/en/latest/configuration/configuration.html) for full details.
For an exhaustive list of all the parameters in the SOP and their function, see Table 6 in [Built-in Configuration SOPs](https://npyc-toolbox.readthedocs.io/en/latest/configuration/Built-in-Configuration-SOPs.html).

The nPYc-Toolbox contains three configuration files for Bruker IVDr methods, 'BrukerQuant-UR' (urine small molecules), 'BrukerBI-QUANT-PS' (serum/plasma small molecules) and 'BrukerBI-LISA' (serum/plasma) lipoprotein.
For this urine biofluid dataset, we will load data from the 'BI-QUANT-UR' IVDr method. 

In [None]:
sop = 'BrukerQuant-UR'

The .xml files from the BI-QUANT-UR method contain values in absolute and relative (normalised to creatinine) units.
The 'unit' parameter can be used to select either the values 'mmol/L' or in 'mmol/mol Crea'. We will extract only the absolute units in this tutorial.


The following line creates an object representing the targeted dataset.


In [None]:
IVDrTargetedData = nPYc.TargetedDataset(dataPath, fileType='Bruker Quantification', sop=sop, unit='mmol/L')


Each nPYc Dataset object contains a name that can be changed as shown in the next cell. 
This name will be used in the summary and visualisation reports and in the file names of the exported data.


In [None]:
IVDrTargetedData.name = 'nPYc Targeted NMR Tutorial dataset'


# 3. Import Sample Metadata and Match to Acquired Data

The default way to add sample metadata is to prepare a CSV file which follows the set of conventions as described in [Sample Metadata](https://npyc-toolbox.readthedocs.io/en/latest/samplemetadata.html) and match it with the acquired data using the 'addSampleInfo' method.

Although optional, this is recommended in order to make optimal use of the quality control features and visualisations provided by the nPYc-Toolbox.

An example CSV file is provided, as given in 'DEVSET U 1D NMR IVDr Basic CSV.csv:

In [None]:
IVDrTargetedData.addSampleInfo(descriptionFormat='Basic CSV', filePath='DEVSET U 1D NMR IVDr Basic CSV.csv')

As described in [Datasets](https://npyc-toolbox.readthedocs.io/en/latest/objects.html), the spectral data, sample metadata and feature metadata can be inspected directly using:

```
dataset.intensityData
dataset.sampleMetadata
dataset.featureMetadata

```
# 4. Generate Quality Control Reports

The nPYc-Toolbox offers a series of reports, pre-set visualisations comprised of text, figures and tables to describe and summarise the characteristics of the dataset, and help the user assess the overall impact of quality control decisions (ie, excluding samples or features and changing filtering criteria). 

For full details see [Reports](https://npyc-toolbox.readthedocs.io/en/latest/reports.html)

### Sample Summary Report

The first report can be used to check the expected samples against those acquired, in terms of numbers, sample type, and any samples either missing from acquisition or not recorded in the sample metadata CSV file.

In [None]:
nPYc.reports.generateReport(IVDrTargetedData, 'sample summary')

For example, for this dataset there are two samples with entries in the sample metadata CSV file, but missing from acquisition. This allows the user to quickly assess the completeness of the dataset and, for example, investigate why these samples were missing.

Based on the corresponding entries in the sample metadata CSV file, the acquired samples are categorised into different types, where 'Study Samples' comprise the main core of the study, and the others are acquired for specific roles in characterising data quality. The main QC samples here are the 'Study Reference' samples, which comprise a pool of study samples and are used to assess analytical stability accross the run. For interest we have also included some 'Long Term Reference' samples (a QC sample external to the study) and a blank, for full details see [Sample and study design nomenclature](https://npyc-toolbox.readthedocs.io/en/latest/samplemetadata.html#Sample-and-study-design-nomenclature).

From the 'sample summary' report, it can be seen that corresponding information is missing from the sample metadata CSV file for one sample. This sample is listed has having unknown type, and missing information. As above, this allows the user to quickly determine whether information should be added to the sample metadata CSV file for this sample, or whether the spectrum should be excluded from the final dataset (see '5. Exclude Samples and/or Features if Required' for details.

### Feature Summary Report

The feature summary report provides visualisations summarising the quality 

In order, for an NMR targeted Dataset these consist of:
- Tables containing summary information about each of the quantified features (Name, and if available, reference ranges). These are further broken down by the type of quantification (not applicable for targeted NMR datasets).
- The residual standard deviation (RSD) observed for each feature per Sample Type/Assay Role (Figure 2)
- Boxplots showing the distribution across the different sample types for each chemical compound (Figure 3)
- A table with the RSD values for each feature, calculated separately for the Study Pool, External Reference and Study Sample types (Table 2)


In [None]:
nPYc.reports.generateReport(IVDrTargetedData, 'feature summary')

For a TargetedNMR dataset, sample exclusions are performed based on the assessment of the raw full high resolution NMR spectra. Further down, we will filter
one of the samples highlighted in the 'Preprocessing and quality control of NMR data with the nPYc-Toolbox' tutorial as failing the solvent suppression check.

The boxplots can be used to exclude features with a very high proportion of zeros. We will show how to exclude these on the 'Excluding Specific Samples and/or Features' section. 



### Exporting Reports

By default all reports are output directly to the notebook (as above), however, if html copies are required these can be automatically saved to the save directory by adding the optional input argument "destinationPath".

For example, to save to the path defined in "saveDir":  

In [None]:
saveDir = '/path to save outputs'
nPYc.reports.generateReport(IVDrTargetedData, 'feature summary', destinationPath=saveDir)

# 5. Exclude Samples and/or Features if Required

Dataset objects contain two internal 'mask' vectors, the 'sampleMask' and 'featureMask', which store whether a sample or feature respectively should be used when calculating QC metrics, visualised in the reports and finally exported, see [Sample and Feature Masks](https://npyc-toolbox.readthedocs.io/en/latest/objects.html#Sample-and-Feature-Masks).

There are several functions which modify these masks, which are useful at various stages of quality control and in preparing a final dataset for export.

### Mask Samples

The 'updateMasks' function can be used to automatically mask samples (and/or features).

For now, we do not mask any features, setting "filterFeatures=False".

By default (running the line below), samples which fail quality control (as described above) will be automatically masked, alongside any samples of unknown 'SampleType' and/or 'AssayRole':

```
nmrData.updateMasks(filterSamples=True, filterFeatures=False)
```

Here, by setting preferences with the 'sampleTypes' argument, any other samples which are not required can also be masked. In this example, we limit our dataset to study samples ('SampleType.StudySample') and study reference samples ('SampleType.StudyPool') only (excluding the long term reference, which has 'SampleType.ExternalReference' and the sample blank, which has 'SampleType.ProceduralBlank'), for full details see [Sample and study design nomenclature](https://npyc-toolbox.readthedocs.io/en/latest/samplemetadata.html#Sample-and-study-design-nomenclature) and [Enumerations](https://npyc-toolbox.readthedocs.io/en/latest/enumerations.html).

In [None]:
IVDrTargetedData.updateMasks(sampleTypes=[SampleType.StudySample, SampleType.StudyPool], filterFeatures=False)

The results of masking can be summarised using the 'sample summary' report.

In [None]:
nPYc.reports.generateReport(IVDrTargetedData, 'sample summary')

### Mask Features

This can be done automatically using the 'updateMasks' function.

As the sample mask has been specified in the previous section, here we set "filterSamples=False".

In [None]:
IVDrTargetedData.updateMasks(filterSamples=False, filterFeatures=True)


### Excluding Specific Samples and/or Features 

The 'updateMasks' function works to mask samples or features not meeting specific criteria, in addition to this, the nPYc-Toolbox also contains two additional methods to mask specific samples or features directly, 'excludeSamples' and 'excludeFeatures' respectively, see [Sample and Feature Masks](https://npyc-toolbox.readthedocs.io/en/latest/objects.html#Sample-and-Feature-Masks).

Each of these functions takes three input arguments; firstly, a list of sample or feature identifiers; secondly, the name of the column in 'sampleMetadata' (for 'excludeSamples') or 'featureMetadata' (for 'excludeFeatures') where these identifiers can be found; and finally an optional message as to why these samples or features have been flagged for exclusion.

The IVDr algorithms will attempt generate a .xml file for all NMR raw data directories. However, if the high resolution NMR spectrum which was used to derive is not of high quality, we recommend excluding the quantification results from the TargetedDataset.

In the next cell, we use the 'excludeSamples' syntax to exclude a sample that failed the quality control in the 'Preprocessing and quality control of NMR data with the nPYc-Toolbox' tutorial, and another
sample for which no metadata was provided (which was highlighted in the 'Sample Summary' report.

In [1]:
IVDrTargetedData.excludeSamples(['PipelineTest_Urine_300K_RFT_290118/580'], on='Sample File Name', message='Incomplete solvent suppression')
IVDrTargetedData.excludeSamples(['PipelineTest_Urine_300K_RFT_290118/930'], on='Sample File Name', message='No metadata provided')


NameError: name 'IVDrTargetedData' is not defined

Finally, we will use the 'excludeFeatures' command exclude all features containing only 0 values.

In [None]:
zero_value_features = IVDrTargetedData.featureMetadata[IVDrTargetedData.intensityData.sum(axis=0) == 0]['Feature Name']
IVDrTargetedData.excludeFeatures(zero_value_features, on='Feature Name', message='Features with only zero values')


### Permanently Exclude Masked Samples/Features

Once satisfied with the sample and feature masks, exclusions can be applied (permanently removed from the dataset) using the 'applyMasks' function.

This method should be used only when it is absolutely certain that the masked features and samples are to be removed, as the excluded data will have to be re-imported.

Before masks have been applied, they can be re-set to include all samples/features using:

```
IVDrTargetedData.initialiseMasks() 
```

For details see [Sample and Feature Masks](https://npyc-toolbox.readthedocs.io/en/latest/objects.html#Sample-and-Feature-Masks)

In [None]:
IVDrTargetedData.applyMasks()


# 6. Analytical Multivariate Quality Control

The nPYc-Toolbox provides the capacity to generate a principal component analysis (PCA) model of the data (via the pyChemometrics module), and subesquently, to use this to assess data quality, identify potential sample and feature outliers, and determine any potential analytical associations with the main sources of variance in the data ([Multivariate Analysis](https://npyc-toolbox.readthedocs.io/en/latest/Multivariate-Analysis.html))

A PCA model can be generated using 'exploratoryAnalysisPCA', and there are a number of parameters which can be optimised depending on the dataset (see [PCA Model](https://npyc-toolbox.readthedocs.io/en/latest/Multivariate-Analysis.html#PCA-Model) for full details).

One key parameter is 'scaling', which divides each column in the data matrix by its respective standard deviation raised to a power of the scaling parameter. This parameter can range in value between 0 and 1, and recommended values are 0 for mean centering only, 0.5 for Pareto scaling and 1 for unit variance (UV) scaling. The outcome of PCA model will vary based on the scaling method selected, and different scaling functions can be appropriate depending on the data itself and the question being asked of the data, see [van der Berg et al. 2006](https://www.ncbi.nlm.nih.gov/pubmed/16762068)

The default scaling is unit variance ("scaling=1"), which scales every variable to have a variance of one, and thus all variables (despite their different magnitudes) become equally important in the model.

Each model is cross-validated using 7-fold cross-validation and the recommended number of principal components automatically estimated based on two criteria, when either one of these is met no more components will be added and the PCA model will be returned. There criteria are:
1. 'minQ2': Q2 is the variance predicted by each component (from cross-validation), when adding a component does not improve Q2 by at least this value (default "minQ2=0.05") then no more components will be added.
2. 'maxComponents': this defines the maximum number of components (default "maxComponents=10") returned by the model (regardless of Q2 increases).

Again these parameters can be amended by adding them as input arguments to 'exploratoryAnalysisPCA'.

In [None]:
PCAmodel = nPYc.multivariate.exploratoryAnalysisPCA(IVDrTargetedData, scaling=1, withExclusions=True)

The analytical multivariate report provides visualisations summarising the largest sources of variance in the dataset (from the PCA model generated) with particular emphasis on any potential analytical sources, as defined in 'analyticalMeasurements' in the [Built-in Configuration SOPs](https://npyc-toolbox.readthedocs.io/en/latest/configuration/builtinSOPs.html).

These consist of:
- Scree plot of variance (Figure 1)
- Scores plots coloured by sample type (Figure 2)
- Strong sample outliers (Figure 3)
- DmodX sample outliers (Figure 4)
- Loadings plots (Figure 5)
- Distribution of analytical parameters (Figure 6)
- Heatmap of potential associations between analytical parameters and the main sources of variance (Figures 7 and 8)
- Scores plots coloured by analytical parameters with potential association (Figures 9-11)

For full details of each of the above, see [Multivariate Analysis Report](https://npyc-toolbox.readthedocs.io/en/latest/Multivariate-Analysis.html#Multivariate-Analysis-Report)

In [None]:
nPYc.reports.multivariateReport(IVDrTargetedData, PCAmodel, withExclusions=True)

# 7. Finalise and Export Dataset

Datasets can be exported in a variety of formats with the 'exportDataset' method (see [Exporting Data](https://npyc-toolbox.readthedocs.io/en/latest/exportingdata.html)).

By default, data will be automatically exported to the current working directory, to change this a 'saveDir' can be defined

By default datasets are exported to the current working directory, however, if files are required to be exported to a defined path, this can be done by adding the optional input argument "destinationPath".

For example, to save to the path defined in "saveDir":  

```
saveDir = '/path to save outputs'
tData.exportDataset(saveFormat='UnifiedCSV', destinationPath=saveDir)

```
To export a single CSV file, which contains a row for every sample, and a column for every feature, alongside all of the sample and feature specific metadata, set "saveFormat=UnifiedCSV".

In [None]:
IVDrTargetedData.exportDataset(saveFormat='UnifiedCSV')

In addition, the 'final report' provides a document summarising the contents of the dataset alongside pertinent visualisations from the 'feature summary' and 'multivariate reports':

In [None]:
nPYc.reports.generateReport(IVDrTargetedData, 'final report', pcaModel=PCAmodel, destinationPath='.')