# Preprocessing and quality control of NMR data with the nPYc-Toolbox

This tutorial demonstrates how to use the NMR data processing modules of the nPYc-Toolbox, to import and perform some basic preprocessing and quality control of NMR data and to output a final high quality dataset ready for data modeling.

Details of how to install all of the required dependencies and to set up your computing environment can be found in 'document.txt', and full documentation for the nPYc-Toolbox can be found on [read the docs](https://npyc-toolbox.readthedocs.io/en/latest/index.html)

The nPYc-Toolbox has been developed based on the quality control criteria previously described in [Dona et al. 2014](https://www.ncbi.nlm.nih.gov/pubmed/25180432)

The dataset used in this example (NMR DevSet Dataset) is comprised of six samples of pooled human urine, aliquoted, and independently prepared and measured by 1H NMR spectroscopy. Each source sample was separately prepared and assayed thirteen times. A pooled QC sample and independent external reference of a comparable matrix was also acquired to assist in assessing analytical precision. See the Metabolights Study [MTBLS694](https://www.ebi.ac.uk/metabolights/MTBLS694)

# 1. Import the nPYc-Toolbox and configure Jupyter notebook

In [None]:
# Import the nPYc-Toolbox
import nPYc

# Import enumerations for sample type
from nPYc.enumerations import VariableType, DatasetLevel, AssayRole, SampleType

# Import normalisation objects for data normalisation
from nPYc.utilities.normalisation import NullNormaliser, TotalAreaNormaliser, ProbabilisticQuotientNormaliser

# Import matplotlib plotting, configure the Jupyter notebook to plot inline
import matplotlib.pyplot as plt
%matplotlib inline

# Set up plotly to work in offline mode with the notebook
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)

# 2. Import and preprocess NMR data

The first step is to import the raw NMR data (Bruker format) into an nPYc-Toolbox [nmrData object](https://npyc-toolbox.readthedocs.io/en/latest/objects.html#nmrdataset)

The "rawDataPath" parameter sets the location of the NMR raw data.

The NMR DevSet Dataset is located xxxxxxx. This folder contains 93 directories, each corresponding to a spectrum acquired with the ‘noesygppr1d’ pulse sequence. The Fourier Transform, apodization and phasing have already been performed with the vendor software (ie, TopSpin).

The “sop” parameter points to a file which contains a set of parameters to use during data import and pre-processing. Each SOP file is encoded in JSON format. For full details see [Table 6 SOP parameters for all NMRDataset objects](https://npyc-toolbox.readthedocs.io/en/latest/configuration/configurationSOPs.html)

The nPYc-Toolbox contains two default SOP files, GenericNMRUrine and GenericNMRBlood. These contain the recommended parameters for import and quality control of human urine and plasma/serum biofluid, respectively. Since this is a urine biofluid dataset, we will use the GenericNMRUrine ‘sop’ file.

In [None]:
rawDataPath = '/path to NMR raw data files'
pulseProgram = 'noesygppr1d'
sop = 'GenericNMRurine'

The following line triggers the pre-processing of the NMR spectra and creates an object representing the dataset:

In [None]:
nmrData = nPYc.NMRDataset(rawDataPath, pulseProgram=pulseProgram, sop=sop)

Any of the arguments present in the SOP file can be overwritten by passing to the data import command directly, without having to modify or generate a SOP file. For example, if the user wants to import all the ‘noesygppr1d’, but interpolate the spectra to a higher resolution, the argument variable Size can be overridden in the following manner:

```
nmrData = nPYc.NMRDataset(rawDataPath, pulseProgram=pulseProgram, sop=sop, variableSize=64000)
```

Each nPYc Dataset object contains a name that can be changed as shown in the next cell. This name will be used in the summary and visualization reports and exported data.

In [1]:
nmrData.name = 'nPYc NMR Tutorial dataset'

NameError: name 'nmrData' is not defined

# 3. Import sample metadata and match to acquired NMR data

The default way to add sample metadata to an NMRDataset object is to prepare a CSV file which follows the set of conventions as described in the previous section (Metadata CSV file) and use the following command:

More information about the CSV file format used for linking metadata, as well as the study design nomenclatures used in the nPYc-Toolbox can be found in the toolbox documentation (https://npyc-toolbox.readthedocs.io/en/latest/nomenclature.html).

By default, all imported spectra are considered ‘equal’ in terms of their role in the assay/overall experiment, and are assumed to correspond to a regular study sample. But in many instances, including the dataset we are using in this tutorial, a series of quality control samples can be acquired as well. In this case, we have samples with different roles in the overall assay QC process.

Although it is not necessary to provide extra information about the samples to use the nPYc-Toolbox to process and QC an NMR dataset, the extra information will enhance the quality of the reports generated in the next section. It is also recommended to add metadata related to run order, batch effects and other potential analytical confounders, so we can assess their influence in the multivariate reports.


In [None]:
nmrData.addSampleInfo(descriptionFormat='Basic CSV', filePath='/path to sample metadata file/PipelineTest 1D NMR Basic CSV.csv')

In [None]:
# 3. Dataset Summaries

In [None]:
Or directly inspect the sample or feature metadata, and the raw measurements:

dataset.sampleMetadata
dataset.featureMetadata

dataset.intensityData

In [None]:
sample report

In [None]:
feature report

# 3. Sample & Feature Filtering

### Generate sample summary report

#### By default all summary reports (with the exception of the final report) will be output only to this notebook. The notebook (including outputs) can be saved using >File>Save and Checkpoint. However, if html copies of any reports are required these can be automatically saved to the save directory by adding the optional input argument output=pathToSaveDirectory.

saveDir = '/path to save outputs'

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        This summary can be used to check the expected samples against those acquired (for example, sample numbers, sample type, samples missing from acquisition or lacking metadata information).
    </font>
</div>

In [None]:
nPYc.reports.generateReport(nmrData, 'sample summary')

### Generate feature summary report

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        The feature summary report provides visualisations summarising the quality of the dataset and highlighting any problematic areas. These include: a chemical shift calibration check, the distribution of peak widths, and the distribution of intensities at baseline and surrounding the residual water peak.
        <br/><br/>
        Any samples which fall outside the pre-set thresholds for any of these analytical parameters are flagged (final table) and subsequently marked for exclusion from the dataset.
    </font>
</div>

In [None]:
nPYc.reports.generateReport(nmrData,'feature summary')

### Filter samples

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        Remove samples which fail based on any of the above analytical criteria by applying the sample masks.
        <br/><br/>
        At this point we can also exclude any other samples which are not required by setting preferences with the 'sampleTypes' argument. In this example, we limit our dataset to study samples and quality control samples only.
        <br/><br/>
        To keep all features, set the 'filterFeatures' argument to False.
    </font>
</div>

In [None]:
nmrData.updateMasks(sampleTypes=[SampleType.StudySample, SampleType.StudyPool], filterFeatures=False)

### Filter features

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        Remove unwanted/uninformative regions of the spectrum (the defaults here are the TSP peak region and the region contaning the water pre-saturation signal) by updating the masks with 'filterFeatures=True'. Here, we also demonstrate removing an additional region between 8.4 and 8.5.
        <br/><br/>
        To keep all samples (as specified at the previous step) set the 'filterSamples' argument to False.
        <br/><br/>
        To summarise features retained/marked for exclusion use the feature summary report.
    </font>
</div>

In [None]:
nmrData.updateMasks(filterSamples=False, filterFeatures=True)

In [None]:
nmrData.updateMasks(filterSamples=False, filterFeatures=True, exclusionRegions=[(8.4, 8.5)])

In [None]:
nPYc.reports.generateReport(nmrData,'feature summary')

### Permanently exclude masked samples/features

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
If happy with the samples and features masked for exclusion, apply these exclusions (permanently remove samples/features from the dataset) using the 'applyMasks' function.
    </font>
</div>

In [None]:
nmrData.applyMasks()

# 4. Analytical Multivariate Quality Control

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        The analytical multivariate report provides visualisations summarising the largest sources of variance in the dataset (by PCA) with particular emphasis on any potential analytical sources. Plots include:
        <ul>
        <li>Model statistics.</li>
        <li>Scores plots. This provides insight into the relationship between sample, for example, consistency of the QC samples, sample outliers etc. </li>
        <li>Loadings plots. This provides insight into the features with the largest variance in the dataset.</li>
        <li>Potential associations with analytical parameters. Correlation (for continuous metadata) or Kruskal-Wallis test (for categorical metadata) between each metadata field and each set of PCA scores generated, any significant associations are flagged.</li>
        <li>The default scaling is unit variance ('scaling=1'), but other scaling options are available (0 for means centering; 0.5 for Pareto scaling)</li>
        </ul>
    </font>
</div>

In [None]:
PCAmodelAnalytical = nPYc.multivariate.exploratoryAnalysisPCA(nmrData, scaling=1)

In [None]:
nPYc.reports.multivariateQCreport(nmrData, PCAmodelAnalytical)

# 5. Finalise & Export Dataset

### Check final dataset output:

In [None]:
nPYc.reports.generateReport(nmrData, 'final report', pcaModel=PCAmodelAnalytical)

### Export data:

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
    Export a unified csv file, which contains the intensity data (one row per sample, one column per variable), alongside all sample and feature metadata (left columns and top rows respectively).
    <br/><br/>
    Output the final report to provide a summary of the dataset.
    </font>
</div>

In [None]:
if not os.path.exists(saveDir):
    os.makedirs(saveDir)

In [None]:
nmrData.exportDataset(saveFormat='UnifiedCSV', destinationPath=saveDir)

In [None]:
nPYc.reports.generateReport(nmrData, 'final report', pcaModel=PCAmodelAnalytical, destinationPath=saveDir)