# Preprocessing NMR targeted data with the nPYc-toolbox

#### This document provides a pipeline for the import of Targeted data (_quantitative, semi-quantitative or annotated data_), and any associated sample metadata, followed by summaries and quality control reports of the data (both in sample and feature dimensions) and output of a final dataset ready for sharing with collaborators and data modeling.

#### By default all summary reports (with the exception of the final report) will be output only to this notebook. The notebook (including outputs) can be saved using >File>Save and Checkpoint. However, if html copies of any reports are required these can be automatically saved to the save directory by adding the optional input argument destinationPath=saveDir.

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        This notebook will import semi-quantitative data extracted from Urine NMR spectra using the Bruker BI-QuantUR processing routine.
    </font>
</div>

# 1. Initial Setup

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
    Define the path to the raw data and corresponding metadata. 
    </font>
</div>

### Define file paths

In [None]:
rawDataPath = '/path to NMR raw data files'
basicCSVPath = '/path to sample metadata file/PipelineTest 1D NMR Basic CSV.csv'

saveDir = '/path to save outputs'

### Import code

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
    Import the required Python libraries and the nPYc toolbox.
    </font>
</div>

In [None]:
import os
import matplotlib.pyplot as plt
import scipy
import pandas
import numpy
import pickle
import seaborn as sns
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
%matplotlib inline
import sys
import nPYc
import copy
from nPYc.enumerations import VariableType, DatasetLevel, AssayRole, SampleType, CalibrationMethod, QuantificationType

# 2. Import Data and Sample Metadata

### Import acquired data and associated acquisition parameters

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
    Import all data in the 'rawDataPath' with import parameters as defined in the 'BrukerQuant-UR' SOP. Set the name of the 'TargetedDataset' object ('targetedData').
    <br/><br/>
     BI-QuantUR values exist with multiple units, so the 'unit' parameter is employed to only import values expressed in 'mmol/L' ('mmol/mol Crea' are also available in this example dataset).  
    </font>
</div>

In [None]:
# Load plate separately, or from a parent folder, all sub-folders will be searched
targetedData = nPYc.TargetedDataset(rawDataPath, fileType='Bruker Quantification', sop='BrukerQuant-UR', fileNamePattern='.*?urine_quant_report_b\.xml$', unit='mmol/L')
targetedData.name = 'nPYc example NMR targeted dataset'

---
### Match acquired samples to Sample IDs and subject information (if available)

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
    Add corresponding sample metadata (including Sample IDs and any other sample specific information) and match to the acquired data, here sample metadata is in the 'Basic CSV' format.
    </font>
</div>

In [None]:
targetedData.addSampleInfo(descriptionFormat='Basic CSV', filePath=basicCSVPath)

# 3. Sample & Feature Filtering

### Generate sample summary report

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
       This summary can be used to check the expected samples against those acquired (for example, sample numbers, sample type, samples missing from acquisition or lacking metadata information).
    </font>
</div>

In [None]:
nPYc.reports.generateReport(targetedData, 'sample summary')

### Generate feature summary report

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        The feature summary report provides visualisations summarising the quality of the dataset. This includes tables of feature properties and limits of quantification and RSD, and violin plots of feature concentration by sample type.
    </font>
</div>

In [None]:
nPYc.reports.generateReport(targetedData, 'feature summary')

### Filter the samples

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        Remove samples which fail based on any of the above analytical criteria by applying the sample masks.
        <br/><br/>
        At this point we can also exclude any other samples which are not required by setting preferences with the 'sampleTypes' argument. In this example, we limit our dataset to study samples and quality control samples only.
        <br/><br/>
        To keep all features, set the 'filterFeatures' argument to False.
    </font>
</div>

In [None]:
targetedData.updateMasks(sampleTypes=[SampleType.StudySample, SampleType.StudyPool], filterFeatures=False)

### Filter the features

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        In the targeted assay we do not mark any features for exclusion, but simply separate our reports into two sections; one for those features which are present and measured with high accuracy and precision, and one for those features which are not meeting these quality criteria.
    </font>
</div>

### Permanently exclude masked samples

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
If happy with the samples and features masked for exclusion, apply these exclusions (permanently remove samples from the dataset) using the 'applyMasks' function.
    </font>
</div>

In [None]:
targetedData.applyMasks()

# 4. Analytical Multivariate Quality Control

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
        The analytical multivariate report provides visualisations summarising the largest sources of variance in the dataset (by PCA) with particular emphasis on any potential analytical sources. Plots include:
        <ul>
        <li>Model statistics.</li>
        <li>Scores plots. This provides insight into the relationship between sample, for example, consistency of the QC samples, sample outliers etc. </li>
        <li>Loadings plots. This provides insight into the features with the largest variance in the dataset.</li>
        <li>Potential associations with analytical parameters. Correlation (for continuous metadata) or Kruskal-Wallis test (for categorical metadata) between each metadata field and each set of PCA scores generated, any significant associations are flagged.</li>
        <li>The default scaling is unit variance ('scaling=1'), but other scaling options are available (0 for means centering; 0.5 for Pareto scaling)</li>
        </ul>
    </font>
</div>

In [None]:
# SVD does not accept missing values
try:
    PCAmodelAnalytical = nPYc.multivariate.exploratoryAnalysisPCA(targetedData, withExclusions=True, scaling=1.0)
    nPYc.reports.multivariateReport.multivariateQCreport(targetedData, PCAmodelAnalytical, reportType='analytical', withExclusions=True)
except ValueError:
    print('Multivariate analysis is not currently possible with values <LLOQ or >ULOQ.')

# 5. Finalise & Export Dataset

### Check final dataset output:

In [None]:
nPYc.reports.generateReport(targetedData, 'final report', pcaModel=PCAmodelAnalytical)

### Export data:

<div style='background-color:#EEFFEC'>
    <font color='#0B6D01'>
    Export a unified csv file, which contains the intensity data (one row per sample, one column per variable), alongside all sample and feature metadata (left columns and top rows respectively).
    <br/><br/>
    Output the final report to provide a summary of the dataset.
    </font>
</div>

In [None]:
targetedData.exportDataset(saveFormat='UnifiedCSV', destinationPath=saveDir)

In [None]:
nPYc.reports.generateReport(targetedData, 'final report', pcaModel=PCAmodelAnalytical, destinationPath=saveDir)