Permalink
Fetching contributors…
Cannot retrieve contributors at this time
296 lines (227 sloc) 16.4 KB

Table of Contents

# TRIC alignment

## Overview

TRIC uses a graph-based alignment strategy based on non-linear retention time correction to integrate information from all available runs. The input consists of a set of csv files derived from a targeted proteomics experiment generated by OpenSWATH (using either mProphet or pyProphet) or generated by Peakview.

There are two basic running modes available. The first one uses a reference-based alignment where a single run is chosen as a reference and all other runs are aligned to it. This is a useful choice for a small number of runs that are chromatographically similar. The second mode generates a guidance tree based on chromatographic similarity of the input runs and uses this tree to align the targeted proteomics runs (the nodes in the tree are runs and the edges are pairwise alignments). Generally this mode is better for a large number of runs or for chromatographically dissimilar samples.

## Design of the Algorithm

Alignment Order and RT correction

The first step in the algorithm is to compute the alignment order. If the tree-based alignment is used, then first a set of of high-confidence anchor points is used to estimate the pairwise chromatographic distance between all runs. This distance matrix is then used to compute a guidance tree (minimum spanning tree, MST) where the nodes represent LC-MS/MS runs and the edges represent pairwise alignments. If a reference-based approach is used, then a reference run is selected first (the run with the most features) and a star-shaped tree is created with the reference run in the middle, connected to all other runs.

Then, for each edge in the tree, a pairwise non-linear transformation between the retention time (RT) domains of the two runs at the nodes is computed, using one of several available methods (e.g. local regression, spline fit, k-nearest neighbor).

Confidence transfer:

Using the guidance tree from above (star-shaped or MST-based), for each measured targeted proteomics assay, traversal of the global guidance tree starts with a suitable starting point, or seed identification (a identification below the --target_fdr or --fdr_cutoff cutoff). During traversal each edge of the tree is visited sequentially and a confident identification is mapped from one node (run $n$) to an adjacent node (run $m$), where the choice of using a MST-guidance tree ensures that the mapping only occurs between chromatographically similar runs. During confidence transfer, the identification confidence of all peakgroups in run $m$ within the specified retention time window (user-defined or adaptive) is considered. If the confidence score of the best peakgroup within the RT window passes the user-defined threshold given by --max_fdr_quality, it gets added to the result.

The size of he RT window during confidence transfer is given by --max_rt_diff. However, as different parts of the tree may have different alignment quality, it is possible to use adaptive retention time windows, derived from the quality of the alignment. This approach allows different parameters for confidence transfer on different parts of the tree, increasing robustness and decreasing the influence of outlier runs (see --mst:Stdev_multiplier parameter).

Requantification

TRIC contains a separate, optional requantification step where runs in the guidance tree where no peakgroup passed the confidence filter can be re-visited for re-quantification. In these cases, the software can infer the peak boundaries from the closest neighboring run and quantify the fragment ion signal within those boundaries, see TRIC requantification.

Installing TRIC

Please see the main README file for installation instructions.

## Running TRIC

To get an overview over all available options, please use

./analysis/alignment/feature_alignment.py --help

A sample run of the tool may look as follows:

./analysis/alignment/feature_alignment.py 
--in file1_input.csv file2_input.csv file3_input.csv 
--out aligned.csv 
--method best_overall --realign_method diRT --max_rt_diff 90 
--target_fdr 0.01 --max_fdr_quality 0.05 

This command will run alignment on 3 files using the (initial) linear iRT alignment and pick an appropriate peakgroup in each run within the aligned window using a reference-based alignment. In order to be reported, each peptide is required to have at least an identification in at least one run below the 1 % q-value cutoff and each quantitative cell in the resulting data matrix is required to have a q-value below 5 %. The maximal RT deviation between the aligned runs is 90 seconds in the above example (you may choose a smaller value if you select one of the nonlinear alignment methods).

The individual parameters can be adjusted as follows:

  • --method refers to using either a reference-based alignment or a tree-based alignment (see below).
  • --realign_method Refers to the (non)-linear alignment strategy employed (see below).
  • --max_rt_diff refers to the maximal shift in RT after alignment that is tolerated. If a peakgroup is shifted more than this amount, it is excluded from the result (except if its FDR is below the set FDR threshold and a non-global strategy was selected in the reference-based approach). Note that this a difference, thus the RT window for alignment is twice the size of this parameter (e.g. the window considered is expectedRT +/- max_rt_diff.
  • --target_fdr refers to the desired FDR on assay level.
  • --max_fdr_quality refers to the maximal FDR value a value in the data matrix may have to still be considered for quantitation.
  • --file_format Which input file format is used (openswath (default), mprophet or peakview). openswath is used for a file generated by the OpenSwath workflow (OpenSwath + mProphet / pyProphet) while mprophet is used for traditional SRM files generated by the mQuest + mProphet workflow. peakview is for PeakView files.

(Non)-linear pairwise alignment

Several options are available for (non)-linear pairwise alignment. Generally, the alignment is performed by using a set of highly confident "anchor points" that are present in both runs and then compute a transformation function from the RT-space of one run into the RT-space of the other run.

The method for pairwise alignment can be selected using --realign_method. The recommended method is lowess (or the faster lowess_cython) or SmoothLLDMedian.

The very simple or linear alignment methods are:

  • diRT uses the difference to the expected elution time of the assay computed by OpenSWATH
  • linear performs a linear alignment using the anchor points

The more complex, non-linear alignment methods are:

  • lowess use Robust locally weighted regression for alignment (lowess smoother)
  • splinePy use Python native spline from scikits.datasmooth (slow!)
  • nonCVSpline compute a spline for alignment (no cross-validation)
  • CVSpline compute a spline for alignment (using cross-validation)
  • CVSpline compute a spline for alignment (using cross-validation)
  • WeightedNearestNeighbour weighted interpolation using local linear differences of the k nearest neighbors
  • SmoothLLDMedian local median interpolation using local linear differences of the k nearest neighbors

Several alignment methods require additional packages to be installed:

  • splineR perform alignment using the smooth.spline function in R (needs the rpy2 package)
  • splineR_external perform alignment using the smooth.spline function in R (starts an R process using the command line)
  • Earth use Multivariate Adaptive Regression Splines (needs the py-earth package)
  • lowess_cython uses a faster lowess implementation (see the main README file, "Fast lowess" for install instructions)

Reference-based alignment

The reference-based alignment selects the run with the most features (identified peakgroups) as the reference. Then all other runs are aligned against the reference in a pairwise fashion.

This mode can be enabled by choosing --method to be one of the following:

  • best_overall
  • best_cluster_score
  • global_best_cluster_score
  • global_best_overall

The recommended method is global_best_overall. Note that the two global options will align all peakgroups according to retention time whereas the other two methods will keep peakgroups below the FDR cutoff in all cases. This means that when using the global option, peakgroups below the FDR cutoff may be removed if they are not at the expected position in retention time (this is useful to remove spurious identifications but may lead to low identification numbers if the parameters are too strict).

The reference-based approach will try to automatically estimate a sensible value if you set --max_rt_diff to auto_3medianstdev.

Tree-based alignment

Alternatively, a tree-based alignment is available where the input runs are arranged in a guidance tree (the nodes in the tree are runs and the edges are pairwise alignments). This approach is reference-free and means that each alignment step is purely local and each run is only aligned against runs that are chromatographically close. Generally this mode is better for a large number of runs or for chromatographically dissimilar samples.

This mode can be enabled by choosing --method to be one of the following:

  • LocalMST
  • LocalMSTAllCluster

The best choice here is to use LocalMST which reports the best result for each assay. If you want to have a full output where multiple results (multiple clusters) per peakgroup may be reported, use LocalMSTAllCluster.

The tree-based alignment has several options specific to it:

  • --mst:useRTCorrection Use aligned peakgroup RT to continue threading in MST algorithm. It is highly recommend to set this to "True"
  • --mst:Stdev_multiplier Turn on adaptive RT tolerances: How many standard deviations the peakgroup can deviate in RT during the alignment (if less than max_rt_diff, then max_rt_diff is used). It is recommended to set this to a value between 2.0 and 4.0.
  • --mst:useLocalStdev Use standard deviation of local region of the chromatogram. This is experimental and may not work.

Adaptive RT tolerance can be very useful if not all alignments have the same quality. This allows the user to set an overall strict tolerance for the alignment while a few, particularly bad pairwise alignments are allowed to have a larger tolerance. These "bad" pairwise alignments may potentially be the edges in the tree that connect two sub-trees which may represent two batches for example.

Thus, a sample command for a tree-based alignment may look like this

./analysis/alignment/feature_alignment.py 
--in file1_input.csv file2_input.csv file3_input.csv 
--out aligned.csv 
--method LocalMST --realign_method lowess_cython --max_rt_diff 60 
--mst:useRTCorrection True --mst:Stdev_multiplier 3.0 
--target_fdr 0.01 --max_fdr_quality 0.05 

Further parameters

  • --disable_isotopic_grouping Disable grouping of isotopic variants by peptide_group_label, thus disabling matching of isotopic variants of the same peptide across channels. If turned off, each isotopic channel will be matched independently of the other. If enabled, the more certain identification will be used to infer the location of the peak in the other channel.
  • --use_dscore_filter Enable the filter by d score (this is mainly for speedup)
  • --dscore_cutoff Quality cutoff to still consider a feature for alignment using the d_score: everything below this d-score is discarded (this is mainly for speedup)
  • --nr_high_conf_exp Number of experiments in which the peptide needs to be identified with high confidence (e.g. above fdr_curoff)
  • --readmethod Read full or minimal transition groups (minimal,full)
  • --tmpdir Temporary directory location
  • --alignment_score Minimal score needed for a feature to be considered for alignment between runs (e.g. score needed to be considered an "anchor point" for pairwise alignment)
  • --fdr_cutoff A fixed m-score cutoff which does not take into account the number of runs (use target_fdr instead)

# TRIC requantification

## Overview

Even after alignment, a complete data matrix will not be achieved. There is a last step in the TRIC-based workflow that allows requantification of signal across the integration border derived from alignment. This is implemented as a second script after TRIC since for this step, the chromatograms generated by OpenSWATH are needed.

./analysis/alignment/requantAlignedValues.py 
  --do_single_run run_n_chromatograms.mzML 
  --peakgroups_infile aligned_peakgroups.csv  
  --out requantified_output.csv 
  --realign_runs linear 
  --method singleShortestPath 

Note that the --do_single_run input file is a chromatogram mzML file generated by OpenSWATH. If you have n files, you should run the above command n times for each mzML file and then concatenate the resulting output files.

The individual parameters can be adjusted as follows:

  • --peakgroups_infile Infile containing peakgroups (outfile from feature_alignment.py
  • --file_format Which input file format is used for --peakgroups_infile (openswath (default), mprophet or peakview). openswath is used for a file generated by the OpenSwath workflow (OpenSwath + mProphet / pyProphet) while mprophet is used for traditional SRM files generated by the mQuest + mProphet workflow. peakview is for PeakView files.
  • --method Which method to use (singleShortestPath or singleClosestRun are recommended)
  • --realign_runs Same as realign_method above, see (Non)-linear pairwise alignment options

Alignment approach

There are multiple alignment approaches available, which can be controlled with --method:

  • singleShortestPath (tree-based alignment): The integration border are taken from the run that is closest to the current run in the guidance tree
  • singleClosestRun (tree-based alignment): The integration borders are transferred from the single closest run, disregarding the guidance tree
  • reference (reference-based alignment): The integration borders are aggregated across all runs (see advanced parameters)

If singleShortestPath or singleClosestRun is given, a tree based alignment is chosen while using reference, a reference-based alignment is chosen. Note that using a tree-based alignment. The reference based approach is currently not recommended.

Advanced parameters

  • --border_option (only in effect when --method is reference): How to determine integration border for the aggregate alignment (possible values: max_width, mean, median). All integration borders will be computed across all runs and then an aggregate is computed, using either the maximal width, the mean or the median. Max width will use the maximal possible width (most conservative since it will overestimate the background signal).
  • --cache_in_memory Cache data from a single run in memory
  • --disable_isotopic_grouping Disable grouping of isotopic variants by peptide_group_label, thus disabling matching of isotopic variants of the same peptide across channels. If turned off, each isotopic channel will be matched independently of the other. If enabled, the more certain identification will be used to infer the location of the peak in the other channel.)
  • --disable_isotopic_transfer Disable the transfer of isotopic boundaries in all cases. If enabled (default), the best (best score) isotopic channel dictates the peak boundaries and all other channels use those boundaries. This ensures consistency in peak picking in all cases.)