Skip to content

update of NovoBridge with miscleavage handling and multiple candidate submission

License

Notifications You must be signed in to change notification settings

hbckleikamp/NovoBridge_plus

Repository files navigation

NovoBridge +

This is the repository for the NovoBridge pipeline, as described in:
Hugo B. C. Kleikamp, Mario Pronk, Claudia Tugui, Leonor Guedes da Silva, Ben Abbas, Yue Mei Lin, Mark C.M. van Loosdrecht and Martin Pabst* Database-independent de novo metaproteomics of complex microbial communities, Cell Systems (2021) doi:10.1016/j.cels.2021.04.003

The pipeline was established and tested with shotgun (meta)proteomics data obtained from Q Exactive Orbitrap Mass Spectrometers, using either PEAKS or DeepNovo generated de novo sequence lists. The generation of accurate de novo peptide sequence lists depends on high quality peptide sequencing spectra.

NovoBridge has been tested only in an Anaconda Spyder environment!
Novobridge + is essentially the same pipeline as Novobridge, with two modifications:

  1. Tryptic digestion of uncleaved peptides: Since uncleaved peptides do not give an exact match with unipept API, each uncleaved peptide is cleaved and its peptides are submitted seperately.
  2. Multiple candidate submission: Since both PEAKs and Deepnovo can have mutiple good scoring candidate peptides for a single spectra, subsequent submission of candidate peptides is performed to increase recall.

What is Novobridge?

Novobridge is an automated pipeline for fast processing, and integrated annotation and visualization of de novo proteomics data.


Basic use

How does it work?

The current version of the pipeline is included as a single script: Novobridge+.py, which can be run from any python interpreter. The pipeline uses UniPept API methods pept2lca and pept2fun to annotate taxonomy and function and uses the KEGG database to match the functional annotations to pathways based on EC numbers. It is based on the repository hbckleikamp/NovoBridge with two modifications. 1. Miscleavage handling: by cleaving any uncleaved peptides with Trypsin, and 2. Multiple candidate submission: PEAKs software generates multiple candidate sequences for each petpide, starting with the longest, bests scoring sequence, sequences are submitted to Unipept until a match is returned.


What does it do?

The Novobridge pipeline consists of 3 main parts:

  1. Unipept submission
  2. Taxonomic analysis
  3. Functional analysis

In Part 1: input files are read, parsed, filtered and submitted to Unipept for taxonomic and functional annotations.
In Part 2: Unipept taxonomic annotations are quantified, and visual outputs are generated.
In Part 3: Unipept functional annotations are matched to KEGG orthologies and quantified.


Running Novobridge+

  • Novobridge is designed as a single "tunable" python script.
  • Novobridge does not offer command line options, but parameters can be altered in the script Novobridge.py
  • The script will automatically loop through all files present in the folder input_peaks, located in the same folder as Novobridge.py
  • The input path can be altered by changing variable pathin in Novobridge.py
  • To run the script, simply open it in your interpreter of choice and run.
  • Outputs will be generated in folders: output_unipept, output_composition, output_function.

What input files does it use?

  • It is recommended to use de novo sequence lists obtained from high resolution mass spectrometers. The pipeline was established and tested with data from QE Orbitrap mass spectrometers.
  • NovoBridge can work with filetypes -.txt, -.tsv, -.csv, -.xls, or in -.xlsx.
  • The only required input data to run Novobridge is a single column of peptides with header Peptide
  • When filtering steps are required, Novobridge is designed to work with output formats from de novo sequencing softwares Peaks and Deep NoVo.
  • Apart from the input files, there are two utilities files: keg.pkl which is required for functional annotation, and Krona_template.xlsm, which is required for Krona-plot visualization. Both files can be created with the script download_utilities.py

What outputs does it generate?

The outputs generated by the pipeline are distributed over 3 folders. For each file in input_peaks the following outputs are generated:

  1. output_unipept: input file, annotated with unipept for each separate peptide.
  2. output_composition: quantified taxonomic distributions, krona plots and stacked bar charts.
  3. output_function: quantified KEGG pathways

As default, each input dataset set generated one of each output for normal, and one of each with a randomized dataset `Rand_` of scrambled peptides.

Parameter options

Parameters can be freely changed within the script Novobridge.py. There are several parameters that can be changed to include more stringent filtering for de novo peptides, and to change quantification methods.

Part 1: Unipept submission

Filter parameters

Parameter Default value Description
ALC_cutoff 40 numeric, minimum required ALC score (Peaks score)
Score_cutoff -0.1 numeric, minimum required score cutoff (DeepNovo score)
ppm_cutoff 20 numeric, maximum allowed ppm
length_cutoff 7 numeric, minimum required peptide length
Area_cutoff 0 numeric, minimum required peak area
Intensity_cutoff 0 numeric minimum required intensity

Part 2: Taxonomic analysis

Filter parameters (also applied to Part 3: functional analysis)

Parameter Default value Description
comp_ALC_cutoff 70 numeric, minimum required ALC score (Peaks score)
comp_Score_cutoff -0.1 numeric, minimum required score cutoff (DeepNovo score)
comp_ppm_cutoff 15 numeric, maximum allowed ppm
comp_length_cutoff 7 numeric, minimum required peptide length
comp_Area_cutoff 0 numeric, minimum required peak area
comp_Intensity_cutoff 0 numeric, minimum required intensity
cutbranch 3 numeric, minimum number of unique peptides per taxonomic branch in denoising

Quantification parameters

Parameter Default value Description
comp_ranks ["superkingdom","phylum","class","order","family","genus"] list, which ncbi-taxonomic ranks to annotate and quantify
tax_count_targets ["Spectral_counts","Area","Intensity"] list or string, on which value should the quantification be done
tax_count_methods ["average","total","topx"] list or string, how the quantification should be done
tax_topx 5 integer, the amount of top hits selected, in case of topx quantification
normalize False boolean, normalize quantification to total for that rank

Part 3: Functional analysis

Quantification targets

Parameter Default value Description
Pathways 09100 Metabolism list, which Kegg pathways to annotate
09120 Genetic Information Processing
09130 Environmental Information Processing
09140 Cellular Processes
cats cat1,cat2,cat3,cat4 list, on which levels of pathways to quantify

Quantification parameters

Parameter Default value Description
fun_count_targets ["Spectral_counts","Area","Intensity"] list or string, on which value should the quantification be done
fun_count_methods ["average","total","topx"] list or string, how the quantification should be done
fun_topx 5 integer, the amount of top hits selected, in case of topx quantification
normalize False boolean, normalize quantification to total for that rank

How are quantities calculated?

As a default, taxa and kegg pathways are quantified with 3 different methods and 3 different targets. The targets determine to count by either Spectral_counts of peptides, by Area or by Intensity, if they are available. The user can also supply custom columns as target to count by, provided the parameters tax_count_targets or fun_count_targets are changed.

If the target is Spectral counting, the only way of quantification is a sum of total spectra. However, when quantification is done on Area, Intensity or a custom target, different quantification methods are available, such as average: which averages all amounts belonging to a pathway or taxa, total: which sums all amounts, and topx: which sums the topx largest amounts, where topx is supplied by a variable.

As an example: if only spectral counts are desired as outputs, the parameter configuration could be changed to: tax_count_targets="Spectral_counts", tax_count_methods="", fun_count_targets="Spectral_counts", fun_count_methods=""


Licensing

The pipeline is licensed with standard MIT-license.
If you would like to use this pipeline in your research, please cite the following papers:

  • Hugo B. C. Kleikamp, Mario Pronk, Claudia Tugui, Leonor Guedes da Silva, Ben Abbas, Yue Mei Lin, Mark C.M. van Loosdrecht and Martin Pabst* Quantitative profiling of microbial communities by de novo metaproteomics, BiorXiv (2020) (accepted in CELL SYSTEMS)

  • Robbert Gurdeep Singh, Alessandro Tanca, Antonio Palomba, Felix Van der Jeugt, Pieter Verschaffelt, Sergio Uzzau, Lennart Martens, Peter Dawyndt, and Bart Mesuere. (2019). Unipept 4.0: Functional Analysis of Metaproteome Data. J. Proteome Res. 2019, 18, 606−615 Article.

  • Kanehisa, M., & Goto, S. (2000). KEGG : Kyoto Encyclopedia of Genes and Genomes, 28(1), 27–30.

Contact:

-Hugo Kleimamp (Developer): hugo.kleikamp@uantwerpen.be
-Martin Pabst: M.Pabst@tudelft.nl

Recommended links to other repositories:

https://github.com/unipept
https://github.com/marbl/Krona
https://github.com/nh2tran/DeepNovo

About

update of NovoBridge with miscleavage handling and multiple candidate submission

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages