Skip to content

1. Home

Natasha Pavlovikj edited this page Feb 21, 2021 · 5 revisions

ProkEvo is:

  1. An automated, user-friendly, reproducible, and open-source platform for bacterial population genomics analyses that uses the Pegasus Workflow Management System;
  2. A platform that can scale the analysis from at least a few to tens of thousands of bacterial genomes using high-performance and high-throughput computational resources;
  3. An easily modifiable and expandable platform that can accommodate additional steps, custom scripts and software, user databases, and species-specific data;
  4. A modular platform that can run many thousands of analyses concurrently, if the resources are available;
  5. A platform for which the memory and run time allocations are specified per job, and automatically increases its memory in the next retry;
  6. A platform that is distributed with conda environment and Docker image for all bioinformatics tools and databases needed to perform population genomics analyses.

To demonstrate versatility of the ProkEvo platform, we performed population-based analyses from available genomes of three distinct pathogenic bacterial species as individual case studies (three serovars of Salmonella enterica, as well as Campylobacter jejuni and Staphylococcus aureus).

The specific case studies used reproducible Python and R scripts documented in Jupyter Notebooks and collectively illustrate how hierarchical analyses of population structures, genotype frequencies, and distribution of specific gene functions can be used to generate novel hypotheses about the evolutionary history and ecological characteristics of specific populations of each pathogen.

The scalability and portability of ProkEvo was measured with two datasets comprising significantly different numbers of input genomes (one with ~2,400 genomes, and the second with ~23,000 genomes) on two different computational platforms, the University of Nebraska high-performance computing cluster (Crane) and the Open Science Grid (OSG), a distributed, high-throughput cluster. Depending on the dataset and the computational platform used, the running time of ProkEvo varied from ~3-26 days.