Some thoughts on comparing Roary to LS BSR
When the Roary paper (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4817141/) came out and reported differences between the size of the core genome between LS-BSR and Roary, I didn't think too much about it, as you will often see differences between methods on different datasets. However, I just reviewed a paper that made a blanket statement that Roary is more accurate than LS-BSR and is also much faster. This made me revisit the Roary paper and investigate these results using updated versions of both software packages. The purpose of this re-analysis is not a rebuttal of the results presented in the Roary paper, but represents an update using recent versions of both software packages.
Concern 1: LS-BSR overclusters sequences. In Table 1 of the Roary manuscript, the results clearly show that the USEARCH method in LS-BSR seems to overcluster simulated genome datasets. Without knowing which version of USEARCH was used, it's hard to reproduce these results, although USEARCH is the likely culprit. Also, LS-BSR has changed substantially since this test. I tested the most recent implementation of LS-BSR to see if this is a lingering problem or has been resolved with software updates. I received the 12 simulated genomes in the Roary paper from the authors. The following tests were performed with the following software versions:
VSEARCH: vsearch v2.5.0_linux_x86_64
I ran LS-BSR using all 3 clustering methods in conjunction with all aligners. In all cases, the results were better than what was described in the Roary paper, although the pan/core genome numbers didn't match exactly with what was reported by Roary. VSEARCH returned a smaller pan-genome than the other clustering methods.
Roary was run (v 3.11.0) with the following parameters:
time roary -o roary_out -f . -cd 100 -p 4 gffs/*gff
The time comparisons between pipelines showed that Roary did run faster than any cluster/alignment combinations in LS-BSR. In general, LS-BSR found 1 extra core gene compared to Roary and one less pan gene. To investigate further, I re-ran Roary to force the creation of the pan_genome_reference.fa file (using -e flag):
roary -o roary_out -f . -cd 100 -p 4 -e gffs/*gff
Using this method took significantly longer:
One of the proteins had 2 AA mismatches to the centroid produced by LS-BSR, which likely explains the differences in pan-genome numbers. Overall, such a small difference on a clustering of very similar proteins doesn't suggest that Roary is demonstrating significantly enhanced accuracy.
Thoughts on issue 1
Differences were observed in LS-BSR using different clustering methods and aligners, although the extent of the issue using updated clustering methods and LS-BSR source code has resulted in improved results compared to the USEARCH overclustering problem reported in the Roary manuscript.
Concern 2: LS-BSR is significantly slower than Roary. Table 2 in the Roary paper demonstrates that LS-BSR is significantly slower than Roary. If LS-BSR is used with TBLASTN, this wouldn't be surprising to me as TBLASTN performs 6 frame translations on each genome while Roary uses BLASTP. Additionally, the comparison of LS-BSR and Roary may not have been an apples to apples comparison as LS-BSR uses FASTA files (not annotated) and Roary uses GFF files. LS-BSR predicts coding regions with Prodigal, so that step is part of the pipeline and should be part of the time comparison. Roary requires gff files which means that genomes must be previously annotated (e.g using Prokka): annotations are not always publicly available. Running Prokka on 184 Salmonella genomes described below took between 5-12 minutes per genome to annotate (average was 7.2m). Running this on a cluster still took a significant amount of time and would be much worse on a single machine/laptop. Total system time to run all annotations was 1423 minutes (~24h) for 184 genomes; this was performed by adding the real time taken for each job and summing all values. Prokka was run with the following command:
time prokka --outdir prefix.prokka --prefix prefix --locustag prefix --genus Salmonella --species enterica --usegenus --force prefix.fasta
As a direct comparison, we downloaded SRA data from the study described in the Roary paper (ERP001718). We assembled all genomes with SPAdes and removed a number of assemblies (n=16) with a very small assembly size. This left us with 184 S. enterica genomes (https://gist.github.com/jasonsahl/d61b3c6762ca43080bb666c63c255c42). I ran these through LS-BSR with the following command:
time ls_bsr.py -d fasta_files -b blat -c usearch -p 8 -x bsr_results
Pan-genome size = 5869
Core-genome size = 4516
I then ran Roary with the following command:
time roary -o roary_out -f . -cd 100 gffs/*gff -p 8
Pan-genome size = 5813
Core-genome size = 4442
I also ran LS-BSR using CD-HIT and DIAMOND, which uses peptides on gene predictions (from Prodigal) instead of genomes, for a more appropriate comparison:
time ls_bsr.py -d fastas/ -b diamond -c cd-hit -p 8 -x diamond_cdhit
Pan-genome size = 5861
Core-genome size = 4427
Thoughts on issue 2
Time comparisons are much closer than what was reported in the Roary manuscript. If using Diamond, a new addition to the pipeline, LS-BSR runs faster. Roary compares proteins, while LS-BSR uses can use both nucleotides and proteins, but protein searches using TBLASTN are much slower. When comparing time between pipelines, the up front time to run Prokka should be considered on genomes where the gff files are not already available. Time for annotation of coding regions after the pipeline, as is done as a post-pipeline method in LS-BSR, could also be considered in a direct time comparison. The point of this analysis was to demonstrate that the new implementation of LS-BSR is not significantly slower than Roary, and may be faster depending on the method used.
A comparative analysis against a larger set of genomes
To test the ability/time to process a larger number of genomes, we downloaded all S. aureus genome assemblies from GenBank and filtered those out that failed several internal quality filters, leaving us with 2805 genomes (https://gist.github.com/jasonsahl/024fcc941fe940ff183b55b9dc1b8c5c). First they were analyzed with LS-BSR with the following command in LS-BSR:
time ls_bsr.py -d ../genomes/ -c cd-hit -b diamond -p 8 -x Sa_diamond_cdhit -z F
Pan-genome size = 33153
Core-genome size = 542
For comparison, we also ran Roary, first by generating gff files, which took an average of 2.5m and a total time of 6912m (115h). Roary was then run with the following command:
time roary -o roary_out -p 8 -f . -cd 100 ../gffs/*gff
Pan-genome size = 24697
Core-genome size = 102
Here, the pan-genome size is much larger using Roary. LS-BSR finished ~80% faster than Roary. The small strict core genome likely represents variable quality in some portion of the genome assemblies.
When choosing a method for comparative pan-genomes, there are differences between pipelines that need to be understood:
- LS-BSR provides BSR values and Roary provides presence/absence values. LS-BSR provides a user the flexibility to change thresholds when comparing genes between bacterial groups, while Roary output is perhaps more straightforward to interpret
- Since Roary uses GFF files, functional information for differences between genes is integrated into the output, which can be useful for making biological conclusions
- LS-BSR has integrated coding region prediction, which may not be as thorough as Prokka, but can result in less up front time requirements and still provide comparable results
- For diagnostics, the use of nucleotides may be more useful than proteins for finding differences between groups. For functional differences between groups, proteins are more useful. Single base deletions can cause large differences in proteins, but negligible differences in nucleotides
- Roary is highly dependent on annotation for pan-genomic comparisons. LS-BSR uses coding region prediction, but then aligns the region back against the entire genome to calculate the BSR. This ensures that differences in annotation are substantiated against the raw genome. This is one of the reasons that LS-BSR is much slower when using TBLASTN alignments. However, LS-BSR can now use blastp and Diamond to compare peptides against gene predictions, which improves the speed, but may reduce the accuracy.
- LS-BSR can separately screen selected genes against assemblies to look at patterns of presence/absence, while Roary does all vs. all comparisons only.
Overall, the current LS-BSR implementation is not slower than Roary and may be faster depending on the dataset analyzed. Furthermore, the claim that Roary is more accurate than LS-BSR is not substantiated using recent software implementations; differences between one manufactured dataset appear to be based on the result of one event of clustering highly similar proteins.