We are happy to announce anvi'o
v7 with the code name, "hope"!
After more than 3,000 changes that introduced about 35,000 new lines of code, this stable release of anvi'o represents one of the largest leaps forward in the history of the platform that introduces many new features, performance improvements, and fixes for known bugs.
This page intends to give you a summary of some of the notable changes that come with hope.
The code name recognizes Hope E. Hopps as a tribute to all laboratory technicians whose contributions have often been poorly recognized in science. This is despite the fact that technicians not only ensure accuracy, efficiency, and reproducibility in any laboratory, but also push the boundaries of science as much as any other member of their groups, if not more in many cases. Hopps was a specialist in infectious diseases and in 1966 she developed, together with Harry M. Meyer and Paul J. Parkman, a highly effective vaccine for rubella, a viral infection which caused more than 30,000 stillbirths in the United States alone between 1962 and 1965. Despite her role in the vaccine development, in a historical photograph by the NIH that portrays the rubella vaccine development team, Hopps was only identified as "Female Lab Technician" until recently, even though the caption of the same photograph explicitly named Meyer and Parkman. The unfair treatment of laboratory technicians remains to be commonplace in today's science. In fact, "not more than a technician's job" can serve as an argument for professors when they wish to refuse the recognition of one's contributions to science. We can't ignore the significant progress we have made as a community during the past few years. But while we continue working on increasing the diversity, equity, and inclusion in science, we must also recognize and face the implicit and explicit biases against those in science who are not PIs, post-docs, or graduate students.
Disclosures: The code name was a suggestion by Alon Shaiber, a Genomics Data Scientist at Weill Cornell Medicine. The release notes were written by Meren and proofread by Iva Veseli. Alon, Meren, and Iva are among the developers of anvi'o.
New help pages for anvi'o programs and artifacts
As anvi'o developers, we always knew the critical importance of providing our users with extensive tutorials so they can find their way through their data themselves. However, as anvi'o matured, the number of anvi'o programs and artifacts increased dramatically. This created a bottleneck since every anvi'o tutorial assumed that our users knew about the common concepts in anvi'o (such as 'the profile database' or 'a collection') or common anvi'o programs (such as 'anvi-profile' or 'anvi-interactive'). Solving this fundamental problem required us to think of an entirely new technical approach to our documentation that is now in place.
We have now implemented a system (#1425) that makes two things possible:
- (1) For anvi'o developers, a means to quickly describe their contributions without leaving the environment where they write code (for instance, here is the description of 'collection' in the codebase),
- (2) For anvi'o users, a means to be able to see that information on a web page where all anvi'o programs and concepts are interconnected (for instance, here is the description of 'collection' on the web page).
This way, anvi'o could accumulate information from its developers without burdening them and present it to its users in a way where self-learning is possible. However, there was one significant problem: retrospectively describing all the things that have already been implemented in the codebase. Enter Jessica Pan (@Jessica-Pan), an undergraduate student at the MIT. Jessica took the responsibility of describing existing anvi'o programs and artifacts a few months ago and with the guidance of other anvi'o developers, Jessica was able to populate this technical framework with her words and descriptions (#1470), which added more than 200 files and tens of thousands of words of documentation to the codebase.
With this release, we are happy to also release the first outputs of this documentation project here, with the hope that it will make your life with anvi'o a little easier going forward:
Perhaps it will not be a surprise to the long-term anvi'o users that this documentation system is also connected to our command-line programs. Thanks to this, they will be able to offer you more useful help menu outputs. For instance, if you were to type
anvi-interactive --help in your terminal in
v7, you would see the following section at the end of the help menu, so you can click on the link to go to the online description of the program and browse through examples and artifacts associated with it:
If you visit the help pages you will see there are 'edit' links under every file. It is our way of inviting the rest of the community to contribute to these pages with their own experiences with anvi'o tools. If you have ideas to make it better, come to our Slack channel for a discussion, or file a GitHub issue. We are all ears.
Significant performance improvements
This version will likely be remembered for significant performance improvements by multiple heroes, including Evan Kiefl (@ekiefl), Iva Veseli (@ivagljiva), and Ryan Moore (@mooreryan). Here is a glimpse of what happened in
v7 compared to
Profiling BAM files is one of the most critical steps in anvi'o and the program anvi-profile has been a nightmare for memory and processing time. Thanks to @ekiefl's significant improvements that influenced the runtime and memory requirements of the profiling step (#1362, #1339),
anvi-profileis now ~17 times faster.
One of the first steps in any anvi'o workflow is turning boring FASTA files into talented contigs databases that can be used by many anvi'o programs. Yet, the program anvi-gen-contigs-database has not been multithreaded, which has been a significant performance bottleneck before (#1344, #1431). Responding to our plea on Twitter, @mooreryan has made remarkable contributions to the anvi'o codebase (while wrapping up his PhD, that is) (#1437, #1468, and #1445). As a result, anvi-gen-contigs-database is now multithreaded, at least two times faster than before, and can take advantage of your fancy clusters to be even more efficient.
After making our users who care about the quality of their MAGs go through so much pain and suffering for years, anvi-refine in this release is ~13 times faster after significant improvements in its memory requirements (#1455, #1458), thanks to help from Xabier Vázquez-Campos (@xvazquezc).
Our dealings with HMMs also benefited from major performance improvements. Thanks to @ivagljiva's efforts (#1413), the two frequently used programs, anvi-run-hmms and anvi-run-pfams that rely on HMMER, will perform HMM annotations ~3x faster.
An integrated subsystem for metabolic reconstruction
We are also thrilled to announce that starting with this release, anvi'o includes a new suite of programs for predicting metabolic capabilities for genomic and metagenomic data, thanks to @ivagljiva's extensive work that started with #1413. The new programs in this release rely on the extensive set of resources in the Kyoto Encyclopedia of Genes and Genomes (KEGG) for gene annotation and metabolism estimation, although in future releases we will expand source resources for metabolic reconstruction.
As a part of this subsystem, this release introduces a new database, the anvi'o MODULES.db, which is generated from parsed KEGG data files (such as KOfam HMM profiles and text-based descriptions of KEGG MODULE) and used by subsequent programs detailed below for easy, organized access to metabolic data. A version tracking system ensures metabolism estimation is run using the same MODULES.db that was used to annotate a given CONTIGS.db. Here are the key programs for the anvi'o subsystem for metabolism:
The program anvi-setup-kegg-kofams generates a
MODULES.dband stores it on your server or the local computer.
The program anvi-run-kegg-kofams annotates genes in a given anvi'o contigs database with KEGG Orthology (KO) numbers (via hits to the KEGG KOfam database). This program is included in the anvi'o snakemake workflows Alon Shaiber (@ShaiberAlon) had introduced, which enables bulk annotation of several contigs databases with a single command.
The program anvi-estimate-metabolism predicts metabolic potential in a given set of sequences by integrating KO annotations with KEGG MODULE information. These estimates are integrated into anvi'o in various ways and can be summarized in flat text files. In addition to contigs databases, and optionally profile databases and collections, the program accepts internal, external, or metagenomes files as input. The program is able to work with a variety of output options.
We look forward to the input from the community to offer improved and integrated metabolic insights into microbial genomes and metagenomes.
New tools for Transfer RNA biology
One of the most significant advances in this release include the new tools developed by Samuel Miller (@semiller10) for the study of Transfer RNA transcripts that are generated by tRNA-seq, a sequencing protocol that is developed by the members of Tao Pan's group at the University of Chicago. This sequencing strategy aims to offer high-resolution insights into the translational regulation of cells by revealing changes in the abundance and chemical modifications of tRNA transcript across environmental conditions. While the sequencing strategy makes accessible tRNA transcripts from diverse environments, the extremely complex data generated by this strategy requires completely new computational approaches not only to characterize tRNA sequences with their structural properties but also to resolve chemical modification fractions and their taxonomy. We hope that the more than 10,000 lines of code @semiller10 has created behind
anvi-convert-trnaseq-database (primarily through #1509 and #1615) and their cousin programs
anvi-estimate-trna-taxonomy, will set the stage for new horizons that can bring more RNA biology into the anvi'o ecosystem in an integrated fashion.
Ability to profile INDELs in mapping results
Anvi'o offers a powerful and comprehensive framework to enable in-depth investigations of microbial population genetics. Yet, these insights have so far been limited to single-nucleotide variants, single-codon variants, and single-amino acid variants as reported and/or used by anvi-gen-variability-profile, anvi-display-structure, and anvi-gen-fixation-index-matrix or as displayed in varioius interactive interfaces.
This version comes with a completely redesigned anvi-profile, which is now able to characterize INDELs in read recruitment results thanks to @ekiefl's additions to the codebase especially in #1394. The ability to characterize INDELs pushes the boundaries of our ability to make sense of microbial population genetics through metagenomes, and we hope you will find many gems in your data, as well.
Currently, anvi'o reports INDEL tables that are ready to go into R or any other statistical analysis or visualization environment that you can obtain via the program
To benefit from this feature in your existing projects, you will need to re-profile your BAM files ... which should be easy-peasy if you have been using anvi'o snakemake workflows ;)
Improvements in the interactive interface
The power of anvi'o in the command line is complemented by its interactive interfaces, which also benefit from numerous new features in this release. But perhaps the most critical improvements were those contributed by Isaac Fink (@isaacfink21), an undergraduate student at the University of Chicago, who revamped the inspection page (in #1466 and others).
Not only we are now able to visualize single-nucleotide variants better,
but also the interactive interface is now able to visualize INDELs in (meta)genomic/(meta)transcriptomic read recruitment results,
so you can spend EVEN MORE TIME looking at your coverages.
Isaac Fink's revamped inspection page comes with a 'settings' panel, organizing all the features this page offers, which looks like this:
There are other exciting features that will likely make those who use anvi'o for pangenomics and phylogenomics very happy.
Anvi'o interactive interface was not programmed to show support values for phylogenomic trees, which was a long standing item (#1450) in the list of feature requests we had. Matthew Lawrence Klein (@matthewlawrenceklein) who joined our team only a few weeks ago, managed to improve this shortcoming through #1618 thanks to the guidance we received from Tom Delmont. Result? Anvi'o can now visualize branch support values on trees when applicable:
While this is a preliminary implementation of this feature, we are looking forward to the feedback we will receive from the community to improve it.
Another critical shortcoming was the difficulty of selecting multiple items using the interactive interface when there was no tree or dendrogram to guide the selection of objects. This happened especially working with pangenomes, where ordering gene clusters based on their synteny is a common need, yet selecting regions of interest requires clicking on each item individually. @matthewlawrenceklein also addressed this through #1614, and it is now possible to select a range of items in a straightforward fashion:
On top, after selecting regions of interest in a pangenome, it is now possible to take a quick peek at functions gene clusters encode through the interactive interface without having to run anvi-summarize:
Other fancy tools and functionality
Ability to estimate per-residue binding frequencies for protein strucutres with anvi-run-interacdome by Evan Kiefl. This enables very highly resolved analyses of environmental variants in conjunction with the structural context of proteins and their ligand binding sites. If this sounds interesting to you, read Evan's journey implementing this feature, see his gargantuan of a pull request at #1472, or read this paper by Shilpa Nadimpalli Kobren and Mona Singh to get yourself familiarized with the goal here.
Ability to extract gene loci or operons from any genomic context with anvi-export-locus by Matthew Schechter (@mschecht) and others. By enabling high-throughput recovery of loci of interest across any number of genomes by marking genes with their functional annotations or HMM hits, this strategy makes it possible to ask very specific questions regarding the gene content, evolution, and ecology of genomic operons. You can read Matt's tutorial here, or see is pull request at #1386.
Generalization of the functional enrichment analysis in #1500 by Iva Veseli. This statistical approach was initially developed by Amy Willis (@adw96) and was implemented in the anvi'o program
anvi-get-enriched-functions-per-pan-group. As the name suggests, this tool was specific to studying functional enrichment in pangenomes. Thanks to Iva's contribution, the new program anvi-compute-functional-enrichment is now able to work with pangenomes, metabolic pathways, and internal or external genomes to study functional enrichment statistics between distinct groups of entities.
And many more.
A list of new anvi'o programs
This release comes with the following programs that were not in the previous stable release: anvi-convert-trnaseq-database, anvi-display-metabolism, anvi-estimate-metabolism, anvi-estimate-trna-taxonomy, anvi-run-interacdome, anvi-run-kegg-kofams, anvi-run-trna-taxonomy, anvi-script-augustus-output-to-external-gene-calls, anvi-script-fix-homopolymer-indels, anvi-script-gen-pseudo-paired-reads-from-fastq, anvi-script-get-primer-matches, anvi-script-pfam-accessions-to-hmms-directory, anvi-script-tabulate, anvi-setup-interacdome, anvi-setup-kegg-kofams, anvi-setup-scg-taxonomy, anvi-setup-trna-taxonomy, anvi-trnaseq.
Anvi'o as a community platform:
We have always imagined anvi'o as a community platform, and we are getting there. Even this very release is a product of voluntary contributions of many members of the anvi'o community, who slowly shape this open-source software ecosystem for integrated multi-omics.
A few weeks ago we published a comment that was authored by those who are mentioned in these release notes and many more who have been supporting anvi'o in many ways to make it more accessible to the community of microbiologists.
Our paper ends with this statement:
(...) As an open-source platform that empowers microbiologists by offering them integrated yet uncharted means to steer through complex ‘omics data, anvi’o welcomes its new users and contributors.
We thank you for your interest in anvi'o and for your patience with it, in advance. We hope that anvi'o will continue to empower you in 2021 so you can find the answers you are looking for in the avalanche of data that surrounds you.
See our up-to-date installation instructions here, which include docker and conda solutions and ways to reach out to the anvi'o community for help if you run into a problem.
After nearly 9,000 changes that introduced about 16,000 new lines of code, the current version of anvi'o represents many fixes to big and small bugs, as well as new features. This page intends to give you a summary of most notable changes that come with esther.
The codename is a small tribute to Esther Lederberg (1922-2006), an American microbiologist who studied plasmids and bacterial viruses. Lederberg discovered lambda phage, an E. coli virus that is commonly used in bacterial genetics and molecular biology to deliver DNA into a recipient organism. This led to her description of specialized transduction, that occurs when a prophage improperly excises from the host chromosome carrying host DNA in addition to the viral DNA. In collaboration with her husband, Lederberg developed the technique known as replica plating, which allows repeatable inoculation of bacterial colonies. Lederberg and Luigi L. Cavalli-Sforza discovered the Fertility factor or F-plasmid in E. coli. This is a sequence of DNA that lets the host cell transfer genetic material via a rod-like structure into recipient cells (conjugation). Despite her many incredible scientific accomplishments, she was constantly overshadowed by her husband. She was not appointed to a tenured position while they were both faculty at Stanford, and after their divorce she had a difficult time retaining her appointment. We dedicate anvi'o version 6 to the memory and revolutionary discoveries of Dr. Lederberg.
Real-time estimation of genome taxonomy
Working with genomes often requires insights into their taxonomy. This becomes a critical need especially in genome-resolved metagenomics studies as we are burning to find out where the genomes we reconstruct from metagenomes fit in the tree of life. Until this esther, anvi’o did not offer anything to address this need, however, this new version comes with a novel solution that covers both the interactive interface during binning:
and the terminal environment to survey existing collections of genomes:
These two examples are from the infant gut dataset by Sharon et al (2013), which we often use to demonstrate anvi'o features, but we can't wait to hear from you to learn about your experience with this feature.
Please read in this article the usage details, our thanks to The Genome Taxonomy Database for making their raw data public, and potential caveats of our approach:
A new tool for genome de-replication
De-replication is a critical need to minimize bias in metagenomic read recruitment analyses. In our previous studies we had performed de-replication with a series of Python scripts, but no more. Thanks to Mahmoud Yousef and Evan Kiefl's efforts, we now have two new programs, anvi-compute-genome-similarity and anvi-dereplicate-genomes, integrated with metagenomic and pangenomic workflows in anvi'o and use sourmash and PyANI in the backend.
A tutorial for their usage is on the way!
Support for more binning algorithms
In previous versions of anvi'o we had a native module for CONCOCT, one of the popular binning algorithms for automatic clustering of contigs into genome bins. We have changed that behavior in this version. You will still be able to use the program anvi-import-collection to import binning results from ANY binning software as before, but anvi'o will also be able to automatically use binning tools existing on your system through our new program anvi-cluster-contigs. Here is a command line output to give you a sense of it:
This framework is highly modular, so the integration of new binning algorithms is extremely straightforward thanks to Özcan Esen's excellent design. If you are a programmer you can take a look at the module for MaxBin2 or BinSanity to develop one for your algorithm for benchmarking or testing efforts.
Effective ways to inspect and visualize contig coverages
Recognizing the importance of actually 'looking' at data, we have been putting a lot of emphasis on the inspection capabilities of anvi'o. When it comes to metagenomic read recruitment and coverages, inspecting contigs can be critical to gain deeper insights into what is actually going on.
In this version we have two new programs. The first one is anvi-inspect. The inspect page of anvi’o is very useful for careful examination of contig coverages and single nucleotide variants. Sometimes this might even be all you want. This new program enables you to immediately pull up the inspection page of a given contig without going through the whole hassle of opening the interactive interface.
We often feel the need to put coverage patterns of contigs in presentations or publications. Yet it becomes challenging when there are too many samples in a dataset as it makes it harder to study or save patterns comfortably using the interface. So we thought it would have been very useful if anvi'o could export coverage statistics using
ggplot, but we didn't know enough
R to be able to do this properly. As a result, we did what anyone who wish to work with talented people would do --we asked for help on Twitter:
Our call for help was heard by Ryan Moore, who actually developed a new anvi'o program that did exaxtly what we thought you would need, and much more:
anvi-script-visualize-split-coverages (we sent him an anvi'o t-shirt as a token of our deep gratitude for his contribution, but we never got a photo back, so we don't know whether he is wearing it).
This program can export split coverages along with single-nuleotide variants on them into PDF files for even very large numbers of samples. It uses the output files anvi'o generates through anvi-get-split-coverages and optionally anvi-gen-variability-profile. The output is customizable with respect to plot color, axes, SNV color and grouping of samples. The tutorial for this feature will soon be on our web page.
Improved genome completion/redundancy estimates
New single-copy core gene collections
Starting with this version, we no longer use Campbell et al. and Rinke et al. single-copy core gene (SCGs) HMM sets to estimate completion of bacterial and archaeal genomes. Instead, we are using a modified version of the bacterial single-copy core gene collections Mike Lee recently described, and a set of BUSCO HMMs Tom Delmont curated. Now anvi'o can estimate the completion of bacterial, archaeal, and protist genomes (#1150).
New random forest domain of life classifier
In previous versions anvi'o has relied on multiple heuristics to predict the domains of selected contigs or genomes for the determination of which SCG collection to use to estimate and display completion and redundancy. In this version we have a brand new random forest classifier to take care of this challenging task. This robust classifier with appropriate addition of noise solves this issue like magic, and when you have a bunch of genomes, it gives you proper estimates in the interface (the example is also from the infant gut dataset),
or in the terminal,
Undo/Redo for the interactive interface
Yes. This feature is finally here. Now when you make a mistake while curating or refining your genomes using anvi-interactive or anvi-refine, you will be able to use
Ctrl + Z and
Ctrl + Shift + Z key combinations for undo and redo your binning decisions. If you can't contain your emotions, consider taking Özcan Esen for a coffee for this excellent feature :)
A new tool to extract target loci from genomes and metagenomes
Some genetic analyses call for the comparison of specific genetic loci between genomes. For example, one may be interested in investigating evidence for adaptive evolution of the lac operon between different E. coli strains by extracting all loci from different genomes. Anvi'o esther comes with a very talented tool, anvi-export-locus, that will help you extract target loci from a larger genomic context, whether those context are genomes or metagenomic assemblies.
This tool cuts out loci using two approaches: default mode or what we call flank-mode. In the default mode, the tool locates a designated anchor gene, then cuts upstream and downstream context based on user-defined input. Flank-mode, on the other hand, locates designated genes that surround the target locus, then cuts in between them. Target genes of interest to locate anchors for exicion can be defined through their specific ids in anvi'o or through search-terms that query functional annotations or HMM hits stored in your contigs databse!
Much faster HMMs
You complained, we heard (hehe). In anvi'o esther we finally fixed the sluggish speeds of HMM operations from which we you have suffered even when you assigned multiple threads to anvi-run-hmms. Özcan Esen revamped our code and has improved our speed dramatically with increasing number of threads given to anvi'o. Our tests indicate that speed gains roam around as much as four gazillion.
Much better functional enrichment analyses for pangenomes
Anvi'o esther comes with a new version of anvi-get-enriched-functions-per-pan-group thanks to the invaluable statistics input and code we have received from Amy Willis (@AmyDWillis). Please take a look at our tutorial on pangenomics for details.
Anvi'o gets better at helping you
Getting offline help from anvi'o has been difficult. Recognizing this limitation, Evan Kiefl created the program anvi-help that will help you find your way through anvi'o by simply asking anvi'o what does it have to do X. Here is an example. You type the following,
And you get back this:
We are very also very thankful for our users, whose feature requests, bug reports, and patience continue to give us energy to push things forward (although I can promise that we are not going to be pushing anything anywhere for a week or two after this release as we all just want to take a very long nap).
Finally, we thank all the open-source software developers and data curators everywhere. Without them none of these would have ever existed.
We hope esther helps you with your research
To read the updated installation instructions for
v6, please visit http://merenlab.org/install-anvio
We are happy to announce a new version of anvi'o, "margaret".
After nearly 1,500 changes that introduced about 15,000 new lines to the anvi'o codebase and removed about 4,000 from it, the current version includes many fixes to big and small bugs, as well as new features. This page intends to give you a summary of most notable changes that comes with margaret.
The codename is a small tribute to Margaret Oakley Dayhoff, an American physical chemist, who is known as the founder of bioinformatics. Dayhoff developed first programmable computer methods to compare protein sequences, and published in 1965 a book titled "Atlas of Protein Sequences and Structure", which is considered as of today the first text book of bioinformatics. The codename was suggested by Mick Watson, and won the popular vote on Twitter. Dayhoff sadly died at an early age of 57 in 1994, shortly before bioinformatcis emerged as a distinct field. However, her astonishing contributions to life sciences, such as the development of essential approaches for protein sequence comparison and evolutionary tree construction, still constitute some of the most common approaches in our bioinformatics toolkit.
Your new disconcerting toy: GC-content overlaid on reference contexts
Metagenomic read recruitment often results in wavy coverage patterns in the reference context. This phenomenon, which can be attributed to three major sources, can result in up to an order of magnitude coverage difference for genes within the same contig. While we are kind enough to leave those alone who solely work with metagenomic short reads to quantify functions in metagenomes in their blissful world, we wanted to include in this version of anvi'o something so you can overlay GC-content change throughout your contigs to see whether variation you observe in the context of some of your key genes is largely driven by GC-content or not:
This is not yet anything but a qualitative insight for you to make sense of to what extent variation in coverage could be explained by deterministic factors that have nothing to do with the biology of your system given the metagenome, but it shows that more quantitative insights into this could be useful. We will think about this going forward, and we are open to your suggestions!
A new anvi'o workflow management system for serious anvians
This new version of anvi'o includes a new program
anvi-run-workflow, which provides an interface to our new module that implements snakemake-based anvi'o workflows.
These workflows offer accessible, reproducible, and comprehensible solutions for complex analyses that may include hundreds of samples. We have been using
anvi-run-workflow every day in our lab since it first appeared in our
master repository, and we are happy to make its power available to you as soon as we could.
There will be an extensive tutorial very soon, but until then you can send your questions to Alon (smiley).
Single-codon variants for a more powerful framework to study microbial population genetics
Anvi'o already could make sense of single-amino acid variants (SAAVs) in environmental metagenomes. But working with SAAVs was limiting our ability to infer and quantify neutral processes that may not result in changes in the amino acid sequence. We changed our design in such a way, now
anvi-profile can characterize single-codon variants (SCVs) if
--profile-SCVs flag is declared. We updated our reference manual for variability analysis to include new sections describing SCVs and SAAVs.
With SNVs, SCVs, and SAAVs, anvi'o
v5, deserving of its codename, offers a robust framework to investigate population genetics of environmental microbes, while SCVs and SAAVs leverage our ability to tease apart evolutionary forces acting upon them. We hope you enjoy these new toys, and feel free to get in touch with us if you have questions or suggestions.
Visualize environmental variation on protein structures through the new Structure DB
Our efforts to push the boundaries of investigations of environmental variation within microbial populations reaches to a new level in this release with a brand new ability about which we are very excited: linking variation to predicted protein structures.
With the new structure database associated workflows anvi’o can predict the tertiary structure of genes identified from a contigs database using the Protein Database Bank. Then, it can directly overlay onto the predicted protein structures the variability data from your metagenomes in the form of SCVs and SAAVs. All of this is accomplished in just two new programs,
We believe that this nexus between structural biology and metagenomics will elevate environmental metagenomics into the realm of biophysics, and enable investigations into evolutionary processes driving the diversity of proteins that could not be learned from sequence analyses alone.
Here is a teaser from the new interactive
We will soon make available an extensive tutorial to describe this workflow in detail. Until the, you can send your questions to Evan and Ozcan.
Computing average nucleotide identity for genomes in pangenomes
This release also includes significant improvements for our comparative genomics and pangenomics workflows.
One of these improvements is the inclusion of a new program,
anvi-compute-ani, to calculate the average nucleotide identities across a given set of genomes, which can be automatically added into any anvi'o pangenome.
For instance, this is an anvi'o pangenome of the 31 Prochlorococcus isolates we played with in our recent paper:
And this is what you get when you run
We updated our tutorial on pangenomics to describe intermediate steps.
A new approach to explore functional enrichment in pangenomes
This version of anvi'o also incluedes a new analytical framework to study functional enrichment in a given pangenome based on any arbitrary organizations of genomes. You simply define how would you like to partition your genomes, whether based on a phylogenetic tree or a dendrogram that anvi'o computed from gene cluster distributions, and this new tool finds functions that are enriched in those groups (i.e. functions that are characteristic of a given group of genomes, and predominantly absent from genomes from outside this group).
This is done by the new program
anvi-get-enriched-functions-per-pan-group, and Alon extended our current tutorial on pangenomics with an extensive description of how it works.
Native functional annotation options += PFAMs
If you have your own functional annotations for your genes in an anvi'o contigs database, it is quite straightforward to import them via
anvi-import-functions program. Anvi'o
v3 had made available another program to automatize the annotation process,
anvi-run-ncbi-cogs, if you were fine with NCBI's Cluster of Orthologus Groups. This release contains a new program,
anvi-run-pfams to use the collection of HMMs produced by the European Bioinformatics Institute based on UniProt.
Tree modification through the interactive interface
It has been a challenge to deal with phylogenetic tree operations in anvi'o interactive interface. This version includes a significant code refactoring effort, which makes possible to have new toys that we could not have before. These new toys include basic tree editing and storage abilities such as re-rooting trees, rotating and collapsing branches. You can even see the branch support values in the mouse tab of the anvi'o interactive interface. These functions are now available to you through the menu that appears when you click a branch in the interactive interface while pressing the Command or Control key:
A new HMM collection to estimate completion of eukaryotic bins
Since its conception anvi’o included single-copy core gene collections to assess the completion and redundancy of bacterial and archaeal bins. This release includes a collection to estimate the completion of eukaryotic bins that Tom Delmont, who recently left us physically to join the ranks of Genoscope, curated from the BUSCO collection.
If you are recovering tiny eukaryotic organisms from your metagenomes please help us improve this collection by reporting back your experiences with it.
Importing metagenome-level short-read taxonomy and the enhanced stacked bar data type
While our efforts on shotgun metagenomes largely focus on genome-resolved strategies, we acknowledge that one could learn a lot from taxonomic annotation of short-reads as an additional layer of information. In this release anvi'o comes with a new program,
anvi-import-taxonomy-for-layers with a KrakenHLL parser, which can import short-read level taxonomic annotations into anvi'o profiles. Thanks to the improved data groups, different levels of taxonomy would be available in the layers tab,
And could be visualized easily:
The best part is that our improved stacked bar data type in this release then would allow you to order your metagenomes based on the relative abundance of any given taxon at any given taxonomic level in those metagenomes according to short reads (the example below, orders metagenomes in the infant gut dataset from Sharon et al. based on the increasing relative abundance of Enterococcus):
Here we would like to assume that you're saying to yourself "the example is boring, but the concept has promise". Thanks! We agree.
A year ago I listened to Jeff Gordon's talk at the University of Chicago to which he started with this African proverb:
If you want to go fast, go alone. If you want to go far, go together.
This concept applies to scientific endeavors so well. Speed is transient, and teamwork is essential for major contributions. Fortunately anvi'o has been becoming more and more of a team effort. But looking at our release notes, I don't know whether we could go any faster from
v5 either. This release was a result of significant intellectual and coding contributions from Alon Shaiber, Evan Kiefl, as well as Özcan Esen, whose guidance and hard work continue to keep this operation together. Altogether, they spent hours and hours on big and small features and issues, with an enthusiasm that can be best justified by curiosity and the desire to contribute to your journey in data-driven microbiology. I, Meren, who gets to write this release note one more time, thank them wholeheartedly.
As a team we also thank Jarrod Scott, Alexandra Campbell, Samantha Atkinson, Carlos Ruiz, Bryan Merill, Mike Lee, Varun Srinivasan, and many others who asked for features and reported bugs with their endless patience with us.
We hope you find
v5 useful for your research, and we certainly hope you will not run into any bugs we probably left in the code