Skip to content

Commit

Permalink
update README, add check for too few genomes
Browse files Browse the repository at this point in the history
  • Loading branch information
nickp60 committed Apr 19, 2019
1 parent 3cbb80d commit cc6a76b
Show file tree
Hide file tree
Showing 3 changed files with 12 additions and 6 deletions.
10 changes: 6 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
![annofilt](https://github.com/nickp60/annofilt/blob/master/docs/icon/icon.svg)

# The Problem
Pangenomes from genome assemblies can be befuddled by missassemblies of genes, expecially those truncated by contig breaks.
Pangenomes from genome assemblies can be befuddled by missassemblies of genes. Often the repeats from multicopy genes cause regions ttat are impossible to assemble from short reads; this results in truncated genes often being found on contig ends.

# The Solution
`annofilt` is used to filter annotations that appear to be truncated, based on BLAST comparison with a pangenome generated from closed genomes. Briefly, the algorithm proceeds as follows:
Expand All @@ -46,13 +46,15 @@ create filtered .gff file
To verify the length of annotated genes, we compare annotation length, alignement coverage, and evalue to a pangenme built of well-currated annotations for a given strain. To build a pangenome for your strain of interest, do the following:

1. Download as many complete genomes from RefSeq as desired (minimum of 10?, maybe?) with `get_compete_genomes``
2. Run Roary. This is a good time to explore their stringincy options for percentage identity (which defaults to 95%)
2. Create pangenome with `make_annofilt_pangenome`. This is a good time to explore their stringincy options for percentage identity (which defaults to 95%). If you want, you can adjust the default params using the `--add_roary_args` command.
4. Take a look at the resulting `summary_statistics.txt` file, to make sure nothing looks amiss.
3. Move the `pan_genome_reference.fa` file to a convenient location for use with annofilt. This contains a representative nucleotide sequences for each gene in the core.

# Installation
```
conda create -n annofilt -c conda-forge -c bioconda prokka roary blast
conda activate annofilt
pip install annofilt
```

# Quick Start
Expand Down Expand Up @@ -100,7 +102,7 @@ annofilt annofilt_test_data_archive/11complete_colis/pan_genome_reference.fa ./a


# So what does it do to my assemblies?
I used a subset of the Enterobase E coli collection, where I downloaded a representative from each Ackman sequence types (~1100 strains).
I used a subset of the Enterobase E coli collection, where I downloaded a representative from each Acktman sequence types (~1100 strains).

By default, annofilt checks the annotations at the end of each contig. The figure below shows the number of genes searched (2 * number of contigs) in gray, and the number of genes retained is in red.

Expand Down Expand Up @@ -129,4 +131,4 @@ Overall, in the pangenome we generated with and without annofilt, we reduced the


## Notes for running with Docker
To keep the image size rom being outrageously large, we did not include Prokka in the image. I maintain a separate Prokka image, which can be obtained from docker hub. So, we really only recoomend using Docker to run the main annofilt procedure, not using it to download data, run Prokka, or run Roary.
To keep the image size rom being outrageously large, we did not include Prokka in the image. I maintain a separate Prokka image, which can be obtained from docker hub. So, we really only recoomend using Docker to run the main annofilt procedure, not using it to run `get_complete_genomes` or `make_annofilt_pangenome`. .
2 changes: 1 addition & 1 deletion annofilt/_version.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = '0.0.6'
__version__ = '0.0.7'
6 changes: 5 additions & 1 deletion annofilt/make_annofilt_pangenome.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,11 @@ def main(args=None, logger=None):
for k, v in sorted(vars(args).items()):
logger.debug("{0}: {1}".format(k, v))
genomes = glob.glob(args.genomes + "*.fna")
logger.info("Preparing Prokka commands")
if len(genomes) < 2:
raise ValueError("Prokka needs a minimum of 2 genomes to run! " +
"check the contents of your --genomes dir. " +
"Genomes need to end in .fna")
logger.info("Running Prokka commands")
for i, genome in enumerate(genomes):
thisname = os.path.basename(os.path.splitext(genome)[0])
outdir = os.path.join(output_root,
Expand Down

0 comments on commit cc6a76b

Please sign in to comment.