update README, add check for too few genomes

nickp60 · Apr 19, 2019 · cc6a76b · cc6a76b
1 parent 3cbb80d
commit cc6a76b
Show file tree

Hide file tree

Showing 3 changed files with 12 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -19,7 +19,7 @@
 ![annofilt](https://github.com/nickp60/annofilt/blob/master/docs/icon/icon.svg)
 
 # The Problem
-Pangenomes from genome assemblies can be befuddled by missassemblies of genes, expecially those truncated by contig breaks.
+Pangenomes from genome assemblies can be befuddled by missassemblies of genes. Often the repeats from  multicopy genes cause regions ttat are impossible to assemble from short reads; this results in truncated genes often being found on contig ends.
 
 # The Solution
 `annofilt` is used to filter annotations that appear to be truncated, based on BLAST comparison with a pangenome generated from closed genomes.  Briefly, the algorithm proceeds as follows:
@@ -46,13 +46,15 @@ create filtered .gff file
 To verify the length of annotated genes, we compare annotation length, alignement coverage, and evalue to a pangenme built of well-currated annotations for a given strain.  To build a pangenome for your strain of interest, do the following:
 
 1. Download as many complete genomes from RefSeq as desired (minimum of 10?, maybe?) with `get_compete_genomes``
-2. Run Roary.  This is a good time to explore their stringincy options for percentage identity (which defaults to 95%)
+2. Create pangenome with `make_annofilt_pangenome`.  This is a good time to explore their stringincy options for percentage identity (which defaults to 95%).  If you want, you can adjust the default params using the `--add_roary_args` command.
+4. Take a look at the resulting `summary_statistics.txt` file, to make sure nothing looks amiss.
 3. Move the `pan_genome_reference.fa` file to a convenient location for use with annofilt.  This contains a representative nucleotide sequences for each gene in the core.
 
 # Installation
 ```
 conda create -n annofilt -c conda-forge -c bioconda prokka roary blast
 conda activate annofilt
+pip install annofilt
 ```
 
 # Quick Start
@@ -100,7 +102,7 @@ annofilt annofilt_test_data_archive/11complete_colis/pan_genome_reference.fa ./a
 
 
 # So what does it do to my assemblies?
-I used a subset of the Enterobase E coli collection, where I downloaded a representative from each Ackman sequence types (~1100 strains).
+I used a subset of the Enterobase E coli collection, where I downloaded a representative from each Acktman sequence types (~1100 strains).
 
 By default, annofilt checks the annotations at the end of each contig. The figure below shows the number of genes searched (2 * number of contigs) in gray, and the number of genes retained is in red.
 
@@ -129,4 +131,4 @@ Overall, in the pangenome we generated with and without annofilt, we reduced the
 
 
 ## Notes for running with Docker
-To keep the image size rom being outrageously large, we did not include Prokka in the image.  I maintain a separate Prokka image, which can be obtained from docker hub.  So, we really only recoomend using Docker to run the main annofilt procedure, not using it to download data, run Prokka, or run Roary.
+To keep the image size rom being outrageously large, we did not include Prokka in the image.  I maintain a separate Prokka image, which can be obtained from docker hub.  So, we really only recoomend using Docker to run the main annofilt procedure, not using it to run `get_complete_genomes` or `make_annofilt_pangenome`. .
diff --git a/annofilt/_version.py b/annofilt/_version.py
@@ -1 +1 @@
-__version__ = '0.0.6'
+__version__ = '0.0.7'
diff --git a/annofilt/make_annofilt_pangenome.py b/annofilt/make_annofilt_pangenome.py
@@ -90,7 +90,11 @@ def main(args=None, logger=None):
     for k, v in sorted(vars(args).items()):
         logger.debug("{0}: {1}".format(k, v))
     genomes = glob.glob(args.genomes + "*.fna")
-    logger.info("Preparing Prokka commands")
+    if len(genomes) < 2:
+        raise ValueError("Prokka needs a minimum of 2 genomes to run! " +
+                         "check the contents of your --genomes dir.  " +
+                         "Genomes need to end in .fna")
+    logger.info("Running Prokka commands")
     for i, genome in enumerate(genomes):
         thisname = os.path.basename(os.path.splitext(genome)[0])
         outdir = os.path.join(output_root,