Switch branches/tags
Nothing to show
Clone or download
Unknown and Unknown fix line 412 remove /n
Latest commit 0813638 Feb 25, 2018
Type Name Latest commit message Commit time
Failed to load latest commit information.
g3-iterated.csh initial commit Apr 12, 2017 initial commit Apr 12, 2017 fix line 412 remove /n Feb 25, 2018 Update May 22, 2017
rrna_databases.tar Add files via upload Apr 24, 2017

peasant 1.0 Manual


The following software must be installed and be included in the PATH.

  1. Operating System: UNIX/Mac OSX
  2. GCC Compiler:
  3. Python – Version 2.6+:
  4. BioPython Package 1.68:
    1. NumPy:
  5. Sickle:
    1. Zlib:
  6. SPAdes 3.10.1:
  7. GLIMMER3:
    1. ELPH:
  8. BLAST+ Binaries:
  9. BBMap:
    1. JAVA v6+:
  10. tRNAScan-SE:

All binaries' directories must be added to system PATH.


The script g3-iterated-viral.csh should be located in the same folder as Note, this script includes 3 paths which must be set upon installation. Open the script in a text editor. Set the awkpath, glimmerpath, and elphbin to the locations of these three folders on your machine. The script must also be an executable; this may require chmod.

Download the RNA databases from github: rna_databases.tar file and put (uncompressed) in your database_path folder. Precompiled bacteria databases for annotation are publicly available online at:

The script must be modified by the user to set the path where the databases (RNA and bacteria) are located:

Usage: [assembly options] [filter options] [database] [homology options] -o output_path

assembly options:
-a {spades}, --assembler {spades}
Assembly methods:
spades: spades assembler

-A , --assembled_contigs
This will start analysis from an existing assembly/ multi-FASTA format file.

-s , --single_reads
Single reads file (FASTA or FASTQ).
-p , --paired_end_reads
Paired-end read files. List both read files (FASTA or FASTQ).

filter options:
-m , --min_contig_size
Minimum contig size.

-M , --max_contig_size
Maximum contig size.

-c , --min_coverage
Minimum coverage.

-cov , --min_SPAdes_cov
Minimum SPAdes cov value.

database options:
-g , --genus
Genus name to use for annotation. To consider more than one genus, list genus name (space separator). (required)

homology options:
-q , --qcov
Minimum query coverage to call homologous genes. Default=70.0.

-i , --pident
Minimum percent identity to call homologous genes. Default=70.0.

-b , --bitscore
Minimum bitscore to call homologous genes. Default=50.0.

other options:
-o , --output_path
Directory to store all the resulting files (required)

-t , --num_threads
Number of processors to use. (default=1)

-h, --help
Shows help message and exits.

Show program's version number and exits.


Let’s say I have paired-end reads R1 and R2. They are for the genome of a novel E. coli species isolated in the lab. I want to output my results to a folder my_output.

python -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -g Escherichia -o /mydirectory/my_output

This will produce the output folder my_output with the following files and directory:
Directory temp: This directory contains all of the temporary files generated by read QC (‘trimmed_’ and singletons.fastq (in the case of paired-end reads), the assembly (in the subdirectory /assembly), and coding region prediction.
annotations.csv: comma delimited file listing the contig/location of predicted genes.
final_contigs.fasta: Multi-FASTA format file of the final contigs (after filtering)
predicted_orfs.fasta: all predicted ORFs for the final contigs
predicted_orfs_aa.fasta: Multi-FASTA amino acid sequences for the ORFs predicted within the contigs.
predicted_orfs_nt.fasta: Multi-FASTA nucleotide sequences for the ORFs predicted within the contigs.
predicted_rRNAs.fasta: Multi-FASTA nucleotide sequences for the rRNA (5, 16, and 23) sequences predicted.
tRNAscan-SE_output.txt: output generated by tRNAscan-SE listing the locations and tRNA type found.
log.txt: Log file generated. Includes information about parameters, results of filtering and classification.

Let’s say I’m most interested in finding sequences that are of a certain size, say 20K-80K, I can revise my above function call as follows:
python -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -m 20000 -M 80000 -g Escherichia -o /mydirectory/my_output2

I can filter based upon the SPAdes k-mer coverage value (-cov / --min_SPAdes_cov) if the assembly is done using SPAdes. I can also filter for base coverage (-c / --min_coverage). Following up on the example above, I’d like only those contigs with a SPAdes k-mer coverage value > 10.
python -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -m 20000 -M 80000 -cov 10 -g Escherichia -o /mydirectory/my_output3

Single end reads can also be considered:
python -a spades -s /mydirectory/myReads.fastq -g Escherichia -o /mydirectory/my_output3

If you’re not sure which taxa would be best to annotate your genome, you can select more than one, e.g.,
python -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -g Escherichia Salmonella Shigella -o /mydirectory/my_output4

I can also create my own repository (blast database) for annotation, e.g.,
python -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -g myTaxa -o /mydirectory/my_output5

I can also provide an assembly. Let’s say it’s a file called myAssembly.fasta. Just as if I were going to run the assembly here, I can also include filters. Note, the SPAdes k-mer coverage value (cov) cannot be used if the assembly file is provided. The base coverage (-c / --min_coverage) can be used but only if reads are provided.
python -A /mydirectory/myAssembly.fasta -p /mydirectory/R1.fastq /mydirectory/R2.fastq -c 10 -g Escherichia -o /mydirectory/my_output6