Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



21 Commits

Repository files navigation

peasant 1.0 Manual


The following software must be installed and be included in the PATH.

  1. Operating System: UNIX/Mac OSX
  2. GCC Compiler:
  3. Python – Version 2.6+:
  4. BioPython Package 1.68:
    1. NumPy:
  5. Sickle:
    1. Zlib:
  6. SPAdes 3.10.1:
  7. GLIMMER3:
    1. ELPH:
  8. BLAST+ Binaries:
  9. BBMap:
    1. JAVA v6+:
  10. tRNAScan-SE:

All binaries' directories must be added to system PATH.


The script g3-iterated-viral.csh should be located in the same folder as Note, this script includes 3 paths which must be set upon installation. Open the script in a text editor. Set the awkpath, glimmerpath, and elphbin to the locations of these three folders on your machine. The script must also be an executable; this may require chmod.

Download the RNA databases from github: rna_databases.tar file and put (uncompressed) in your database_path folder. Precompiled bacteria databases for annotation are publicly available online at:

The script must be modified by the user to set the path where the databases (RNA and bacteria) are located:

Usage: [assembly options] [filter options] [database] [homology options] -o output_path

assembly options:
-a {spades}, --assembler {spades}
Assembly methods:
spades: spades assembler

-A , --assembled_contigs
This will start analysis from an existing assembly/ multi-FASTA format file.

-s , --single_reads
Single reads file (FASTA or FASTQ).
-p , --paired_end_reads
Paired-end read files. List both read files (FASTA or FASTQ).

filter options:
-m , --min_contig_size
Minimum contig size.

-M , --max_contig_size
Maximum contig size.

-c , --min_coverage
Minimum coverage.

-cov , --min_SPAdes_cov
Minimum SPAdes cov value.

database options:
-g , --genus
Genus name to use for annotation. To consider more than one genus, list genus name (space separator). (required)

homology options:
-q , --qcov
Minimum query coverage to call homologous genes. Default=70.0.

-i , --pident
Minimum percent identity to call homologous genes. Default=70.0.

-b , --bitscore
Minimum bitscore to call homologous genes. Default=50.0.

other options:
-o , --output_path
Directory to store all the resulting files (required)

-t , --num_threads
Number of processors to use. (default=1)

-h, --help
Shows help message and exits.

Show program's version number and exits.


Let’s say I have paired-end reads R1 and R2. They are for the genome of a novel E. coli species isolated in the lab. I want to output my results to a folder my_output.

python -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -g Escherichia -o /mydirectory/my_output

This will produce the output folder my_output with the following files and directory:
Directory temp: This directory contains all of the temporary files generated by read QC (‘trimmed_’ and singletons.fastq (in the case of paired-end reads), the assembly (in the subdirectory /assembly), and coding region prediction.
annotations.csv: comma delimited file listing the contig/location of predicted genes.
final_contigs.fasta: Multi-FASTA format file of the final contigs (after filtering)
predicted_orfs.fasta: all predicted ORFs for the final contigs
predicted_orfs_aa.fasta: Multi-FASTA amino acid sequences for the ORFs predicted within the contigs.
predicted_orfs_nt.fasta: Multi-FASTA nucleotide sequences for the ORFs predicted within the contigs.
predicted_rRNAs.fasta: Multi-FASTA nucleotide sequences for the rRNA (5, 16, and 23) sequences predicted.
tRNAscan-SE_output.txt: output generated by tRNAscan-SE listing the locations and tRNA type found.
log.txt: Log file generated. Includes information about parameters, results of filtering and classification.

Let’s say I’m most interested in finding sequences that are of a certain size, say 20K-80K, I can revise my above function call as follows:
python -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -m 20000 -M 80000 -g Escherichia -o /mydirectory/my_output2

I can filter based upon the SPAdes k-mer coverage value (-cov / --min_SPAdes_cov) if the assembly is done using SPAdes. I can also filter for base coverage (-c / --min_coverage). Following up on the example above, I’d like only those contigs with a SPAdes k-mer coverage value > 10.
python -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -m 20000 -M 80000 -cov 10 -g Escherichia -o /mydirectory/my_output3

Single end reads can also be considered:
python -a spades -s /mydirectory/myReads.fastq -g Escherichia -o /mydirectory/my_output3

If you’re not sure which taxa would be best to annotate your genome, you can select more than one, e.g.,
python -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -g Escherichia Salmonella Shigella -o /mydirectory/my_output4

I can also create my own repository (blast database) for annotation, e.g.,
python -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -g myTaxa -o /mydirectory/my_output5

I can also provide an assembly. Let’s say it’s a file called myAssembly.fasta. Just as if I were going to run the assembly here, I can also include filters. Note, the SPAdes k-mer coverage value (cov) cannot be used if the assembly file is provided. The base coverage (-c / --min_coverage) can be used but only if reads are provided.
python -A /mydirectory/myAssembly.fasta -p /mydirectory/R1.fastq /mydirectory/R2.fastq -c 10 -g Escherichia -o /mydirectory/my_output6


No releases published


No packages published