peasant 1.0 Manual
Dependencies:
The following software must be installed and be included in the PATH.
- Operating System: UNIX/Mac OSX
- GCC Compiler: http://gcc.gnu.org/
- Python – Version 2.6+: https://www.python.org/downloads/
- BioPython Package 1.68: http://biopython.org/wiki/Download
- Sickle: https://github.com/ucdavis-bioinformatics/sickle
- Zlib: http://www.zlib.net/
- SPAdes 3.10.1: http://bioinf.spbau.ru/content/spades-download
- GLIMMER3: http://ccb.jhu.edu/software/glimmer/index.shtml
- BLAST+ Binaries: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
- BBMap: https://sourceforge.net/projects/bbmap/
- tRNAScan-SE: http://eddylab.org/software.html
All binaries' directories must be added to system PATH.
Setup/Install:
The script g3-iterated-viral.csh should be located in the same folder as peasant.py. Note, this script includes 3 paths which must be set upon installation. Open the script in a text editor. Set the awkpath, glimmerpath, and elphbin to the locations of these three folders on your machine. The script must also be an executable; this may require chmod.
- Note, the glimmer scripts may not include your location of awk and will need to be updated. (See: https://ccb.jhu.edu/software/glimmer/glim302notes.pdf)
Download the RNA databases from github: rna_databases.tar file and put (uncompressed) in your database_path folder. Precompiled bacteria databases for annotation are publicly available online at: https://drive.google.com/drive/folders/0B3MjIo6BB7_1NGZTN3l4WXl5VEE?usp=sharing.
The peasant.py script must be modified by the user to set the path where the databases (RNA and bacteria) are located:
database_path='/mydirectory/peasant/databases'
Usage: peasant.py [assembly options] [filter options] [database] [homology options] -o output_path
assembly options:
-a {spades}, --assembler {spades}
Assembly methods:
spades: spades assembler
-A , --assembled_contigs
This will start analysis from an existing assembly/ multi-FASTA format file.
-s , --single_reads
Single reads file (FASTA or FASTQ).
-p , --paired_end_reads
Paired-end read files. List both read files (FASTA or FASTQ).
filter options:
-m , --min_contig_size
Minimum contig size.
-M , --max_contig_size
Maximum contig size.
-c , --min_coverage
Minimum coverage.
-cov , --min_SPAdes_cov
Minimum SPAdes cov value.
database options:
-g , --genus
Genus name to use for annotation. To consider more than one genus, list genus name (space separator). (required)
homology options:
-q , --qcov
Minimum query coverage to call homologous genes. Default=70.0.
-i , --pident
Minimum percent identity to call homologous genes. Default=70.0.
-b , --bitscore
Minimum bitscore to call homologous genes. Default=50.0.
other options:
-o , --output_path
Directory to store all the resulting files (required)
-t , --num_threads
Number of processors to use. (default=1)
-h, --help
Shows help message and exits.
--version
Show program's version number and exits.
Example:
Let’s say I have paired-end reads R1 and R2. They are for the genome of a novel E. coli species isolated in the lab. I want to output my results to a folder my_output.
python peasant.py -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -g Escherichia -o /mydirectory/my_output
This will produce the output folder my_output with the following files and directory:
Directory temp: This directory contains all of the temporary files generated by read QC (‘trimmed_’ and singletons.fastq (in the case of paired-end reads), the assembly (in the subdirectory /assembly), and coding region prediction.
annotations.csv: comma delimited file listing the contig/location of predicted genes.
final_contigs.fasta: Multi-FASTA format file of the final contigs (after filtering)
predicted_orfs.fasta: all predicted ORFs for the final contigs
predicted_orfs_aa.fasta: Multi-FASTA amino acid sequences for the ORFs predicted within the contigs.
predicted_orfs_nt.fasta: Multi-FASTA nucleotide sequences for the ORFs predicted within the contigs.
predicted_rRNAs.fasta: Multi-FASTA nucleotide sequences for the rRNA (5, 16, and 23) sequences predicted.
tRNAscan-SE_output.txt: output generated by tRNAscan-SE listing the locations and tRNA type found.
log.txt: Log file generated. Includes information about parameters, results of filtering and classification.
Let’s say I’m most interested in finding sequences that are of a certain size, say 20K-80K, I can revise my above function call as follows:
python peasant.py -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -m 20000 -M 80000 -g Escherichia -o /mydirectory/my_output2
I can filter based upon the SPAdes k-mer coverage value (-cov / --min_SPAdes_cov) if the assembly is done using SPAdes. I can also filter for base coverage (-c / --min_coverage). Following up on the example above, I’d like only those contigs with a SPAdes k-mer coverage value > 10.
python peasant.py -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -m 20000 -M 80000 -cov 10 -g Escherichia -o /mydirectory/my_output3
Single end reads can also be considered:
python peasant.py -a spades -s /mydirectory/myReads.fastq -g Escherichia -o /mydirectory/my_output3
If you’re not sure which taxa would be best to annotate your genome, you can select more than one, e.g.,
python peasant.py -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -g Escherichia Salmonella Shigella -o /mydirectory/my_output4
I can also create my own repository (blast database) for annotation, e.g.,
python peasant.py -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -g myTaxa -o /mydirectory/my_output5
I can also provide an assembly. Let’s say it’s a file called myAssembly.fasta. Just as if I were going to run the assembly here, I can also include filters. Note, the SPAdes k-mer coverage value (cov) cannot be used if the assembly file is provided. The base coverage (-c / --min_coverage) can be used but only if reads are provided.
python peasant.py -A /mydirectory/myAssembly.fasta -p /mydirectory/R1.fastq /mydirectory/R2.fastq -c 10 -g Escherichia -o /mydirectory/my_output6