peasant
Switch branches/tags
Nothing to show
Clone or download
Unknown and Unknown fix line 412 remove /n
Latest commit 0813638 Feb 25, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
g3-iterated.csh initial commit Apr 12, 2017
make_peasant_db.py initial commit Apr 12, 2017
peasant.py fix line 412 remove /n Feb 25, 2018
readme.md Update readme.md May 22, 2017
rrna_databases.tar Add files via upload Apr 24, 2017

readme.md

peasant 1.0 Manual

Dependencies:

The following software must be installed and be included in the PATH.

  1. Operating System: UNIX/Mac OSX
  2. GCC Compiler: http://gcc.gnu.org/
  3. Python – Version 2.6+: https://www.python.org/downloads/
  4. BioPython Package 1.68: http://biopython.org/wiki/Download
    1. NumPy: http://www.scipy.org/scipylib/download.html
  5. Sickle: https://github.com/ucdavis-bioinformatics/sickle
    1. Zlib: http://www.zlib.net/
  6. SPAdes 3.10.1: http://bioinf.spbau.ru/content/spades-download
  7. GLIMMER3: http://ccb.jhu.edu/software/glimmer/index.shtml
    1. ELPH: http://www.cbcb.umd.edu/software/ELPH/
  8. BLAST+ Binaries: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/
  9. BBMap: https://sourceforge.net/projects/bbmap/
    1. JAVA v6+: https://www.java.com/en/download/ie_manual.jsp?locale=en
  10. tRNAScan-SE: http://eddylab.org/software.html

All binaries' directories must be added to system PATH.


Setup/Install:

The script g3-iterated-viral.csh should be located in the same folder as peasant.py. Note, this script includes 3 paths which must be set upon installation. Open the script in a text editor. Set the awkpath, glimmerpath, and elphbin to the locations of these three folders on your machine. The script must also be an executable; this may require chmod.

Download the RNA databases from github: rna_databases.tar file and put (uncompressed) in your database_path folder. Precompiled bacteria databases for annotation are publicly available online at: https://drive.google.com/drive/folders/0B3MjIo6BB7_1NGZTN3l4WXl5VEE?usp=sharing.

The peasant.py script must be modified by the user to set the path where the databases (RNA and bacteria) are located:
database_path='/mydirectory/peasant/databases'


Usage: peasant.py [assembly options] [filter options] [database] [homology options] -o output_path

assembly options:
-a {spades}, --assembler {spades}
Assembly methods:
spades: spades assembler

-A , --assembled_contigs
This will start analysis from an existing assembly/ multi-FASTA format file.

-s , --single_reads
Single reads file (FASTA or FASTQ).
-p , --paired_end_reads
Paired-end read files. List both read files (FASTA or FASTQ).

filter options:
-m , --min_contig_size
Minimum contig size.

-M , --max_contig_size
Maximum contig size.

-c , --min_coverage
Minimum coverage.

-cov , --min_SPAdes_cov
Minimum SPAdes cov value.

database options:
-g , --genus
Genus name to use for annotation. To consider more than one genus, list genus name (space separator). (required)

homology options:
-q , --qcov
Minimum query coverage to call homologous genes. Default=70.0.

-i , --pident
Minimum percent identity to call homologous genes. Default=70.0.

-b , --bitscore
Minimum bitscore to call homologous genes. Default=50.0.

other options:
-o , --output_path
Directory to store all the resulting files (required)

-t , --num_threads
Number of processors to use. (default=1)

-h, --help
Shows help message and exits.

--version
Show program's version number and exits.

Example:

Let’s say I have paired-end reads R1 and R2. They are for the genome of a novel E. coli species isolated in the lab. I want to output my results to a folder my_output.

python peasant.py -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -g Escherichia -o /mydirectory/my_output

This will produce the output folder my_output with the following files and directory:
Directory temp: This directory contains all of the temporary files generated by read QC (‘trimmed_’ and singletons.fastq (in the case of paired-end reads), the assembly (in the subdirectory /assembly), and coding region prediction.
annotations.csv: comma delimited file listing the contig/location of predicted genes.
final_contigs.fasta: Multi-FASTA format file of the final contigs (after filtering)
predicted_orfs.fasta: all predicted ORFs for the final contigs
predicted_orfs_aa.fasta: Multi-FASTA amino acid sequences for the ORFs predicted within the contigs.
predicted_orfs_nt.fasta: Multi-FASTA nucleotide sequences for the ORFs predicted within the contigs.
predicted_rRNAs.fasta: Multi-FASTA nucleotide sequences for the rRNA (5, 16, and 23) sequences predicted.
tRNAscan-SE_output.txt: output generated by tRNAscan-SE listing the locations and tRNA type found.
log.txt: Log file generated. Includes information about parameters, results of filtering and classification.

Let’s say I’m most interested in finding sequences that are of a certain size, say 20K-80K, I can revise my above function call as follows:
python peasant.py -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -m 20000 -M 80000 -g Escherichia -o /mydirectory/my_output2

I can filter based upon the SPAdes k-mer coverage value (-cov / --min_SPAdes_cov) if the assembly is done using SPAdes. I can also filter for base coverage (-c / --min_coverage). Following up on the example above, I’d like only those contigs with a SPAdes k-mer coverage value > 10.
python peasant.py -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -m 20000 -M 80000 -cov 10 -g Escherichia -o /mydirectory/my_output3

Single end reads can also be considered:
python peasant.py -a spades -s /mydirectory/myReads.fastq -g Escherichia -o /mydirectory/my_output3

If you’re not sure which taxa would be best to annotate your genome, you can select more than one, e.g.,
python peasant.py -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -g Escherichia Salmonella Shigella -o /mydirectory/my_output4

I can also create my own repository (blast database) for annotation, e.g.,
python peasant.py -a spades -p /mydirectory/R1.fastq /mydirectory/R2.fastq -g myTaxa -o /mydirectory/my_output5

I can also provide an assembly. Let’s say it’s a file called myAssembly.fasta. Just as if I were going to run the assembly here, I can also include filters. Note, the SPAdes k-mer coverage value (cov) cannot be used if the assembly file is provided. The base coverage (-c / --min_coverage) can be used but only if reads are provided.
python peasant.py -A /mydirectory/myAssembly.fasta -p /mydirectory/R1.fastq /mydirectory/R2.fastq -c 10 -g Escherichia -o /mydirectory/my_output6