Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Perl TeX Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
|Failed to load latest commit information.|
PhyloSift PhyloSift is a suite of software tools to conduct phylogenetic analysis of genomes and metagenomes. Using a reference database of protein sequences, PhyloSift can scan new sequences against that database for homologs and identify the phylogenetic relationship of the new sequence to the database sequences. During this procedure, high quality alignments of codon and amino acid sequence are generated. Website : http://phylosift.wordpress.com/ Twitter : @PhyloSift User support forum: https://groups.google.com/d/forum/phylosift Our google group is a place to ask questions, request functionalities, get help if you're stuck, and generally interact with our development team at UC Davis. If you run into a bug during your analysis, please post this as an "Issue" in GitHub. These tasks get priority, and we'll fix any broken pipes as soon a possible! == Quick Start for the Impatient == Download PhyloSift from here: http://edhar.genomecenter.ucdavis.edu/~koadman/phylosift/phylosift_latest.tar.bz2 Unpack with `tar xvjf phylosift_latest.tar.bz2`, then change into the new directory and run: `bin/phylosift all --output=results sequences.fasta` Or using compressed paired FastQ data from an Illumina instrument: `bin/phylosift all --output=results --paired read_1.gz read_2.gz` Results will be stored in a new directory called "results". See `results/krona.html` for a visual summary of the sequence data. == NEW VERSION == PhyloSift has several new features : * The marker gene database has been expanded to encompass Bacteria, Archaea, Eukarya, and virii. * Support for processing of next-gen sequence data including Illumina, 454, and others * Integration with metAMOS for pre-assembling next-generation datasets * Codon based alignment of marker genes ...more to come! == Installation == The easy way : Download the tarball archive from http://edhar.genomecenter.ucdavis.edu/~koadman/phylosift/phylosift_latest.tar.bz2 Navigate to the directory where you downloaded the software and run : tar -xvf phylosift_latest.tar.bz2 PhyloSift is ready for use. Refer to the Usage section. The hard way (installing from the development repository) : Download and install Git for your system. http://git-scm.com/ For debian/ubuntu systems, run the following command sudo apt-get install git To check out the PhyloSift development repository, run the following command : git clone git://github.com/gjospin/PhyloSift.git It may also be necessary to install several support packages, including BioPerl, Bio::Phylo, and others. BioPerl offers bioperl-live from github, we suggest installing other dependencies using cpan. == Updating PhyloSift == If running PhyloSift from a packaged release (i.e., you chose the "easy way"), simply download the newest tarball. If running PhyloSift from the development repository, run the following command : git update == Usage == PhyloSift currently accepts input data in the following file formats: * FASTA * paired FASTA (specify paired data by using the -paired flag) * interleaved paired FASTA (specify paired data by using the -paired flag) * FASTQ (same as FASTA but with quality scores; this file type is the standard output from Illumina platforms) * interleaved FASTQ (same as FASTA but with quality scores) * .gz (any of the above file types compressed using gzip) * .bz (any of the above file types compressed using bzip) PhyloSift can be executed by running the following command: $ phylosift <Mode> <options> <sequence_file> sequence_file needs to be in an accepted format $ phylosift <Mode> <options> --paired <sequence_file_1> <sequence_file_2> sequence_file_1 and _2 must contain paired sequences Creating a PhyloSift Marker (Does not allow for taxonomic summaries, only searching, aligning and placement steps) : $ phylosift build_marker <alignment_file> <PD pruning threshold> Example 1: $ phylosift build_marker test.aln 0.01 Example 2: $ phylosift build_marker -f test.aln 0.01 WARNING : This step does not automatically add the new marker to the search indices. You will need to rerun the indexing step. The marker directory will be added directly to the directory in which PhyloSift looks for markers when running. If a marker with the same name exists, the marker building process will be halted unless the -f option is used. The marker name will be the same as the alignment given minus the trailing suffix. If the user wants to build a marker from the file <test.aln>, the marker name will be <test> == Index the search databases == This step is run automatically when markers are downloaded but if you add new markers, you will need to run this step manually. $ phylosift index == Modes == Mode : Only execute a specific step in the pipeline all : execute all the following steps in that order search : only execute the database searches + write candidate files align : hmmsearch + hmmalign + trims + coverage filters placer : run pplacer on the results from the align step summary : run the taxonomy assignment steps index : indexes the various databases needed for Rapsearch, Bowtie and Blast build_marker : creates a reference package from a multiple alignment file sim : Generates simulated reads using Grinder from a set of genomes, top PD contribution and randomly picked from the concatenated tree for the marker DB Need to have the genomes downloaded to a directory. This feature should be reserved for in house usage. Generates paired ends reads in fastq format, 454 reads in fasta format and paired end reads in fasta format. Generates 2 files containing the taxon IDs for the genomes used in the simulations == Options == Options Available at this time : -h, --help Prints the usage section of the README -f Overrides a previous run of the input data by Phylosift, if applicable --custom=<file> Reads a custom marker list from a file otherwise use all the markers from the markers directory; marker names shouldn't contain '_' --threaded=<Number> Runs Blast and Hmmer using the number of processors sepcified Runs 1 Pplacer per processor specified (DEFAULT : 1) --extended Uses the extended set of markers --clean Cleans the temporary files created by Phylosift (NOT YET IMPLEMENTED) --paired Looks for 2 input files (paired end sequencing) in FastQ format. Reversing the sequences for the second file listed and appending to the corresponding pair from the first file listed. --continue Enables the pipeline to continue to subsequent steps when not using the 'all' mode --reverse If the submitted sequences were Nucleotides, the program prints out an alignment for all the markers in DNA space in addition to the one in Protein space. --isolate Use this mode if you are running data from an Isolate genome --simple Generates a simple taxonomic summary of the data only; no Krona output --besthit When there are multiple hits to the same read, this option will keep only the best hit for that read --updated Use the set of updated markers with newly sequenced genomes instead of the stock markers --marker_url=<url> Phylosift will use markers available from the url provided --coverage=<file> Provides a contig/scaffold coverage file to Phylosift --config=<file> Provides a custom configuration file to Phylosift --output=<directory> Specifies an output directory instead of using PS's default --debug Prints debugging messages == Requirements == A 64bit operating system is required as the binaries packaged with the PhyloSift scripts were compiled using such OS. PhyloSift will NOT work on a 32bit operating system. PhyloSift depends on a great many other open source software packages. The precompiled version linked above bundles most of the dependencies into a single downloadable package. Default Markers used by PhyloSift ========================================= PMPROK00003 PMPROK00025 PMPROK00048 PMPROK00064 PMPROK00081 PMPROK00106 PMPROK00014 PMPROK00028 PMPROK00050 PMPROK00067 PMPROK00086 PMPROK00123 PMPROK00015 PMPROK00029 PMPROK00051 PMPROK00068 PMPROK00087 PMPROK00126 PMPROK00019 PMPROK00031 PMPROK00052 PMPROK00069 PMPROK00092 PMPROK00020 PMPROK00034 PMPROK00053 PMPROK00071 PMPROK00093 PMPROK00022 PMPROK00041 PMPROK00054 PMPROK00074 PMPROK00094 PMPROK00024 PMPROK00047 PMPROK00060 PMPROK00075 PMPROK00097 This data has not yet been published. == Results == All results are generated within the PhyloSift directory in the the Amph_temp/<filename>/ directory. blastDir : All files related to the search step. *.candidate Fasta format of the candidate sequences in Protein space for each marker *.candidate.ffn Fasta format of the candidate sequences in DNA space for each marker (option activated) alignDir : All files related to the alignment and masking steps. *.aln_hmmer3.fasta hmmer3 generated alignment for the candidate sequences in fasta format *.aln_hmmer3.trim Trimmed version of the previous file *.hmmsearch.out Hmmersearch output of the candidate sequences Vs the markers HMM profile *.hmmsearch.tblout Tabular version of the hmmsearch.out file *.newCandidate Fasta file of the candidate sequences after the hmmsearch filter *.seed.stock stockholm format of the reference sequences alignment *.stock.hmm HMM profile treeDir : All files related to the tree placements of the candidate sequences by Pplacer. *.jplace Placement file for the newCandidate sequences taxasummary.txt Probability mass over taxa present in the sample, tab-delimited text Column 1 -- NCBI Taxon ID Column 2 -- Taxonomic rank (genus, species, phylum, etc) Column 3 -- Name Column 4 -- Read/sequence probability sum placed at this taxon The values in this column can be normalized to sum to 1, the result will be a rank-abundance distribution taxaconfidence.txt Quantiles over the probability distribution for number of reads at a taxonomic level under a multinomial model of read sampling from taxa during the sequencing process. Column 1 -- NCBI Taxon ID Column 2 -- Taxonomic rank (genus, species, phylum, etc) Column 3 -- Name Column 4 -- Minimum number of reads Column 5 -- Number of reads at 10th percentile Column 6 -- Number of reads at 25th percentile Column 7 -- Median number of reads Column 8 -- Number of reads at 75th percentile Column 9 -- Number of reads at 90th percentile Column 10 -- Maximum number of reads == SUPPORT AND DOCUMENTATION == After installing, you can find documentation for this module with the perldoc command. perldoc Phylosift::Phylosift Bugs and other apparent problems with the software can be reported as issues in our github issue tracker : https://github.com/gjospin/PhyloSift/issues == LICENSE AND COPYRIGHT == Copyright (C) 2011 Aaron Darling and Guillaume Jospin This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation. == 3rd PARTY SOFTWARE == PhyloSift is distributed with several open source components that were developed by other groups. These components are (c) their respective developers and are redistributed with PhyloSift to provide ease-of-use. Please see the following web sites for licensing details and source code for these other components: * pplacer -- http://matsen.fhcrc.org/pplacer/ * rapsearch2 -- http://omics.informatics.indiana.edu/mg/RAPSearch2/ * HMMER 3 -- http://hmmer.janelia.org/ * LAST -- http://last.cbrc.jp * pda -- http://www.cibiv.at/software/pda/ * bowtie2 -- http://bowtie-bio.sourceforge.net/bowtie2/ * FastTree -- http://www.microbesonline.org/fasttree/ * infernal -- http://infernal.janelia.org/ == CONTACT INFORMATION == Please direct correspondence to aarondarling (at) uc davis (dot) edu Or on twitter to @PhyloSift