This pipeline assemblies organelle genome from genomic skimming data.
Citation: Jian-Jun Jin*, Wen-Bin Yu*, Jun-Bo Yang, Yu Song, Ting-Shuang Yi, De-Zhu Li. 2018. GetOrganelle: a simple and fast pipeline for de novo assembly of a complete circular chloroplast genome using genome skimming data. bioRxiv, 256479. http://doi.org/10.1101/256479
License: GPL https://www.gnu.org/licenses/gpl-3.0.html
Please cite the dependencies if they are used:
SPAdes: Bankevich, A., S. Nurk, D. Antipov, A. A. Gurevich, M. Dvorkin, A. S. Kulikov, V. M. Lesin, S. I. Nikolenko, S. Pham, A. D. Prjibelski, A. V. Pyshkin, A. V. Sirotkin, N. Vyahhi, G. Tesler, M. A. Alekseyev and P. A. Pevzner. 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology 19: 455-477.
This pipeline was written in python 3.5.1, but compatible with versions higher than 3.5.1 and 2.7.11. GetOrganelle is more efficient under Python 3.*.
Execute following simple git commands to download the latest version or find older stable versions here:
# Supposing you are going to install it at ~/Applications/bin GetOrganellePATH=~/Applications/bin cd $GetOrganellePATH git clone git://github.com/Kinggerm/GetOrganelle
then add GetOrganelle to the path:
# for MacOS echo "PATH=$GetOrganellePATH/GetOrganelle:\$PATH" >> ~/.bash_profile echo "PATH=$GetOrganellePATH/GetOrganelle/Utilities:\$PATH" >> ~/.bash_profile echo "export PATH" >> ~/.bash_profile source ~/.bash_profile # for Linux echo "PATH=$GetOrganellePATH/GetOrganelle:\$PATH" >> ~/.bashrc echo "PATH=$GetOrganellePATH/GetOrganelle/Utilities:\$PATH" >> ~/.bashrc echo "export PATH" >> ~/.bashrc source ~/.bashrc
and make them writable/executable if they are not:
chmod +x $GetOrganellePATH/GetOrganelle/*.py chmod +x $GetOrganellePATH/GetOrganelle/Utilities/*.py chmod +x $GetOrganellePATH/GetOrganelle/Library/*.py chmod +w $GetOrganellePATH/GetOrganelle/Library/*Reference
It is also very IMPORTANT to keep updated (if you find your version is out of date!):
cd $GetOrganellePATH/GetOrganelle rm Library/SeqReference/*index* git pull
You could run the main script (get_organelle_reads.py) to get organelle reads (*.fastq) successfully, without any third-party libraries or software.
However, to get a complete organelle genome (such as a plastome) rather than organelle reads, other files in GetOrganelle are needed in the original relative path. Also, the following software/libraries are needed to be installed and added to the PATH, since they could be called automatically:
Python libraries numpy, scipy, sympy are used to solve the assembly graph, and could be easily installed by typing in:
pip install numpy scipy sympy
SPAdes is the assembler
Bowtie2 is used to speed up initial recruitment of target-like reads
BLAST+ is used to filter target-like contigs and simplify the final assembly graph
Bandage is suggested to view the final contig graph (
[NOT necessary] If you installed python library psutil (pip install psutil), the memory cost of get_organelle_reads.py will be automatically logged. If you want to evaluate your results and plot the evaluation with
round_statistics.py, you have to further install python library matplotlib (pip install matplotlib).
What you actually need to do is just typing in one simple command as suggested in Example. But you are still invited to read the following introductions:
Currently, this script was written for illumina pair-end/single-end data (fastq or fastq.gz). 1G per end is enough for chloroplast for most normal angiosperm samples, and 5G per end is enough for mitochondria data. You could simply assign a maximum number of reads (number of seqs, not number of bases) for GetOrganelle to use with flag
--max-reads or manually cut raw data into certain size before running GetOrganelle using the Linux or Mac OS build-in command (eg.
head -n 20000000 large.fq > small.fq).
Filtering and Assembly
Take your input reference (fasta or bowtie index; the default is
Library/SeqReference/*.fasta) as probe, the script would recruit target reads in successive rounds (extending process). You could also use a more related reference, which would be safer if the sequence quality is bad (say, degraded DNA samples). The value word size (followed with "-w"), like the kmer in assembly, is crucial to the feasibility and efficiency of this process. The best word size changes upon data and will be affected by read length, read quality, base coverage, organ DNA percent and other factors. Since version 1.4.0, if there is no user assigned word size value, GetOrganelle would automatically estimate a proper word size based on the data characters. Although the automatically-estimated word size value does not ensure the best performance nor the best result, you do not need to adjust the value if a complete/circular organelle result is produced, because the circular result by GetOrganelle is generally consistent under different options. After extending, this script will automatically call SPAdes to assembly the target reads produced by the former step. The best kmer depends on a wide variety of factors too.
By default, SPAdes is automatically called to produce the assembly graph file
filtered_spades/assembly_graph.fastg. Then, Utilities/slim_fastg.py is called to modify the
filtered_spades/assembly_graph.fastg file and produce a new fastg file (would be
assembly_graph.fastg.extend_plant_cp-del_plant_mt.fastg if -F plant_cp been used) along with a tab-format annotation file (
assembly_graph.fastg.extend_plant_cp-del_plant_mt.fastg file along with the
assembly_graph.fastg.extend_plant_cp-del_plant_mt.csv file would be further parsed by disentangle_organelle_assembly.py, and your target sequence file(s)
*complete*path_sequence.fasta would be produced as the final result, if disentangle_organelle_assembly.py successfully solve the path.
Otherwise, if disentangle_organelle_assembly.py failed to solve the path (produce
*contigs*path_sequence.fasta), you could use the incomplete sequence to conduct downstream analysis or manually view
assembly_graph.fastg.extend_plant_cp-del_plant_mt.fastg and load the
assembly_graph.fastg.extend_plant_cp-del_plant_mt.csv in Bandage, choose the best path(s) as the final result.
Here (or here) is a short video showing a standard way to manually extract the plastome from the assembly graph with Bandage. See here or here for more examples with more complicated (do not miss
3m01s - 5m53s) situations.
To assembly chloroplast (e.g. using 2G raw data of 150 bp paired reads), typically I use:
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -o chloroplast_output -R 15 -k 75,85,95,105 -F plant_cp
or in a draft way:
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -o chloroplast_output --fast -k 75,85,95,105 -w 0.68 -F plant_cp
or in a slow and memory-economic way:
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -s cp_reference.fasta -o chloroplast_output -R 30 -k 75,85,95,105 -F plant_cp --memory-save -a mitochondria.fasta
To assembly plant mitochondria (usually you need more than 5G raw data):
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -s mt_reference.fasta -o mitochondria_output -R 50 -k 65,75,85,95,105 -P 1000000 -F plant_mt
To assembly plant nuclear ribosomal RNA (18S-ITS1-5.8S-ITS2-26S):
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -o nr_output -R 7 -k 95,105,115 -P 0 -F plant_nr
To assembly fungus mitochondria (currently only tested on limited samples, suggested parameters might not be the best)
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -s fungus_mt_reference.fasta --genes fungus_mt_genes.fasta -R 3 -k 65,75,85,95,105 -F fungus_mt
To assembly animal mitochondria (currently only tested on limited samples, suggested parameters might not be the best)
get_organelle_reads.py -1 sample_1.fq -2 sample_2.fq -s animal_mt_reference.fasta --genes animal_mt_genes.fasta -R 3 -k 65,75,85,95,105 -F animal_mt
See the detailed illustrations of those arguments by typing in:
or see verbose illustrations:
Also see GetOrganelleComparison for a benchmark test of
NOVOPlasty using 50 online samples.
Published Works Using GetOrganelle
Yu Song, Wen-Bin Yu, Yun-Bong Tan, Bing Liu, Xin Yao, Jian-Jun Jin, Michael Padmanaba, Jun-Bo Yang, Richard T. Corlett. 2017. Evolutionary comparisons of the chloroplast genome in Lauraceae and insights into loss events in the Magnoliids. Genome Biology and Evolution. 9(9): 2354-64. doi: https://doi.org/10.1093/gbe/evx180
Twyford AD, Ness RW. 2017. Strategies for complete plastid genome sequencing. Molecular Ecology Resources. 17(5):858-68. doi: https://doi.org/10.1111/1755-0998.12626
Guan-Song Yang, Yin-Huan Wang, Yue-Hua Wang, Shi-Kang Shen. 2017. The complete chloroplast genome of a vulnerable species Champereia manillana (Opiliaceae). Conservation Genetics Resources. 9(3): 415-418. doi: https://doi.org/10.1007/s12686-017-0697-1
Thanks to Chao-Nan Fu, Han-Tao Qin, Xiao-Jian Qu, Shuo Wang, and Rong Zhang for giving tests or suggestions.