Skip to content

Latest commit

 

History

History
408 lines (292 loc) · 18.5 KB

README.md

File metadata and controls

408 lines (292 loc) · 18.5 KB

Botany meeting workshop tutorial

This is a step-by-step tutorial to prossess genome skimming data for phylogeny reconstruction. This tutorial will use the example dataset hosted on the PhyloHerb repository to reconstruct a plastome-based phylogeny for five species.

Events:

Upcoming workshop: Botany 2022, July 24 Abstract

Botany 2021 workshop: Abstract on the BSA website and archived here. Video recordings on Youtube.

I. Prerequisites

Install the following software

  1. Install GetOrganelle v1.7.0+, Bowtie2 v2.2.2+, and BLAST+ all at once with conda

Install Anaconda or miniconda for Linux, Mac, or PC (see https://conda.io/projects/conda/en/latest/user-guide/install/index.html)

Create a conda environment and install GetOrganelle and its dependencies (may take a while)

conda create --name phyloherb python=3.7.0
#answer promt questions 
#proceed ([y]/n)?
#y
#for Mac
#conda activate phyloherb

#for linux
source activate phyloherb
conda install -c bioconda getorganelle
conda install -c conda-forge biopython
conda install -c etetoolkit ete3

#get database for getorganelle
get_organelle_config.py --add embplant_pt,embplant_mt,embplant_nr

IMPORTANT: If you run into errors associated with Bowtie2, try: conda install tbb=2020.2 GetOrgannelle currently is not support on Windows :(

  1. Bandage: Assembly graph viewer with a graphical user interface. Can be downloaded and installed based on the instruction online.

  2. Optional Geneious: the free trial version lasts for 14 days. After that some functions are restricted. Your institute may have purchased a license for you.

  3. Pasta or MAFFT. Pasta is highly recommended for distantly related taxa and large datasets (many species, but not long alignments). Installing from Docker image should be the easiest because installation from the source code may require root permissions. If you plan to use it during the workshop, please try to have it ready prior to the workshop.

  4. IQ-TREE or RAxML or ExaML. Choose your favorite one!

II. Assembly

We will use GetOrganelle to assemble the plastome, mitochondrial genome, and ribosomal regions.

1. Input preparation:

Download PhyloHerb and example datasets from Github

git clone https://github.com/lmcai/PhyloHerb.git

Alternatively, you can download PhyloHerb from GitHub directly and then unzip the folder. No installation is needed.

Then let's create a working directory herbariomics_workshop.

mkdir herbariomics_workshop
cd herbariomics_workshop
#define path to PhyloHerb, the following command is for Mac/Linux only
export PH=$(pwd)/../PhyloHerb

#For PC users, you can always call phyloherb from its relative path.

Move and rename the example dataset from PhyloHerb to the current working directory

mkdir 0_fastq
mv $PH/example/*.gz 0_fastq

2. How to:

Load dependencies (assuming installing GetOrganelle under the conda environment named 'phyloherb')

#load Anaconda and GetOrganelle (for linux users)
module load Anaconda
source activate phyloherb

#for mac
#conda activate phyloherb

Create a working directory for assembly

mkdir 1_getorg
cd 1_getorg

mkdir chl
mkdir ITS
mkdir mito

The basic commands for running assembly with pair end data is as follows:

export DATA_DIR=$(pwd)/../0_fastq
#To assemble plant plastome
get_organelle_from_reads.py -1 $DATA_DIR/SP1_R1.100m.1.fastq.gz -2 $DATA_DIR/SP1_R2.100m.1.fastq.gz -o chl/sp1 -R 15 -k 21,45,65,85,95,105 -F embplant_pt

#To assemble plant nuclear ribosomal RNA
get_organelle_from_reads.py -1 $DATA_DIR/SP1_R1.100m.1.fastq.gz -2 $DATA_DIR/SP1_R2.100m.1.fastq.gz  -o ITS/sp1 -R 10 -k 35,85,105,115 -F embplant_nr

#To assemble plant mitochondria:
get_organelle_from_reads.py -1 $DATA_DIR/SP1_R1.100m.1.fastq.gz -2 $DATA_DIR/SP1_R2.100m.1.fastq.gz -o mito/sp1 -R 50 -k 21,45,65,85,105 -P 1000000 -F embplant_mt

For this dataset, each plastome assembly takes ~5GB memory, and ~120s CPU time.

If your GetOrganelle is complaining about not finding the seed database

############################################################################
ERROR: default embplant_pt,embplant_mt database not added yet!

Install it by: get_organelle_config.py -a embplant_pt,embplant_mt
or
Install all types by: get_organelle_config.py -a all

You will need to download them manually to ~/.GetOrganelle

cd ~/.GetOrganelle/SeedDatabase
#or create such directory if there is none
curl -L https://github.com/Kinggerm/GetOrganelleDB/releases/download/0.0.1/v0.0.1.tar.gz | tar zx
get_organelle_config.py -a embplant_pt,embplant_mt,embplant_nr --use-local ./0.0.1

3. Large dataset and batch submission to cluster

If you are working with a high performance computing cluster with slurm workload manager, you can modify the bash file getorg.sh and submit jobs to your cluster simultaneously.

Copy the example getorg.sh to current directory

cp $PH/phyloherbLib/getorg.sh .

The bash job looks like this:

#!/bin/bash
#
#SBATCH -n 8                 # Number of cores
#SBATCH -N 1                 # Number of nodes for the cores
#SBATCH -t 0-10:00           # Runtime in D-HH:MM format
#SBATCH -p <partition>    # Partition to submit to
#SBATCH --mem=10000            # Memory pool for all CPUs
#SBATCH -o getorg.out      # File to which standard out will be written
#SBATCH -e getorg.err      # File to which standard err will be written


#Activate conda environment, assuming the name of the conda environment is 'getorg'
#module load Anaconda
#source activate getorg

export DATA_DIR=<absolute path to input data>

#To assemble plant plastome
get_organelle_from_reads.py -1 $DATA_DIR/$1 -2 $DATA_DIR/$2 -o chl/$3 -R 15 -k 21,45,65,85,95,105 -F embplant_pt

#To assemble plant nuclear ribosomal RNA
get_organelle_from_reads.py -1 $DATA_DIR/$1 -2 $DATA_DIR/$2 -o ITS/$3 -R 10 -k 35,85,105,115 -F embplant_nr

#To assemble plant mitochondria:
get_organelle_from_reads.py -1 $DATA_DIR/$1 -2 $DATA_DIR/$2 -o mito/$3 -R 50 -k 21,45,65,85,105 -P 1000000 -F embplant_mt

IMPORTANT: Make sure you modify the above bash file to load correct environment, provide full path to the data, request correct resources, etc (marked by "<>").

To submit the job, you can provide a tab-delimited text file sample_sheet.tsv. The sp_prefix will be used throughout and in the final phylogeny.

sp_prefix	Forward_reads	Reverse_reads
sp1	SP1_R1.100m.1.fastq.gz	SP1_R2.100m.1.fastq.gz
sp2	SP2_R1.100m.1.fastq.gz	SP2_R2.100m.1.fastq.gz
sp3	SP3_R1.100m.1.fastq.gz	SP3_R2.100m.1.fastq.gz
sp4	SP4_R1.100m.1.fastq.gz	SP4_R2.100m.1.fastq.gz
sp5	SP5_R1.100m.1.fastq.gz	SP5_R2.100m.1.fastq.gz

Then generate a batch file using the submission function of phyloherb

cp $PH/example/sample_sheet.tsv .
python $PH/phyloherb.py -m submission -b getorg.sh -s sample_sheet.tsv -o submitter.sh

#submit jobs
sh submitter.sh

4. Output

The batch submission will generate three subdirectories chl/, ITS/, and mito/, each containing Getorganelle output directories named after sample-specific prefixes. The expected output can be found in example/results/1_getorg.tgz. For detailed descriptions of the output, see Getorganelle instructions

To view these resulting files:

cp $PH/example/results/1_getorg.tgz .
tar xzvf 1_getorg.tgz
rm 1_getorg.tgz

5. Assembly QC

After the assemblies are completed, you can summarize the results using the QC function of phyloherb. For each species, it will extract the following information: the number of total input reads, the number of reads used for assembly, average base coverage, the total length of the assembly, GC%, and whether the assembly is circularized.

#create a new folder under the herbariomics_workshop to store assemblies
cd herbariomics_workshop
mkdir 2_assemblies
python $PH/phyloherb.py -m qc -s 1_getorg/sample_sheet.tsv -i 1_getorg/chl -o 2_assemblies/chl

This command will copy all of the assemblies under the input directory 1_getorg/chl to a new directory 2_assemblies/chl and rename the files based on their species prefixes. In the output directory, you will also find a summary spreadsheet assembly_sum.tsv that looks like this

sp_prefix       Total_reads     Reads_in_target_region  Average_base_coverage   Length  GC%     Circularized
sp1     666666  30400.0 31.4    159658  0.36851895927545125     Yes
sp2     666666  48728.0 -246.9  133603  0.3567809106082947      No
sp3     666666  35628.0 43.1    133223  0.35511135464596955     No
sp4     666666  24956.0 25.4    159410  0.3685026033498526      No
sp5     666666  20474.0 35.3    131162  0.357679815800308       No

For nuclear ribosomal regions and mitochondrial assemblies:

python $PH/phyloherb.py -m qc -s 1_getorg/sample_sheet.tsv -i 1_getorg/ITS -o 2_assemblies/ITS
python $PH/phyloherb.py -m qc -s 1_getorg/sample_sheet.tsv -i 1_getorg/mito -o 2_assemblies/mito

The expected results can be found in example/results/2_assemblies.tgz

6. Assembly visualization with Bandage

Bandage is a program for visualising de novo assembly graphs. The assembly graph files *.fastg and *.gfa generated from our previous assembly step could be visualized in Bandage and exported into sequences. You can find a more detailed introduction about Bandage here.

The key output files generated from assembly include

• *.path_sequence.fasta, each fasta file represents one type of genome structure

• *.selected_graph.gfa, the assembly graph only with true contigs of organelle genome.

• extended_K*.assembly_graph.fastg, the raw assembly graph

• extended_K*.assembly_graph.fastg.extend_embplant_pt-embplant_mt.fastg, a simplified assembly graph

• extended_K*.assembly_graph.fastg.extend_embplant_pt-embplant_mt.csv, a tab-format contig label file for bandage visualization

• get_org.log.txt, the log file

The *.path_sequence.fasta files do not always navigate the right paths for organelle genomes, especially the ones with complicated structures. Here is an video introducing how to generate complete (if possible) and accurate sequences from Bandage with different examples. We will also practice with our own data during the workshop.

VI. Annotation and organelle structure variations

We will upload the circularized assembly to the web-based annnotator GeSeq. The annotations will be returned as genbank files, graphics, and alignments.

The genbank files can be imported to Geneious for visualization and manual curation.

V. Ortholog identification and alignment

  1. Ortholog identification

We will use the build-in references to extract plastid genes. The list of our reference species is here. The list of the genes in the database is here. We will only use blast hits longer than 120 bp.

Makesure executable blast commands are loaded in your environment. In the herbariomics_workshop directory, type the following commands

#create a new folder under the herbariomics_workshop to store alignments
cd herbariomics_workshop
mkdir 3_alignments
python $PH/phyloherb.py -m ortho -i 2_assemblies/chl -o 3_alignments/chl/ -l 120

In the output directory 3_alignments/chl, orthologous genes will be written to separate fasta files and the headers will be the species prefixes.

  1. Alignment

I will demonstrate the use of mafft and pasta, but the installation of pasta can be tricky though. We may not have enough time troubleshoot it during the workshop.

cd 3_alignments/chl
mafft --adjustdirection accD.fas | sed 's/_R_//g' >accD.mafft.aln.fas
mkdir accD.pasta
cd accD.pasta
cp ../accD.mafft.aln.fas .
run_pasta.py -i accD.mafft.aln.fas
  1. Nuclear ribosomal regions

The nuclear ribosomal data requires a slightly different curation strategy. The highly variable sequence requires more manual curation than the plastome. The nuclear ribosomal region exists as tandem repeats on multiple chromosomes.

Based on our experiences, NTS is not alignable even between closely related taxa. The entire rDNA region (18S+ITS1+5.8S+ITS2+28S) and some portion of ETS can be aligned at family level.

To get 18S, ITS1, 5.8S, ITS2, and 28S as separate fasta files, use the following commands

#go to the herbariomics_workshop folder
python $PH/phyloherb.py -m ortho -i 2_assemblies/ITS -o 3_alignments/ITS -rdna
  1. Build a custom reference database

We will use three annotated plastid genomes within the same family (Malpighiaceae) to build a custom reference data for gene and intergenic regions.

First, go to the NCBI genome database (https://www.ncbi.nlm.nih.gov/genome/browse/#!/overview/) to search and download published plastid genome annotations in genbank format. You can use the pre-downloaded one:

#create a new folder to store reference sequences 
cd herbariomics_workshop
mkdir genbank_ref
cp $PH/example/*.gb genbank_ref

Then to extract all genes from these genbank annotations

python $PH/phyloherb.py -m getseq -f gene -i genbank_ref -suffix .gb -o custom_ref

Within custom_ref, you will find a fasta file for each gene *.gene.fas. Each fasta file contains sequences from all species for that gene.

To extract intergenic regions, we need to define intergenic regions based on genbank annotations first. Here we will use three predefined intergenic regions.

cp $PH/example/gene_def.txt .
less gene_def.txt

#It looks like this:
name	start	end
INT1	psbA	matK
INT2	rps16	atpA
INT3	rpoC2	rpoB

Each row define the begin and end position of a genetic block. Note that they may contain multiple gene and intergenic regions.

Visualization of the INT1 region in Geneious

Visualization of the INT2 region in Geneious

Longer genetic block can result in higher blast accuracy, especially for intergenic regions. To get these three regions and the genes on both ends, type the following commands

python $PH/phyloherb.py -m getseq -f genetic_block -i genbank_ref -suffix .gb -o custom_ref -gene_def gene_def.txt

This will similarly result in multiple fasta files, each containing sequences from all species for that loci.

After checking gene names to make sure they are consistent, we can combine them into a single file.

cat custom_ref/*.fas | awk '/^>/{f=!d[$1];d[$1]=1}f' >custom_ref.fas

To use this custom database for ortholog extraction, use the following command

python $PH/phyloherb.py -m ortho -i 2_assemblies/chl -o 3_alignments/chl_custom -l 120 -ref custom_ref.fas
  1. Manual curation in Geneious

The fully functional version of Geneious (trial or subscribed version) is required to edit alignments. Alternative tools to view and edit alignments include AliView, Seaview, and many more.

VI. Phylogeny reconstruction

  1. Concatenation

Many tools are available for concatenating alignments. I recommend the conc function of phyloherb or Geneious. I have applied both tools to dataset with 1000 sp x 100 genes. The conc function of phyloherb will also output a gene delineation file that you can use to generate the configuration file for PartitionFinder.

To concatenate all of the fasta sequences in the input directory 3_alignments/chl with the suffix .mafft.aln.fas. Use the following commands:

#now go to the herbariomics_workshop folder
#copy pre-aligned fasta files
cp $PH/example/results/3_alignments.tgz .
tar xzvf 3_alignments.tgz
python $PH/phyloherb.py -m conc -i 3_alignments/chl -o 5sp_chl -suffix .mafft.aln.fas

IMPORTANT: Make sure you are using python 3. An error will occur if python 2 is used.

  1. Maximum likehood phylogeny

For this small dataset, we will use IQTREE to generate the maximum likelihood tree. After loading IQTREE to your environment, the example command is:

iqtree2 -m GTR+G -s 5sp_chl.conc.fas --prefix 5sp_chl.rnd1
  1. Second round of manual alignment curation

We will conduct a second round of alignment curation and remove spurious regions arising from assembly errors or false positive BLAST hits.

First, using a reference species tree (newick format), we will order the sequences based on their phylogenetic affinity using the order function of phyloherb. We will also remove sequences with >50% missing data.

python $PH/phyloherb.py -m order -t 5sp_chl.rnd1.treefile -i 3_alignments/chl -o 3_alignments/chl_ordered -suffix .mafft.aln.fas -missing 0.5

This will generate an ordered alignment *.ordered.fas and a companion tree file *.pasta_ref.tre for each gene in the newly created directory chl_ordered. You will need this tree for the second round of pasta alignment after manual curation.

Now let's load the ordered alignments to Geneious for some fine tuning. This time we will delete blocks of problematic sequences. They usually appears as a cluster of SNPs highlighted in red below. These SNPs are not conserved in their close relatives, so they are phylogenetically uninformative autapomorphies. Regardless of the causes, we can safely delete them.

After a second manual check, your alignments is ready for re-alignment in pasta. This time we will use a reference tree *.pasta_ref.tre to guide the alignment for each gene.

#go to the chl_ordered folder
run_pasta.py -i accD.mafft.aln.ordered.fas -a -t accD.mafft.aln.pasta_ref.tre -o accD.pasta

After the alignment is done, use the newly generated ordered alignment *.ordered.aln in each gene folder to repeat VI.1 to VI.2 to get your final species tree.