# Phylogeny from whole genome sequence data

When we sequence a population, we aim to capture the variation (SNPs, indels, gene gain and loss etc.) in the samples and use it to infer the relationships between the samples. Two of the main approaches to capturing this variation and reconstructing the bacterial genomes are:

* De novo genome assembly and annotation
* Mapping and variant calling against a reference genome

Each approach has it's benefits and limitations. We will focus on mapping and variant calling in this tutorial. For mapping and variant calling, whether we are dealing with different bacterial isolates, with viral populations in a patient, or even with genomes of different human individuals, the principles are essentially the same. Instead of assembling the newly generated sequence reads de novo to produce a new genome sequence, it is easier and much faster to align or map the sequence reads to a reference genome. We can then readily identify SNPs and indels that distinguish closely related populations or individual organisms and may thus learn about genetic differences that may cause drug resistance or increased virulence in pathogens, or changed susceptibility to disease in humans. One important prerequisite for the mapping of sequence data to work is that the reference and the re-sequenced subject have the same genome architecture.

In this exercise, we will use sequence data from _Salmonella enterica serovar Typhi_ samples to demonstrate the mapping and variant calling approach. Importantly, although the data is based on real sequence data, it has been edited to make it run more efficiently for the purpose of this tutorial.

Navigate to the directory that contains the sequence data:

In [None]:
cd ~/course_data/snp-phylogeny/data/typhi

Take a look at the directory containing the sequence data for the samples:

In [None]:
ls fastq

## Introducing the tutorial dataset

We will use data adapted from the following paper:

> **A genomic snapshot of Salmonella enterica serovar Typhi in Colombia**  
> Guevara, Paula Diaz, et al.  
> _PLoS Neglected Tropical Diseases2021. doi: 10.1371/journal.pntd.0009755_  
> PMID: [34529660](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8478212/) 

_Salmonella enterica serovar Typhi_ (_S. Typhi_) is the causative agent of typhoid fever, with between 9–13 million cases and 116,800 associated deaths annually. Typhoid fever is still a public health problem in many countries, including in Latin America, which has a modelled incidence of up to 169 (32–642) cases per 100,000 person-years. Several international studies have aimed to fill data gaps regarding the global distribution and genetic landscape of typhoid; however, in spite of these efforts Latin America is still underrepresented. This study provided the first enhanced insights into the molecular epidemiology of S. Typhi in Colombia, using whole genome sequencing data to investigate the population structure in Colombia and identify predominant circulating genotypes.

## Overview of mapping and variant calling approach

The diagram below illustrates the steps involved when mapping and calling variants for a set of bacterial samples.

![Approach](img/snp-phylogeny-approach.png)

The first step once you have obtained your sequence data (FASTQ) from the sequencing machine is to QC the data. After QC, the sequence data is matched or aligned to a reference genome (FASTA) in a process called read mapping to produce a set of read aligments (SAM/BAM). These read alignments are inspected to identify differences between the aligned reads and the reference genome. This process is called variant calling and produces VCF files. In fact during this process we capture information about every position in the genome (variant and non-variant sites) in the VCFs. Each site in the VCF has a set of quality filters applied and any sites identified as low quaility (e.g. less than 4 reads aligned at that position) are marked as low quality in the VCF to produce a filtered VCF. We use this filtered VCF file in a process called consensus caling to reconstruct a consensus _pseudosequence_ or _pseudogenome_ for our sample (FASTA). In the _pseudogenome_, any sites marked as low quality will be represented as an N in the reconstructed sequence. These pseudogenomes (multi-FASTA) are then aligned and variation identified and used to reconstruct a phylogeny of our samples. 

## Exercise

Now let's analyse some data!

### Prepare the data

First take a look at the sequence data provided.

In [None]:
ls fastq/

#### Check your understanding

1. How many samples have been sequenced?
1. How many fastq files are there? 


We will use the chromosome sequence of _Salmonella typhi CT18_ as the reference genome. This has already been downloaded from  RefSeq. Take a look at the reference genome:

In [None]:
ls ref/

Check the size of the reference file:

In [None]:
assembly-stats ref/Styphi_CT18.fasta

Now use bwa to index the reference genome. This creates a lookup table that bwa uses when matching the sequence reads against the reference genome.

In [None]:
bwa index ref/Styphi_CT18.fasta

#### Check your understanding

1. How many sequences in the reference fasta file?
2. What are the names of the sequences in the reference fasta file?
3. What is the size of the reference? 
4. What additional files did the indexing step produce?

### QC the sequence data

An important first step in any analysis is QC. Use FastQC to QC the data:

In [None]:
fastqc XXX

Collate together all the QC reports into one file using MultiQC:

In [None]:
multiqc

Open the collated report summary.html in firefox.

#### Check your understanding

1. XXX?
2. XXX?

### Trim the reads to remove low quality and adapter sequence

This is an optional step and can be carried out if your sequence reads have a high level of adapter contamination and/or have low quality bases at the end of the reads. The fastp software can be used to do this.

Use fastp to trim the reads for sample ERR5243693.

In [None]:
cd fastq
fastp \
   --in1 ERR5243693_1.fastq.gz --in2 ERR5243693_2.fastq.gz \
   --out1 ERR5243693_1.trim.fastq.gz \
   --out2 ERR5243693_2.trim.fastq.gz \
   --json ERR5243693.fastp.json --html ERR5243693.fastp.html \
   --detect_adapter_for_pe --cut_mean_quality 20 \
   --thread 2

Now repeat for the other samples.

Take a look at the output from fastp:

In [None]:
head -20 ERR5243693.fastp.json
head -20 ERR5243695.fastp.json
head -20 ERR5243699.fastp.json

#### Check your understanding
1. How much data (bp/base pairs) was lost due to trimming and adapter removal?

### Map the data to a reference genome 

Use bwa to map the reads for sample ERR5243693 to the reference genome.  

In [None]:
bwa mem -t 2 ../ref/Styphi_CT18.fasta \
ERR5243693_1.sub.fastq.gz ERR5243693_2.sub.fastq.gz > \
ERR5243693.sam

Convert the sam file to a bam file with samtools:

In [None]:
samtools view -@ 2 -bhS -o ERR5243693.bam ERR5243693.sam

Sort and index the sorted bam file:

In [None]:
samtools sort -@ 1 -o ERR5243693.sorted.bam -T \
ERR5243693.sorted ERR5243693.bam

In [None]:
samtools index ERR5243693.sorted.bam

Generate some statistics about the alignment:

In [None]:
samtools stats ERR5243693.sorted.bam > ERR5243693.stats
samtools flagstat ERR5243693.sorted.bam > ERR5243693.flagstat
samtools coverage ERR5243693.sorted.bam > ERR5243693.coverage

Now repeat for the other samples.

#### Check your understanding
1. What %reads mapped to the reference for each sample? 
2. What %genome was covered for each sample? 
3. What is the mean depth of coverage for each sample? 

### Call variants

Go through each position in the reference genome and look at reads aligned at that position and make a call about what the base is in the sample at that position. If there are any differences then this is a variant. 

Do this for sample ERR5243693 using bcftools:

In [None]:
bcftools mpileup --fasta-ref ../ref/Styphi_CT18.fasta \
--min-BQ 20 \
--annotate \
FORMAT/AD,FORMAT/ADF,FORMAT/ADR,FORMAT/DP,FORMAT/SP,INFO/AD,\
INFO/ADF,INFO/ADR ERR5243693.sorted.bam | bcftools call \
--output-type v --ploidy 1 --multiallelic-caller - | \
bcftools view --output-file ERR5243693.vcf.gz --output-type z

Index the VCF file and generate some statistics:

In [None]:
tabix -p vcf -f ERR5243693.vcf.gz
bcftools stats ERR5243693.vcf.gz > ERR5243693.bcf_stats.txt

Now repeat for the other samples.

Look at the statistics for the variant calling:

In [None]:
less ERR5243693.bcf_stats.txt
less ERR5243695.bcf_stats.txt
less ERR5243699.bcf_stats.txt

#### Check your understanding
1. How many sites are in the VCF file for each sample?
2. Does this match to the size of the reference used in the read mappping step?
3. How many variant sites were identified for each sample?

### Filter the sites/variants 
We want to identify calls where we have a high confidence that they are correct (and not due to sequencing errors and/or misalignment of the reads). We use criteria like read depth at a position, quality scores etc. to filter out low quality calls at each position.

Use bcftools to filter sites for sample ERR5243693.

In [None]:
bcftools filter \
--output ERR5243693.filtered.vcf.gz \
--soft-filter LowQual --exclude "%QUAL<25 || FORMAT/DP<10 \
|| MAX(FORMAT/ADF)<2 || MAX(FORMAT/ADR)<2 || \
MAX(FORMAT/AD)/SUM(FORMAT/DP)<0.9 || MQ<30 || MQ0F>0.1" \
--output-type z ERR5243693.vcf.gz

Index the filtered VCF file and generate some statistics:

In [None]:
tabix -p vcf -f ERR5243693.filtered.vcf.gz
bcftools stats ERR5243693.filtered.vcf.gz > \
ERR5243693.filtered.bcf_stats.txt

Take a look at the VCF file and notice how it contains information about variant and non variant sites.

In [None]:
bcftools view -f PASS ERR5243693.filtered.vcf.gz | less

Now repeat for the other samples.

#### Check your understanding
1. How many sites were marked as low quality in the filtering step?
2. How many variant sites were marked as low quality in the filtering step?

### Call a consensus sequence for each sample

A pseudogenome is a reconstruction of what we think the genome is for the sample using the reference genome as a basis. To create it for a sample, you go through each position in the reference and determine what base is called (using the VCF from the previous steps) for the sample. Sometimes this will be the same as the reference, sometimes different (a variant). For positions that are flagged as low quality/filtered out (e.g. no reads covering the position) we use an N in the pseudogenome. This is because you cannot be confident what the base is at this position for the sample. In the end the length of the pseudogenome for your sample should be the same as the length of the reference. 

To create a pseudogenome for sample ERR5243693 use the script _vcf2pseudogenome.pl_.

In [None]:
vcf2pseudogenome.py -r ../ref/Styphi_CT18.fasta \
-b ERR5243693.filtered.vcf.gz -o ERR5243693.fas

Now repeat for the other samples.

#### Check your understanding
1. What is the length of the pseudogenomes?
2. Does it match the length of the reference?

### Create a multiple sequence alignment of all pseudogenomes

Remember to reconstruct the phylogeny of samples we need a multi fasta alignment of our sequences. 

In [None]:
cat *.fas > aligned_pseudogenomes.aln

#### Check your understanding
1. How many sequences in the multiple sequence alignment file of pseudogenomes?
2. What is the largest and mean sequence length?

### Add the reference genome to the multiple sequence alignment

In [None]:
cat ../ref/Styphi_CT18.fasta aligned_pseudogenomes.aln > ref_and_aligned_pseudogenomes.aln

At this point it would be useful to look at your alignment in a multiple sequence alignment viewer e.g. seaview.

In [None]:
seaview &

#### Check your understanding
How many sequences in the multiple sequence alignment file of reference and pseudogenomes?

### Mask repetitive regions (optional but good practice)

Bases called in repetitive regions may not be true variation (e.g. due to misalignment of reads) and may affect/compromise the core phylogenetic signal. Therefore it is good practice to identify known repetitive regions and mask these out from your alignment (explain the process of masking).

To achieve this, either a file of known regions for the reference you aligned to will exist (see literature) or one can be generated by matching the reference genome against itself (to identify repeat regions) with Mummer and use Phast to identify prophage. (Out of scope for now and details on how to do this will be added later).

In [None]:
insert commands for masking

## Draw a tree with iqtree
Now that we have a multiple sequence alignment, we can use IQ-TREE to build a maximum likelihood phylogeny.

Make a SNP-only alignment using snp-sites

Calculating a phylogeny on whole genome sequences can be very time consuming. We can speed this up by only using the variable sites (SNPs). However, we need to be aware that only including variable sites can affect the evolutionary rate estimates made by phylogenetics software - therefore, we need to account for the sites we remove in our analysis.

We will use snp-sites to do this. You can view the options for snp-sites

First, remove all the invariant sites and create a SNP-only multiple sequence alignment.

In [None]:
snp-sites -o clean.full.SNPs.aln clean.full.aln

We can see how many invariant sites were removed (and what proportion of A, T, G, C they were) using

In [None]:
snp-sites -C clean.full.aln

We can look at the options for IQ-TREE below

In [None]:
iqtree -h

Draw the tree qith IQ_TREE:

In [None]:
iqtree -s clean.full.SNPs.aln -fconst $( snp-sites -C clean.full.aln ) -m GTR+F+I -T 2 -mem 2G -B 1000 -o M66

In the command below, we:

specify the multiple sequence alignment using -s clean.full.SNPs.aln
ask IQ-TREE to take account of missing invariant sites using -fconst $(snp-sites -C clean.full.aln)
specify an evolutionary model we want IQ-TREE to use -m GTR+F+I
tell IQ-TREE to use a maximum of 2 CPUs (threads) and 2GB memory -T 2 -mem 2G
perform 1000 ultrafast bootstraps -B 1000
use sample M66 as an outgroup -o M66

Our maximum likelihood tree is labelled clean.full.SNPs.aln.treefile. The treefile suffix is not always correctly identified by many tools, so we'll relabel this as something else:

cp clean.full.SNPs.aln.treefile clean.full.SNPs.aln.tre
We can look at the raw text

cat clean.full.SNPs.aln.tre

Instead, we can visualise this using figtree

figtree clean.full.SNPs.aln.tre &

Side Note:
Often you may want to incorporate a ‘finished reference genome’ into your tree e.g. to use as an outgroup. (Explain what an outgroup is and why you may want to use it) or to see how your samples/isolates relate to them. If there is no sequence data available for the isolate e.g. there is only a complete reference genome then one approach you can take is to shred the reference genome (fasta file) and make simulated reads (remember it is a haploid genome) and treat the simulated data as another sample and process it like the other samples. You will then have a ‘pseudogenome’ of the sample that will be included in the ref_and_aligned_pseudogenomes.fas file at this stage. The practical aspect of this is beyond the scope of this tutorial. The article below lists some popular tools for simulating reads:

https://www.nature.com/articles/s41437-022-00577-3

Step 16: Draw a phylogenetic tree 

Get the count of constant sites (for use with IQ-TREE -fconst).

conda activate snp-sites-2.5.1
snp-sites -C ref_and_aligned_pseudogenomes.aln
conda deactivate

Draw a tree using IQ-TREE

conda activate iqtree-2.2.0
iqtree \
    -fconst 1150024,1250471,1254262,1153503 \
    -alrt 1000 -B 1000 -m MFP -czb \
    -s pseudogenomes.snpsites.aln \
    -nt AUTO \
    -ntmax 2 \
    -mem 2GB
conda deactivate

Rename the tree file:

mv pseudogenomes.snpsites.aln.treefile \
pseudogenomes.snpsites.iqtree.tree

Draw a tree using FastTree

conda activate fasttree-2.1.11
FastTree -gtr -gamma -nt pseudogenomes.snpsites.aln > \
pseudogenomes.snpsites.fasttree.tree
conda deactivate

## Remove recombination with Gubbins


Step 14: Identify regions of recombination (optional)

Variation due to recombination events can mask the core phylogenetic signal, therefore it is recommended to identify these regions in your alignment and mask them out.

Identify regions of recombination using Gubbins (Out of scope for now and details on how to do this will be added later).

Step 15: Identify variant sites across all samples

You now need to identify the sites in the genomes that differ in at least one of your samples and output a file listing the variant sites only. This can be done using snp-sites.

conda activate snp-sites-2.5.1
snp-sites ref_and_aligned_pseudogenomes.aln \
-o pseudogenomes.snpsites.aln 
conda deactivate

Checkpoint: 

How many variant sites are identified?

conda activate assembly-stats-1.0.1
assembly-stats pseudogenomes.snpsites.aln
conda deactivate

Does this correlate with the expected number i.e. from the literature?

Accounting for recombination with gubbins
We can use gubbins to infer recombining sites by looking for increased SNP density that occurs in specific ancestral nodes

run_gubbins.py -h

The following command runs gubbins on standard settings, with 4 CPUs.

Note: the -c option tells the program to use 4 CPUs. Note: the -p option tells the program to name all files with the prefix gubbins This command can take a few minutes to run.

run_gubbins.py -c 4 -p gubbins clean.full.aln

NB. If gubbins takes more than 10 mins to complete, we have already run it for you - the files are available at ~/Module_5_Mapping_and_Phylogeny/gubbins_backups/.

ls -lh gubbins_backups/
cp gubbins_backups/* ./

Lets look at what gubbins has done

ls -l gubbins.*

You can explore these files For example gubbins.recombination_predictions.gff is a gff file that contains a record of each recombination block identified, how many SNPs it contains, and what samples are affected.

head gubbins.recombination_predictions.gff

gubbins.final_tree.tre is a phylogeny in which recombination has already been accounted for.

You can visualise this in figtree or microreact as above.

## Root the phylogeny

The trees generated by iqtree and gubbins are unrooted, but we may want to apply some evolutionary direction to them. One strategy for rooting a tree is called _midpoint rooting_. Midpoint rooting involves locating the midpoint of the longest path between any two tips and putting the root in that location. Note that this does not necessarily infer the true root, and this should be used with caution.

To midpoint root our tree, we will use a simple script written in python that uses the ete package. You can examine the code:

In [None]:
less midpoint.root.py

![Python script to midpoint root a tree](img/midpoint.root.image.png)

Run this script to midpoint root the tree.

In [None]:
python midpoint.root.py gubbins.final_tree.tre > gubbins.final_tree.midpoint.tre

Visualise this in figtree. How does it compare to the unrooted version?

Another common strategy for rooting the tree is _outgroup rooting_. This is the preferred approach for bacterial datasets. Outgroup rooting involves including one or more sequences in the analysis that are more distantly related to our sequences of interest than they are to one another. These sequence are usually referred to as _outgroups_. The root estimate is then simply the point at which the outgroup(s) join the tree. The best possible outgroups are those available which are most closely related to our sequences of interest but different enough to .... Examples.... 

How to include the outgroup, select sequence data for the outgroup and include in your analaysis (map and snp call from the beginning). Run gubbins again with outgroup and how to generate the stats. 

Step 17: Visualise the trees 

Use a tree visualisation software like iTOl, FigTree or Phandango to visualise your phylogenetic tree.

Using iTOl, open the files pseudogenomes.snpsites.iqtree.tree and pseudogenomes.snpsites.fasttree.tree
 
Checkpoint

How do the trees produced by IQ-TREE and FastTree compare?

### Clean-up!

Clean up any intermediate files that were generated during the analysis that you no longer require. This is always an important last step of any analysis as sequence data analysis files can use up large amounts of disk space.

In [None]:
rm *.trim.fastq.gz
rm *.sub.fastq.gz
Rm *.html
rm *.sam
rm ERR5243693.bam*
rm ERR5243695.bam*
rm ERR5243699.bam*
rm ERR5243693.vcf.gz*
rm ERR5243695.vcf.gz*
rm ERR5243699.vcf.gz*

Now go to the next section: [Phylogeny and Metadata](metadata.ipynb)