# Mapping and Phylogeny

## Introduction

Phylogenetics is essentially about similarity, and looking at patterns of similarity between taxa to infer their relationships. It has important applications in many fields of genome biology. For example, when annotating a gene in a new genome it is useful for identifying previously-annotated genes in other genomes that share a common ancestry. It is also becoming increasingly common to use phylogeny to trace the evolution and spread of bacterial diseases, and even as an epidemiological tool to help identify disease outbreaks in a clinical setting. Further analysis of genome sequences to examine recombination, molecular adaptation and the evolution of gene function, all benefit from phylogeny.

With sequence data can use multiple approaches to infer phylogenetic relationships between samples. All involve identifying variation and ca be at fifferent levels of resolution.

* a single gene
* MLST
* cgMLST
* core/accessory pangenome
* wg

Sentence about resolution, and that the two main approaches are pangenome and wg. In this tutorial we will focus on wg approach.

For reference mapping, whether we are dealing with different bacterial isolates, with viral populations in a patient, or even with genomes of different human individuals, the principles are essentially the same. Sequence reads are matched to a reference genome and SNPs and INDELs are identified. These variants can be used to distinguish closely related populations or individual organisms and may thus learn about genetic differences that may cause drug resistance or increased virulence in pathogens, or changed susceptibility to disease in humans. One important prerequisite for the mapping of sequence data to work is that the reference and the re-sequenced subject have the same genome architecture.

## Learning outcomes
On completion of the tutorial, you can expect to be able to:

* List the different approaches to constructing a phylogeny for WGS data
* Map sequence reads to a reference genome and identify variants in your sample
* Create a sequence alignment of your samples
* Identify and remove recombination with Gubbins
* Draw a phylogenetic tree
* Visualise phylogenetic tree in context of sample metadata

## Tutorial sections
This tutorial comprises the following sections:   
 1. [Data formats for NGS](formats.ipynb) 
 2. [Converting between formats](conversion.ipynb)
 3. [Data formats for NGS](formats.ipynb) 
 4. [Converting between formats](conversion.ipynb)
 5. [Converting between formats](conversion.ipynb)
  
## Authors and License
This tutorial was written by [Jacqui Keane](https://github.com/jacquikeane).

The content is licensed under a [Creative Commons Attribution 4.0 International License (CC-By 4.0)](https://creativecommons.org/licenses/by/4.0/).

## Running the commands in this tutorial
You can follow this tutorial by typing all the commands you see in a terminal window on your computer. Remember, the terminal window is similar to the "Command Prompt" window on MS Windows systems, which allows the user to type DOS commands to manage files.

To get started, open a terminal window and type the command below followed by the `Enter` key:

In [None]:
cd ~/course_data/snp-phylogeny/data

## Prerequisites
This tutorial assumes that you have the following software and their dependencies installed on your computer. The software used in this tutorial may be updated from time to time so, we have also given you the version which was used when writing this tutorial.


| Package name | Link for download/installation instructions                          | Version |
| :----------: | :------------------------------------------------------------------: |:------: |
| samtools     | https://github.com/samtools/samtools                                 | 1.17    |
| seaview      | https://github.com/samtools/samtools                                 | 1.17    |
| bcftools     | https://github.com/samtools/bcftools                                 | 1.17    |
| bwa          | https://broadinstitute.github.io/picard/                             | 3.0.0   |
| fastp        | https://broadinstitute.github.io/picard/                             | 3.0.0   |
| snp-sites    | https://broadinstitute.github.io/picard/                             | 3.0.0   |
| gubbins      | https://broadinstitute.github.io/picard/                             | 3.0.0   |
| iqtree       | https://broadinstitute.github.io/picard/                             | 3.0.0   |
| FigTree      | https://broadinstitute.github.io/picard/                             | 3.0.0   |
| Phandango    | https://broadinstitute.github.io/picard/                             | 3.0.0   |
| Microreact   | https://broadinstitute.github.io/picard/                             | 3.0.0   |

The easiest way to install the required software is using `conda`, a software package manager. These software have already been installed on the computer for you. To activate them type:

In [None]:
conda activate snp-phylogeny

After the software is activated type the following commands:

In [None]:
samtools --help

In [None]:
bcftools --help

In [None]:
bwa

In [None]:
fastp

In [None]:
snp-sites

In [None]:
gubbins

In [None]:
iqtree

In [None]:
FigTree

This should return the help message for these tools.

To get started with the tutorial, go to the first section: [Introduction to phylogeny](phylogeny.ipynb)