# Day 1, Practical 2

A statistic commonly used to represent the genetic diversity within an individual is the proportion of sites that have two different alleles, referred to as heterozygosity. While heterozygosity can refer to the proportion out of any set of markers, in this exercise we will look at the number of heterozygous sites out of all sites. Heterozygosity, while being a measure for the within-individual genetic variation, also provides information about the genetic variation in the population as a whole.

In this exercise we will cover:
 - Filtering data to get accurate calls of heterozygous sites
 - How to estimate heterozygosity for a single individual
 - Comparing these values for multiple individuals
    
    
Tools used: bcftools, samtools, AWK, R

The notebooks are editable, so feel free to experiment and change the code to see what happens, or to write notes in the text cells. Just remember to download the notebooks (e.g. both the originals and any edited versions you may make) to your own computer at some point so you can access them later.

First, we define the paths for the files we need during the exercise.

In [None]:
### data paths
BAM=/davidData/users/thomas/workshop/CTauTzS_8872.Goat.bam
GOODSITES=/davidData/users/thomas/workshop/Goat.siteQC.good.bed
GOAT_REF=/davidData/users/thomas/workshop/goat.fa.gz
HET=/davidData/users/thomas/workshop/het.roh.tsv

### make sure required software is installed
which bcftools
which samtools
which awk 

### make directory for the exercise
mkdir -p ~/kenya2024/GeneticDiversity
cd ~/kenya2024/GeneticDiversity


Do you remember from earlier what a bam file contains? Let's try to look inside the BAM file we're going to be using as input by running the following:

In [None]:
samtools view $BAM | head -n1

This outputs the information for a single read coming from a single wildebeest individual, the one with ID number 8872. The BAM file contains around 242 million such reads - only from one individual! Pause a bit and think about the enormity of this data, and how much information it contains. That's the power of whole-genome sequencing.

## Filtering
Since we are using data that has not been filtered, and because heterozygosity estimation depends a lot on correctly determining which sites are truly heterozygous, we will want to employ some filters on the mapped reads. But first we will need to know the average sequencing depth of our data. This is done by first extracting the depth for every site mapped to goat chromosome 27 or "NC_030834.1" with "samtools depth" and then piping this result into AWK to compute the mean.

In [None]:
### compute depth
samtools depth $BAM -r NC_030834.1 |  awk '{sum+=$3} END { print "Mean = ",sum/NR}'

Two of the filters we are going to use are given as options to bcftools below: "-Q 30  -q 25", which tell the program to simply ignore bases in reads if their base calling quality score is below 30 and to ignore entire reads if their mapping quality score is below 25.
 - What is the difference between base calling quality score and mapping quality score?
 - What do values of 30 and 25 correspond to? (Hint: https://en.wikipedia.org/wiki/Phred_quality_score)
 
In addition to the quality filters, we will also set a minimum and maximum depth of sequencing for each site. Here, we exclude all sites from the estimation if they have less than half of the mean sequencing depth we just calculated, and if they have more than twice of the mean sequencing depth, respectively. We add this filter because sites with unusally low or high depth are likely to be problematic. For example, they can arise due to "paralogy", when there are two very similar regions in the genome of the sampled species mapping to a single region in the reference genome. 

In addition to these basic filters, there are several others that can be considered before calling genome-wide heterozygosity. For example, various kinds of repetitive sequence regions are spread across the genome of many organisms - e.g. the goat genome has about 45% - that can make mapping tricky in such parts, and thus makes the information about sites in such regions unreliable. One of the ways to identify these regions to be able to exclude them from analysis is with the tool RepeatMasker, which in our case was used on the goat reference genome beforehand. We will not be going into detail about this procedure, but we have prepared a list of sites in advance that has passed through more extensive filtering. This list of what we can call "good sites" to use for heterozygosity estimation can then be supplied to bcftools with the "-T" option.

It looks like this inside:

In [None]:
head $GOODSITES

Now we are almost ready to apply these filters and call genotypes on sites that pass the filters. However, going through an entire genome and evaluating every single position takes time and processing power, and so for the purpose of demonstration we will be using data mapped to only one small region, namely positions 30,000,000 to 31,000,000 on goat chromosome  27. In a real study, we would want to base the heterozygosity estimation on as much data as possible to get the most accurate estimate. 
 - How large a portion of chromosome 27 is this subset of 1 million bases? What about compared to the whole genome? (Hint: Look here https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_001704415.2/)
 
We now apply our filters and call genotypes in the small region:

In [None]:
### filtering
bcftools mpileup --threads 10 --full-BAQ -r NC_030834.1:30000000-31000000 -T $GOODSITES -Q 30 -q 25 -O u \
    --fasta-ref $GOAT_REF --per-sample-mF -a FORMAT/AD,FORMAT/DP $BAM | \
    bcftools call -Ob -o CTauTzS_8872.subset.bcf.gz --threads 10 -c

Next, we want to do some more processing of the output and fill out the tag "AC" in the bcf file, which is short for "allele count". The command to do this may look a bit scary, but if we break it down it shouldn't be too bad. In the first part, the expression after -i removes any sites where the reference or alternate allele are not a single base (non-SNP's), as well as any site where the number of reads covering it is less than half or more than double the mean depth of coverage. The next part removes any site that is called to be heterozygous, but where only a single read supports one of the two alleles, because this is unilkely to occur if the site is truly heterozygous, but can happen by a single erroneous base call. The last part on the last line fills in the tag "AC" for the number of alternate alleles in the sites remaining - we will need this tag later. 

In [None]:
### more filtering and filling AC tag
bcftools view --threads 10 -i 'strlen(REF)==1 & (strlen(ALT)==1 || ALT=".") &  FMT/DP>=8 & FMT/DP<=34' \
    -M 2 -Ou CTauTzS_8872.subset.bcf.gz | \
    bcftools view --threads 10 -i '(GT=="het" & FMT/AD[*:0]>=2 & FMT/AD[*:1]>=2 ) || GT=="hom"' | \
    bcftools +fill-tags /dev/stdin -Ob -o CTauTzS_8872.subset.filtered.bcf.gz -- -t AC

Lets have a look at the resulting file (grep -v '^##' skips the header lines that contain metadata):

In [None]:
bcftools view CTauTzS_8872.subset.filtered.bcf.gz | grep -v '^##' | head -n500 | column -t 

 - Try to see if you can find a heterozygous site in the cell above. (Hint: Start by looking at sites where the "ALT" field is not "." and then look at the genotype or "GT")
 - If a site is truly heterozygous, what proportion of reads covering that site would you expect to support each of the two alleles?

## Estimation
Above, we made a cautiously filtered vcf file containing genotypes from a small region of chromosome 27. From here, the actual heterozygosity estimation is quite simple both in concept and execution. We are simply going to count up all the sites estimated as heterozygous and then calculate what their proportion is out of all the sites contained in the file. We can do this  by piping the "AC" field into the tool AWK and then count up the number of each occurence.

In [None]:
# count up alleles:
bcftools query -f '%INFO/AC\n' /davidData/users/thomas/workshop/CTauTzS_8872.chr27.filtered.bcf.gz | \
awk '{a[$1]++} END {for (allele in a){print allele, a[allele]}}' > CTauTzS_8872.AC

In [None]:
cat CTauTzS_8872.AC

Saved to the file above we have the counts of sites that are homozygous for the alternative allele, sites that are heterozygous and sites that are homozygous for the reference allele.
- Why do you suppose the counts are labeled as "2", "1" and "."?
- What do you think could be the reason for the large number of sites homozygous for the alternative allele? (Hint: What could the reference/alternative alleles be defined in relation to? What is the reference genome that our data is mapped to?)

To find the proportion of heterozygous sites simply divide the count of heterozygous sites by the total count (fill in the blanks and run the cell):

In [None]:
... / ( ... + ... + ... )

 - Now try to run the estimation again but on this file: /davidData/users/thomas/workshop/CTauTzS_8872.chr27.filtered.bcf.gz  which is the same as the one we generated except that it covers the whole of chromosome 27.(Hint: replace "CTauTzS_8872.subset.filtered.bcf.gz" with the new file path in the paragraph where we count up the alleles). Why do you get different values on the whole of chromosome 27 compared to the small region/subset and which should we trust more?
 - Then try to run the filtering and then the estimation again with one or more of the filters turned off and see what kind of difference it makes. (Hint: look to the options "-T", "-Q" or "-q")
 - Why do you think we see a difference?

## Comparison between individuals
In isolation the heterozygosity value we just estimated does not tell us much, so we will need something to compare it to. In general compared to other large african mammals, this value is on the low side, but lets look at some more wildebeest samples in comparison.
While we could have estimated values for more indviduals by repeating the previous procedure, we have cheated a bit and done this ahead of time, saving the heterozygosities estimated on whole genomes in a file where the relevant values look like this:

In [None]:
cut $HET -f1,6,7 | column -t

To get a better overview of these values, what is commonly done is to plot them in eg. a boxplot separated by a grouping of interest such as population, locality or species. Here the field "map" denotes their locality and we can use this information to separate the samples into groups.

In [None]:
options(repr.plot.width = 16)
het_table <- read.table("/davidData/users/thomas/workshop/het.roh.tsv", header = TRUE)
boxplot(het ~ map, data = het_table, col= "hotpink")

 Now we have a better overview of the distributions of heterozygosity.
  - What could be a reason that we see differences in heterozygosity between some populations, but not so much within the different populations?
  - What could be the reason for the outliers we see? (the dots at Etosha and Monduli)
  - The value we got for individual 8872 was lower than the other wildebeest populations. Individual 8872 comes from Nyerere National Park in Tanzania (formerly Selous Game Reserve). Can you suggest an explanation of why it has lower heterozygosity?
  
Finally, we have also estimated heterozygosity in a similar manner from a range of other species belonging to the Tragelaphines, or spiral-horned antelopes. Here we plot these values to allow comparison with the wildebeests.

In [None]:
het_table_trag <- read.table("/davidData/users/thomas/workshop/heterozygosity_trag.txt", header = FALSE)
names <- c("Tory" = "eland", "Tder" = "giant_eland", "Tstr" = "greater_kudu", "Timb" = "lesser_kudu", "Tbux" = "mountain_nyala", "Scaf" = "nyala", "Tspe" = "sitatunga", "Tscr" = "bushbuck")

het_table_trag$species <- names[substr(het_table_trag$V1, 1, 4)]
het_table_wildebeest <- data.frame(V1 = het_table$sampleID, V2 = het_table$het, species = rep("wildebeest", dim(het_table)[1]))
het_table_species <- rbind(het_table_trag, het_table_wildebeest)

boxplot(V2 ~ species, data = het_table_species, col = "salmon", ylab = "Heterozygosity")



- Compare the heterozygosity values for the different species, and try to relate them to actual population sizes of each species. Is there a clear correlation between population sizes and heterozygosities? Why or why not? (estimates for current number of total individuals can be found online, for example here:https://www.iucnredlist.org/ja/species/22054/166487759)