# Genotyping short tandem repeat (STRs) in sequencing data with GangSTR
### BIO392 30.09.2022
Contact: Max Verbiest (maxadriaan.verbiest@uzh.ch)

In this notebook, we will do a hands-on introduction to genotyping STRs in alignments using GangSTR. We will subsequently use GangSTRs output to look for a mutation in on of the two samples we analyse.

The situation is as follows:

You have two alignments resulting from the sequencing of two samples from a colorectal cancer patient. You know that one sample was taken from healthy tissue, and one from the patients tumour. The only problem is... you forgot which sample came from which tissue! (very embarrassing)
Your task is now to figure out which of the two samples is most likely to come from the tumour, based on the STR genotypes that you will determine with GangSTR.

### 1: Run GangSTR

First of all, we need to run GangSTR. Open up your terminal and follow the steps from the image below:


![title](img/run_gangstr.png)

If all went well, GangSTR should have generated output files in the 'results' directory. Let's check:


![title](img/check_output.png)

Feel free to take a look at the contents of these output files. The most relevant files for us are the ones with '.vcf' extension (Variant Call Format). For now, just know that these are output files generated by GangSTR that will tell us the STR genotypes at the loci we specified. In a later session we will take a closter look at file formats related with the analysis of biological sequences.

### 2: Load GangSTR output as a pandas DataFrame

First, we will load the required libraries and scripts.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("poster")

import gangstr_utils

%matplotlib inline

Next, we load the GangSTR vcf files into Pandas DataFrames.

As a sidenote: for this notebook I wrote a *very* rudimentary function to parse these vcf files and combine them with our previously generated set of reference STR loci. This is to reduce the number of dependencies needed to be installed for this notebook to work. In any real-life scenario, it is much wiser to use some third-party vcf parsing library, such as [cyvcf2](https://brentp.github.io/cyvcf2/).

In [2]:
df_str_loci = pd.read_csv(
    "../data/repeats/APC_repeats.tsv", 
    sep="\t",
    header=None,
    names=["chr", "start", "end", "unit_len", "unit"]
)

df_gangstr_results_s1 = gangstr_utils.load_gangstr_output(df_str_loci, "../results/sample_1/sample_1.vcf")
df_gangstr_results_s2 = gangstr_utils.load_gangstr_output(df_str_loci, "../results/sample_2/sample_2.vcf")

Let's print the first 10 entries of one of these dataframes:

In [3]:
df_gangstr_results_s1.head(10)

Unnamed: 0,chr,start,end,unit_len,unit,ref,alt
0,chr5,298,309,1,A,12,12
1,chr5,7241,7249,1,A,9,9
2,chr5,9390,9399,1,A,10,10
3,chr5,10062,10077,1,T,16,16
4,chr5,10673,10688,1,A,16,16
5,chr5,15411,15439,1,T,29,29
6,chr5,15503,15512,1,T,10,10
7,chr5,16887,16897,1,T,11,11
8,chr5,17044,17058,1,T,15,15
9,chr5,19394,19402,1,A,9,9


This is what every column in this dataframe represents:
 - 'chr': the chromosome the STR is located on
 - 'start': the first position of the STR 
 - 'end': the last position of the STR
 - 'unit_len': the length of the repeating DNA motif
 - 'unit': the sequence of the repeating DNA motif 
 - 'ref': the copy number that this STR locus has in the reference genome
 - 'alt': the copy number that GangSTR found for this STR locus in the alignment
 
Using this information, we can start our hunt for mutations!

### 3: Combine information from both samples and look for mutations

We will first combine data from our two samples into one DataFrame.

**Bonus exercise:**
If you are already familiar with Pandas, or would like to learn, you can try to merge the two dataframes yourself! The [pandas documentation pages](https://pandas.pydata.org/docs/) have a ton of information and tutorials. Alternatively, you can uncomment the first few lines in the block below, to use a pre-written merging function.

In [5]:
df_comparison = gangstr_utils.merge_samples(
     sample1=df_gangstr_results_s1,
     sample2=df_gangstr_results_s2
 )


df_comparison.head(10)

Unnamed: 0,chr,start,end,unit_len,unit,ref,alt_s1,alt_s2,sample_difference
0,chr5,298,309,1,A,12,12,12,0
1,chr5,7241,7249,1,A,9,9,9,0
2,chr5,9390,9399,1,A,10,10,10,0
3,chr5,10062,10077,1,T,16,16,16,0
4,chr5,10673,10688,1,A,16,16,16,0
5,chr5,15411,15439,1,T,29,29,29,0
6,chr5,15503,15512,1,T,10,10,10,0
7,chr5,16887,16897,1,T,11,11,11,0
8,chr5,17044,17058,1,T,15,15,15,0
9,chr5,19394,19402,1,A,9,9,9,0


Now that we have both samples in the same DataFrame, we can easily determine if there is an STR locus where the samples have a different genotype:

In [6]:
df_comparison[df_comparison["sample_difference"] != 0]

Unnamed: 0,chr,start,end,unit_len,unit,ref,alt_s1,alt_s2,sample_difference
86,chr5,137481,137490,2,AG,5,9,5,-4


### 4: Determining the relevance of observed STR variation

The APC gene starts at position 112'702'498 on the forward strand of chromosome 5. You can find the entry for the APC gene in Ensembl [here](http://www.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000134982;r=5:112707498-112846239;t=ENST00000257430). Using the start position of the APC gene, and the start and end positions of the mutated STR we detected, try to figure out if you expect the observed mutation to have an impact on the function of the APC gene.

To make life easier, you can set the 'start' and 'end' variables in the code block below to the appropriate values. If you then run the code, it will print a link that you can follow to the region of interest. Look for the 'All phenotype-associated - short variants (SNPs and indels)' section of the page.

In [10]:
# TODO: fill in start and end positions you are interested in
start = 112702498+137481
end = 112702498+137490


print(f"https://www.ensembl.org/Homo_sapiens/Location/View?db=core;r=5:{start}-{end};contigviewbottom=variation_set_ph_variants=labels")


https://www.ensembl.org/Homo_sapiens/Location/View?db=core;r=5:112839979-112839988;contigviewbottom=variation_set_ph_variants=labels


After having explored the potential effects and phenotype associations on ensembl, do you expect our observed mutation to have an effect on the functionality of the *APC* gene product?

#### Task: explain whether you expect the observed mutation to have a functional impact 
--> Yes, it probaly has a functional impact because the the STR variation lies in the protein coding region.

Which of the two samples do you expect to originate from the healthy sample? And which from the tumour sample?

#### Task: which sample came from healthy tissue, and which from tumour?
--> The first sample came form tumor tissue, the second sample came from helathy tissue.