# Day 2, Exercise 2 - frequency analysis of SNP rs4988235 in VCF data

### Background:

The SNP rs4988235, located in the MCM6 gene on chromosome 2 at position 136608646, is associated with lactose persistence. This variant is significantly more common in Scandinavian countries, indicating that individuals from these regions are less likely to be lactose intolerant. 

### Task:

Calculate the frequency of the rs4988235 SNP amongst the samples in the VCF file 'genotypes_small.vcf'.

- The file `genotypes_small.vcf` can be found <a href="https://python-bioinfo.bioshu.se/downloads/genotypes_small.vcf" target="_blank">here</a> if you have not downloaded yet.

### Tips:

#### 1. A VCF file is structured like this:

<img src="../../img/vcf_header.png" alt="Drawing" style="width: 1000px;"/> 

The genotypes start from column 10 (index 9). The first part of the genotypes denoted which alleles each individual has and it can be in one of the following three forms
```
- 0/0         means there are no alternate alleles, called homozygous reference allele, or wildtype
- 0/1 or 1/0  means there is one alternate allele, called heterozygous
- 1/1         means there are two 2 alternate alleles, called homozygous alternate allele. 

Note: when there are multiple alternative alleles, it can be a number larger than 1.
```
(The reason why there are two numbers, separated by `/`, for each genotype alleles notation is because human has two sets of chromosomes, one inherited from each parent) 

So here you have to count the number of alternate alleles, and divide by the total number of alleles.

Feel free to use the code presented in the lecture for this dataset, and modify it accordingly. 

  
#### 2. It is possible to loop over a list starting from a specified position using:  
`for x in list[9:]`  

  
#### 3. Before starting to write any code, break down the problem into smaller, manageable components by utilizing pseudocode, and devise a strategy for tackling each of these components individually. 

 
#### 4. There is always more than one solution to a problem. If you find another way to reach the same result, that's perfectly fine.

If you get stuck, the answers (one of the possible answers) will be presented piece by piece. So you can always scroll down a bit, look at a part of the code, and try from there.

#### 5. If you really don't know where to start, follow these step by step instructions:  
- Read the file line by line
- Find the line where chromosome is 2 and position is 136608646
- Get the genotype information by looping over the line starting at position 9
- Split on : to only get the genotypes
- Initialize three counters and count the number of wildtype, heterozygous, and homozygous genotypes
- Calculate the frequency using these three counters

___

### The answers

There are various ways to write the code for this task. Here, we present one solution in the answers, but if you have written a different one, that's perfectly fine. Just ensure that you test your code to confirm that it performs as expected.

Below is the second code from the lecture, let's start with that one and modify it.

In [27]:
vcf_file = '../../downloads/genotypes_small.vcf' # set this to the actual path on your computer
with open(vcf_file, 'r') as fh:
    for line in fh:
        if not line.startswith('#'):
            cols = line.strip().split('\t')
            if cols[0] == '5':
                print(line)
                break

5	106565	rs115608877	G	T	.	PASS	AA=.;AC=7;AN=120;DP=91;GP=5:53565;BN=132	GT:DP:CB	1/0:1:SM	1/0:3:SM	0/0:1:SM	0/0:2:SM	0/0:0:S	0/0:0:SM	0/0:0:SM	0/0:1:S	0/0:0:S	0/0:0:S	0/0:2:SM	0/1:1:SM	0/0:3:SM	0/1:0:SM	0/0:2:SM	1/0:1:S	0/0:1:SM	0/0:0:SM	0/0:2:SM	0/0:1:SM	0/0:5:SM	0/0:0:S	0/0:1:SM	0/0:0:S	0/0:1:SM	0/0:1:SM	0/1:1:SM	0/1:1:SM	0/0:2:SM	0/0:3:SM	0/0:2:SM	0/0:8:SM	0/0:0:SM	0/0:7:SM	0/0:4:SM	0/0:0:S	0/0:2:S	0/0:0:SM	0/0:3:SM	0/0:2:SM	0/0:1:SM	0/0:2:SM	0/0:3:S	0/0:0:SM	0/0:3:SM	0/0:1:SM	0/0:0:SM	0/0:0:SM	0/0:3:S	0/0:1:SM	0/0:0:SM	0/0:0:SM	0/0:3:SM	0/0:4:SM	0/0:2:SM	0/0:1:SM	0/0:0:SM	0/0:0:SM	0/0:1:SM	0/0:2:SM



Now let's start modifying it:

- First we want to find the line with the SNP we are interested in

In [28]:
vcf_file = '../../downloads/genotypes_small.vcf' # set this to the actual path on your computer
with open(vcf_file, 'r') as fh:
    for line in fh:
        if not line.startswith('#'):
            cols  = line.strip().split('\t')
            chrom = cols[0]                          # This is the chromosome
            pos   = cols[1]                          # This is the position of the SNP on the chromsome
            if chrom == '2' and pos == '136608646':  # Check if chrom and pos match. Notice the type! python reads all as strings!
                print(line)                          # Print the line to see if it looks correct


2	136608646	rs4988235	G	A	.	PASS	AA=C;AC=94;AN=120;DP=254;HM2;GP=2:136892165;BN=92	GT:DP:CB	0/1:4:SMB	1/1:17:SMB	1/1:8:SMB	0/1:8:SMB	0/1:1:SMB	0/1:1:SMB	1/1:1:SMB	1/1:1:SMB	1/1:3:SMB	0/1:5:SMB	1/1:9:SMB	0/1:3:SB	0/0:4:SMB	1/1:5:SMB	1/1:2:SMB	0/1:2:SMB	0/1:2:SB	0/1:2:SMB	1/1:1:SMB	0/1:3:SMB	1/1:18:SMB	1/1:4:SMB	1/1:6:SMB	0/1:3:SMB	1/1:1:SMB	0/1:5:SMB	1/1:12:SMB	0/1:3:SMB	1/1:3:SMB	0/1:1:SMB	0/0:2:SMB	1/1:9:SMB	1/1:2:SM	1/1:8:SMB	1/1:2:SB	1/1:2:SMB	1/1:3:SMB	1/1:0:SMB	0/1:8:SMB	1/1:2:SMB	1/1:8:SMB	1/1:1:SMB	1/1:5:SMB	1/1:3:SMB	1/1:4:SMB	1/1:4:SMB	1/1:5:SMB	0/0:4:SMB	0/1:7:SMB	0/1:3:SMB	1/1:10:SMB	1/1:1:SMB	0/1:0:SMB	0/1:1:SMB	1/1:2:SMB	1/1:2:SMB	1/1:0:SMB	0/1:6:SMB	1/1:8:SMB	1/1:4:SMB



When making comparisons, it is crucial to consider the data type of the values being compared. When Python reads from a file, all content is initially read as strings. Consequently, if you intend to compare with an integer, you must first convert the data type. In this instance, we are comparing `chrom` to `'2'` (a string), which is reasonable as chromosomes can include non-integer identifiers like 'X', 'Y', 'M', etc. However, the position is inherently an integer, so depending on the nature of the comparison, you may need to convert the type. For exact matches, like `'136608646' == '136608646'`, the comparison works as expected. But for comparisons involving order, like `'136608646' > '136608646'`, which would not work as intended due to string comparison rules, you would need to convert the position to an integer (e.g., `int(pos) > 136608646`).

Now that we have identified the line of interest, what should be our next step?

- Let's attempt to extract and print only the genotype information.

In [29]:
vcf_file = '../../downloads/genotypes_small.vcf' # set this to the actual path on your computer
with open(vcf_file, 'r') as fh:
    for line in fh:
        if not line.startswith('#'):
            cols  = line.strip().split('\t')
            chrom = cols[0]                          # This is the chromosome
            pos   = cols[1]                          # This is the position of the SNP on the chromsome
            if chrom == '2' and pos == '136608646':  # Check if chrom and pos match. Notice the type! python reads all as strings!
                for geno_col in cols[9:]:            # Loop over the items in cols, starting from index 9
                    print(geno_col)


0/1:4:SMB
1/1:17:SMB
1/1:8:SMB
0/1:8:SMB
0/1:1:SMB
0/1:1:SMB
1/1:1:SMB
1/1:1:SMB
1/1:3:SMB
0/1:5:SMB
1/1:9:SMB
0/1:3:SB
0/0:4:SMB
1/1:5:SMB
1/1:2:SMB
0/1:2:SMB
0/1:2:SB
0/1:2:SMB
1/1:1:SMB
0/1:3:SMB
1/1:18:SMB
1/1:4:SMB
1/1:6:SMB
0/1:3:SMB
1/1:1:SMB
0/1:5:SMB
1/1:12:SMB
0/1:3:SMB
1/1:3:SMB
0/1:1:SMB
0/0:2:SMB
1/1:9:SMB
1/1:2:SM
1/1:8:SMB
1/1:2:SB
1/1:2:SMB
1/1:3:SMB
1/1:0:SMB
0/1:8:SMB
1/1:2:SMB
1/1:8:SMB
1/1:1:SMB
1/1:5:SMB
1/1:3:SMB
1/1:4:SMB
1/1:4:SMB
1/1:5:SMB
0/0:4:SMB
0/1:7:SMB
0/1:3:SMB
1/1:10:SMB
1/1:1:SMB
0/1:0:SMB
0/1:1:SMB
1/1:2:SMB
1/1:2:SMB
1/1:0:SMB
0/1:6:SMB
1/1:8:SMB
1/1:4:SMB


`cols` is here a list, meaning we can loop over it. By writing `cols[9:]` we also use slicing to tell Python to start the loop from index 9 of the list, and loop until the end.  


So, now we have the full genotype information, but we are only interested in the first part before the `:` separator, so let's remove that:

In [30]:
vcf_file = '../../downloads/genotypes_small.vcf' # set this to the actual path on your computer
with open(vcf_file, 'r') as fh:
    for line in fh:
        if not line.startswith('#'):
            cols  = line.strip().split('\t')
            chrom = cols[0]                          # This is the chromosome
            pos   = cols[1]                          # This is the position of the SNP on the chromsome
            if chrom == '2' and pos == '136608646':  # Check if chrom and pos match. Notice the type! python reads all as strings!
                for geno_col in cols[9:]:            # Loop over the items in cols, starting from index 9
                    geno = geno_col.split(":")[0]    # Here we get the first item from geno_col, split by ':'
                    print(geno)

0/1
1/1
1/1
0/1
0/1
0/1
1/1
1/1
1/1
0/1
1/1
0/1
0/0
1/1
1/1
0/1
0/1
0/1
1/1
0/1
1/1
1/1
1/1
0/1
1/1
0/1
1/1
0/1
1/1
0/1
0/0
1/1
1/1
1/1
1/1
1/1
1/1
1/1
0/1
1/1
1/1
1/1
1/1
1/1
1/1
1/1
1/1
0/0
0/1
0/1
1/1
1/1
0/1
0/1
1/1
1/1
1/1
0/1
1/1
1/1


There are several ways to do this. For example, one could extract the first three characters from `geno_col` using `geno = geno_col[:3]`. However, if the alternative alleles have varying lengths (such as when there are 10 different alternate alleles, which would allow genotypes like `'0/10'`), this option could be problematic. 

Now that we have the genotypes for each individual, it's time to count them.

In [31]:
vcf_file = '../../downloads/genotypes_small.vcf' # set this to the actual path on your computer
with open(vcf_file, 'r') as fh:
    wt  = 0      # Remember to initialize the counting outside the loop, otherwise it will reset for every iteration
    het = 0
    hom = 0
    for line in fh:
        if not line.startswith('#'):
            cols  = line.strip().split('\t')
            chrom = cols[0]                          # This is the chromosome
            pos   = cols[1]                          # This is the position of the SNP on the chromsome
            if chrom == '2' and pos == '136608646':  # Check if chrom and pos match. Notice the type! python reads all as strings!
                for geno_col in cols[9:]:            # Loop over the items in cols, starting from index 9
                    geno = geno_col.split(":")[0]    # Here we get the first item from geno_col, split by ':'
                    allele_list = geno.split('/')    # Here we get each allele and convert the number to integer
                    allele_1 = int(allele_list[0])
                    allele_2 = int(allele_list[1])
                    if allele_1 == 0 and allele_2 == 0: # Conditional to test whether both alleles are 0
                        wt += 1                      #  If matching, wt is increased by 1
                    elif allele_1 > 0 and allele_2 > 0:
                        hom += 1
                    else:        
                        het += 1
                    
print("wt =", wt)        # After looping through the entire file, print the counts for the different genotypes
print("het =", het)
print("hom =", hom)

wt = 3
het = 20
hom = 37


Here, we set three counters at the beginning, one for each of the three possible combinations of alleles we have. For each genotype, we increment the corresponding counter by 1, and finally, we will calculate the allele frequency and print our results.


Equation: `allele_freq = (2*hom + het)/(wt*2+hom*2+het*2)`

In [32]:
vcf_file = '../../downloads/genotypes_small.vcf' # set this to the actual path on your computer
with open(vcf_file, 'r') as fh:

    wt  = 0      # Remember to initialize the counting outside the loop, otherwise it will reset for every iteration
    het = 0
    hom = 0

    for line in fh:
        if not line.startswith('#'):
            cols  = line.strip().split('\t')
            chrom = cols[0]                          # This is the chromosome
            pos   = cols[1]                          # This is the position of the SNP on the chromsome
            if chrom == '2' and pos == '136608646':  # Check if chrom and pos match. Notice the type! python reads all as strings!
                for geno_col in cols[9:]:                # Loop over the items in cols, starting from index 9
                    geno = geno_col.split(":")[0]    # Here we get the first item from geno_col, split by ':'
                    allele_list = geno.split('/') # Here we get each allele and convert the number to integer
                    allele_1 = int(allele_list[0])
                    allele_2 = int(allele_list[1])
                    if allele_1 == 0 and allele_2 == 0: # Conditional to test whether both alleles are 0
                        wt += 1                      # If match, wt is increased by 1
                    elif allele_1 > 0 and allele_2 > 0:
                        hom += 1
                    else:        
                        het += 1
    freq = (2*hom + het)/((wt+hom+het)*2)                       # Calculate the alllele frequency
    print('The frequency of the rs4988235 SNP is: '+str(round(freq,3)))  # Print a nice message and format freq to a string before printing


The frequency of the rs4988235 SNP is: 0.783


<b>All done!</b>