# Lab 3 : Population genetics

## Exercises 1 (10 pts)

This week's exercises consist of two parts, broken into two notebooks:
* Part 1 (10 pts): We'll become familiar with the concepts of genotype and allele frequencies, and how they behave when we are looking at large or diverse populations. 
* Part 2 (10 pts): We'll explore linear regression, p-values, and effect sizes.

Reminder, you may work with a partner or consult your classmates for help!

## 1. Allele and genotype frequencies

If you were to look at the genomes of two unrelated people, the vast majority would be the same. But in some cases, mutations have arisen over time that make genomes slightly different from each other.

For this week, we'll focus almost exclusely on a type of genetic variation (mutation) which we call "SNPs" (single nucleotide polymorphisms). A SNP can be thought of simply as a spelling mistake, where originally there was a particular nucleotide at a position in the genome (say "A"), but at some point there was a mutation (for example, changing the "A" to a "C"). 

We use **allele** to refer to a particular "version" of a SNP. For example a SNP may have to different *alleles* (e.g. "A" or "C").

A **genotype** is the combination of an individual's two alleles (since humans are diploid, meaning we have two copies of each chromosome except for the sex chromosomes). So for a T/A SNP a person's genotype can be either TT, AT, or AA. (At least for now, we don't care about the order of alleles in heterozygous SNPs. So you could also say "TA" instead of "AT".

**Question 1 (2 pts)**: For a SNP with alleles "G" and "C", what are all possible diploid genotypes"? Set the list `possible_gts` to your answer below. An example is shown for you. Please keep alleles in uppercase.

In [3]:
# Example: what are the possible genotypes for a SNP with alleles "A" and "G"?
possible_gts_AG = ["AA","AG","GG"]

# What are the possible genotypes for a SNP with alleles "G" and "C"?
possible_gts_GC = [] 

# your code here
possible_gts_GC = ["GG", "GC", "CC"] 
# raise NotImplementedError

In [4]:
"""Test the list of possible genotypes is correct"""
assert(len(possible_gts_AG)==3)
assert("AA" in possible_gts_AG)
assert("GA" in possible_gts_AG or "AG" in possible_gts_AG)
assert("GG" in possible_gts_AG)


We can think of all of the copies of a chromosome in the population (2 per person) as deriving from some common ancestor a long time ago. The leaves of the tree below are all the present-day copies of the chromosome. Since humans are diploid, humans each have two copies (right).

<img src=popgen_fig1.png width=800>

Once a mutation arises on one copy of a chromosome, it can be passed on to future generations and can eventually spread throughout the population (see left figure above).

We use **allele frequency** to refer to the frequency of a particular allele out of all the copies of the genome in population. 

We use **genotype frequency** to refer to the frequency of each of possible *genotype* (consisting of two alleles) in the population.

Since we each have two copies of each chromosome, allele frequency is usually the number of times the allele was seen divided by two times the number of people we analyzed (ignoring sex chromosomes or mitochondria).

For example, consider an "A/T" SNP that we are analyzing in a set of 1,000 people. We find that 40 people have genotype AA, 320 have genotype AT, and 640 have genotype TT:

* The *genotype frequencies* are: AA=40/1000 = 4%, AT=320/1000 = 32%, and TT=640/1000 = 64%.
* To find the *allele frequencies*:
  * We have 2,000 (2*number of people) total copies of the chromosome
  * For A, 40 people have two copies of A and 320 people have one copy of A, so there are $40*2+320 = 400$ total copies of A.
  * Similarly, there are $640*2+320=1600$ copies of T
  * So the allele frequencies are A=400/2000 = 20%, T=1600/2000 = 80%
  
We use **minor allele** to refer to the allele at a position that is least frequent in the population, and **major allele** to refer to the most common allele. (Note, in some cases the reference allele is actually the minor allele!). In the example above, A is the minor allele.

We use **minor allele frequency (MAF)** to refer to the frequency of the minor allele. In the example above, the MAF is 0.2.

**Question 2 (7 pts)**: Complete the function `ComputeMinorAlleleFrequency` to compute the minor allele frequency of a SNP, given the counts of all the possible genotypes observed in a dataset.

The cell below tests your function on the example given above. Additional tests, some of which are hidden, check your answers for other examples.

In [13]:
def ComputeMinorAlleleFrequency(gt_counts):
    """
    Compute frequencies of each genotype
    
    Parameters
    ----------
    gt_counts : dict[str]->int
       Count of how many times each genotype was seen
       e.g. {"AA": 40, "AT": 320, "TT":" 640"}
    Returns
    -------
    maf : float
       Minor allele frequency
       Return -1 if more than two alleles identified
    """
    # Print out some basic stats about the
    # genotypes and alleles
    gts = gt_counts.keys()
    print("Genotype frequencies:")
    for gt in gts:
        print("%s: %.2f"%(gt, gt_counts[gt]/sum(gt_counts.values())))
    alleles = set("".join(gts))
    if len(alleles) > 2:
        print("ERROR: more than 2 alleles identified")
        return -1
    print("Possible alleles: %s"%alleles)
    # Compute minor allele frequency
    maf = -1
    # your code here
    allele_counts = {}
    total_alleles = sum(gt_counts.values())*2

    for gt, count in gt_counts.items():
        for allele in gt:
            allele_counts[allele] = allele_counts.get(allele, 0) + count
    allele_freqs = {allele: count / total_alleles for allele, count in allele_counts.items()}
    maf = min(allele_freqs.values())
    
#     raise NotImplementedError
    return maf
    
ex_gt_counts = {
    "AA": 40,
    "AT": 320,
    "TT": 640
}
print("Expected MAF=0.20; Observed MAF=%.2f"%ComputeMinorAlleleFrequency(ex_gt_counts))

Genotype frequencies:
AA: 0.04
AT: 0.32
TT: 0.64
Possible alleles: {'T', 'A'}
Expected MAF=0.20; Observed MAF=0.20


In [14]:
"""Test results of ComputeMinorAlleleFrequency"""
ex_gt_counts_2 = {
    "AG": 10,
    "GG": 20,
    "AA": 20
}
assert(ComputeMinorAlleleFrequency(ex_gt_counts_2)==0.50)

Genotype frequencies:
AG: 0.20
GG: 0.40
AA: 0.40
Possible alleles: {'A', 'G'}


In [15]:
"""Test results of ComputeMinorAlleleFrequency"""
ex_gt_counts_2 = {
    "AC": 10,
    "CC": 20,
    "AA": 70
}
assert(ComputeMinorAlleleFrequency(ex_gt_counts_2)==0.25)

Genotype frequencies:
AC: 0.10
CC: 0.20
AA: 0.70
Possible alleles: {'C', 'A'}


In [16]:
"""Test results of ComputeMinorAlleleFrequency"""
ex_gt_counts_2 = {
    "AC": 10,
    "CC": 70,
    "AA": 20
}
assert(ComputeMinorAlleleFrequency(ex_gt_counts_2)==0.25)

Genotype frequencies:
AC: 0.10
CC: 0.70
AA: 0.20
Possible alleles: {'C', 'A'}


In [17]:
"""Test results of ComputeMinorAlleleFrequency"""
ex_gt_counts_2 = {
    "AC": 0,
    "CC": 40,
    "AA": 10
}
assert(ComputeMinorAlleleFrequency(ex_gt_counts_2)==0.2)

Genotype frequencies:
AC: 0.00
CC: 0.80
AA: 0.20
Possible alleles: {'C', 'A'}


In [18]:
"""Test results of ComputeMinorAlleleFrequency"""
# Hidden test

'Test results of ComputeMinorAlleleFrequency'

## 2. Mutations in populations

Depending on when a mutation occurs, it can have very different frequency in the population. Consider a mutation that occurred thousands of years ago vs. a mutation that happened very recently:

<img src=popgen_fig2.png width=800>

Now think about different human populations in the world. At the highest level, we can think of the major population groups of the world as being related as depicted in the tree below:

<img src=popgen_fig3.png width=400>

(This is a major simplification, and ignores things like admixture between different populations)

If a mutation occurred before these populations split, it might be pretty common all across the globe. On the other hand, if a mutation occurred after a population split, it may be common in one population but completely missing from another.

<img src=popgen_fig4.png width=700>

**Question 3 (1 pt)**: The ancestral copy of a chromosome has base "T" at a particular position. This "T" mutated to a "C" in an Asian individual thousands of years ago but after Asian populations had diverged from other populations. What do you expect the minor allele frequency (frequency of allele "C") of this SNP to be in Europeans? Set `expected_maf` to your answer below (ignoring new mutations at the same position, which are rare but can happen). 

In [20]:
expected_maf = None
# Set to your answer
# your code here
expected_maf = 0
# raise NotImplementedError

In [21]:
"""Test result of expected_maf"""

'Test result of expected_maf'