# Exercises
Here we will read a .txt file of a genome sequence and perform some functions commonly used in bioinformatics.

_Tip_: It's helpful to draw out how you plan to develop a function or pipeline. 

In [1]:
import pathlib
import numpy as np

# Import the data
data_dir = pathlib.Path("data/genome_sequence.txt")

with open(data_dir) as file:
    kmers = [line.strip().split(" ") for line in file.readlines()]

sequence = "".join(kmers[0])

# You now have your sequence!
print(sequence[0:20])

AACGGTGTCATCGCTATACT


### Task 1: divide the sequence into kmers of 3 with a moving window of 1 as a list. 
Example input: "AACGG"

Example output: ["AAC", "ACG", "CGG", ...,]

Below is a helpful example:

In [2]:
randomness = "hdjsfakidiauweoubfdhfigawnkjdsbcvhjsiruahwfoen"


def separate(string_: str, spliced_length: int) -> list:
    output_list = list()
    for index_ in range(0, len(string_) - spliced_length + 1):
        output_list.append(string_[index_ : index_ + spliced_length])

    return output_list


separated_randomness = separate(string_=randomness, spliced_length=3)
print(separated_randomness)

['hdj', 'djs', 'jsf', 'sfa', 'fak', 'aki', 'kid', 'idi', 'dia', 'iau', 'auw', 'uwe', 'weo', 'eou', 'oub', 'ubf', 'bfd', 'fdh', 'dhf', 'hfi', 'fig', 'iga', 'gaw', 'awn', 'wnk', 'nkj', 'kjd', 'jds', 'dsb', 'sbc', 'bcv', 'cvh', 'vhj', 'hjs', 'jsi', 'sir', 'iru', 'rua', 'uah', 'ahw', 'hwf', 'wfo', 'foe', 'oen']


### Your turn! Write your code in the chunk below
See `exercise_answers/` for help.

In [3]:
# Your code here

### Task 2
Put it back together!

_hint_: kmers are separated in order and are a product of a moving window (i.e. "ACTGCG" --> ["ACT", "CTG", "TGC", "GCG"])

In [None]:
# Steps
# 1. Initialize an empty string to add each kmer to --> empty_string = ""
# 2. Iterate through each kmer
# 3. Add the first index of the kmer to the empty string in the loop
# 4. Add the last element to the results string (because we're only adding the first index in the loop)

# Your code here

## Task 3

#### Population genetics: Hardy-Weinberg simulator

Hardy-Weinberg (HW) is a simple approach to determine populaton allele frequencies. The HW equation is the following:
- [Eqn1]: $p^2 + 2pq + q^2 = 1$
- Where *$p$*, *$2pq$*, and *$q$* represent the genotypes *$A/A$* (homozygous dominant), *$A/C$* (heterozygous), and *$C/C$* (homozygous recessive), respectively.

------

#### Your task is to write a function to calculate the alle frequency _p_ after 1 generation with selection. I'll help outline some of the steps here.
1. Define the allele frequencies *$p$* and *$q$* where $p + q = 1$
2. Define a function to calculate *$p_{1}$*, *$q_{1}$*, and *$2pq_{1}$* after selection
 - [Eqn2]: [*$p^2$* * (*$w_{11}$* / $\overline{W}$)] + [2 *$pq$* * (*$w_{12}$* / $\overline{W}$)] + [*$q^2$* * (*$w_{22}$* / $\overline{W}$)] = 1
   - Where *$w_{mn}$* is the repective fitness coefficient and *$\overline{W}$* is the relative fitness
   - [Eqn3]: *$\overline{W}$* = (*$w_{11}$* * *$p_{0}^{2}$*) + (*$w_{12}$* * (2 * *$p_{0}$* * *$q_{0}$*)) + (*$w_{22}$* * (*$q_{0}^{2}$*))
 - Most variables will need to be passed as arguments to the function.
    - _Think_: which variable do we not need to pass as an argument?
 - To calculate the allele frequencies of *$p$*, *$q$*, and *$2pq$* (heterozygous), simply calculate each component in brackets of Eqn2.
 - The return of your function should be a list or dictionary of the $F_{1}$ generation.
3. Check that Eqn1 is true for the F1 generation.

In [None]:
# Step 1: define alle frequencies p and q where p+q=1

# Step 2: define the relative fitness coefficients w11, w12, and w22

# Step 3: define a function to calculate the F1 allele frequencies

# Step 4: check to make sure p^2 + 2pq + q^2 = 1 for the F1 allele frequencies
# Here's an exmaple function that I used to check the output of my F1 dict
def hwe(p, q):
    check_ = (p**2) + (2 * p * q) + (q**2)
    return check_


# print("Check that p^2 + 2pq + q^2 =", hwe(p=next_gen["p_1"], q=next_gen["q_1"]))