# Problem Set 1: Intro to Bash and Python

## Due Monday September 11th, 11:30AM

Please submit this assignment by uploading the completed Jupyter notebook to Canvas.

**Before submitting, please make sure to run all cells so that reviewers do not have to run your code to see your results.**

Also, please make sure to save your work before uploading to Canvas.
You can do this by clicking on `File -> Save` in the menu bar or by pressing `Ctrl-S` (`Cmd-S` on Mac).

## Problem 1: Bash Basics

First, make sure that you have downloaded and unzipped the contents of the [ion_channel_sequences folder](https://github.com/gofflab/Quant_mol_neuro_2022/tree/main/modules/module_1/pset/ion_channel_sequences) of genome sequence files.
This dataset will be included in Canvas for this problemset as well.

Then, answer each question using Bash scripting.

**Please note that the bash operations should be executed inside a `bash` code chunk like so:**


In [None]:
%%bash

You'll need to have your folder of sequence files (`ion_channel_sequence`) in your working directory.

**Print your working directory and its contents. (Hint: use `pwd` and `ls`)**


In [None]:
%%bash
echo "Working directory:"
pwd
echo "Contents of working directory:"
ls -l

We'll start with `Kcna1.fa`, a FASTA file that contains the genome sequence of a mouse voltage-gated potassium channel.

The first line contains a carat `>`, followed by a unique sequence identifier.
The actual sequence starts on the next line.

**See for yourself by printing the first three lines of the file**. Note that FASTA files can contain multiple sequences, but this one has just one.


In [1]:
%%bash
head -3 ion_channel_sequences/Kcna1.fa

>ref|NM_010595.3|:1-8970 Mus musculus potassium voltage-gated channel, shaker-related subfamily, member 1 (Kcna1), mRNA
GGGGGCTCCTCAGAGGCTCCGCAGCGGTGGAAGGACTGGAGCTGCTGGCTGCCTCCTCCGGTGCAGCCTG
TATCCAGGTGCAGCGGCACTGGGGACGCGGTGCATATCCCTTGCTCAGACTGCCACTGTGACCCTTGCGC


**Print just the sequence into a new file called `Kcna1_sequence.txt`**

In [12]:
%%bash

tail -n +2 ion_channel_sequences/Kcna1.fa > Kcna1_sequence.txt

**How many lines are in `Kcna1_sequence.txt`? How many characters are in the file?**

If you subtract the number of lines from the number of characters, you should get the number of nucleotides in the Kcna1 gene.

**Why? Answer in a comment.**


In [14]:
%%bash

wc Kcna1_sequence.txt

     129     129    9099 Kcna1_sequence.txt


129, 9099. We subtract out "\n".

**Count the number of times each nucleotide appears in the _Kcna1_ gene.**

In [15]:
%%bash

fold -w1 Kcna1_sequence.txt | sort | uniq -c  # really?


2267 A
2207 C
2166 G
2330 T


Many voltage-gated potassium channels have a signature selectivity filter motif with the amino acid sequence `TVGYG`.
The reverse translation of this AA sequence can be modeled using the following DNA codons:

`AC[TCAG] GT[TCAG] GG[TCAG] TA[TC] GG[TCAG]`

Note the bases in square brackets.
This string uses [regular expressions](https://quickref.me/regex) to allow flexibility in the wobble base of each codon.
For example the first codon in this motif evaluates to `AC[TCAG]` which, when used with a `grep` search, will find matches that begin with `AC` followed by _any_ base in the range `[TCAG]`.

**Use this provided motif sequence as an argument to `grep` to search the _Kcna1_ gene for any instances that match.**

Confirm that the _Kcna1_ gene has a sequence that would encode these amino acids.

Hint: Your first step should be to remove newline characters from `Kcna1_sequence.txt`.

In [17]:
%%bash
tvgyg="AC[TCAG]GT[TCAG]GG[TCAG]TA[TC]GG[TCAG]"

cat Kcna1_sequence.txt | tr -d '\n' | grep -o $tvgyg


ACTGTGGGATACGGT


### Part 2: Python

1. Here are 25 numbers drawn from a normal distribution. Calculate the mean, standard deviation, and variance of the set.

    **Do the mean calculation manually, and then use the built-in functions provided in the **statistics** module for the others.**

    **Hint**: The functions `sum`, `len`, `statistics.stdev`, and `statistics.variance` will be useful.
    Use the `help` function to learn more about them, if needed.

In [20]:
import random
import statistics


def sample_normal(mean: float, sd: float, *, n: int = 1, seed: int = 42) -> list[float]:
    """Draw n samples from a normal distribution with given mean and standard deviation"""
    random.seed(seed)
    return [random.normalvariate(mean, sd) for _ in range(n)]


my_samples = sample_normal(10, 1, n=25)

print(my_samples)

## your manual mean calculation code here
manual_mean = sum(my_samples) / len(my_samples)

print(manual_mean)

## Calculate mean, sd, and variance using statistics module
mean = statistics.mean(my_samples)
sd = statistics.stdev(my_samples)
var = statistics.variance(my_samples)

print((mean, sd, var))

[10.245326341707864, 9.503155526588797, 11.254785931057462, 9.861940937257188, 9.024179666704676, 10.565050217344965, 8.83235498617007, 11.738025467335527, 9.675490796158401, 11.182326307936343, 11.504353761657875, 11.949497039569765, 12.311774710432584, 9.534829130637144, 11.481342356701061, 11.46816841749334, 10.367803012316621, 9.343104196870232, 9.060388465261997, 9.051983236409392, 10.366470756822187, 9.718266746644588, 10.649728727536722, 7.917214485592975, 9.069458652778662]
10.227080794999457
(10.227080794999457, 1.1534838840843793, 1.3305250708423857)


From `my_samples` above, print out the following:
- The first five elements (remember that python begins with index 0)
- The last five elements
- The 13th and 14th elements

2. You are given a sample of a DNA plasmid with a known concentration of 1.85 μg/μL and a length of 1,354 bases, and are asked to calculate the molarity of the sample. 

  * **Create a function to calculate the molarity of a double-stranded DNA molecule given this information**
    (Google is your friend here to find the formula and the molecular weight for an 'average' oligonucleotide base pair).

In [22]:
def calc_molarity(plasmid_length: int, conc: float) -> float:
    """Calculate molarity of a DNA sample given its length and concentration"""
    return conc / (plasmid_length * 660)


# Test your function with the above values
plasmid_length = 1354
conc = 1.85
calc_molarity(plasmid_length, conc)

2.0701848619130745e-06

  * You receive another plasmid with a length of 2,500 bases. You make a series of 10 dilutions ranging from 0-10 μg/μl.
  
    **Construct a loop or list comprehension to calculate the molarity of each dilution. (hint: help(range))**

In [23]:
[calc_molarity(plasmid_length, conc) for conc in range(11)]

[0.0,
 1.1190188442773376e-06,
 2.238037688554675e-06,
 3.357056532832013e-06,
 4.47607537710935e-06,
 5.5950942213866884e-06,
 6.714113065664026e-06,
 7.833131909941364e-06,
 8.9521507542187e-06,
 1.0071169598496038e-05,
 1.1190188442773377e-05]

3. Using either a for loop or a list comprehension approach, **translate the following formulae and solve for the indicated range of values**.
    * $x^2$ for $x:\{0 ... 9\}$
    * $2^x$ for all even numbers between 0 and 20
    * $3x^4-2x^3+17x$ for $x:\{1 ... 200\}$

In [24]:
[x**2 for x in range(10)]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [25]:
[2**x for x in range(0, 21, 2)]

[1, 4, 16, 64, 256, 1024, 4096, 16384, 65536, 262144, 1048576]

In [27]:
print([3 * x**4 - 2 * x**3 + 17 * x for x in range(1, 201)])

[18, 66, 240, 708, 1710, 3558, 6636, 11400, 18378, 28170, 41448, 58956, 81510, 109998, 145380, 188688, 241026, 303570, 377568, 464340, 565278, 681846, 815580, 968088, 1141050, 1336218, 1555416, 1800540, 2073558, 2376510, 2711508, 3080736, 3486450, 3930978, 4416720, 4946148, 5521806, 6146310, 6822348, 7552680, 8340138, 9187626, 10098120, 11074668, 12120390, 13238478, 14432196, 15704880, 17059938, 18500850, 20031168, 21654516, 23374590, 25195158, 27120060, 29153208, 31298586, 33560250, 35942328, 38449020, 41084598, 43853406, 46759860, 49808448, 53003730, 56350338, 59852976, 63516420, 67345518, 71345190, 75520428, 79876296, 84417930, 89150538, 94079400, 99209868, 104547366, 110097390, 115865508, 121857360, 128078658, 134535186, 141232800, 148177428, 155375070, 162831798, 170553756, 178547160, 186818298, 195373530, 204219288, 213362076, 222808470, 232565118, 242638740, 253036128, 263764146, 274829730, 286239888, 298001700, 310122318, 322608966, 335468940, 348709608, 362338410, 376362858, 3

4. Solve the following equation for $x$:

    $x = p^2 + 2pq + q^2$ where $0<p<1$ and $q=1-p$ over the range of P values provided below. 

In [28]:
# Here are the range of p's you will need (note 0<p<1 and the Set P variable is capitalized.)
P = [0.0, 0.2, 0.33, 0.5, 0.66666667, 0.99]


# First define a function `q` that takes p as an argument and returns q (ie. q=1-p)
def q(p: float):
    return 1 - p

Using either a list comprehension or a for-loop along with your newly created `q` function, 

**translate the above formula and calculate the solution over the values in `P`**

In [29]:
# Below is a pseudocode template to help you get started

for p in P:
    # Find q for given p
    q_ = q(p)  # not the best idea to name a function and a variable the same thing
    x = p**2 + 2 * p * q_ - q_**2  # put your formula here using p and q
    print(x)

-1.0
-0.28
0.10220000000000012
0.5
0.7777777822222222
0.9998


_Bonus points: What is the name of this formula and what does it describe?_

Answer: Technically, this is a binomial expansion of $(p+q)^2$ such that $p+q=1$. Hardy-Weinberg is the name of the formula that describes the frequency of alleles in an ideal population.

#### Python functions

5. Write a function that takes a DNA sequence as an argument and:
    * a) Checks to make sure that the DNA sequence uses only appropriate nucleotides
    * b) Returns a tuple containing the **GC content, length, and reverse complement sequence** of the input DNA molecule
    
**Hint:** _you can define functions for each of the requested properties and another function to create/organize the output_

In [1]:
def is_DNA(seq: str) -> bool:
    """Test whether a given sequence is a valid DNA sequence (ie contains only ATGC bases).
    Should return True or False depending on whether the sequence is valid."""
    return all(base in "ATGC" for base in seq)


def calc_GC(seq: str) -> float:
    """Calculate the GC percent of the DNA sequence passed in as an argument."""
    return (seq.count("G") + seq.count("C")) / len(seq)


def revcomp(seq: str) -> str:
    """Take a DNA sequence (seq) and return the reverse complement
    Hint: you can create a dictionary of complemntary bases and use that to look up the complement of a given base
    """
    complements = dict(zip("ATGCatgc", "TACGtacg"))
    try:
        return "".join(complements[base] for base in reversed(seq))
    except KeyError as e:
        print(e, "Invalid DNA sequence")
        return ""

    # The best way is to precompute the complements and then join them together.
    complements_ = str.maketrans(complements)
    return seq.translate(complements_)[::-1]


def main(seq: str) -> tuple[float, int, str]:
    """Take an input sequence and process through each of the functions described above and return a tuple of the results"""
    return calc_GC(seq), len(seq), revcomp(seq)


# Test your main() function with the following sequence:
test_seq = "CAGTACGATCTTGACGGTACG"
print(main(test_seq))

(0.5238095238095238, 21, 'CGTACCGTCAAGATCGTACTG')


c) Apply your main() function to calculate the above parameters of interest (b) for all of the following of sequences in the `dna_seqs` list.

In [2]:
dna_seqs = [
    "TTATCAGCGGATTATTAGGTATAGTGCTATGC",
    "CGAGATTAGCGATTTGTG",
    "GGTATACTCTGCACGACGAGCGAGCGACGGACGACGGCICGATCTATCTA"
    "ACGTACGTACGTACGTACGTACGTACGTACGTACGT",
    "tacgagctactgagcgatcggatcgtacgtagc",
    "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
    "GGCTTAATATCGAGCTAGTAGTCTATTCTAGCGAGCGACTATTCGACTATCGATGCTATCTGCGCAGCGAGCATCGAGCGCTATCGAGCTAGCTAGCTAGCTATCATCGAGCTACTAGCATCTGATTATTCTTTAGCGCGACGACT",
]

[main(seq) for seq in dna_seqs]

'I' Invalid DNA sequence


[(0.375, 32, 'GCATAGCACTATACCTAATAATCCGCTGATAA'),
 (0.4444444444444444, 18, 'CACAAATCGCTAATCTCG'),
 (0.5465116279069767, 86, ''),
 (0.0, 33, 'gctacgtacgatccgatcgctcagtagctcgta'),
 (0.0, 46, 'TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT'),
 (0.4794520547945205,
  146,
  'AGTCGTCGCGCTAAAGAATAATCAGATGCTAGTAGCTCGATGATAGCTAGCTAGCTAGCTCGATAGCGCTCGATGCTCGCTGCGCAGATAGCATCGATAGTCGAATAGTCGCTCGCTAGAATAGACTACTAGCTCGATATTAAGCC')]

### Undefined behavior for I