# Problem Set 1: Intro to Bash and Python

## Due Monday September 11th, 11:30AM

Please submit this assignment by uploading the completed Jupyter notebook to Canvas.

**Before submitting, please make sure to run all cells so that reviewers do not have to run your code to see your results.**

Also, please make sure to save your work before uploading to Canvas.
You can do this by clicking on `File -> Save` in the menu bar or by pressing `Ctrl-S` (`Cmd-S` on Mac).

## Problem 1: Bash Basics

First, make sure that you have downloaded and unzipped the contents of the [ion_channel_sequences folder](https://github.com/gofflab/Quant_mol_neuro_2022/tree/main/modules/module_1/pset/ion_channel_sequences) of genome sequence files.
This dataset will be included in Canvas for this problemset as well.

Then, answer each question using Bash scripting.

**Please note that the bash operations should be executed inside a `bash` code chunk like so:**


In [None]:
%%bash

You'll need to have your folder of sequence files (`ion_channel_sequence`) in your working directory.

**Print your working directory and its contents. (Hint: use `pwd` and `ls`)**


In [None]:
%%bash
echo "Working directory:"
pwd
echo "Contents of working directory:"
ls -l

We'll start with `Kcna1.fa`, a FASTA file that contains the genome sequence of a mouse voltage-gated potassium channel.

The first line contains a carat `>`, followed by a unique sequence identifier.
The actual sequence starts on the next line.

**See for yourself by printing the first three lines of the file**. Note that FASTA files can contain multiple sequences, but this one has just one.


In [None]:
%%bash
head -3 ion_channel_sequences/Kcna1.fa

**Print just the sequence into a new file called `Kcna1_sequence.txt`**

In [None]:
%%bash


**How many lines are in `Kcna1_sequence.txt`? How many characters are in the file?**

If you subtract the number of lines from the number of characters, you should get the number of nucleotides in the Kcna1 gene.

**Why? Answer in a comment.**


In [None]:
%%bash

**Count the number of times each nucleotide appears in the _Kcna1_ gene.**

Many voltage-gated potassium channels have a signature selectivity filter motif with the amino acid sequence `TVGYG`.
The reverse translation of this AA sequence can be modeled using the following DNA codons:

`AC[TCAG] GT[TCAG] GG[TCAG] TA[TC] GG[TCAG]`

Note the bases in square brackets.
This string uses [regular expressions](https://quickref.me/regex) to allow flexibility in the wobble base of each codon.
For example the first codon in this motif evaluates to `AC[TCAG]` which, when used with a `grep` search, will find matches that begin with `AC` followed by _any_ base in the range `[TCAG]`.

**Use this provided motif sequence as an argument to `grep` to search the _Kcna1_ gene for any instances that match.**

Confirm that the _Kcna1_ gene has a sequence that would encode these amino acids.

Hint: Your first step should be to remove newline characters from `Kcna1_sequence.txt`.

In [None]:
%%bash
tvgyg="AC[TCAG]GT[TCAG]GG[TCAG]TA[TC]GG[TCAG]"


### Part 2: Python

1. Here are 25 numbers drawn from a normal distribution. Calculate the mean, standard deviation, and variance of the set.

    **Do the mean calculation manually, and then use the built-in functions provided in the **statistics** module for the others.**

    **Hint**: The functions `sum`, `len`, `statistics.stdev`, and `statistics.variance` will be useful.
    Use the `help` function to learn more about them, if needed.

In [None]:
import random
import statistics

def sample_normal(mean: float, sd: float, *, n: int = 1, seed: int = 42) -> list[float]:
    """Draw n samples from a normal distribution with given mean and standard deviation"""
    random.seed(seed)
    return [random.normalvariate(mean, sd) for _ in range(n)]

my_samples = sample_normal(10, 1, n=25)

print(my_samples)

## your manual mean calculation code here
manual_mean = 0

print(manual_mean)

## Calculate mean, sd, and variance using statistics module
mean = 0
sd = 0
var = 0

print((mean, sd, var))

From `my_samples` above, print out the following:
- The first five elements (remember that python begins with index 0)
- The last five elements
- The 13th and 14th elements

2. You are given a sample of a DNA plasmid with a known concentration of 1.85 μg/μL and a length of 1,354 bases, and are asked to calculate the molarity of the sample. 

  * **Create a function to calculate the molarity of a double-stranded DNA molecule given this information**
    (Google is your friend here to find the formula and the molecular weight for an 'average' oligonucleotide base pair).

In [None]:
def calc_molarity(plasmid_length: int, conc: float) -> float:
    mol = 0
    return mol


# Test your function with the above values
plasmid_length = 1354
conc = 1.85

  * You receive another plasmid with a length of 2,500 bases. You make a series of 10 dilutions ranging from 0-10 μg/μl.
  
    **Construct a loop or list comprehension to calculate the molarity of each dilution. (hint: help(range))**

In [None]:
#

3. Using either a for loop or a list comprehension approach, **translate the following formulae and solve for the indicated range of values**.
    * $x^2$ for $x:\{0 ... 9\}$
    * $2^x$ for all even numbers between 0 and 20
    * $3x^4-2x^3+17x$ for $x:\{1 ... 200\}$

In [None]:
[3*x**4-2*x**3+17*x for x in range(1,201)]

4. Solve the following equation for $x$:

    $x = p^2 + 2pq + q^2$ where $0<p<1$ and $q=1-p$ over the range of P values provided below. 

In [None]:
# Here are the range of p's you will need (note 0<p<1 and the Set P variable is capitalized.)
P = [0.0, 0.2, 0.33, 0.5, 0.66666667, 0.99]


# First define a function `q` that takes p as an argument and returns q (ie. q=1-p)
def q(p: float):
    return 0.0

Using either a list comprehension or a for-loop along with your newly created `q` function, 

**translate the above formula and calculate the solution over the values in `P`**

In [None]:
# Below is a pseudocode template to help you get started

# for p in P:
#     Find q for given p
#     x = ??? # put your formula here using p and q
#     print(x)

_Bonus points: What is the name of this formula and what does it describe?_

Answer:

#### Python functions

5. Write a function that takes a DNA sequence as an argument and:
    * a) Checks to make sure that the DNA sequence uses only appropriate nucleotides
    * b) Returns a tuple containing the **GC content, length, and reverse complement sequence** of the input DNA molecule
    
**Hint:** _you can define functions for each of the requested properties and another function to create/organize the output_

In [None]:
def is_DNA(seq: str) -> bool:
    """Test whether a given sequence is a valid DNA sequence (ie contains only ATGC bases).
    Should return True or False depending on whether the sequence is valid."""
    return False


def calc_GC(seq: str) -> float:
    """Calculate the GC percent of the DNA sequence passed in as an argument."""
    return 0.0


def revcomp(seq: str) -> str:
    """Take a DNA sequence (seq) and return the reverse complement
    Hint: you can create a dictionary of complemntary bases and use that to look up the complement of a given base
    """
    complements = {}

    return 


def main(seq: str) -> tuple[float, int, str]:
    """Take an input sequence and process through each of the functions described above and return a tuple of the results"""
    ...


# Test your main() function with the following sequence:
test_seq = "CAGTACGATCTTGACGGTACG"
print(main(test_seq))

c) Apply your main() function to calculate the above parameters of interest (b) for all of the following of sequences in the `dna_seqs` list.

In [None]:
dna_seqs = [
    "TTATCAGCGGATTATTAGGTATAGTGCTATGC",
    "CGAGATTAGCGATTTGTG",
    "GGTATACTCTGCACGACGAGCGAGCGACGGACGACGGCICGATCTATCTA"
    "ACGTACGTACGTACGTACGTACGTACGTACGTACGT",
    "tacgagctactgagcgatcggatcgtacgtagc",
    "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA",
    "GGCTTAATATCGAGCTAGTAGTCTATTCTAGCGAGCGACTATTCGACTATCGATGCTATCTGCGCAGCGAGCATCGAGCGCTATCGAGCTAGCTAGCTAGCTATCATCGAGCTACTAGCATCTGATTATTCTTTAGCGCGACGACT",
]