# Overview 

In this lab, you will use your Python skills to analyze the SARS-CoV-2 genome
in various ways.
 
We will be using the RefSeq genome sequence from NCBI
(https://www.ncbi.nlm.nih.gov/nuccore/NC_045512). For the lab, you will
use a few techniques (string manipulation, dictionaries, plotting) that are (as
you will see throughout the course) very useful in biological sequence analysis.

As with any coding questions, Google and Python documentation are your
friend! Like always, make sure to clone your Github repo to Datahub, document all your steps and answer all questions using Markdown sections in your
Jupyter notebook, and display all plots inline.

## Submission checklist

- Commit notebook on DataHub and push to GitHub
- Make sure `lab-2-username` repository contains changes on GitHub (GSIs can see your private repos)

# Read data

First, let's read the genome fasta file using Biopython, and store the genome as a
Python string. Similarly, let's read the transcriptome fasta file, and store the
transcriptome as a list of strings. Be sure that you are working with native
python strings and lists, not Biopython Seq objects. You are not allowed to use any BioPython function from now onwards.

In [None]:
# code here

# Conversion

The genome on the RefSeq database is the cDNA version of the genome,
but SARS-CoV-2 is an RNA virus! Convert the genome into an RNA
string. Print the first 100 bp.

In [None]:
# code here

The SARS-CoV-2 genome is positive stranded. Convert the genome into
a negative strand sequence by computing the reverse complement. Print
the first 100 bp of the reverse complement sequence.

In [None]:
# code here

# $k$-mer analysis

## Part 1

What does a $k$-mer represent?

### Answer

[Answer here]

## Part 2

Build two dictionaries of $k$-mer frequencies, one for $k=3$ (use `threemers` as variable name) and one for $k=4$ (respectively, `fourmers`). In each dictionary, the keys should be every possible k-mer of the given length and the corresponding values should be the number of times that k-mer occurs in the SARS-CoV-2 genome. To find the $k$-mers, scan along the sequence with a window size of $k$ and a stride of 1.

Note: Use the negative strand sequence to compute $k$-mer frequencies

Example for k=3:  
String: "ACTGACT"  
```
threemers = {"ACT": 2,   
             "CTG: 1,  
             "TGA": 1,  
             "GAC": 1,  
             ...}
```

In [None]:
# code here

## Part 3

Plot a bar graph of the frequency of each 4-mer. Include only the 50 most
frequent 4-mers, and sort the bars in the plot by frequency (most frequent
→ least frequent).

In [None]:
# code here and plot inline

## Part 4

Plot the frequency of all 64 3-mers, similarly sorted by
frequency.

In [None]:
# code here and plot inline

# Translation

We have a transcriptome which contains the mRNA sequences expressed
by the virus, but we want to know the final protein sequences. Build a
new list of the proteins by converting each mRNA sequence to an amino
acid sequence (use the [standard translation code](https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes#SG1)). Print the amino acid sequence for the S protein.

Note: Translate through the entire sequence ignoring the stop codons. This is not biologically accurate but sufficient for the purposes of this lab.

In [None]:
# code here

Make a bar chart of amino acid composition per transcript. You should
have a bar for each transcript (the x-axis), and you should have 20 stacked
bars, one for each amino acid.

To make this bar chart, we can leverage the Pandas package to create a `DataFrame` that makes the data amenable to a 1-line command for plotting.

This `DataFrame` should have the following structure:

In [None]:
# example DataFrame, your DataFrame should have the same exact columns
import pandas as pd

df = pd.DataFrame(columns=["Gene", "Position", "AA"])
df["Position"] = [1, 2, 3] * 3
df["Gene"] = ["Gene 1"] * 3 + ["Gene 2"] * 3 + ["Gene 3"] * 3
# random and just for example
df["AA"] = ["D", "T", "P"] + ["A", "B", "A"] + ["D", "A", "T"]

df

This is called a **long-form** dataframe! Each gene is repeated many times in the gene column

In [None]:
# code here and plot inline