In [None]:
## Sanity Check, look away or you will turn into stone
import sys
# Check that python versions are correct
assert sys.version_info.major == 3
assert sys.version_info.minor == 6

__author__ = "Emanuel Burgos"
__email__ = "eburgos@wisc.edu"

# Hour of Code with Mandel Lab #4

# 2020-09-24: Exercises

Textbook: [Python for Biologists](https://pythonforbiologists.com/) by Dr. Martin Jones

### Guidelines:

- Notebooks is sectioned by headers. Each one will have small exercises that we can practice with as the discussion goes on. With each practice cell, there is an test cell that you can run to verify your solution. DO NOT MODIFY THIS IN ANY WAY. You will run this code to verify your solution but do not change the code within it. Have fun.

## Easy

The file `data.csv` has gene records for *Drosophila melangester* species that we will be using. The column names are **[species, sequence, gene, expression]**.

In [3]:
## READ IN YOUR DATA

# ONE WAY
import pandas as pd
df = pd.read_csv('data.csv', names=['species', 'sequence', 'gene','expression'])

# ANOTHER WAY
data = []
with open('data.csv', 'r') as f:
    for line in f:
        data.append(line.strip().split(','))
data

Unnamed: 0,species,sequence,gene,expression
0,Drosophila melanogaster,atatatatatcgcgtatatatacgactatatgcattaattatagca...,kdy647,264
1,Drosophila melanogaster,actgtgacgtgtactgtacgactatcgatacgtagtactgatcgct...,jdg766,185
2,Drosophila simulans,atcgatcatgtcgatcgatgatgcatccgactatcgtcgatcgtga...,kdy533,485
3,Drosophila yakuba,cgcgcgctcgcgcatacggcctaatgcgcgcgctagcgatgc,hdt739,85
4,Drosophila ananassae,ttacgatcgatcgatcgatcgatcgtcgatcgtcgatgctacatcg...,hdu045,356
5,Drosophila ananassae,gcatcgatcgatcgcggcgcatcgatcgcgatcatcgatcatacgc...,teg436,222


 #### 1. Several Species
Print out gene names for all genes belonging to *Drosophila melanogaster* or *Drosophila simulans*.

In [8]:
### YOUR SOLUTION HERE
df['length'] >= 90

0     True
1    False
2     True
3    False
4     True
5     True
Name: length, dtype: bool

#### 2. Length range
Print out gene names for all genes between 90 and 110 bases long.

In [10]:
### YOUR SOLUTION HERE
df['length'] = df['sequence'].apply(lambda x: len(x))
for i, row in df[(df['length'] >= 90) & (df['length'] <= 110)].iterrows():
    print(row['gene'])

kdy647
kdy533
teg436


#### 3. AT Content
Print out gene names for all genes whose AT content is less than 0.5 and whose expresssion level is greater than 200.

In [None]:
### YOUR SOLUTION HERE

#### 4. Complex Condition
Print out gene names for all genes whoser names begins with "k" or "h" except those that belong to *Drosophila melanogaster*.

In [None]:
### YOUR SOLUTION HERE

#### 5. High low medium
For each gene, print out a message giving the gene name and saying whether its AT content is high (> 65), low (< 45) and medium (between 65 and 45). 

In [None]:
### YOUR SOLUTION HERE

## Hard

- I found this **side** exercise from Exercism which I thought summarizes well what we learned today. If you cannot solve it today, do not worry. Just keep it as homework
- If you know how to use Exercism, go ahead and download the scripts into your computer and use your favorite `IDE` to solve it.
- Once you do, post the solution in **Exercism Teams** so you can share it with us and see how others did!

#### Introduction

##### Translate RNA sequences into proteins.

RNA can be broken into three nucleotide sequences called codons, and then translated to a polypeptide like so:

RNA: `AUGUUUUCU` => translates to

Codons: `AUG`, `UUU`, `UCU` => which become a polypeptide with the following sequence =>

Protein: "Methionine", "Phenylalanine", "Serine"

There are 64 codons which in turn correspond to 20 amino acids; however, all of the codon sequences and resulting amino acids are not important in this exercise. If it works for one codon, the program should work for all of them. However, feel free to expand the list in the test suite to include them all.

There are also three terminating codons (also known as 'STOP' codons); if any of these codons are encountered (by the ribosome), all translation ends and the protein is terminated.

All subsequent codons after are ignored, like this:

RNA: `AUGUUUUCUUAAAUG` =>

Codons: `AUG`, `UUU`, `UCU`, `UAA`, `AUG` =>

Protein: "Methionine", "Phenylalanine", "Serine"

Note the stop codon "UAA" terminates the translation and the final methionine is not translated into the protein sequence.

Below are the codons and resulting Amino Acids needed for the exercise.

| Codon | Protein |
|-|-|
| AUG | Methionine |
|UUU, UUC | Phenylalanine |
|UUA, UUG | Leucine |
|UCU, UCC, UCA, UCG | Serine |
|UAU, UAC | Tyrosine |
|UGU, UGC | Cysteine |
|UGG | Tryptophan |
|UAA, UAG, UGA | STOP |