# Day 1 - Introduction to Python Programming for Biologists
All commands for today's session are found below, except where you see the word ***EXAMPLE*** which may be just non functional psuedo-code to explain a concept.

If you see ***Exercises*** then there is a challenge for you to write some code yourself!

## Day 1
- Introduction to Python concepts and coding in notebooks
- Basic data types and simple functions
- Data manipulation and I/O handling      
- **More data types!<-------------**
- Project: Creating a program with variable inputs to manipulate DNA sequences

# More Data formats

So far we've just looked at singe pieces of data whether they are individual numbers, or large strings of text. However the real power comes from collections of data in lists, tuples, and dictionaries.



# Lists

A list is a collection of items that are ordered and changeable. Lists are one of the most commonly used data structures in Python and are used to store a collection of related values.

To create a list in Python, we simply use square brackets and separate the items in the list with commas:

In [None]:
# Create a list of DNA bases
bases = ['A', 'T', 'G', 'C', 'G', 'C']

print(bases)


['A', 'T', 'G', 'C', 'G', 'C']


We can access individual elements in a list by using their index. In Python, list indexes start at 0, so the first element in the list has an index of 0, the second element has an index of 1, and so on. 

When selecting multiple elements the index must be inclusive of the last element to be included.

In [None]:
# Access the first element in the list
print(bases[0])

# Access the second, third, and fourth element in the list
print(bases[1:4])

# Access from the third to the end of the list
print(bases[2:])

# Access the last element in the list
print(bases[-1])


A
['T', 'G', 'C']
['G', 'C', 'G', 'C']
C


Once you have data in your list, you can replace individual elements, or add, remove, insert etc.

In [None]:
# Modify the third element in the list
bases[2] = 'N'

print(bases)


['A', 'T', 'N', 'N', 'C', 'G', 'C', 'U']


In [None]:
# Add an element to the end of the list
bases.append('U')

# Insert an element at a specific index
bases.insert(4, 'N')

print(bases)


['A', 'T', 'N', 'N', 'N', 'C', 'G', 'C', 'U', 'U']


In [None]:
# Remove an element from the list - only the first instance!
bases.remove('N')

# Remove the last element from the list
bases.pop()

print(bases)


['A', 'T', 'C', 'G']


You can also add mulptiple elements on at the same time, or combine lists easily:

In [None]:
confirmed_genes = ['BRCA1', 'TP53', 'EGFR', 'KRAS']
suggested_target_genes = ['MET1', 'ROS']

all_genes = confirmed_genes + suggested_target_genes

print(all_genes)

['BRCA1', 'TP53', 'EGFR', 'KRAS', 'MET1', 'ROS']


## Exercise - BugFix

Try to find and correct the issue in the following code blocks:

In [2]:
# Our gene as a list of bases
my_insert = ['C', 'A', 'T', 'G', 'C', 'G', 'T', 'A', 'T', 'A', 'T', 'G', 'C', 'C', 'A', 'T', 'C', 'G']

# Get last base
length = len(my_insert)
last_base = my_inser[length]

print(last_base)

# Remove all N's
my_insert.remove("N")

print(my_insert)

IndexError: ignored

## Exercise - Manipulating lists

We know that the restriction enzyme **DANase** always cuts 9bp after the ATG start codon. We want to create two list variables for the begginging and remaining part of the sequence and print them out.

Fix and complete this code. 
(Hint: In this example it should cut at CAT|GCA)

In [None]:
# Our gene as a list of bases
my_gene = ['A', 'T', 'G', 'C', 'C', 'G', 'C', 'G', 'G', 'C', 'A', 'T', 'G', 'C', 'A', 'T', 'G', 'C', 'G', 'T', 'A', 'T', 'A', 'T', 'G', 'C', 'C', 'A', 'T', 'C', 'G']

# Expected cut site: CAT|GCA
first_part = my_gene[]
second_part = my_gene[]

print(first_part)
print(second_part)

We can use the .join() method to change a list into a string for outputting
Make an easily readable output highlighing the restriction cut in our sequence so that it looks like:
```...ATCG-CGAT...```


In [None]:
# To start you off
print(' -space- '.join(first_part))

---

# Tuples
A tuple is a collection of ordered and immutable elements. Tuples are similar to lists, but **once a tuple is created, its elements cannot be modified**. Because they won't be sorted, elonogated or shortened, Tuples are often used to store related values that are constant and do not change. A classic example would be genomic co-ordinates.

Because they are immutable it can be useful for storing data that should not be changed once read in, and are more memory-efficient and faster to access and process than lists.

Tuples are differentiated from lists by using round brackets (parentheses) instead. Because we define a specific format we can access elements like so:

In [None]:
# Define a genomic coordinate as a tuple
TFBS_coords = ('TFBS01', 1500, 1700)

# Print the chromosome name
print(TFBS_coords[0])

# Print the start position
print(TFBS_coords[1])

# Print the end position
print(TFBS_coords[2])


TFBS01
1500
1700


Because of the fixed order of tuples there is also a very useful function to quickly assign the values to variables:

In [None]:
id, start, end = TFBS_coords

print("ID:\t", id)
print("Start:\t", start)
print("End:\t", end)
print("\nBED format output:", id + ":" + str(start) + "-" + str(end))

ID:	 TFBS01
Start:	 1500
End:	 1700

BED format output: TFBS01:1500-1700


---

# Ranges
Really, ranges are a type of automatic list method, where you can define the range and period in your sequence. They are not really a different data type, but very useful!

You can easily create a simple range like so, but note that it is stored as a range-object until you force it to be a list:

In [None]:
x = range(20)

# As it's own object
print(x)

# Forced List
print(list(x))

range(0, 20)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]


You can also define the interval that you want your sequence to have

In [None]:
my_timeline = range(0, 20, 3)
print(list(my_timeline))

[0, 3, 6, 9, 12, 15, 18]


## Exercise - Creating ranges

You have taken a flourescence measurement of your culturing algae every **6 minutes for 3 hours** to see the response to a growth stimulation. 

1. How many elements are going to be in your list? (You could either calculate it, or count the length of your measurements)
2. Create a list object named ```timeline```
3. Run the graph making code to plot your data

Note: Remember that list indexes must be inclusive!

In [None]:
# Your data
measurements = [0.68, 1.69, 3.37, 5.61, 7.89, 9.86, 10.9, 10.84, 9.86, 8.22, 6.46, 4.69, 3.19, 2.05, 1.24, 0.7, 0.38, 0.2, 0.1, 0.05, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 1.0, 2.0, 3.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

## Create your timeline
timeline = 


51

In [None]:
# Make a quick graph
import matplotlib.pyplot as plt

def plot_data(x, y):
    plt.plot(x, y)
    plt.xlabel('Time')
    plt.ylabel('Fluorescence')
    plt.show()

plot_data(timeline, measurements)

---

# Two Dimensional (2D) lists 

2d lists are where each element of the list is a list itself, representing a row of the annotation table. 

Here the first element in each row is the gene ID, the second is a confidence value, and the third is the length of the gene. 

We can access individual elements of the annotation table using two indices by combining square bracket indexing like with a normal list

In [None]:
gene_confs = [[0.92, 'MET1'], [0.82, 'EGFR'], [0.93, 'KRAS'], [0.4, 'TP53'], [0.94, 'ROS5'], [0.87, 'BRCA1']]

print(gene_confs[1][1])
print(gene_confs[3][0])
print(gene_confs[5][-1])

[[0.92, 'MET1'], [0.82, 'EGFR'], [0.93, 'KRAS'], [0.4, 'TP53'], [0.94, 'ROS5'], [0.87, 'BRCA1']]


We can then then add, remove, or sort the list as if it was a single list.

There is also a method named ```zip``` which can combine lists into an array. We will come back to this later in the course but works like this (note the requirement to force a ```list``` object for printing)


In [None]:
bases = ["A", "T", "C", "G"]
count = [24, 17, 73, 26]

combined_base_count = zip(bases, count)

print(combined_base_count)
print(list(combined_base_count))

<zip object at 0x7f72335a5600>
[('A', 24), ('T', 17), ('C', 73), ('G', 26)]


It can also be sometimes useful to do this in reverse, splitting a 2D list into two lists:

In [None]:
line_exp = [("WT", 98), ("Mut1", 76), ("Mut2", 86), ("Mut3", 79)]

line, exp = zip(*line_exp)

print(line)
print(exp)


('WT', 'Mut1', 'Mut2', 'Mut3')
(98, 76, 86, 79)


## Exercise: Manipulating 2D lists

1. Sort the gene list confidence value - sort() will only work on the first element in the sublists
2. Create a new variable named ```highest``` with the gene ID with the highest confidence value and print it
3. We have new data! Add the gene ```APC1``` with the conf value ```0.97``` and the gene ```COI``` with ```0.72``` to the list
4. Lets put the three genes with the lowest confidence into a new list named ```low_conf_genes```
5. Actually, even better lets add a new element to the low confidence genes saying ```RETEST```
6. Print out just the list of conf_values
7. Calculate the average of the values. You can use the ```sum()``` and ```len()``` functions on a list like this.

Note: If you have done programming before you may notice that in several of these examples there are better and quicker ways to achieve the steps by using loops and conditionals. We will cover this in the next session but for now, use the commands we've learned


In [None]:
gene_confs = [[0.92, 'MET1'], [0.62, 'EGFR'], [0.93, 'KRAS'], [0.74, 'TP53'], [0.94, 'ROS5'], [0.87, 'BRCA1']]


Highest confidence gene is: ROS5
[[0.62, 'EGFR'], [0.74, 'TP53'], [0.78, 'COI'], [0.87, 'BRCA1'], [0.92, 'MET1'], [0.93, 'KRAS'], [0.94, 'ROS5'], [0.97, 'APC1']]
[[0.62, 'EGFR', 'RETEST'], [0.74, 'TP53', 'RETEST'], [0.78, 'COI', 'RETEST'], [0.87, 'BRCA1'], [0.92, 'MET1'], [0.93, 'KRAS'], [0.94, 'ROS5'], [0.97, 'APC1']]


---

# Dictionaries

A dictionary at first looks like a 2D list but is actually unique because all data is in unordered key-value pairs. They are particularly useful (like a real dictionary!) for finding the corresponding pair when you have one value for example:

- ```gene => sequence```
- ```aminoAcid => frequency```

This dictionary has a frequency count of each base, where the first element in each pair is the "key" reference, and the second element is the "value". 

It's defined in the format of 'key' : 'value', between braces (curly brackets)

In [None]:
# Create a dictionary of DNA base counts
base_counts = {'A': 101, 'T': 250, 'G': 125, 'C': 92}
print(base_counts)

# Print just the values
print(base_counts.values())

# Print just the keys
print(base_counts.keys())

{'A': 101, 'T': 250, 'G': 125, 'C': 92}
dict_values([101, 250, 125, 92])
dict_keys(['A', 'T', 'G', 'C'])


Note how the ```.keys()``` and ```.values()``` methods return dict objects (just like when we made a range). They can be turned into a list using the ```list()``` function in the same way.

I'll point out here, we can also use the ```.items()``` method to return key:value pairs as a tuple, but that is for more complicated functions we will look at later.

We can then access individual values based on their key, or modify/delete them:

In [None]:
# Print corresponding value
print("Number of Adenine:", base_counts['A'])
print("Number of Thymine:", base_counts['T'])

# Modify the value for a key
base_counts['A'] = 65

# Add a new key-value pair to the dictionary
base_counts['N'] = 5

# Remove the 'C' key-value pair from the dictionary
del base_counts['C']

print(base_counts)


Dictionaries are really important and powerful. Lets look at another example where we can attach specific information to geneIDs, and use the ```.get``` method to search the dictionary.

Here we can combine a gene sequence dictionary with our 2D list exercise from earlier

In [None]:
# Dictionary of gene names and sequences
gene_dict = {'BRCA1': 'ATGTTGTCATCGTTGAGCTTTGCTTCCT',
             'TP53': 'ATGGAGGAGCCGCAGTCAGATC',
             'EGFR': 'ATGACCATCCAAGATGATGGTGTC',
             'KRAS': 'ATGACTGAATATAAACTTGTGGTAG',
             'BRAF': 'ATGGTCCAGCTTGGACCCACTCC',
             'ALK': 'ATGAAGGAGCCCTCAGATTTCTTG',
             'RET': 'ATGGGTGGGTTGTCGGAAGATCTT',
             'ROS1': 'ATGAGCCACCCAGGTCCCTGTAGT',
             'MET': 'ATGGCTTCAAGCTGTTGTCGTGAAGA'}

# gene confidence values
gene_confs = [[0.92, 'MET1'], [0.62, 'EGFR'], [0.93, 'KRAS'], [0.74, 'TP53'], [0.94, 'ROS5'], [0.87, 'BRCA1']]

# sort and get lowest conf gene ID
sorted_gene_confs = sorted(gene_confs)
lowest_gene_ID = sorted_gene_confs[0][1]

print(lowest_gene_ID)

# Search dictionary keys for that ID
print(gene_dict.get(lowest_gene_ID))

# easier to read!
print("Gene sequence for", lowest[1], "(Confidence value:", str(lowest[0]) + ") is:",  gene_dict.get(lowest[1]))


EGFR
ATGACCATCCAAGATGATGGTGTC
Gene sequence for EGFR (Confidence value: 0.62) is: ATGACCATCCAAGATGATGGTGTC


Actually, lets throw away the lowest value. We can use ```.pop()``` in the same way as with a list. Lets throw away the bottom three:

In [None]:
print(sorted_gene_confs)

[[0.62, 'EGFR'], [0.74, 'TP53'], [0.87, 'BRCA1'], [0.92, 'MET1'], [0.93, 'KRAS'], [0.94, 'ROS5']]


In [None]:
print("gene_dict contains", len(gene_dict), "genes")
gene_dict.pop("EGFR")
gene_dict.pop("TP53")
gene_dict.pop("BRCA1")
print("gene_dict contains", len(gene_dict), "genes")

gene_dict contains 9 genes
gene_dict contains 6 genes


Actually, even better idea! Lets put the lowest three into a new list called ```bad_genes```. 

Note, what happens if you run this code immediately? Read the error and identify what is wrong, and correct it.

In [None]:
bad_genes = []

print("gene_dict contains", len(gene_dict), "genes")
bad_genes += gene_dict.pop("EGFR")
bad_genes += gene_dict.pop("TP53")
bad_genes += gene_dict.pop("BRCA1")
print("gene_dict contains", len(gene_dict), "genes")

There are lots of powerful methods to search throughnot just the keys, but also the values to find relevant keys which can be very useful but first we need to learn a bit about loops and conditionals!

## Exercises

1. Create a dictionary of bacteria and confluence values using the data:

```
["E. coli", "S. aureus", "P. aeruginosa", "K. pneumoniae", "A. baumannii"]

[60, 80, 75, 90, 70]
```
2. Print the confluence value of *K. pneumoniae* from the dictionary
3. A function we haven't used yet is ```sum()```, but it works just like ```len()```. Use both of these to calculate the average confluence of all the samples in the dictionary




In [None]:
# Format reminder
bacteria_dict = {"E. coli": 60, ........}

{'E. coli': 60, 'S. aureus': 80, 'P. aeruginosa': 75, 'K. pneumoniae': 90, 'A. baumannii': 70}
90
75.0
