# Practical Python Programming for Biologists
Author: Dr. Daniel Pass | www.CompassBioinformatics.com

---

# More Data formats

So far we've just looked at singe pieces of data (as individual numbers, strings of text, or True/False values). However the real power comes from collections of data in lists, tuples, and dictionaries.

# Lists

A list is a collection of items that are ordered and changeable. Lists are one of the most commonly used data structures in Python and are used to store a collection of related values.

To create a list in Python, we simply use square brackets and separate the items in the list with commas:

In [None]:
# Create a list of DNA bases
bases = ['A', 'T', 'G', 'C', 'G', 'C']

print(bases)

Because they are stored in order we can access individual elements in a list by using their index. Indexes always start at 0, so the first element in the list has an index of 0, the second element has an index of 1, and so on. 

When selecting multiple elements the index must be inclusive of the last element to be included. Note that it will return a list too, not just the elements you ask for.

In [None]:
# Access the first element in the list
print(bases[0])

In [None]:
# Access the second, third, and fourth element in the list
print(bases[1:4])

There are also some useful shortuts, for example if you leave one value empty it will assume start/end of the list. You can also use the minus ```-``` character to start from the end of the list.

In [None]:

# Access from the third to the end of the list
print(bases[2:])

# Access the last element in the list
print(bases[-1])


Once you have data in your list, you can easily replace individual elements, or add, remove, insert etc.

In [None]:
# Modify the third element in the list
bases[2] = 'N'

print(bases)

In [None]:
# Add an element to the end of the list
bases.append('U')

# Insert an element at a specific index
bases.insert(4, 'N')

print(bases)

# Remove an element from the list - only the first instance!
bases.remove('N')
print(bases)

The ```.pop``` method is useful as it both takes the last element (by default) and removes it, but also can put it into a new variable.

In [None]:
# A list of sites we still have to process
sites = ["CDF", "BRI", "LDN", "SWN", "EXT"]

# Remove the last element from the list (and discard it)
print(sites)
sites.pop()
print(sites)

In [None]:
# A list of sites we still have to process
sites = ["CDF", "BRI", "LDN", "SWN", "EXT"]

last_site = sites.pop()
print("The most recent site was", last_site)
print("Remaining sites to be processed are:", sites)

You can also add mulptiple elements on at the same time, or combine lists easily:

In [None]:
confirmed_genes = ['BRCA1', 'TP53', 'EGFR', 'KRAS']
suggested_target_genes = ['MET1', 'ROS']

all_genes = confirmed_genes + suggested_target_genes + ['ABC13']

print(all_genes)

## Exercise - BugFix

Try to find and correct the issues in the code block:

In [None]:
# Our gene insert as a list of bases
my_insert = ['C', 'A', 'T', 'G', 'C', 'G', 'T', 'A', 'T', 'A', 'T', 'N', 'N', 'C', 'A', 'T', 'C', 'G']

# Get last base
region_length = len(my_insert)
last_base = my_insert[length]

print(last_base)

# Remove all N's
my_insert.remove("N")

print(my_insert)

## Exercise - Manipulating lists

We know that the restriction enzyme **DANase** always cuts 9bp after the start codon. We want to create two list variables for the begginging and remaining part of the sequence and print them out.

1. Complete this code. Using numerical slicing, cut the string 9 bases ***after*** ATG

In [None]:
# Our gene as a list of bases
my_gene = ['A', 'T', 'G', 'C', 'C', 'G', 'C', 'G', 'G', 'C', 'A', 'T', 'G', 'G', 'A', 'C', 'G', 'C', 'G', 'T', 'A', 'T', 'A', 'T', 'G', 'C', 'C', 'A', 'T', 'C', 'G']

# Expected cut site: CAT|GCA
first_part = my_gene[]
second_part = my_gene[]

print(first_part)
print(second_part)

<details>
<summary>Expected output</summary>

In this example it should cut at (```CAT|GGA```)

</details>

Extension: We can use the .join() method to join all the elements in a list into a string and to make it easier for reading. The part between the single quotes is what will be between each element.

2. Make an easily readable output highlighing the restriction cut in our sequence so that it looks like:
```...ATCG-CGAT...```


In [None]:
# To start you off
print(' - '.join(first_part))

---

# Tuples
A tuple is a collection of "ordered and immutable" elements. Basically they are similar to lists, but **once a tuple is created, its elements cannot be changed**. 

Because they won't be sorted, elonogated or shortened, Tuples are often used to store related values that are constant and do not change. A classic example would be genomic co-ordinates.

Because they are immutable they are also more memory-efficient and faster to access and process than lists.

Tuples are differentiated from lists by using round brackets (parentheses) instead. Because we define a specific format we can access elements like so:

In [None]:
# Define a genomic coordinate as a tuple
TFBS_coords = ('TFBS01', 1500, 1700)

# Print the chromosome name
print(TFBS_coords[0])

# Print the start position
print(TFBS_coords[1])

# Print the end position
print(TFBS_coords[2])


Because of the fixed order of tuples there is also a very useful function to quickly assign the values to variables.

We know the exact length and order of the tuple elements so we can assign them together on one line:

In [None]:
# Assign our three value tuple, to three new variable names
id, start, end = TFBS_coords

# And we can use them immediately
print("ID:   ", id)
print("Start:", start)
print("End:  ", end)
print("BED format output:", id + ":" + str(start) + "-" + str(end))

---

# Ranges
Ranges are not really a different data type, but lets look at them quickly now as they are a very useful way of creating lists! Really, ranges are a type of automatic list method, where you can define the range and period in your sequence.

You can easily create a simple range like so, but note that it is stored as a range-object until you force it to be a list:

In [None]:
x = range(20)

# As it's own object
print(x)

# Forced List
print(list(x))

You can also define the interval that you want your sequence to have

In [None]:
my_timeline = range(0, 20, 3)
print(list(my_timeline))

## Exercise - Creating ranges

You have taken a flourescence measurement of your culturing algae every **6 minutes for 3 hours** to see the response to a growth stimulation. Lets plot that data on a graph, starting from time 0.

Your data is stored in a list variable named ```measurements```

1. How many elements are going to be in your list? (You could either calculate it, or count the length of your measurements)
2. Create a list object named ```timeline``` using ```range()```. This will be your X-axis
3. Run the graph making code to plot your data

Note: Remember that list indexes must be inclusive!

In [None]:
# Your data
measurements = [0.68, 1.69, 3.37, 5.61, 7.89, 9.86, 10.9, 10.84, 9.86, 8.22, 6.46, 4.69, 3.19, 2.05, 1.24, 0.7, 0.38, 0.2, 0.1, 0.05, 0.02, 0.01, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 1.0, 2.0, 3.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

## Create your timeline
timeline =  


In [None]:
# Make a quick graph
import matplotlib.pyplot as plt

plt.plot(timeline, measurements)
plt.xlabel('Time')
plt.ylabel('Fluorescence')
plt.show()

---

# Two Dimensional (2D) lists 

2d lists are where each element of the list is a list itself. You can imagine it a little like rows and columns in a spreadsheet or dataframe. 

Here the first element in each row is a confidence value, the second is the gene ID, and the third is the length of the gene. 

We can access individual elements of the annotation table using two indices by combining square bracket indexing like with a normal list

In [None]:
gene_confs = [[0.92, 'MET1', 2205], [0.82, 'EGFR', 1567], [0.93, 'KRAS', 6523], [0.4, 'TP53', 5002], [0.94, 'ROS5', 1999], [0.87, 'BRCA1', 2323]]

# Try and find the each of these outputs in the gene_confs list
print(gene_confs[1][1])
print(gene_confs[3][0])
print(gene_confs[5][2])
print(gene_confs[5][-1])

We can then also add, remove, or sort the list just the same way as if it was a single list. Sort will work on just the first element in each sublist (there are more complex methods to sort on other elements, for later in the course!)

One thing to note: There are two sort-related approaches. the ```.sort()``` method changes the list you already have, however ```sorted(list)``` will create a new list. 

In [None]:
new_list = sorted(gene_confs)
# Now we have two lists
print(gene_confs)
print(new_list)

# Here we just modify the first list
gene_confs.sort(reverse = True)
print(gene_confs)

#### Advanced: 

There is also a method named ```zip``` which can combine lists into an array. Incredibly useful!

We will come back to this later in the course in more complex examples but works simply like this:

In [None]:
bases = ["A", "T", "C", "G"]
count = [24, 17, 73, 26]

combined_base_count = zip(bases, count)

print(combined_base_count)
print(list(combined_base_count))

Note that it creates tuples, not a list, and also the requirement to force a ```list``` object to make it print. 

It can also be helpful to do this in reverse, splitting a 2D list into two lists:

In [None]:
line_exp = [("WT", 98), ("Mut1", 76), ("Mut2", 86), ("Mut3", 79)]

line, exp = zip(*line_exp)

print(line)
print(exp)


## Exercise: Manipulating 2D lists

Here we have a list of genes with three values: confidence score, geneID, and gene length

1. Sort the gene list by confidence value
2. Create a new variable named ```highest``` which contains the gene ID of the list with the highest confidence value and print it
3. We have new data! Add the gene ```APC1``` with the conf value ```0.97``` and length ```4287``` and the gene ```COI``` with ```0.72``` and ```1660``` to the list (there are several ways to do this! You could look back at the joining list section)
4. Put the three genes with the lowest confidence into a new list named ```low_conf_genes``` and print it.

Extension: 

5. Add a new element into the gene_confs list for the low confidence genes so that they have a 4th element saying ```RETEST```

Note: If you have done some programming before you may notice that in several of these examples there are better and quicker ways to achieve the steps by using loops and conditionals. We will cover this in the next session but for now, use the commands we've learned


In [None]:
gene_confs = [[0.92, 'MET1', 2205], [0.82, 'EGFR', 1567], [0.93, 'KRAS', 6523], [0.4, 'TP53', 5002], [0.94, 'ROS5', 1999], [0.87, 'BRCA1', 2323]]

# Your code


---

# Dictionaries

A dictionary at first looks like a 2D list but is actually unique because all data is in unordered key-value pairs. They are particularly useful for finding the corresponding pair (like a real dictionary!) when you have one value for example:

- ```geneID => sequence```
- ```aminoAcid => frequency```
- ```site => longitude/lattitude```

This example dictionary has a frequency count of each base, where the first element in each pair is the "key" reference, and the second element is the "value". 

It's defined in the format of ```'key' : 'value'```, between braces ```{ }``` (curly brackets)

In [None]:
# Create a dictionary of DNA base counts
base_counts = {'A': 101, 'T': 250, 'G': 125, 'C': 92}
print(base_counts)

We can then access individual values based on their key just like looking them up in a dictionary, or modify/delete them:

In [90]:
base_counts = {'A': 101, 'T': 250, 'G': 125, 'C': 92}

# Print corresponding value (two methods)
print("Number of Adenine:", base_counts.get('A'))
print("Number of Thymine:", base_counts['T'])

print(base_counts)
# Modify the value for a key
base_counts['A'] = 65

# Add a new key-value pair to the dictionary
base_counts['N'] = 5

# Remove the 'C' key-value pair from the dictionary
base_counts.pop('C')

print(base_counts)

# There is also the del statement
del base_counts['T']
print(base_counts)

Number of Adenine: 101
Number of Thymine: 250
{'A': 101, 'T': 250, 'G': 125, 'C': 92}
{'A': 65, 'T': 250, 'G': 125, 'N': 5}
{'A': 65, 'G': 125, 'N': 5}


We can use the square bracket method to return the value of a dictionary key, however the .get() method is usually better because it handles missing data better. Note the difference:

In [92]:
print("Number of Adenine:", base_counts.get('X'))

Number of Adenine: None


In [93]:
print("Number of Thymine:", base_counts['X'])

KeyError: 'X'

A useful set of methods are ```.keys()``` and ```.values()``` which return dict objects (just like when we made a range). They can be turned into a list using the ```list()``` function in the same way.

We can also use the ```.items()``` method to return key:value pairs as a tuple, but that is for more complicated functions we will look at later.


In [None]:
# Print just the values
print(base_counts.values())

# Print just the keys
print(base_counts.keys())

---
#### Dictionary Example
Dictionaries are really important and powerful. Lets look at another example where we can attach specific information to geneIDs, and use the ```.get``` method to search the dictionary.

Here we can combine a gene sequence dictionary with our 2D list exercise from earlier

In [None]:
## The data

# Dictionary of gene names and sequences
gene_dict = {'BRCA1': 'ATGTTGTCATCGTTGAGCTTTGCTTCCT',
             'TP53': 'ATGGAGGAGCCGCAGTCAGATC',
             'EGFR': 'ATGACCATCCAAGATGATGGTGTC',
             'KRAS': 'ATGACTGAATATAAACTTGTGGTAG',
             'BRAF': 'ATGGTCCAGCTTGGACCCACTCC',
             'ALK': 'ATGAAGGAGCCCTCAGATTTCTTG',
             'RET': 'ATGGGTGGGTTGTCGGAAGATCTT',
             'ROS1': 'ATGAGCCACCCAGGTCCCTGTAGT',
             'MET': 'ATGGCTTCAAGCTGTTGTCGTGAAGA'}

# gene confidence values
gene_confs = [[0.92, 'MET1', 2205], [0.82, 'EGFR', 1567], [0.93, 'KRAS', 6523], [0.4, 'TP53', 5002], [0.94, 'ROS5', 1999], [0.87, 'BRCA1', 2323]]

# sort and get lowest conf gene ID
gene_confs.sort()
lowest_gene_conf = gene_confs[0][0]
lowest_gene_ID = gene_confs[0][1]

print(lowest_gene_ID)

In [None]:
# Search dictionary keys for that ID
print(gene_dict.get(lowest_gene_ID))

# easier to read!
print("Gene sequence for", lowest_gene_ID, "(Confidence value:", str(lowest_gene_conf) + ") is:",  gene_dict.get(lowest_gene_ID))

Lets continue our example from 2D lists and remove the three lowest genes from our dictionary:

In [None]:
print("gene_dict contains", len(gene_dict), "genes")
gene_dict.pop("EGFR")
gene_dict.pop("TP53")
gene_dict.pop("BRCA1")
print("gene_dict contains", len(gene_dict), "genes")

Actually, even better idea! Instead of just throwing the data out, lets put the DNA sequences of the lowest three genes from the dictionary into a new list called ```bad_genes``` before they get discarded. 

Note, what happens if you run this code immediately? Read the error and identify what is wrong.

In [None]:
bad_genes = []

print("gene_dict contains", len(gene_dict), "genes")
bad_genes.append(gene_dict.pop("EGFR"))
bad_genes.append(gene_dict.pop("TP53"))
bad_genes.append(gene_dict.pop("BRCA1"))
print("gene_dict contains", len(gene_dict), "genes")

In [None]:
print(bad_genes)
print(gene_dict)

Dictionaries and lists are powerful ways to handle data and often involve going between the two. There are also lots of powerful methods to search throughnot just the keys, but also the values to find relevant keys which can be very useful but first we need to learn a bit about loops and conditionals!

## Exercises

1. Create a dictionary of bacteria and confluence values using the data:

```
["E. coli", "S. aureus", "P. aeruginosa", "K. pneumoniae", "A. baumannii"]
[60, 82, 75, 91, 70]
```
2. Print the confluence value of *K. pneumoniae* from the dictionary
3. A function we haven't used yet is ```sum()```, but it works just like ```len()```. Use both of these to calculate the average confluence of all the samples in the dictionary (what's the easiest way to get all the values?)




In [95]:
# Format reminder
bacteria_dict = {"E. coli": 60,}

91
75.6
