# Practical Python Programming for Biologists
Author: Dr. Daniel Pass | www.CompassBioinformatics.com

---
# General Python Concepts

We want to get on to some cool biology but first we need to learn a few basic things about how python works. Let's go!

## Basic Data types in Python

All data (aka variables) in python fall under a small number of formats, and how they are defined changes how they will be used in your code so it's important to know which to use. 

Some are obvious, like a a string. We assign it with an equals character, with the information within quotes ```" "```/```' '``` (either single or double work the same). We can use the ```type()``` function to tell us what type of data it is:

### Strings & Booleans

In [None]:
my_gene = "TGCATGCATGCTAGCGTAC"
type(my_gene)

str

---

Another important but self explanatory data type is True/False values called Booleans. Note the capital letter for the variables.




In [None]:
# Was the site sampled?
WT01 = True

type(WT01)

bool

If we run a line of code that is a statement, it will return whether it is true or false. This could be mathematical, string based, or many other methods we will see.

Note:
*   one equals symbol ( ```=``` ) assigns a value
*   two equals symbol ( ```==``` ) TESTS if that equation is correct. 

In [None]:
2 + 2 == 4

True

In [None]:
my_species = "Xenopus"

"Drosophila" == my_species

False



---
### Numbers

There are two types of number formats, either **integers** (whole 
numbers) or **floats** (Decimal or Floating Point numbers).

In [None]:
read_count = 400500
type(read_count)

int

In [None]:
frequencyObserved = 0.4231
type(frequencyObserved)

float


 Similarly you have to be especially careful with numbers as calling a float (decimal number) as an integer can cause errors. ```int()``` doesn't round the value, it just throws away everything after the decimal.

In [None]:
int(frequencyObserved)

0

In [None]:
count = int(432.237)
count * 1000

432000

More information on how floating-point numbers become strange [can be found here](https://docs.python.org/3/tutorial/floatingpoint.html)

It is really important to be aware of what format your data is in especially when combining data. If they are different types it could go very wrong! 

We're using Python 3 which is usually more obvious with it's expectations than older python versions. In the olden days doing a division would return an int by default, wereas now it always outputs a float. But there can still be some unexpected rounding to be aware of.

In [None]:
100/3

33.333333333333336

### Combining variables

Here we have three variables. A string, an integer, and a float. Or do we?

Run the codeblock and work out what is happening.

In [None]:
my_gene = "TGCATGCATGCTAGCGTAC"
read_count = 400500
frequency_observed = "0.4231"

print(my_gene * 4)
print(read_count * 4)
print(frequency_observed * 4)

TGCATGCATGCTAGCGTACTGCATGCATGCTAGCGTACTGCATGCATGCTAGCGTACTGCATGCATGCTAGCGTAC
1602000
0.42310.42310.42310.4231


Can you see what went wrong there? Putting the float value was inside of " " quotes, resulted in it being interpretted as a string, so was repeated multiple times rather than used as a number like the **read_count** variable above it.

In each of these examples we're using the variable format to assign data or values to a label or "variable name". It means we can refer back to them later or multiple times through our code. Variables don't automatically get displayed on the screen or saved to a file which is why we use the "print" command whenever we want to see a variable's value. 

It also means we can pass variables through multiple steps. Lets convert our gene sequence into something a bit more gene-like by adding a start codon, polyA tail, and repeat it 4 times for good luck!

In [None]:
# Defining some variables
my_gene = "TGCATGCATGCTAGCGTAC"
start_codon = "ATG"
stop_codon = "GGC"
polyA = "AAAAAAAAAA"

# My code
newGene = start_codon + my_gene * 4 + stop_codon + polyA
print("My new gene is:", newGene)
print("It is", len(newGene), "bases long")

My new gene is: ATGTGCATGCATGCTAGCGTACTGCATGCATGCTAGCGTACTGCATGCATGCTAGCGTACTGCATGCATGCTAGCGTACGGCAAAAAAAAAA
It is 92 bases long


We introduced a few things there. Firstly you'll see the lines that begin with a # character (called a hash (or hashtag in the twitter era!) or a pound). We use them for putting comments in our code to explain it or to make more sense when returning. Comments are super important!

Firstly we were defining our variables. In this short sequence we could have just typed them straight in, but imagine you had code that was hundreds of lines long and you typed the stop codon every time. Then you realise you used a stop codon for bacteria, not eukaryotes! This way you only need to change one line and it will correct your whole script.

The actual code line does what is called string concatenation - basically just joining strings together - and also a multiplication of the central sequence. This is a very python way to work, because other langugages may require you to define the types but python interprets it for you.

And to be fancy we also finish up by counting the length of our string using the `len` function and including it in the print command.

## Errors & Help

Coding is always going to go wrong at some point. Lets see what happens in this case, where we have the same program as above, but I've made a mistake. 

### Exercise - BugFix

Run the next three examples of broken code and read the error messages. Try to correct the errors without looking back at the example above.

In [None]:
# Defining some variables
gene_part1 = "TGCATGCATGCTAGCGTAC"
gene_part2 = "GGGTCGATAGCGCGTATAATGC"
repeat_element = "CAG"
insertion = "-"

# My code - Join the parts, with an x8 CAG repeat and insertion.
newGene = gene_part1 + (repeat_element + insertion ) + 8 + gene_prt2
print("My new gene is:", newGene)
print("It is", len(newGene), "bases long"))

TypeError: ignored

Sometimes you'll get an error like this, but other times your code will complete, but the output will be incorrect like above, so it's important to check and test your outputs to be certain they are as expected!

There are a huge number of complex methods and syntax when writing python code that it's good to consult the manual or find help. Google is always useful, but you can have incorrect information from different version so it can be easiest to do it in code so it is definitely the version you're working with:

In [None]:
help(print)

Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.



---

## Exercise - Data Types 
Lets imagine we have written a program that has a range of variable set of results. Try to create a useful output in normal text that could be easily understood by a non-coding biologist using as many of the variables as possible.

Extension: Include a calculation of % Up/Down regulated genes

In [None]:
# Number of upregulated genes
up_genes = 8423

# Number of upregulated genes
down_genes = 7239

# Total genes in reference database
tot_genes = 23000

# Significance value for t.test
sig_thresh = 0.05

# False Discovery Rate correction was performed?
FDR = True

# Author of analysis
auth = ""


In [None]:
# Write your output code here




# Methods & Data Manipulation
There are many useful functions and methods built in to python that are defined within an object.

We've already seen ```len()``` to output the length of an object. Two other useful numerical ones are ```count``` and ```round```. 

Count is a string method and round is a function which affects how they are used in code but lets not worry about the background for now. We'll look in more detail at methods later but to quickly demonstrate now how they are used as we will rely on them a lot:

In [None]:
myAA = "MVKLRYFMVKLRYFHPCQDEGANISTWHPCQDEGANISTISTW"
print("My amino acid sequence is:", myAA)

# Count the number of As in your sequence
countK = myAA.count("K")
print("There are", countK ,"Lysine residues in my protein")

# As a percentage
K_percent = countK / len(myAA) * 100
print("That is", K_percent, "% of the protein")

# Rounded
K_percent_rounded = round(K_percent, 2)
print("That is", K_percent_rounded, "% of the protein")



My amino acid sequence is: MVKLRYFMVKLRYFHPCQDEGANISTWHPCQDEGANISTISTW
There are 2 Lysine residues in my protein
That is 4.651162790697675 % of the protein
That is 4.65 % of the protein




---


# Exercises - Data Manipulation

We've seen various ways of changing the raw data, so now you will try to use them. Don't forget to use the ```help()``` function for a quick reminder, or also while in Colab then putting the mouse over a function will show the same help text.

**1 - Write some code to caclulate the GC% and print the result**

**2 - Output your result as a float number. Then to 2 decimal places.**

**3 - Include your outputs as a sentence combining the numbers and text**

I have put the first line of reading a block of text in, and removing the newlines. We'll look at better ways of reading data into our scripts in a later session.

In [None]:
#>U96639.2:5349-6893 Canis familiaris mitochondrion, cytochrome c oxidase subunit I
gene_id = "C. familiaris COI"
input = """ATGTTCATTAACCGATGACTGTTCTCCACTAATCACAAGGATATTGGTACTTTATACTTACTATTTGGAG
CATGAGCCGGTATAGTAGGCACTGCTTTGAGCCTCCTCATCCGAGCCGAACTAGGTCAGCCCGGTACTTT
ACTAGGTGACGATCAAATTTATAATGTCATCGTAACCGCCCATGCTTTCGTAATAATCTTCTTCATAGTC
ATGCCCATCATAATTGGGGGCTTTGGAAACTGACTAGTGCCGTTAATAATTGGTGCTCCGGACATGGCAT
TCCCCCGAATAAATAACATGAGCTTCTGACTCCTTCCTCCATCCCGCCTTCTACTATTAGCATCTTCTAT
GGTAGAAGCAGGTGCAGGAACGGGATGAACCGTATACCCCCCACTGGCTGGCAATCTGGCCCATGCAGGA
GCATCCGTTGACCTTAGCGCGCGCGCGCCACACTTAGCCGGAGTCTCTTCTATTTTAGGGGCAATTAATT
TCATCACTACTATTATCAACATAAAACCCCCTGCAATATCCCAGTATCAAACTCCCCTGTTTGTATGATC
AGTACTAATTACAGCAGTTCTACTCTTACTATCCCTGCCTGTACTGGCTGCTGGAATTACAATACTTTTA
ACAGACCGGAATCTTAATACAACACGCGCGGATCCCGCTGGAGGAGGAGACCCTATCCTATATCAACACC
TATTCTGATTCTTCGGACATCCTGAAGTTTACATTCTTATCCTGCCCGGATTCGGAATAATTTCTCACAT
CGCGCGCGCGCGCGAGCGAGCGCGCAGCGGGCGCATGCCAACCACGGCATCGCGCGCGACGCGCCCCCCG"""

gene_sequence = input.replace("\n","")


In [None]:
# Write your code here


