# Python for Biologists

Exercises for the book written by Dr. Martin Jones.

## Chapter 2: Printing and manipulating text

Some functions do manipulate the object without assignment and some don't.
Text functions like `.lower` and `.replace()` don't manipulate their arguments (here parent object) because **strings and integers are immutable** while **lists are mutable**. A call to `my_list.reverse()` therefore reverses a list without re-assignment.

### Exercises

Calculating AT content

In [1]:
seq = 'ACTGATCGAATTCACGTATATATATTTCATATATAGCTAGCTAGCTA'
(seq.count('A') + seq.count('T'))/len(seq)

0.7021276595744681

Complementing DNA

In [2]:
new_seq = ''
for i in seq:
    if i == 'A': new_seq += 'T'
    elif i == 'T': new_seq += 'A'
    elif i == 'G': new_seq += 'C'
    else: new_seq += 'G'
print(seq + '\n' + new_seq)

ACTGATCGAATTCACGTATATATATTTCATATATAGCTAGCTAGCTA
TGACTAGCTTAAGTGCATATATATAAAGTATATATCGATCGATCGAT


Restriction fragment length. Write a program that recognizes the _EcoR1_ restriction site `G*AATTC` and returns length of fragments.
Works only with one restriction site, not multiple.

In [3]:
def restrict(seq, site = 'GAATTC', cut = 1):
    # find start pos of seq and calculate fragment lengths
    cut_site = seq.find(site) + cut
    print('total length: ' + str(len(seq)))
    print('fragment length: {0}, {1}'.format(str(cut_site),str(len(seq)-cut_site)))

restrict(seq)

total length: 47
fragment length: 8, 39


Splicing out introns, part one.

Of the following sequence, remove one intron from nt 15 to 30 (including boundary nts). Note that python counts from zero and does not include final index. If we want to include nt 30 (which has internal index 29), need to index up till 30.

In [4]:
intron = seq[14:30]
exon = seq.split(intron)
print(''.join(exon))

ACTGATCGAATTCAATATAGCTAGCTAGCTA


Splicing out introns, part two.

Calculate percentage of DNA sequence that is coding (i.e. is exon(s))

In [5]:
len(''.join(exon))/len(seq)

0.6595744680851063

Splicing out introns, part three.

Print coding sequences (exons) in upper case and non-coding (introns) in lower case.

In [6]:
print(seq)
print(exon[0] + intron.lower() + exon[1])

ACTGATCGAATTCACGTATATATATTTCATATATAGCTAGCTAGCTA
ACTGATCGAATTCAcgtatatatatttcatATATAGCTAGCTAGCTA


## Chapter 3: Reading and writing files

### Exercises

Splitting genomic DNA. Write intron and exon sequences to two separate files.

In [7]:
with open('./intron.txt', 'x') as f:
    f.write(intron)

with open('./exon.txt', 'x') as f:
    f.write('\n'.join(exon))

Write a `.fasta` file with three arbitrary sequences.

In [8]:
with open('./example.fasta', 'x') as f:
    for i, j in enumerate(exon + [intron]):
        f.write('>sequence_' + str(i) + '\n')
        f.write(j + '\n')

Read the written files and print.

In [9]:
for fil in ['./intron.txt', './exon.txt', './example.fasta']:
    with open(fil, 'r') as f:
        print('FILE: {}'.format(fil))
        print(f.read(), '\n')

FILE: ./intron.txt
CGTATATATATTTCAT 

FILE: ./exon.txt
ACTGATCGAATTCA
ATATAGCTAGCTAGCTA 

FILE: ./example.fasta
>sequence_0
ACTGATCGAATTCA
>sequence_1
ATATAGCTAGCTAGCTA
>sequence_2
CGTATATATATTTCAT
 



In [10]:
# remove test files again
import os
for f in ['./intron.txt', './exon.txt', './example.fasta']:
    os.remove(f)

## Chapter 4: Lists and Loops

### Exercises

Read an example FASTA file that contains different DNA sequences which share a 14 bp adapter at the beginning.
Write a program that will a) remove the adapter, and b) print the length of the sequence on screen.
From now on we will omit writing new files as they have to be removed again.

In [11]:
# make new list
res = []

# read content line by line and modify sequences
with open('./python_for_biologists_examples/lists_and_loops/exercises/input.txt', 'r') as f:
    for l in f.readlines():
        seq_trimmed = l.replace('ATTCGATTATAAGC', '')
        print('length = {0} bp'.format(str(len(seq_trimmed))))
        res += [seq_trimmed]

print(''.join(res))

length = 43 bp
length = 38 bp
length = 49 bp
length = 34 bp
length = 48 bp
TCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
ACTGATCGATCGATCGATCGATCGATGCTATCGTCGT
ATCGATCACGATCTATCGTACGTATGCATATCGATATCGATCGTAGTC
ACTATCGATGATCTAGCTACGATCGTAGCTGTA
ACTAGCTAGTCTCGATGCATGATCAGCTTAGCTGATGATGCTATGCA



Load a single 'long' genomic sequence, and extract exon sequences by positional index.

In [12]:
# read content line by line and modify sequences
with open('./python_for_biologists_examples/lists_and_loops/exercises/genomic_dna.txt', 'r') as f:
    seq_genomic = f.readline()

res = []
with open('./python_for_biologists_examples/lists_and_loops/exercises/exons.txt', 'r') as f:
    for l in f.readlines():
        ind = l.replace('\n', '')
        ind = ind.split(',')
        start = int(ind[0])
        stop = int(ind[1])
        print(seq_genomic[start:stop])
        res += [seq_genomic[start:stop]]
        
print('concatenated result:\n' + ''.join(res))

CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCG
CGATCGATCGATATCGATCGATATCATCGATGCATCGATCATCGATCGATCGATCGATCGA
CGATCGATCGATCGTAGCTAGCTAGCTAGATCGATCATCATCGTAGCTAGCTCGACTAGCTACGTACGATCGATGCATCGATCGTA
CGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGTAGCTAGCTACGATCG
concatenated result:
CGTACCGTCGACGATGCTACGATCGTCGATCGTAGTCGATCATCGATCGATCGCGATCGATCGATATCGATCGATATCATCGATGCATCGATCATCGATCGATCGATCGATCGACGATCGATCGATCGTAGCTAGCTAGCTAGATCGATCATCATCGTAGCTAGCTCGACTAGCTACGTACGATCGATGCATCGATCGTACGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGTAGCTAGCTACGATCG


## Chapter 5: Writing your own functions

### Useful functions

The `assess` statement can be used to conduct simple unit tests, i.e. compare a function's output to a predefined result.
The following function shows this, and, as a side effect just prints something to the console. Such functions return `None` in pytho (equivalent of `NULL` in R).

In [13]:
def my_fun(arg):
    print(arg)

print(my_fun("test"))

test
None


In [14]:
# this test does not fail
assert my_fun("test") == None

test


### Exercises

Percentage of amino acids, part 1.

Write a function that takes two arguments, a protein sequence and an amino acid residue (1 letter code), and returns the percentage of the protein tha thr amino acid makes up. Use assertions to test your function.

In [15]:
def percent_as(seq, as_residue):
    return(seq.count(as_residue)/len(seq)*100)

assert percent_as(seq = "MSRSLLLRFLLFLLLLPPLP", as_residue = "M") == 5
assert percent_as(seq = "MSRSLLLRFLLFLLLLPPLP", as_residue = "R") == 10
assert percent_as(seq = "MSRSLLLRFLLFLLLLPPLP", as_residue = "L") == 50
assert percent_as(seq = "MSRSLLLRFLLFLLLLPPLP", as_residue = "Y") == 0

Percentage of amino acids, part 2.

Modify the function above such that it returns amino acid percentage for an arbitrary list of amino acids. If not AA argument is passed, the function should return the percentage of hydrophobic amino acids.

In [16]:
def percent_as(seq, as_residue = ["A", "I", "L", "M", "F", "W", "Y", "V"]):
    aa_percent = [seq.count(i)/len(seq)*100 for i in as_residue]
    return(sum(aa_percent))

assert percent_as(seq = "MSRSLLLRFLLFLLLLPPLP", as_residue = ["M"]) == 5
assert percent_as(seq = "MSRSLLLRFLLFLLLLPPLP", as_residue = ["M", "L"]) == 55
assert percent_as(seq = "MSRSLLLRFLLFLLLLPPLP", as_residue = ["F", "S", "L"]) == 70
assert percent_as(seq = "MSRSLLLRFLLFLLLLPPLP") == 65

## Chapter 6: Conditional tests

### Exercises

Import the data table `data.csv`. Then perform various tasks with it.

In [17]:
with open('./python_for_biologists_examples/conditional_tests/exercises/data.csv', 'r') as f:
    data = f.readlines()

print(data)

['Drosophila melanogaster,atatatatatcgcgtatatatacgactatatgcattaattatagcatatcgatatatatatcgatattatatcgcattatacgcgcgtaattatatcgcgtaattacga,kdy647,264\n', 'Drosophila melanogaster,actgtgacgtgtactgtacgactatcgatacgtagtactgatcgctactgtaatgcatccatgctgacgtatctaagt,jdg766,185\n', 'Drosophila simulans,atcgatcatgtcgatcgatgatgcatccgactatcgtcgatcgtgatcgatcgatcgatcatcgatcgatgtcgatcatgtcgatatcgt,kdy533,485\n', 'Drosophila yakuba,cgcgcgctcgcgcatacggcctaatgcgcgcgctagcgatgc,hdt739,85\n', 'Drosophila ananassae,ttacgatcgatcgatcgatcgatcgtcgatcgtcgatgctacatcgatcatcatcggattagtcacatcgatcgatcatcgactgatcgtcgatcgtagatgctgacatcgatagca,hdu045,356\n', 'Drosophila ananassae,gcatcgatcgatcgcggcgcatcgatcgcgatcatcgatcatacgcgtcatatctatacgtcactgccgcgcgtatctacgcgatgactagctagact,teg436,222\n']


Print out gene names for selected species.

In [18]:
for entry in data:
    entry_list = entry.split(',')
    if entry_list[0] in ['Drosophila melanogaster', 'Drosophila simulans']:
        print('gene name = ', entry_list[2])

gene name =  kdy647
gene name =  jdg766
gene name =  kdy533


Print gene names for genes that are between 90 and 110 bases long.

In [19]:
for entry in data:
    entry_list = entry.split(',')
    length = len(entry_list[1])
    if int(length) >= 90 and int(length) <= 110:
        print('gene name = ', entry_list[2])

gene name =  kdy647
gene name =  kdy533
gene name =  teg436


Print gene names for all genes whose AT content is less than 0.5 and whose expression level is greater than 200.

In [20]:
for entry in data:
    entry_list = entry.split(',')
    seq = entry_list[1]
    at_content = (seq.count('a') + seq.count('t'))/len(seq)
    level = int(entry_list[3].rstrip('\n'))
    if level > 200 and at_content < 0.5:
        print('gene name = ', entry_list[2])

gene name =  teg436


Print gene names for genes that begin with 'k' or 'h' but DO NOT belong to 'Drosophila melanogaster' species.

In [21]:
for entry in data:
    entry_list = entry.split(',')
    name_start = entry_list[2][0]
    if name_start in ['k', 'h'] and entry_list[0] != 'Drosophila melanogaster':
        print('gene name = ', entry_list[2])

gene name =  kdy533
gene name =  hdt739
gene name =  hdu045


For each gene, print gene name and AT content by category. Low: AT <= 0.45. Medium:  0.45 < AT <= 0.65, High: 0.65 < AT.

In [22]:
for entry in data:
    entry_list = entry.split(',')
    seq = entry_list[1]
    at_content = (seq.count('a') + seq.count('t'))/len(seq)
    if at_content <= 0.45:
        at_cat = 'low'
    elif at_content <= 0.65 and at_content > 0.45:
        at_cat = 'medium'
    elif at_content > 0.65:
        at_cat = 'high'
    print('gene {0} with AT content: {1}'.format(entry_list[2], at_cat))

gene kdy647 with AT content: high
gene jdg766 with AT content: medium
gene kdy533 with AT content: medium
gene hdt739 with AT content: low
gene hdu045 with AT content: medium
gene teg436 with AT content: medium
