# Python for biologists, chapter 2

Geert Jan  Bex ([geertjan.bex@uhasselt.be](mailto:geertjan.bex@uhasselt.be))

Exercises taken from 'Python for biologists', Martin Jones.

## Calculating AT content

Given a string that contains a representation of an DNA sequence, compute the AT content, i.e., the fraction of nucleotides that are either A or T.

In [1]:
dna = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'

In [2]:
at_content = (dna.count('A') + dna.count('T'))/len(dna)

In [3]:
print('{0:.3%}'.format(at_content))

68.519%


If the `print` function call above is too intimidating, don't worry, we will talk about this later.  However, you may try and guess how it works by looking carefully at the result, and experimenting a bit.

Alternatively, a function can be defined that iterates over the DNA sequence once, instead of twice, as the approach above does.

In [4]:
def compute_at_content(seq):
    at_count = 0
    for nucl in seq:
        if nucl == 'A' or nucl == 'T':
            at_count += 1
    return at_count/len(seq)

In [5]:
at_content = compute_at_content(dna)

In [6]:
at_content

0.6851851851851852

## Complementing DNA

Given a string representing a DNA sequence, compute its complement.

In [7]:
dna = 'ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT'

In [8]:
for nucl in dna:
    if nucl == 'A':
        print('T', end='')
    elif nucl == 'C':
        print('G', end='')
    elif nucl == 'G':
        print('C', end='')
    elif nucl == 'T':
        print('A', end='')

TGACTAGCTAATGCATATCATAAACGATAGTATGTATATATAGCTACGCAAGTA

Alternatively, use substitution on strings.

In [9]:
print(dna.replace('A', 't').replace('C', 'g').replace('G', 'c').replace('T', 'a').upper())

TGACTAGCTAATGCATATCATAAACGATAGTATGTATATATAGCTACGCAAGTA


A nicer approach is to define a function that takes a sequence, and returns the complement as a string.

In [10]:
def complement(seq):
    compl = ''
    for nucl in seq:
        if nucl == 'A':
            compl += 'T'
        elif nucl == 'C':
            compl += 'G'
        elif nucl == 'G':
            compl += 'C'
        elif nucl == 'T':
            compl += 'A'
    return compl

In [11]:
complement(dna)

'TGACTAGCTAATGCATATCATAAACGATAGTATGTATATATAGCTACGCAAGTA'

## Restriction fragment lengths

Given a string representing a DNA sequence that contains a recognition site for the EcoRI restriction enzyme, compute the lengths of the resulting fragments.  The motif is `G*AATTC` where `*` indicates the position of the cut.

In [12]:
dna = 'ACTGATCGATTACGTATAGTAGAATTCTATCATACATATATATCGATGCGTTCAT'
motif = 'GAATTC'

In [13]:
cut_index = dna.index(motif) + 1
seq1, seq2 = dna[:cut_index], dna[cut_index:]
print('{0}*{1}'.format(seq1, seq2))

ACTGATCGATTACGTATAGTAG*AATTCTATCATACATATATATCGATGCGTTCAT


In [14]:
print(len(seq1), len(seq2))

22 33


Test whether the result is correct by concatenating `seq1` and `seq2`, the result should be equal to `dna`.

In [15]:
seq1 + seq2 == dna

True

## Splicing out introns, part one

The given DNA sequence contains to exons, separated by an intron.  The first exon runs from the start of the sequence to 63rd character, while the second starts at the 91st and contains the remainder of the sequence.  Print the coding region only.  To make it easier to check, print a `*` between the two exons.

In [16]:
dna = ('ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGAT'
       'CGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT')

In [17]:
exon1, exon2 = dna[:63], dna[90:]
print('{0}*{1}'.format(exon1, exon2))

ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGA*ATCATCGATCGATATCGATGCATCGACTACTAT


## Splicing out introns, part two

Given the previous DNA sequence, compute the fraction of the DNA that is coding.

In [18]:
dna = ('ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGAT'
       'CGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT')

In [19]:
exon1 = dna[:63]
exon2 = dna[90:]
print('{0:.4f}'.format((len(exon1) + len(exon2))/len(dna)))

0.7805


Again, try to guess what happens in the `print` function call.

## Splicing out introns, part three

Given the DNA sequence of the previous part, write out the DNA with the exons in upper case characters, and the intron in lower case.

In [20]:
dna = ('ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGAT'
       'CGATCGATCGATCGATCGATCATGCTATCATCGATCGATATCGATGCATCGACTACTAT')

Here, we use a Python feature: on the left hand of the assignment, three variables are listed (`exon1`, `intron`, `exon2`), on the right hand side three substring of `dna` are cmoputed.  The first substring is assigned to the first variable, the second substring to the second variable, and, of course, the last substring  to the third variable.

In [21]:
exon1, intron, exon2 = dna[:63], dna[63:90], dna[90:]
print('{0}{1}{2}'.format(exon1.upper(), intron.lower(), exon2.upper()))

ATCGATCGATCGATCGACTGACTAGTCATAGCTATGCATGTAGCTACTCGATCGATCGATCGAtcgatcgatcgatcgatcgatcatgctATCATCGATCGATATCGATGCATCGACTACTAT


Strictly speaking, the `upper` methods doesn't really need to be applied to `exon1` and `exon2`, but doing so ensures that the correct result would be produced even if the DNA sequence would cnotain lower case symbols.