# Python for biologists, chapter 6

Geert Jan Bex ([geertjan.bex@uhasselt.be](mailto:geertjan.bex@uhasselt.be))

Exercises taken from 'Python for biologists', Martin Jones.

## Several species

The file `data_6.csv` contains data on a number of genes.  Print the gene names for all genes belonging to *Drosophila melanogaster* and *Drosophila smiulans*.

In [11]:
target_species = ['Drosophila melanogaster', 'Drosophila simulans']
data_filename = 'data_6.csv'

In [1]:
!head -5 data_6.csv

Drosophila melanogaster,atatatatatcgcgtatatatacgactatatgcattaattatagcatatcgatatatatatcgatattatatcgcattatacgcgcgtaattatatcgcgtaattacga,kdy647,264
Drosophila melanogaster,actgtgacgtgtactgtacgactatcgatacgtagtactgatcgctactgtaatgcatccatgctgacgtatctaagt,jdg766,185
Drosophila simulans,atcgatcatgtcgatcgatgatgcatccgactatcgtcgatcgtgatcgatcgatcgatcatcgatcgatgtcgatcatgtcgatatcgt,kdy533,485
Drosophila yakuba,cgcgcgctcgcgcatacggcctaatgcgcgcgctagcgatgc,hdt739,85
Drosophila ananassae,ttacgatcgatcgatcgatcgatcgtcgatcgtcgatgctacatcgatcatcatcggattagtcacatcgatcgatcatcgactgatcgtcgatcgtagatgctgacatcgatagca,hdu045,356


In [14]:
with open(data_filename, 'r') as data_file:
    for line in data_file:
        species, dna, gene_name, expr_level = line.rstrip().split(',')
        if species in target_species:
           print(gene_name)

kdy647
jdg766
kdy533


## Length range

The file `data_6.csv` contains data on a number of genes.  Print the gene names for all genes with sequence length between 90 and 110 base pairs.

In [15]:
with open(data_filename, 'r') as data_file:
    for line in data_file:
        species, dna, gene_name, expr_level = line.rstrip().split(',')
        dna_length = len(dna)
        if 90 <= dna_length and dna_length <= 110:
           print(gene_name)

kdy647
kdy533
teg436


## AT content

The file `data_6.csv` contains data on a number of genes.  Print the gene names for all genes with AT content of less than 0.5, and an expression level larger than 200.

In [16]:
def at_content(dna):
    dna = dna.upper()
    return (dna.count('A') + dna.count('T'))/len(dna)

In [18]:
with open(data_filename, 'r') as data_file:
    for line in data_file:
        species, dna, gene_name, expr_level = line.rstrip().split(',')
        if at_content(dna) < 0.5 and 200 < int(expr_level):
           print(gene_name)

teg436


## Complex condition

The file `data_6.csv` contains data on a number of genes.  Print the gene names for which the name starts with either 'k', or 'h', except for *Drosophila melanogaster*.

In [20]:
with open(data_filename, 'r') as data_file:
    for line in data_file:
        species, dna, gene_name, expr_level = line.rstrip().split(',')
        if (gene_name.startswith('k') or gene_name.startswith('h')) and species != 'Drosophila melanogaster':
           print(gene_name)

kdy533
hdt739
hdu045


## High, low, medium

The file `data_6.csv` contains data on a number of genes.  Print for each gene its name, and whether its AT content is high (i.e., larger than 0.65, low (i.e., less than 0.45), medium otherwise.

In [21]:
def at_content(dna):
    dna = dna.upper()
    return (dna.count('A') + dna.count('T'))/len(dna)

In [23]:
with open(data_filename, 'r') as data_file:
    for line in data_file:
        species, dna, gene_name, expr_level = line.rstrip().split(',')
        at = at_content(dna)
        print('{0}: '.format(gene_name), end='')
        if at < 0.45:
            print('low')
        elif at > 0.65:
            print('high')
        else:
            print('medium')

kdy647: high
jdg766: medium
kdy533: medium
hdt739: low
hdu045: medium
teg436: medium
