#Classes and objects

Warning: long notebook ahead.

What types of data do we already know about in Python?

Built in types:
- string
- integer 
- float
- File object
- dict
- list
- set
- tuple

and some types that live in modules:
- re.Matchobject
- collections.defaultdict
- collections.Counter
- datetime.date

We know that they store different types of data and have different methods. 

###How does this all work?


####Everything is an object
####Each object is an instance of a specific class
####The class definition tells Python how instances of a class behave

First, a couple of useful functions. Use `type()` to get the type of an object:

In [1]:
type(4), type("hello"), type([4,5,6])

(int, str, list)

Use `dir()` to see all the properties of an object: 

In [2]:
dir([4,5,6])

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__delslice__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getslice__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__setslice__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

Some classes (mostly built in ones) are defined in the Python interpreter. Others are written in Python e.g. [date.datetime](https://hg.python.org/cpython/file/3.5/Lib/datetime.py#l641)

Let's approach the idea of writing our own classes slowly...

Here is some code to print the AT content and the complement of a DNA sequence:

In [None]:
from __future__ import division
my_dna = "ACTGATCGTTACGTACGAGTCAT" 

# print the AT content
length = len(my_dna) 
a_count = my_dna.count('A') 
t_count = my_dna.count('T') 
at_content = (a_count + t_count) / length 
print(at_content)

# print the complement
replacement1 = my_dna.replace('A', 't') 
replacement2 = replacement1.replace('T', 'a') 
replacement3 = replacement2.replace('C', 'g') 
replacement4 = replacement3.replace('G', 'c') 
print(replacement4.upper())

Obviously, these bits of code are good candidates for being turned into functions:

In [None]:
def get_AT(my_dna): 
    length = len(my_dna) 
    a_count = my_dna.count('A') 
    t_count = my_dna.count('T') 
    at_content = (a_count + t_count) / length 
    return at_content 
 
def complement(my_dna): 
    replacement1 = my_dna.replace('A', 't') 
    replacement2 = replacement1.replace('T', 'a') 
    replacement3 = replacement2.replace('C', 'g') 
    replacement4 = replacement3.replace('G', 'c') 
    return replacement4.upper() 

which allows the main code to be much simpler:

In [None]:
dna_sequence = "ACTGATCGTTACGTACGAGTCAT" 
print(get_AT(dna_sequence)) 
print(complement(dna_sequence)) 

Now what if we want to add some metadata to our sequence: a species name and a gene name:

In [None]:
dna_sequence = "ACTGATCGTTACGTACGAGTCAT" 
species = "Drosophila melanogaster"
gene_name = "ABC1"
print("Looking at the " + species + " " + gene_name + " gene")
print("AT content is " + str(get_AT(dna_sequence))) 
print("complement is " + complement(dna_sequence)) 

Fine for a single record, but tricky to scale. We could resort to complex data structures as we saw before, but let's try another way: create a new class to store DNA sequences.

A DNA sequence class needs to have variables:
- a sequence
- a gene name
- a species name

and some methods:
- complement()
- get_AT()

Let's dive in. Here's attempt number one:

In [None]:
class DNARecord(object): 
    sequence = 'ACGTAGCTGACGATC'
    gene_name = 'ABC1'
    species_name = 'Drosophila melanogaster'

    def complement(self): 
        replacement1 = self.sequence.replace('A', 't') 
        replacement2 = replacement1.replace('T', 'a') 
        replacement3 = replacement2.replace('C', 'g') 
        replacement4 = replacement3.replace('G', 'c') 
        return replacement4.upper() 

    def get_AT(self): 
        length = len(self.sequence) 
        a_count = self.sequence.count('A') 
        t_count = self.sequence.count('T') 
        at_content = (a_count + t_count) / length 
        return at_content 

Looks OK, we can create a new instance, access variables by using dot, and call methods by using dots:

In [None]:
d = DNARecord()
print('Created a record for ' + d.gene_name + ' from ' + d.species_name)
print('AT is ' + str(d.get_AT()))
print('complement is ' + d.complement())

One major problem: all the variables are set as part of the class definiion, so every object we create will have the same sequence, etc.

We can change the object variables after creation just like any variable:

In [None]:
d1 = DNARecord() 
d1.sequence = 'ATATATTATTATATTATA' 
d1.gene_name = 'COX1' 
d1.species_name = 'Homo sapiens' 
 
d2 = DNARecord() 
d2.sequence = 'CGGCGGCGCGGCGCGGCG' 
d2.gene_name = 'ATP6' 
d2.species_name = 'Gorilla gorilla' 
 
for r in [d1, d2]: 
    print('Created ' + r.gene_name + ' from ' + r.species_name) 
    print('AT is ' + str(r.get_AT())) 
    print('complement is ' + r.complement())

This works perfectly well, but is a bit awkward. Why don't we add a method whose job is to set all the variables in one go:

In [None]:
class DNARecord(object): 
    sequence = 'ACGTAGCTGACGATC'
    gene_name = 'ABC1'
    species_name = 'Drosophila melanogaster'

    def complement(self): 
        replacement1 = self.sequence.replace('A', 't') 
        replacement2 = replacement1.replace('T', 'a') 
        replacement3 = replacement2.replace('C', 'g') 
        replacement4 = replacement3.replace('G', 'c') 
        return replacement4.upper() 

    def get_AT(self): 
        length = len(self.sequence) 
        a_count = self.sequence.count('A') 
        t_count = self.sequence.count('T') 
        at_content = (a_count + t_count) / length 
        return at_content 
    
    def set_variables(self, new_seq, new_gene_name, new_species_name): 
        self.sequence = new_seq 
        self.gene_name = new_gene_name 
        self.species_name = new_species_name     

Now we can do this:

In [None]:
d1 = DNARecord() 
d1.set_variables('ATATATTATTATATTATA','COX1','Homo sapiens') 

Since we can set variable easily, we don't need them in the class definition any more:

In [None]:
class DNARecord(object): 
    
    def complement(self): 
        replacement1 = self.sequence.replace('A', 't') 
        replacement2 = replacement1.replace('T', 'a') 
        replacement3 = replacement2.replace('C', 'g') 
        replacement4 = replacement3.replace('G', 'c') 
        return replacement4.upper() 

    def get_AT(self): 
        length = len(self.sequence) 
        a_count = self.sequence.count('A') 
        t_count = self.sequence.count('T') 
        at_content = (a_count + t_count) / length 
        return at_content 
    
    def set_variables(self, new_seq, new_gene_name, new_species_name): 
        self.sequence = new_seq 
        self.gene_name = new_gene_name 
        self.species_name = new_species_name     
        
d1 = DNARecord() 
d1.set_variables('ATATATTATTATATTATA','COX1','Homo sapiens') 

This works fine until we forget to set the variables:

In [None]:
d1 = DNARecord() 
print(d1.complement())

We need a **constructor**: a special method whose job is to create an object and set up the variables. The constructor has a special name `__init()__`:

In [None]:
class DNARecord(object): 
    
    def __init__(self, sequence, gene_name, species_name):
        self.sequence = sequence
        self.gene_name = gene_name
        self.species_name = species_name
        
    def complement(self): 
        replacement1 = self.sequence.replace('A', 't') 
        replacement2 = replacement1.replace('T', 'a') 
        replacement3 = replacement2.replace('C', 'g') 
        replacement4 = replacement3.replace('G', 'c') 
        return replacement4.upper() 

    def get_AT(self): 
        length = len(self.sequence) 
        a_count = self.sequence.count('A') 
        t_count = self.sequence.count('T') 
        at_content = (a_count + t_count) / length 
        return at_content         

And we can call it just using the class name:

In [None]:
d1 = DNARecord('ATATATTATTATATTATA', 'COX1', 'Homo sapiens')
print(d1.complement())

Now Python will helpfully prevent us from creating a `DNARecord` object without variables:

In [None]:
d1 = DNARecord() 

###Pause for a moment to consider the difference between non-object (let's call it *imperative*) code and OO code

Remember that the actual code to do the calculations is the same for both.

In imperative code, we pass data to functions in order to get the answers we want:

In [None]:
# imperative code
dna_sequence = "ACTGATCGTTACGTACGAGT" 
species = "Drosophila melanogaster"
gene_name = "ABC1"
print("Looking at the " + species + " " + gene_name + " gene")
print("AT content is " + str(get_AT(dna_sequence))) 
print("complement is " + complement(dna_sequence))

In OO code, we ask the object of the answer we want, and the object is responsible for figuring out how to calculate it:

In [None]:
# object-oriented code
d = DNARecord("ACTGATCGTTACGTACGAGT", "ABC1", "Drosophila melanogaster")
print("Looking at the " + d.species_name + " " + d.gene_name + " gene")
print("AT content is " + str(d.get_AT())) 
print("complement is " + d.complement())

To illustrate the difference in one line:

In [None]:
assert get_AT(d.sequence) == d.get_AT()

Because objects store data along with methods, they always have access to their own internal state, so when we call `d.get_AT()` we don't have to worry about telling Python **which** DNA sequence we want the AT content for - it's the one that belongs to the object. 

This allows us to do things that would be awkward with functions. For example, say we want to write a function to generate a FASTA format record for a sequence, where the header contains the gene and species name. These are three separate bits of information, but if we add a method to the class like this:

In [None]:
class DNARecord(object): 
    
    def __init__(self, sequence, gene_name, species_name):
        self.sequence = sequence
        self.gene_name = gene_name
        self.species_name = species_name
        
    def complement(self): 
        replacement1 = self.sequence.replace('A', 't') 
        replacement2 = replacement1.replace('T', 'a') 
        replacement3 = replacement2.replace('C', 'g') 
        replacement4 = replacement3.replace('G', 'c') 
        return replacement4.upper() 

    def get_AT(self): 
        length = len(self.sequence) 
        a_count = self.sequence.count('A') 
        t_count = self.sequence.count('T') 
        at_content = (a_count + t_count) / length 
        return at_content       
    
    def get_fasta(self): 
        safe_species_name = self.species_name.replace(' ','_')
        header = '>' + self.gene_name + '_' + safe_species_name
        return header + '\n' + self.sequence + '\n' 

It becomes straightforward to call:

In [None]:
d1 = DNARecord('ATATATTATTATATTATA', 'COX1', 'Homo sapiens')
print(d1.get_fasta())
d1.get_fasta()

and we can write fairly complex programs with very little code:

In [None]:
# given a list of DNARecord objects in my_dna_records
output = open("high_at.fasta", "w")
for d in my_dna_records:
    if d.get_AT() > 0.6:
        output.write(d.get_fasta())

Now that we've created the DNARecord class, we can pass DNARecord objects around just like any other data type. E.g. here's a function that takes a DNARecord object as its argument and returns a protein translation:

In [None]:
def translate_dna(dna_record): 
    gencode = { 
	    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
	    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
	    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
	    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R', 
	    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
	    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
	    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
	    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
	    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
	    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
	    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
	    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
	    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
	    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
	    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
	    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'} 
    last_codon_start = len(dna_record.sequence) - 2 
    protein = "" 
    for start in range(0,last_codon_start,3): 
        codon = dna_record.sequence[start:start+3] 
        aa = gencode.get(codon.upper(), 'X') 
        protein = protein + aa 
    return protein

translate_dna(d)

We could probably turn this into a method of DNARecord objects if we wanted to. 

---

##Exercises

This exercise is a bit different - we have just one for the whole day. This is becuase we can't really understand how OO code works with just a small example. 

We are going to build a (bad!) population genetics simulation:
- 3 loci
- each with 2 alleles
- in a population of 100 diploid individuals
- multiplicative fitness 
- individuals die in each generation with probability inversely proportional to fitness
- new individuals are created with alleles drawn randomly from the population
- frequncy of alleles are logged over time

![alleles](alleles.png)

This sounds like a lot, but by the end of the day we will have built this together. 

We will tackle it in steps and pause to discuss each step.

Don't think too hard about the shortcomings of the simulation!



###Step one: design

sketch out (paper or plain text is fine) designs for the classes we will need:
- `Allele`
- `Locus`
- `Individual`
- `Population`

What methods and variables do they need?
Which objects contain other objects?

Write minimal class definitions for the four classes - just variables and constructors.




###Step two: create some objects

From now on we will use inital caps when talking about classes and objects. 

Write code to create two `Alleles`.
Write code to create a `Locus` with the two `Alleles`.
Write code two create two more `Loci`, each with two `Alleles`.
Write code to create an `Individual` with these three `Loci`.
Write code to creaet a `Population` with two `Individuals`.



###Step three: randomization

Add a method to the `Locus` class to return a random `Allele` (hint: import random)
 
Write a function that takes a list of `Loci` and returns an `Individual` with randomly picked `Alleles` (this could be a constructor in the `Individual` class).
 
Write a function that takes a size and list of Loci and returns a `Population` of `Individuals` with randomly picked `Alleles` (this could be a constructor in the `Population` class).

Create a `Population` with one hundred `Individuals`. 

###Step four: measuring `Individuals`

Add a method to the `Individual` class which returns the genotype as a string (e.g. AaBBcc).
 
Add a method to the `Population` class which prints the genotypes for all `Individuals`. Run it on your test `Population`.
 
Add a method to the `Individual` class which returns the fitness (i.e. the product of all `Allele` fitnesses).
 
add a method to the `Population` class which prints the genotype and fitness of all `Individuals`.
 
Run it on your test `Population`. do the results make sense i.e. do individuals with more "better" alleles have higher fitness? (AaBbCc > aabbcc)

###Step five: measuring allele frequencies

Add a method to the `Population` class which takes an `Allele` and returns the frequency in the `Population`.
 
Add a method to the `Population` class which prints out the names and frequencies for all `Alleles` at all `Loci`.
 
how does the `Population` know what the list of alleles is....?

###Step six: simulating death

Add a method to the `Population` class which randomly kills `Individuals` with a probability proportional to the inverse of their fitness (hint: lists have a `.remove()` method)
 
How will we know if it's working? the size of the population should go down in each generation ('cos we haven't implemented birth yet).
 
Run the simulation for 10 generations and monitor the population size. 

###Step seven: simulating birth

Add a method to the `Population` class to add a single new `Individual` using `Alleles` picked from the current list of `Individuals`.
 
Add a method to the `Population` class to repeatedly add `Individuals` until the population size is back up to 100.
 
Add a method to the `Population` class which simulates a single generation (death then birth).

Run the simulation for 10 generations and monitor the population size again.

###Step eight: logging allele frequencies

Add a method to the `Population` class to write a CSV header line to an output file:

`generation, A, a, B, b, C, c`
 
Add a method to the `Population` class to write a CSV row for the current state:

`56, 0.6, 0.4, 0.7, 0.3, 0.5, 0.5`

this should be called every generation.
 
Run the simulation for 100 generations.
 
Open the output file in a text editor.
Open the output file in a spreadsheet package. Plot the allele frequencies vs. generation.

---

##More object-oriented features if we have time

Let's create a `ProteinRecord` class to keep our `DNARecord` class company. What does it need?

Variables:
- sequence
- gene name
- species name

Methods:
- ~~complement~~
- ~~get_AT()~~
- get_fasta()
- **get_hydrophobic()**

Here we go:

In [None]:
class ProteinRecord(object): 

    def __init__(self, sequence, gene_name, species_name): 
        self.sequence = sequence 
        self.gene_name = gene_name 
        self.species_name = species_name 

    def get_fasta(self): 
        safe_species_name = self.species_name.replace(' ','_') 
        header = '>' + self.gene_name + '_' + safe_species_name 
        return header + '\n' + self.sequence + '\n' 

    def get_hydrophobic(self): 
        aa_list=['A','I','L','M','F','W','Y','V'] 
        protein_length = len(self.sequence) 
        total = 0 
        for aa in aa_list: 
            aa = aa.upper() 
            aa_count = self.sequence.count(aa) 
            total = total + aa_count 
        percentage = total * 100 / protein_length 
        return percentage
    
p1 = ProteinRecord('MSRSLLLRFLLFLLLLPPLP', 'COX1', 'Homo sapiens') 
print(p1.get_fasta()) 
print(str(p1.get_hydrophobic()))

Everything works, but I am not quite happy; the `get_fasta()` code is duplicated in two places. We could turn it back into a function but that would break the encapsulation of objects. 

The solution is **inheritance**. We will create a generic SequenceRecord object which has the shared code (constructor and `get_fasta()`):

In [None]:
class SequenceRecord(object): 

    def __init__(self, sequence, gene_name, species_name): 
        self.sequence = sequence 
        self.gene_name = gene_name 
        self.species_name = species_name 

    def get_fasta(self): 
        safe_species_name = self.species_name.replace(' ','_') 
        header = '>' + self.gene_name + '_' + safe_species_name 
        return header + '\n' + self.sequence + '\n' 

Both `DNARecord` and `ProteinRecord` will inherit from `SequenceRecord`, which means that they will have access to the constructor and `get_fasta()`:

In [None]:
class ProteinRecord(SequenceRecord): 
    
    def get_hydrophobic(self): 
        aa_list=['A','I','L','M','F','W','Y','V'] 
        protein_length = len(self.sequence) 
        total = 0 
        for aa in aa_list: 
            aa = aa.upper() 
            aa_count = self.sequence.count(aa) 
            total = total + aa_count 
        return total * 100 / protein_length  

class DNARecord(SequenceRecord): 

    def complement(self): 
        replacement1 = self.sequence.replace('A', 't') 
        replacement2 = replacement1.replace('T', 'a') 
        replacement3 = replacement2.replace('C', 'g') 
        replacement4 = replacement3.replace('G', 'c') 
        return replacement4.upper() 

    def get_AT(self): 
        length = len(self.sequence) 
        a_count = self.sequence.count('A') 
        t_count = self.sequence.count('T') 
        return (a_count + t_count) / length 

Now we can happily write code like this:

In [None]:
p1 = ProteinRecord('MSRSLLLRFLLFLLLLPPLP', 'COX1', 'Homo sapiens') 
print(p1.get_fasta()) 
print(p1.get_hydrophobic()) 
 
d1 = DNARecord('ACCACACGATCGATCGAG', 'COX1', 'Homo sapiens') 
print(d1.get_fasta()) 
print(d1.complement()) 

Just for fun, here's a version of our protein tranlation function which takes a `DNARecord` object and returns a `ProteinRecord` object:

In [None]:
def translate_dna(dna_record): 
    gencode = { 
	    'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
	    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
	    'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
	    'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R', 
	    'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
	    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
	    'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
	    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
	    'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
	    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
	    'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
	    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
	    'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
	    'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
	    'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
	    'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'} 
    last_codon_start = len(dna_record.sequence) - 2 
    protein = "" 
    for start in range(0,last_codon_start,3): 
        codon = dna_record.sequence[start:start+3] 
        aa = gencode.get(codon.upper(), 'X') 
        protein = protein + aa 
    
    # gather the information to create the protein record
    protein_name = dna_record.gene_name
    protein_species = dna_record.species_name

    # create the protein record and return it
    protein_record = ProteinRecord(protein,protein_name,protein_species) 
    return protein_record

translate_dna(d1).get_hydrophobic()

Now let's make life even more complicated and add a genetic code variable to `DNARecord`, but not `ProteinRecord`. The genetic code variable can't go in `SequenceRecord`, so it has to go in `DNARecord`, which means that `DNARecord` now needs to have its own constructor:

In [None]:
class DNARecord(SequenceRecord): 
    
    def __init__(self, sequence, gene_name, species_name, genetic_code): 
        self.sequence = sequence 
        self.gene_name = gene_name 
        self.species_name = species_name 
        self.genetic_code = genetic_code 

    def complement(self): 
        replacement1 = self.sequence.replace('A', 't') 
        replacement2 = replacement1.replace('T', 'a') 
        replacement3 = replacement2.replace('C', 'g') 
        replacement4 = replacement3.replace('G', 'c') 
        return replacement4.upper() 

    def get_AT(self): 
        length = len(self.sequence) 
        a_count = self.sequence.count('A') 
        t_count = self.sequence.count('T') 
        return (a_count + t_count) / length 

d1 = DNARecord('ATCGCGTACGTGATCGTAG', 'COX1', 'Homo sapiens', 5) 
print(d1.get_fasta()) 
print(d1.complement()) 
print('genetic code is ' + str(d1.genetic_code))

We say that the constructor in `DNARecord` **overrides** the one in `SequenceRecord`.

Let's say we want to add a bit of error-checking to the construtors. Before creating an object, we will check that the species name looks correct. We can add error cheking to the `SequenceRecord` constructor (don't worry if you haven't seen `raise` before):

In [None]:
import re

class SequenceRecord(object): 

    def __init__(self, sequence, gene_name, species_name): 
        if not re.match(r'[A-Z][a-z]+ [a-z]+', species_name): 
            raise ValueError(species_name + ' is not a valid species name!')
        self.sequence = sequence 
        self.gene_name = gene_name 
        self.species_name = species_name 

    def get_fasta(self): 
        safe_species_name = self.species_name.replace(' ','_') 
        header = '>' + self.gene_name + '_' + safe_species_name 
        return header + '\n' + self.sequence + '\n' 

class ProteinRecord(SequenceRecord): 
    
    def get_hydrophobic(self): 
        aa_list=['A','I','L','M','F','W','Y','V'] 
        protein_length = len(self.sequence) 
        total = 0 
        for aa in aa_list: 
            aa = aa.upper() 
            aa_count = self.sequence.count(aa) 
            total = total + aa_count 
        return total * 100 / protein_length  
    
    
p1 = ProteinRecord('MSRSLLLRFLLFLLLLPPLP', 'COX1', 'human') 


Problem: because `DNARecord` has its own constructor, it can't take advantage of this error checking:

In [None]:
d1 = DNARecord('ATCGCGTACGTGATCGTAG', 'COX1', 'human', 5) 
d1

So we have to add the error-checking code separately to the `DNARecord` constructor:

In [None]:
class DNARecord(SequenceRecord): 
    
    def __init__(self, sequence, gene_name, species_name, genetic_code): 
        if not re.match(r'[A-Z][a-z]+ [a-z]+', species_name): 
            raise ValueError(species_name + ' is not a valid species name!')
        self.sequence = sequence 
        self.gene_name = gene_name 
        self.species_name = species_name 
        self.genetic_code = genetic_code 

    def complement(self): 
        replacement1 = self.sequence.replace('A', 't') 
        replacement2 = replacement1.replace('T', 'a') 
        replacement3 = replacement2.replace('C', 'g') 
        replacement4 = replacement3.replace('G', 'c') 
        return replacement4.upper() 

    def get_AT(self): 
        length = len(self.sequence) 
        a_count = self.sequence.count('A') 
        t_count = self.sequence.count('T') 
        return (a_count + t_count) / length 
    
d1 = DNARecord('ATCGCGTACGTGATCGTAG', 'COX1', 'human', 5) 

Just like with `get_fasta()`, this is a problem - we have the same code duplicated twice. The solution this time is to have the `DNARecord` constructor call the `SequenceRecord` constructor first, then set the genetic code variable:

In [None]:
class SequenceRecord(object): 

    def __init__(self, sequence, gene_name, species_name): 
        if not re.match(r'[A-Z][a-z]+ [a-z]+', species_name): 
            raise ValueError(species_name + ' is not a valid species name!')
        self.sequence = sequence 
        self.gene_name = gene_name 
        self.species_name = species_name     
        
class DNARecord(SequenceRecord): 
    
    def __init__(self, sequence, gene_name, species_name, genetic_code): 
        # first call the SequenceRecord constructor to check the species name
        SequenceRecord.__init__(self, sequence, gene_name, species_name) 
        # now set the genetic code 
        self.genetic_code = genetic_code 
        

This is called **calling a method in the superclass**. We can do it for any method, not just constructors. It's useful when we want one our classes to have some common behaviour (checking species name; setting sequence, gene name and species name) and some specialized behaviour (setting genetic code).

Final complication: we want to have methods with the same name but different behaviour:

In [None]:
class ProteinRecord(SequenceRecord): 
    
    def get_protein_length(self): 
        return len(self.sequence) 
    
class DNARecord(SequenceRecord): 

    def get_protein_length(self): 
        return len(self.sequence) / 3    
    

This is fine and is called **polymorphism**. It allows us to do things like this:

In [None]:
for my_record in list_of_records:
# we don't care whether it's a DNA or protein record
    if my_record.get_protein_length() > 100:
        # do something with the record

We have encountered this many times before without realizing it:

In [94]:
print(3 + 4)
print('abc' + 'def')

print(len('hello'))
print(len([2,4,6,8]))

for i in 'python':
    print(i)
    
for i in [2,4,6,8]:
    print(i)
    


7
abcdef
5
4
p
y
t
h
o
n
2
4
6
8
