# Python Tutorial - Part 4

---
<a id='oo_python'></a>
## Object Oriented Python

---
The Python you have been writing so far is in the form of a procedural language. This is a logical procedure that takes input data, processes it, and produces output data. Object oriented programming is a method of programming that is organised around objects rather than actions and data rather than logic. OO programming takes the approach that what is important are the objects we want to manipulate rather than the logic required to manipulate them. For example, objects range from human beings (described by name, address etc) to buildings or GUIs and apps. All of these have features that can be described and managed.
OO programming is not suitable to all tasks but if used appropriately it can provide benefits as it can sometimes better model the task being undertaken.

Python is not a truly object oriented language, as Java is, but even when written as a procedural language it uses object oriented aspects of the language. In an object oriented language an object is an instance of a class and can access the methods (functions) within that class.

Python lists, dictionaries and strings are in effect objects. For this reason they have their own methods that can be called:

    String:	          	line.rstrip()			Strip whitespace
    List:	          	demolist.index(42)		Return element at index 42

---
<a id='defining_class'></a>
## Defining a Class

A class is defined by the keyword “class”, the name and a semi colon. All code within the class is then indented.

In Python 2 the class definition should always inherit "object", which means the declaration should be:

	class Name(object):

In Python 3 this is not necessary and there are 3 equally valid methods to declare the class:

	class Name(object):
	class Name():
	class Name:

This tutorial will use the last option but all can be used.

---
<a id='the_self'></a>
## The self

Class methods have only one specific difference from ordinary functions - they must have an extra first name that has to be added to the beginning of the parameter list, but you do do not give a value for this parameter when you call the method, Python will provide it. This particular variable refers to the object itself, and by convention, it is given the name self. 
 
Although, you can give any name for this parameter, it is strongly recommended that you use the name self.

---
<a id='example_class'></a>
## Example Class

A very simple class is shown below:


In [None]:
class Hello: 
    def sayHi(self): 
        print ('Hello world') 

h = Hello() 
h.sayHi() 

# This short example can also be written as Hello().sayHi() 


When run the output from the program is simple:

	Hello world

<b>NOTE:</b> The sayHi method takes no parameters but still has the self in the function definition. 

The above example has the class and the calling code in the same file, which will work but ideally you want the classes to be separate so you can use them as required.

To achieve this, store the class code in a file called “hello.py”. The class can then be imported from this file:

	from hello import Hello	

	h = Hello() 
	h.sayHi() 


The class file can contain multiple classes, which can either be imported individually or all together with "from hello import *".

The import method for classes is the same as for modules.

---

## Exercise 24

Write the above program, with the simple <b><i>Hello</i></b> class and run it.

The model answer for this is called <b>Hello.py</b> and can be run with the <b>runHello.py</b> script.

(Answers to all exercises are available <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">here</a>.)


In [6]:
class Hello: 
    def sayHi(self): 
        print ('Hello world') 

h = Hello()
h.sayHi()

Hello world


---
<a id='init_method'></a>
## The __init__ method

The <b><i>__init__ method</i></b> is run as soon as an object of a class is instantiated. The method can be used to do any initialization of the object you want to do. 

<b>NOTE:</b>  there is a double underscore both in the beginning and at the end in the name.

In [None]:
class Person: 
    def __init__(self, name): 
        self.name = name 
    def sayHi(self): 
        print ('Hello, my name is', self.name)

p = Person('Stan') 

p.sayHi() 
# Can also be written as Person(‘Stan').sayHi()  


The <b><i>__init__ method</i></b> takes a parameter <b><i>name</i></b> (along with the usual <b><i>self</i></b>). 

	 def __init__(self, name): 

Variables that belong to an object or class are called <b><i>fields</i></b> and a new <b><i>field</i></b>, also called <b><i>name</i></b>, is created. Notice these are two different variables even though they have the same name. 
The dotted notation differentiates between them. 

	self.name = name 


Most importantly, notice that the <b><i>__init__ method</i></b> is not explicitly called but the arguments in the parentheses following the class name are passed to it when creating a new instance of the class. This is the special significance of this method. 

	 p = Person(‘Stan') 


The <b><i>self.name</i></b> field can now be used in methods, as in the <b><i>sayHi</i></b> method. 

	p.sayHi() 

---

## Exercise 25

Write the above program, with the modified <b><i>Person</i></b> class and run it.

The model answer for this is called <b>Person.py</b>.

(Answers to all exercises are available <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">here</a>.)



In [7]:
class Person:
    def __init__(self, name): 
            self.name = name 

    def sayHi(self): 
        print ('Hello, my name is', self.name) 

p = Person("Abigail")
p.sayHi()



Hello, my name is Abigail


---
<a id='class_object_variables'></a>
## Class and Object Variables

Class variables are shared in the sense that they are accessed by all objects (instances) of that class. There is only copy of the class variable and when any one object makes a change to a class variable, the change is reflected in all the other instances as well. 

Object variables are owned by each individual object/instance of the class. In this case, each object has its own copy of the field i.e. they are not shared and are not related in any way to the field by the same name in a different instance of the same class. 

This is demonstrated in the following modification of the <b><i>Person</i></b> class:


In [None]:
class Person: 
    #Represents a person
    population = 0 
    def __init__(self, name): 
    #Initializes the person's data
        self.name = name 
        print  ('Initializing %s' % self.name)
        # Add them to the population 
        Person.population += 1 

    def sayHi(self): 
        #Greeting by the person
        print ('Hi, my name is %s.' % self.name)

    def howMany(self): 
        # Prints the current population.
        print ('We have a population of %d here.' % Person.population)

stan  = Person('Stan')
stan.sayHi() 
stan.howMany() 

brian = Person('Brian')
brian.sayHi() 
brian.howMany() 


When this is run the output will be:

    (Initializing Stan) 
	Hi, my name is Stan. 
	We have a population of 1 here. 
	(Initializing Brian) 
	Hi, my name is Brian.
	We have a population of 2 here. 

The <b><i>population</i></b> variable belongs to the <b><i>Person</i></b> class and hence is a <b><i>class variable</i></b>. The <b><i>name variable</i></b> belongs to the object (it is assigned using self) and hence is an <b><i>object variable</i></b>. 

Therefore, refer to the population class variable as <b><i>Person.population</i></b> and not as <b><i>self.population</i></b>. 

Note that an object variable with the same name as a class variable will hide the class variable! Refer to the object variable name using self.

Another Python variation, common to other languages, is the print formatting option shown in the classes where the information is printed:

	print ('Initializing %s' % self.name)

The %s is a placeholder for a string in the text to be printed. The string to place here is defined with a % after the enclosing print quote, in this case <b><i>% self.name</i></b>. This means that <b><i>self.name</i></b> will be printed in place of %s.

It is possible to have multiple strings and by enclosing the variables in parenthesis:

    first = "Ema"
    second = "Jones"

In [None]:
first = "Ema"
second = "Jones"

print ("My first name is %s and my second is %s" %(first, second))

The %s is used to denote a string but %d can be used for a digit and %f for a float.

There are options to modify these that enable text to be formatted for example to align text or fix floats to a set number of decimal places. 

---
## Exercise 26

Update your program to use the modified <b><i>Person</i></b> class and run it.

The model answer for this is called <b>Person2.py</b>.

(Answers to all exercises are available <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">here</a>.)


In [11]:
class Person: 
    #Represents a person
    population = 0 
    def __init__(self, name): 
        #Initializes the person's data
        self.name = name 
        print  ('Initializing %s' % self.name)
        # Add them to the population 
        Person.population += 1 

    def sayHi(self): 
        #Greeting by the person
        print ('Hi, my name is %s.' % self.name)

    def howMany(self): 
        # Prints the current population.
        print ('We have a population of %d here.' % Person.population)


stan  = Person("Stan")
stan.sayHi() 
stan.howMany() 
print('')
brian = Person("Brian")
brian.sayHi() 
brian.howMany() 



Initializing Stan
Hi, my name is Stan.
We have a population of 1 here.

Initializing Brian
Hi, my name is Brian.
We have a population of 2 here.


---
<a id='inheritance'></a>
## Inheritance

As expected inheritance is supported in Python. 

For example, a program which has to keep track of the teachers and students in a college. They have some common characteristics such as name, age and address. They also have specific characteristics such as salary, courses and leaves for teachers and, marks and fees for students. 

A class called <b><i>SchoolMember</i></b> can be created and then have the teacher and student classes <b><i>inherit</i></b> from this class i.e. they will become sub-types of this type (class) and then specific characteristics can be added to these sub-types. 

This means that any addition/change to the functionality in SchoolMember is automatically reflected in the subtypes as well. 

For example, a new ID card field can be added for both teachers and students by simply adding it to the <b><i>SchoolMember</i></b> class. However, changes in the subtypes do not affect other subtypes. 

Another advantage is that you can refer to a teacher or student object as a <b><i>SchoolMember</i></b> object which could be useful in some situations such as counting the number of school members. 

This is <b><i>polymorphism</i></b> where a sub-type can be substituted in any situation where a parent type is expected i.e. the object can be treated as an instance of the parent class. 

The <b><i>SchoolMember</i></b> class in this situation is known as the <b><i>base class</i></b> or the <b><i>superclass</i></b>. The <b><i>Teacher</i></b> and <b><i>Student</i></b> classes are called the <b><i>derived classes</i></b> or <b><i>subclasses</i></b>. 

An example class to demonstrate this is shown below, including code to run it.


In [None]:
class SchoolMember:
    #Represents any school member
    def __init__(self, name, age):
        self.name = name
        self.age = age
        print ('Initialized SchoolMember: %s' % self.name)

    def tell(self):
        #Tell my details
        print ('Name:"%s" Age:"%s"' % (self.name, self.age) )

class Teacher(SchoolMember):
    #Represents a teacher
    def __init__(self, name, age, salary):
        SchoolMember.__init__(self, name, age)
        self.salary = salary
        print ('Initialized Teacher: %s' % self.name)

    def tell(self):
        SchoolMember.tell(self)
        print ('Salary: "%d"' % self.salary)

class Student(SchoolMember):
    #Represents a student
    def __init__(self, name, age, marks):
        SchoolMember.__init__(self, name, age)
        self.marks = marks
        print ('Initialized Student: %s' % self.name)

    def tell(self):
        SchoolMember.tell(self)
        print ('Marks: "%d"' % self.marks)
        

# To create objects of the classes
t = Teacher('Mrs. Jones', 40, 30000)
s = Student('Stan', 22, 75)

print() # prints a blank line

members = [t, s]
for member in members:
    member.tell() # works for both Teachers and Students

The output would be:

	(Initialized SchoolMember: Mrs. Jones) 
	(Initialized Teacher: Mrs. Jones)
	(Initialized SchoolMember: Stan) 
	(Initialized Student: Stan)

	Name:"Mrs. Jones" Age:"40" Salary: "30000" 
	Name:“Stan" Age:"22" Marks: "75" 

The Teacher and Student both inherit the <b><i>tell</i></b> method from <b><i>SchoolMember</i></b> but they overload it with their own <b><i>tell</i></b> method. In these cases they also call the <b><i>SchoolMemeber tell</i></b> method but that isn’t necessary. Overloading the method in this way is polymorphism. They inherit a method but implement it differently.

<b>NOTE:</b> Python does not automatically call the constructor of the superclass, you have to explicitly call it yourself. 

	
	class Teacher(SchoolMember):
		'''Represents a teacher.'''
		def __init__(self, name, age, salary):
			SchoolMember.__init__(self, name, age)  # <---------


Methods of the superclass can be called by prefixing the class name to the method call and then pass in the self variable along with any arguments. 

Can also treat instances of <b><i>Teacher</i></b> or <b><i>Student</i></b> as just instances of the <b><i>SchoolMember</i></b> when we use the tell method of the <b><i>SchoolMember</i></b> class. 


	SchoolMember.tell(self)

The classes can all be in the same file, perhaps called <b>People.py</b> and would be imported in a script with:

	from People import *


## Exercise 27
 
Write a program that implements and tests the school member classes. 

The model answer for this is called <b>SchoolMember.py</b>.

(Answers to all exercises are available <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">here</a>.)


In [12]:
class SchoolMember:
    #Represents any school member
    def __init__(self, name, age):
        self.name = name
        self.age = age
        print ('Initialized SchoolMember: %s' % self.name)

    def tell(self):
        #Tell my details
        print ('Name:"%s" Age:"%s"' % (self.name, self.age))

class Teacher(SchoolMember):
    #Represents a teacher
    def __init__(self, name, age, salary):
        SchoolMember.__init__(self, name, age)
        self.salary = salary
        print ('Initialized Teacher: %s' % self.name)
    def tell(self):
        SchoolMember.tell(self)
        print ('Salary: "%d"' % self.salary)

class Student(SchoolMember):
    #Represents a student
    def __init__(self, name, age, marks):
        SchoolMember.__init__(self, name, age)
        self.marks = marks
        print ('Initialized Student: %s' % self.name)

    def tell(self):
        SchoolMember.tell(self)
        print ('Marks: "%d"' % self.marks)


t = Teacher('Mrs. Jones', 40, 30000)
s = Student('Stan', 22, 75)

print # prints a blank line

members = [t, s]
for member in members:
    member.tell() # works for both Teachers and Students



Initialized SchoolMember: Mrs. Jones
Initialized Teacher: Mrs. Jones
Initialized SchoolMember: Stan
Initialized Student: Stan
Name:"Mrs. Jones" Age:"40"
Salary: "30000"
Name:"Stan" Age:"22"
Marks: "75"


---
<a id='atom_class'></a>
## Example Atom Class

An atom is a generic term for multiple different elements. A molecule will be made of multiple atoms so would consist of multiple Atom objects. An Atom class can be defined to represent these generic characteristics

Example variables would be:

	Symbol – C (carbon), N (nitrogen) etc

The class could also store the atoms position in a molecule:

	x coordinate, y coordinate and z coordinate

Other variables could be size, full name etc. The class could then contain methods to set and get these variables:

In [13]:
class Atom:
    def __init__(self,symbol,x,y,z):
        self.symbol = symbol
        self.position = (x,y,z)

    def getSymbol(self):
        return self.symbol

    def __repr__(self):
        return '%s %10.4f %10.4f %10.4f' % (self.getSymbol(), self.position[0], self.position[1],self.position[2])
    
at = Atom('C',0.0,1.0,2.0)

print (at)

C     0.0000     1.0000     2.0000



<b><i>__repr__</i></b> is a reserved method name to represent the object. It should return a printable representation of the object and in this case it returns atom position details.

<b>at = Atom('C',0.0,1.0,2.0)</b>

	Create an atom object for carbon with the positions 0.0,1.0 and 2.0 

<b>print (at)</b>

	This will print:

<b>'C'  0.0000  1.0000 2.0000</b>

	"at" is the atom object name and printing it will call the __repr__ method 	of the Atom class

	This returns the string representation of the object

<b>at.getSymbol()</b>

	returns the atom symbol, in this case:

<b>C</b>

---
<a id='molecule_class'></a>
## Molecule Class

We can now create a <b><i>Molecule</i></b> class, which will consist of multiple <b><i>Atom</i></b> objects:

In [14]:
class Atom:
    def __init__(self,symbol,x,y,z):
        self.symbol = symbol
        self.position = (x,y,z)

    def getSymbol(self):
        return self.symbol

    def __repr__(self):
        return '%s %10.4f %10.4f %10.4f' % (self.getSymbol(), self.position[0], self.position[1],self.position[2])
    
class Molecule:
    def __init__(self,name='Test Molecule'):
        self.name = name
        self.atomlist = []

    def addatom(self,atom):
        self.atomlist.append(atom)

    def __repr__(self):
        s = 'Molecule name is %s ' % self.name
        s = s + ' and it has %d atoms\n' % len(self.atomlist)
        for atom in self.atomlist:
            s = s + str(atom) + '\n'
        return s
                

mol = Molecule('Water')

at = Atom('O',0.,0.,0.)  #  Create first atom
mol.addatom(at)   #  Add it to the molecule object

# The next atoms are added directly to the object:
mol.addatom(Atom('H',0.,0.,1.))
mol.addatom(Atom('H',0.,1.,0.))

print (mol)  # Print the molecule object details 


Molecule name is Water  and it has 3 atoms
O     0.0000     0.0000     0.0000
H     0.0000     0.0000     1.0000
H     0.0000     1.0000     0.0000



Ideally you would actually create the <b>Atom</b> and <b>Molecule</b> classes as separate files and then import them into your script as required:

    # Creating a water molecule:
    from Molecule import Molecule
    from Atom import Atom 

    mol = Molecule('Water')

    at = Atom(‘O’,0.,0.,0.)  #  Create first atom
    mol.addatom(at)   #  Add it to the molecule object

    # The next atoms are added directly to the object:
    mol.addatom(atom(‘H’,0.,0.,1.))
    mol.addatom(atom(‘H’,0.,1.,0.))

    print (mol)  # Print the molecule object details 
    
Files for these classes, called <b>Atom.py</b> and <b>Molecule.py</b>, are available on the exercises ansers page and can be tested with <b>runAtom.py</b> and <b>runMolecule.py</b> respectively.

---
## Exercise 28

Currently the position variable in <b>Atom</b> is publicly accessible. Test the code to demonstrate the problem that this can cause. For example, write a script that changes the positions to different values. Modify the code so that the position is private and cannot be changed and test its accessibility with a script.

There is no model answer for this exercise.

In [21]:
class Atom:
    def __init__(self,symbol,x,y,z):
        self.symbol = symbol
        self.position = (x,y,z)

    @property
    def position(self):
        return 0,0,0
    
    def getSymbol(self):
        return self.symbol

    def __repr__(self):
        return '%s %10.4f %10.4f %10.4f' % (self.getSymbol(), self.position[0], self.position[1],self.position[2])
    
at = Atom('C',0.0,1.0,2.0)
at = Atom('C',0.0,1.0,2.0)


print (at)

AttributeError: can't set attribute

---
## Exercise 29

Download the <b>1l2y.coords</b> from the <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">Exercise Answers</a> page.

This is an extract from the PDB file for Trp-cage protein, which is 20 amino acids in length. The file contains the atomic coordinates for one model of the structure. 

Create a new class called Protein to store the protein sequence information. This class could contain a list of molecule objects but the actual design is up to you. The class should provide methods to for the following functionality: 

i)	Return the amino acid sequence of the protein. You can use the 3 letter names as provided in the coordinates file. This is the method used in the model answer but it would be better to use a dictionary to map these to the single letter codes.
ii)	Provide the amino acid name, atom name and atom coordinate details for any atom at a particular position. This method will take the atom position in the sequence as its argument. For example, “3” would provide all of the information for the third atom, which is a carbon in an aspartic acid.
iii)	Provide the amino acid name and atom coordinate details for all atoms in an amino acid at a particular position. This method will take the amino acid position in the sequence as its argument. For example, “2” would provide all of the information for the atoms in the second amino acid, which is alanine.


For reference the full amino acid sequence is:

DAYAQWLADAGWASARPPPS

The format of the cords file is:

ATOM      1  N   ASP A   1      -5.411  -4.929   5.995  1.00  0.00           N  
ATOM      2  CA  ASP A   1      -6.722  -4.516   5.416  1.00  0.00           C  
ATOM      3  C   ASP A   1      -6.568  -3.352   4.434  1.00  0.00           C  

Column 2 is the atom number
Column 4 is the amino acid and column 5 is the amino acid one letter code
Column 6 is the aminoacid number
Columns 7 – 9 are the X, Y and Z coordinates for the atom
Column 12 is the atom

The full PDB is also available from the <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">Exercise Answers</a> page.

The model answer for this is called <b>Protein.py</b> and can be tested with <b>runProtein.py</b>.

(Answers to all exercises are available <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">here</a>.)


In [29]:
class Protein:
    def __init__(self,name='Test Molecule'):
        self.name = name
        self.atomlist = []
        self.aa_dict = {}
        self.aanum_dict = {}
    def addatom(self,atom, aa, aanum):
        self.atomlist.append(atom)
        self.aa_dict[len(self.atomlist)] = aa
        self.aanum_dict[len(self.atomlist)] = aanum
    def addsequence(self,seq):
        self.sequence = seq
    def getsequence(self):
        print (self.sequence)
    def getatomdetails(self, atom_pos):
        at = self.atomlist[atom_pos - 1]
        print (self.aa_dict[atom_pos], at)
    def getaadetails(self, aapos):

        for atpos in range(0, len(self.atomlist)):
            #print (self.aanum_dict[atpos], aapos)
            if atpos in self.aanum_dict and int(self.aanum_dict[atpos]) == int((aapos)):
                self.getatomdetails(atpos)
    def __repr__(self):
        s = 'Molecule name is %s ' % self.name
        s = s + ' and it has %d atoms\n' % len(self.atomlist)
        for atom in self.atomlist:
            s = s + str(atom) + '\n'
        return s

class Atom:
    def __init__(self,symbol,x,y,z):
        self.symbol = symbol
        self.position = (x,y,z)

    def getSymbol(self):
        return self.symbol

    def __repr__(self):
        return '%s %10.4f %10.4f %10.4f' % (self.getSymbol(), self.position[0], self.position[1],self.position[2])
     


In [49]:

# Creating a protein molecule:

prot = Protein('Trp-cage')
seq = ""

countaa = 0


with open ("1l2y.coords") as protfile:
    for line in protfile:
        line = line.rstrip()
        split_line = line.split()
        aa = split_line[3]
        aanum = split_line[5]
        at = Atom(split_line[11], float(split_line[6]), float(split_line[7]), float(split_line[8]))  #  Create first atom
        prot.addatom(at, aa, aanum)   #  Add it to the molecule object

        if(int(aanum) > countaa):
            seq += aa + " "
            countaa+=1

prot.addsequence(seq)

# print (prot)  # Print the molecule object details 

print ("\nProtein Sequece:")
prot.getsequence()

print ("\nAtom Details:")
prot.getatomdetails(1)

print("\nAA Details")
prot.getaadetails(1)




Protein Sequece:
ASN LEU TYR ILE GLN TRP LEU LYS ASP GLY GLY PRO SER SER GLY ARG PRO PRO PRO SER 

Atom Details:
ASN N    -8.9010     4.1270    -0.5550

AA Details
ASN N    -8.9010     4.1270    -0.5550
ASN C    -8.6080     3.1350    -1.6180
ASN C    -7.1170     2.9640    -1.8970
ASN O    -6.6340     1.8490    -1.7580
ASN C    -9.4370     3.3960    -2.8890
ASN C   -10.9150     3.1300    -2.6110
ASN O   -11.2690     2.7000    -1.5240
ASN N   -11.8060     3.4060    -3.5430
ASN H    -8.3300     3.9570     0.2610
ASN H    -8.7400     5.0680    -0.8890
ASN H    -9.8770     4.0410    -0.2930
ASN H    -8.9300     2.1620    -1.2390
ASN H    -9.3100     4.4170    -3.1930
ASN H    -9.1080     2.7190    -3.6790
ASN H   -11.5720     3.7910    -4.4440
ASN H   -12.7570     3.1830    -3.2940


---
<a id='sequence_files'></a>
# Handling Sequence Files

An example problem may be the need to write a program to parse a multi fasta sequence file and extract a specific sequence based on accession ID. For example, extract P01322 from a file with the following format:

\>sp|Q91XI3|INS_SPETR Insulin OS=Spermophilus tridecemlineatus GN=INS PE=3 SV=1
MALWTRLLPLLALLALLGPDPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKSRREVEEQQGGQVELGGGPGAGLPQPLALEMALQKRGIVEQCCTSICSLYQLENYCN<br>
\>sp|P01313|INS_CRILO Insulin OS=Cricetulus longicaudatus GN=INS PE=1 SV=2
MTLWMRLLPLLTLLVLWEPNPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKSRRGVEDPQVAQLELGGGPGADDLQTLALEVAQQKRGIVDQCCTSICSLYQLENYCN<br>
\>sp|P01322|INS1_RAT Insulin-1 OS=Rattus norvegicus GN=Ins1 PE=1 SV=1
MALWMRFLPLLALLVLWEPKPAQAFVKQHLCGPHLVEALYLVCGERGFFYTPKSRREVEDPQVPQLELGGGPEAGDLQTLALEVARQKRGIVDQCCTSICSLYQLENYCN<br>
\>sp|P01315|INS_PIG Insulin OS=Sus scrofa GN=INS PE=1 SV=2
MALWTRLLPLLALLALWAPAPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKARREAENPQAGAVELGGGLGGLQALALEGPPQKRGIVEQCCTSICSLYQLENYCN<br>
......

A procedural example would be:

    with open('seq.fasta') as in_file:
        match = 0
            for line in in_file:
                line = line.rstrip()  # Remove newline 
                if line.startswith('>'):
                    # A description line so search for required ID
                if line.find('P01322') != -1:
                    # It is the required sequence so set flag to 1
                	match = 1
            	else:
                	# Not the required sequence so set flag to 0
                	match = 0
        		if match:
            		# In the correct sequence so print it out
            		print(line)


Not the most elegant solution but it fulfils the requirement, even if it is not immediately clear how it is working. The code is also limited and difficult to add further functionality

It may be required to add functionality to return the length of the sequence, handle nucleotide as well as protein sequences including options to translate them former and extracting multiple sequences based on a list of IDs.

All of these are achievable but it is not immediately obvious how to add them to the code. They will also make the code more complex and difficult to understand.

OO provides an easier mechanism to add additional functionality and also ensures the code is easier to understand. Object oriented programming groups data and functionality together and provides the ability to abstract away some of the complexities of the processing logic and to encapsulate the data

Two classes will provide this functionality:

<b>FastaRecord</b> - Store and provide the sequence information 
<b>FastaParser</b> - Parse the fasta record to create the FastaRecord objects 

To test the following code you'll need the <b>multi_sequences.fasta</b> file from the <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">exercise anwsers</a> page. You need to store this file in the same directory as the Jupyter Notebook files.


In [None]:
class FastaRecord():
    # Class representing a FASTA record

    def __init__(self, description_line):
        # Initialise an instance of the FastaRecord clas
        self.description = description_line.strip()
        self.sequences = []

    def add_sequence_line(self, sequence_line):

        # Add a sequence line to the FastaRecord instance.
        # This function can be called more than once.

        self.sequences.append( sequence_line.strip() )

    def __repr__(self):
        # Representation of the FastaRecord instance
        lines = [self.description,]
        lines.extend(self.sequences)
        return '\n'.join(lines)


class FastaParser():
    # Class for parsing FASTA files

    def __init__(self, fpath):
        # Initialise an instance of the FastaParser
        self.fpath = fpath

    def __iter__(self):
        # Yield FastaRecord instances
        fasta_record = None
        with open(self.fpath, 'r') as fh:
            for line in fh:
                if line.startswith('>'):
                    if fasta_record:
                        yield fasta_record
                    fasta_record = FastaRecord(line)
                else:
                    fasta_record.add_sequence_line(line)
        yield fasta_record

for fasta_record in FastaParser('multi_sequences.fasta'):
    print(fasta_record.description)


Both classes can be in the same file and then imported into the test script. Assuming they are in a file called <b>Fasta.py</b>:

    from Fasta import FastaParser

    for fasta_record in FastaParser('multi_sequences.fasta'):
       		print(fasta_record.description)


The <b>Fasta.py</b> file and a testing script called <b>runFasta.py</b> are available from the the <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">exercise anwsers</a> page.

---
<a id='iter'></a>
## __iter__

<b>def __iter__(self):</b>

A special class method that provides an iteration 

For example:

	for fasta_record in FastaParser('example.fasta'):
        print(fasta_record.description)


This creates a <b>FastaParser</b> object with the fasta sequence file <b><i>example.fasta</i></b>. As the for loop is an iteration being called on FastaParser it automatically calls the <b>__iter__</b> method. This is the same as print automatically calling the <b>__repr__ method</b>.

---
<a id='magic_methods'></a>
## Magic Methods

<b>__repr__</b> and <b>__iter__</b> are examples of magic methods. The <i>magic</i> aspect of these methods is that you don’t need to call them explicitly. A for loop calls <b>__iter__</b> and print calls <b>__repr__</b>.

There are many magic methods that can be used and they are not well documented but they are explained at:

	http://www.rafekettler.com/magicmethods.html

Magic methods are used all the time in Python but as they are called silently they are not noticed:

	+ 		object.__add__(self, other)
	*= 		object.__imul__(self, other) 
	int() 		object.__int__(self) 
	> 		object.__gt__(self, other) 
	etc
    
---    
<a id='iterables'></a>
## Iterables

To explain <b><i>yield</i></b> first need to understand <b><i>iterables</i></b> and <b><i>generators</i></b>. When you create a <b><i>list</i></b> you can read it one element at a time in a <b><i>for loop</i></b> (iteration) and anything that can be read in a <b><i>for loop</i></b> is an <b><i>iterable</i></b> (<b><i>list, string, file</i></b> etc). They can be read as many times as you like as they are stored in memory. However, <b><i>iterables</i></b> are not efficient if you need to store a large amount of data and particular if you only want to iterate over them once.

---
<a id='generators'></a>
## Generators

<b><i>Generators</i></b> are <b><i>iterators</i></b>, but you can only iterate over them once

They do not store all the values in memory and instead generate them on the fly:

 	mygenerator = (x*x for x in range(3))
	for i in mygenerator:
		print(i)


Cannot perform <b><i>"for i in mygenerator"</i></b> a second time as <b><i>generators</i></b> can only be used once. Each value is calculated once and then forgotten.

---
<a id='yield'></a>
## Yield

<b><i>Yield</i></b> is a special return keyword that returns a <b><i>generator</i></b>.

In the <b>FastaParser</b> class the <b><i>yield</i></b> keyword is required as the magic method <b><i>__iter__</i></b> is an <b><i>iteration</i></b>. The method returns a <b><i>generator</i></b> and so effectively each fasta record is returned individually and they are not stored:
    
    for line in in_file :
        if line.startswith('>'):
        	if fasta_record:
           		yield fasta_record
           		fasta_record = FastaRecord(line)
          	else:
          		fasta_record.add_sequence_line(line)
        yield fasta_record
    
---   
## Expanding the Fasta Classes

The FastaRecord and FastaParser classes could now be extended to include additional methods

<b>FastaRecord:</b>

Returns a fasta sequence with a specified ID
Returns the length of a sequence with a specified ID
Return a translation of a specified sequence

<b>FastaParser:</b>

Returns the description lines of all the sequences
Returns the IDs of all the sequences

---
## Benefits of the Object Oriented Version

Writing an OO version of the fasta parsing program does take longer and require more code but there are distinct advantages.

It is a lot easier to understand as the functionality is grouped.

It is extendible – additional functions can be added with relative ease 	whereas with the procedural version it is not so clear where to start.

It is easy to use – once written the class methods are just called.

---
<a id='oo_exercises'></a>
## Exercises

Note that the model answers for these exercises are the simplest solutions but could be improved. For example, many of the variables could be made private with associated getter and setter methods.

---
## Exercise 30


Currently the <b>FastaParser</b> class reads the fasta file each time it is used, so the <b>FastaRecord</b> objects are transient and not stored. Modify the code so that an instance of the <b>FastaParser</b> class creates and stores <b>FastaRecord</b> objects, which can then be retrieved. Write a script that tests the modified class.

The model answer for this is called <b>Fasta2.py</b> and can be run with the <b>runFasta2.py</b> script.

(Answers to all exercises are available <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">here</a>.)

In [53]:
class FastaRecord(object):
    # Class representing a FASTA record

    def __init__(self, description_line):
        # Initialise an instance of the FastaRecord class
        self.sequence = description_line
        # Remove > and whitespace on ends
        self.description = description_line[1:]
        self.description = self.description.strip()
        self.seqid = (self.description.split())[0]
        self.seq_length = 0

    def add_sequence_line(self, sequence_line):

        # Add a sequence line to the FastaRecord instance.
        # This function can be called more than once.

        self.seq_length += len(sequence_line.strip())
        self.sequence += sequence_line

    def __repr__(self):
        # Representation of the FastaRecord instance
        return self.sequence


class FastaParser(object):
    # Class for parsing FASTA files
    def __init__(self, fpath):
    # Initialise an instance of the FastaParser
        self.fpath = fpath
        self.record_list = []

        fasta_record = None
        with open(self.fpath, 'r') as fh:
            for line in fh:
                if line.startswith('>'):
                    if fasta_record:
                        self.record_list.append(fasta_record)
                    fasta_record = FastaRecord(line)
                else:
                    fasta_record.add_sequence_line(line)
            self.record_list.append(fasta_record)


    def __repr__(self):
        seqs = ""
        for record in self.record_list:
            seqs += str(record) + "\n"
        return seqs

    def getRecords(self):
        return self.record_list




In [57]:
parser = FastaParser('multi_sequences.fasta')
# print (parser)

print ("Now the list of fasta records")

for fasta_record in parser.getRecords():
	print(fasta_record)





Now the list of fasta records
>sp|P01308|INS_HUMAN Insulin OS=Homo sapiens 
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

>sp|Q6YK33|INS_GORGO Insulin OS=Gorilla gorilla gorilla 
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

>sp|Q8HXV2|INS_PONPY Insulin OS=Pongo pygmaeus 
MALWMRLLPLLALLALWGPDPAQAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

>sp|P30410|INS_PANTR Insulin OS=Pan troglodytes 
MALWMRLLPLLVLLALWGPDPASAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
LQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

>sp|P30406|INS_MACFA Insulin OS=Macaca fascicularis 
MALWMRLLPLLALLALWGPDPAPAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
PQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

>sp|P30407|INS_CHLAE Insulin OS=Chlorocebus aethiops 
MALWMRLLPLLALLALWGPDPVPAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAED
PQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCT

---
## Exercise 31

Modify the code to include the following functionality: 
	
Returns a fasta sequence with a specified ID.
Returns the description lines of all the sequences as a list.
Returns the IDs of all the sequences as a list.
Returns the length of a sequence with a specified ID.

To incorporate this functionality separate methods should be added to the <b>FastaRecord</b> class. Write a script that tests the modified class.

The model answer for this is called <b>Fasta3.py</b> and can be run with the <b>runFasta3.py</b> script.

(Answers to all exercises are available <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">here</a>.)

In [60]:
class FastaRecord(object):
    # Class representing a FASTA record

	def __init__(self, description_line):
		# Initialise an instance of the FastaRecord class
		self.sequence = description_line
		# Remove > and whitespace on ends
		self.description = description_line[1:]
		self.description = self.description.strip()
		self.seqid = (self.description.split())[0]
		self.seq_length = 0

	def add_sequence_line(self, sequence_line):
        
		# Add a sequence line to the FastaRecord instance.
		# This function can be called more than once.
        
		self.seq_length += len(sequence_line.strip())
		self.sequence += sequence_line

	def matchesID(self, searchid):
		# Return description line by ID
		if self.seqid == searchid:
			return True

	def getDescription(self):
		# Return description line
		return self.description 

	def getSeqLength(self):
		# Return seq_length
		return self.seq_length
    
	def __repr__(self):
		# Representation of the FastaRecord instance
		return self.sequence


class FastaParser(object):
   	# Class for parsing FASTA files
	def __init__(self, fpath):
	# Initialise an instance of the FastaParser
		self.fpath = fpath
		self.record_list = []

		fasta_record = None
		with open(self.fpath, 'r') as fh:
			for line in fh:
				if line.startswith('>'):
					if fasta_record:
						self.record_list.append(fasta_record)
					fasta_record = FastaRecord(line)
				else:
					fasta_record.add_sequence_line(line)
			self.record_list.append(fasta_record)

	
	def __repr__(self):
		seqs = ""
		for record in self.record_list:
			seqs += str(record) + "\n"
		return seqs

	def getRecords(self):
		return self.record_list

	def getRecordByID(self, seqid):
		for record in self.record_list:
			if record.matchesID(seqid):
				return record

	def getSeqLength(self, seqid):
		for record in self.record_list:
			if record.matchesID(seqid):
				return record.getSeqLength()

	def getIDs(self):
		ids = []
		for record in self.record_list:
			ids.append(record.seqid)
		return ids

	def getDescriptions(self):
		descriptions = []
		for record in self.record_list:
			descriptions.append(record.description)
		return descriptions




In [66]:

parser = FastaParser('multi_sequences.fasta')
# print (parser)

#for fasta_record in parser:
#	print(fasta_record.description)
    	#if fasta_record.matches('Q9Y233'):
        #	print(fasta_record)

print ("Now the list of fasta records")

# for fasta_record in parser.getRecords():
# 	print(fasta_record)


# Get sequence by ID
print ("Record by ID")

print(parser.getRecordByID("DS572233"))

# Get the description lines

descriptions = parser.getDescriptions()
print(descriptions)

# Get the IDs

ids = parser.getIDs()
print(ids)

# Get sequence length by ID

seqlen = parser.getSeqLength("DS572233")
print(seqlen)


Now the list of fasta records
Record by ID
None
['sp|P01308|INS_HUMAN Insulin OS=Homo sapiens', 'sp|Q6YK33|INS_GORGO Insulin OS=Gorilla gorilla gorilla', 'sp|Q8HXV2|INS_PONPY Insulin OS=Pongo pygmaeus', 'sp|P30410|INS_PANTR Insulin OS=Pan troglodytes', 'sp|P30406|INS_MACFA Insulin OS=Macaca fascicularis', 'sp|P30407|INS_CHLAE Insulin OS=Chlorocebus aethiops', 'sp|P67972|INS_AOTTR Insulin OS=Aotus trivirgatus', 'sp|P01321|INS_CANFA Insulin OS=Canis familiaris', 'sp|Q91XI3|INS_SPETR Insulin OS=Spermophilus tridecemlineatus', 'sp|P01313|INS_CRILO Insulin OS=Cricetulus longicaudatus', 'sp|P01322|INS1_RAT Insulin-1 OS=Rattus norvegicus', 'sp|P01315|INS_PIG Insulin OS=Sus scrofa', 'sp|P01311|INS_RABIT Insulin OS=Oryctolagus cuniculus', 'sp|Q62587|INS_PSAOB Insulin OS=Psammomys obesus', 'sp|P06306|INS_FELCA Insulin OS=Felis catus', 'sp|P01317|INS_BOVIN Insulin OS=Bos taurus', 'sp|P01318|INS_SHEEP Insulin OS=Ovis aries', 'sp|P17715|INS_OCTDE Insulin OS=Octodon degus', 'sp|P01329|INS_CAVPO Insu

---
## Exercise 32

The classes so far are designed to handle multiple fasta sequences but sometimes a requirement is just to manipulate a single sequence. Below is a simple class to store a nucleotide sequence, which can be modified to provide additional functionality. This class will just store the sequence itself and not the description line.

	class DNA: 
     	def __init__(self, s): 
        	# Create DNA instance initialized to string s
        	self.seq = s 
      
Add methods to this class to perform the following functions:

Reverse the sequence.
Complement the sequence.
Reverse complement the sequence.
Provide the %GC content of the sequence.
Provide a list of codons in the sequence.
You could also additionally add a method to translate the sequence. You should have the code for this from a previous exercise.

Write a script that tests the DNA class. You can use the <b>sequence.fasta</b> file available from the the <a href="http://teaching.bc.ic.ac.uk/msc/ipython-files/exercises.html">exercise anwsers</a> page to test the class. You will need to extract just the sequence from this file and use that to create an object instance of the DNA class.

The model answer for this is called <b>DNA.py</b> and can be run with the runDNA.py script. It does not include the final option to translate the sequence, the code for which has already been provided.

In [67]:
class DNA: 
 
	def __init__(self, s): 
		# Create DNA instance initialized to by the sequence s
		self.seq = s 
		self.comp = {'a':'t', 't':'a', 'g':'c', 'c':'g'}
	
	def reverse(self): 
        	# Return the reversed sequence
		self.revseq = self.seq[::-1]
		return self.revseq 

	def complement(self): 
		# Return the complementary sequence
		compseq = ""
		for nuc in self.seq:
			compseq += self.comp[nuc] 
		return compseq 
     
	def reversecomplement(self): 
		# Return the reverse complement of the sequence
		compseq = ""
		for nuc in self.seq:
			compseq += self.comp[nuc]
		return compseq[::-1] 
     
	def gc(self): 
		# Return the percentage of sequence composed of G and C
		# Note the limitation of these next two methods are that they are case sensitive
		# They need to be modified to work with a sequence in upper case
		gc = self.seq.count('g') + self.seq.count('c') 
		return gc * 100.0 / len(self.seq) 
 
	def codons(self): 
		# Return a list of codons for the sequence
		codons = []
		for count in range(0, len(self.seq), 3):  
			codons.append(self.seq[count:count+3])
		 
		return codons 

	# Return the printable representation of the object, the sequence
	def __repr__(self):
		return self.seq
  


In [85]:
sequence = 'atgttcaaatcc'
dna = DNA(sequence)
rev = dna.reversecomplement
codon = dna.codons
codon()

['atg', 'ttc', 'aaa', 'tcc']