Basic Biological Computing in Python {#chap:pythonI}
====================================

<span>**Firstly, chapter \[chap:unix1\]’s UNIX question?**</span>

In [None]:
find . -type f -exec ls -s {} \; | sort -n | head -10

What is the command doing? How has it been built (explain the
components)?

Outline of the <span>python</span> module
-----------------------------------------

The <span>python</span> module is geared towards teaching you scientific
programming in biology using this modern, and for good reason, immensely
popular language. The components of this module across all the chapters
(Basic, Advanced, Additional topics) are:

Basics of <span>python</span>

How to write and run <span>python</span> code

Understand and implement “control flows”

Learning to use the <span>ipython</span> environment

Writing, debugging, using, and testing <span>python</span> functions

Learning efficient numerical programming in <span>python</span>

Using regular expressions in <span>python</span>

Introduction to certain particularly useful <span>python</span> packages

Using <span>python</span> for building and modifying databases

Using <span>python</span> to run other “stuff” and to patch together
data analysis and/or numerical simulation work flows

![Is <span>python</span> the most common answer to your daily
programming needs? Possibly!](python.png "fig:"){width=".4\textwidth"}\
[www.xkcd.com](www.xkcd.com)

Why <span>python</span>?
------------------------

<span>python</span> was designed with readability and re-usability in
mind. Time taken by programming + debugging + running is likely to be
relatively lower in <span>python</span> than less intuitive or cluttered
languages (e.g., <span>FORTRAN</span>, <span>perl</span>). It is a
pretty good solution if you want to easily write readable code that is
also reasonably efficient (computationally speaking).

![<span>python</span> is pretty
fast!](benchmark.png "fig:"){width="100.00000%"}\
<http://julialang.org/>

### The Zen of python

Open a terminal and type

In [None]:
$ python -c "import this"

Installing <span>python</span>
------------------------------

<span>**We will use 2.7.x, not 3.x (you can use 3.x later, if you
want)**</span>

Your Ubuntu distribution needs <span>python</span>, so it will already
be installed. However, let’s install the interactive python shell
<span>ipython</span> which we will soon use.

\[$\quad\star$\]

On Ubuntu/Linux, open a terminal (ctrl+alt+t) and type:

In [None]:
$ sudo apt-get install ipython python-scipy python-matplotlib

In Linux, you can easily install python packages that come with the
standard python distribution using the usual <span>sudo apt-get install
python-packagename</span>

Getting started with <span>python</span>
----------------------------------------

Open a terminal (<span>ctrl+alt+t</span>) and type <span>python</span>
(or use the terminal that you just used to install
<span>ipython</span>). Then, try the following:

In [None]:
>>> 2 + 2 # Summation; note that comments start with #
4

>>> 2 * 2 # Multiplication
4

>>> 2 / 2 # Integer division
1

>>> 2 / 2.0 # "Float" division, note the output is float
1.0

>>> 2 / 2.
1.0

>>> 2 > 3
False

>>> 2 >= 2
True

What does “float” mean in the above comment? Why is it necessary to
specify this in Python (not necessary in Python 3.x)? You will
inevitably run into some such jargon in this chapter. The main ones you
need to know are (you will learn more about these along the way):

  ----------- -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  Workspace   The state of the “environment” of your current python <span>*session*</span>, including all variables, functions, objects, etc.
  Variable    A named number, text string, boolean (<span>True</span> or <span>False</span>), or data structure that can change (more on variable and data types later)
  Function    A computer procedure or routine that returns some value(s), and which can be used again and again
  Module      <span>*Variables*</span> and <span>*functions*</span> packaged into a single set of programs that can be invoked as a command (potentially with sub-commands)
  Class       Also, variables and functions packaged into a single set of programs that that can be invoked as a command (potentially with sub-commands), but unlike modules, you can spawn many copies of a class within a python session or program
  Object      A particular instance of a class (every object belongs to a class) that is created in a session and eventually destroyed; pretty much everything in your workspace is an object in python!
  ----------- -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

This Module vs. Class vs. Object business is confusing. These constructs
are created to make an (object-oriented) programming language like
<span>python</span> more flexible and user friendly (though it might not
seem so to you currently!). In practice, at least for your current
purposes, you will not build python classes yourself much, typically
working with modules. More on all this later. Also, have a look at
<https://learnpythonthehardway.org/book/ex40.html>

### ipython

We will now immediately switch to the <span>i</span>nteractive
<span>python</span> shell, <span>ipython</span> that you installed
above.

OK, now let’s continue learning python using <span>ipython</span>.

\[$\quad\star$\]

Type <span>ctrl+D</span> in the terminal at the python prompt: this will
exit you from the python shell and you will see the bash prompt again.

Now type <span>ipython</span>

You should now see (after some text):

In [None]:
In [ ]: 

(I have deleted the prompt numbering , , etc to avoid confusion). This
is the <span>i</span>nteractive <span>python</span> shell (or,
“<span>ipython</span>”). This shell has many advantages over the
bare-bones, non-interactive python shell with the $>>>$ prompt. For
example, as in the bash shell, <span>TAB</span> leads to auto-completion
of a command or file name (try it).

### Magic commands

IPython also has “magic commands” (start with %; e.g.,
<span>%run</span>). Some useful magic commands:

  ----------------------- ----------------------------------------------------------------------------------------------------------
  <span>%who</span>       Shows current namespace (all variables, modules and functions)
  <span>%whos</span>      Also display the type of each variable; typing <span> %whos function</span> only displays functions etc.
  <span>%pwd</span>       Print working directory
  <span>%history</span>   Print recent commands
  ----------------------- ----------------------------------------------------------------------------------------------------------

Try any of these now!

### Determining an object’s type

Another useful IPython feature is the question mark, which can be used
to find what a particular Python object is, including variables you
created. For example, try:

In [None]:
In [1]: a = 1

In [2]: ?a
Type:        int
String form: 1
Docstring:
int(x=0) -> int or long
int(x, base=10) -> int or long

Convert a number or string to an integer, or return 0 if no arguments
are given.  If x is floating point, the conversion truncates towards zero.
If x is outside the integer range, the function returns a long instead.

If x is not a number or if base is given, then x must be a string or
Unicode object representing an integer literal in the given base.  The
literal can be preceded by '+' or '-' and be surrounded by whitespace.
The base defaults to 10.  Valid bases are 0 and 2-36.  Base 0 means to
interpret the base from the string as an integer literal.
>>> int('0b100', base=0)
4

You can configure ipython’s environment and behavior by editing the
<span>ipython\_config.py</span> file:

In [None]:
$ geany ~/.config/ipython/profile_default/ipython_config.py &

This file does not inititally exist, but you can create it by running
<span>ipython profile create</span> in a bash terminal (try it now).

Now you can configure ipython. For example, If you don’t like the blue
<span>ipython</span> prompt, you can type <span>%colors linux</span>
(once inside the shell). If you want to make this color the default,
then edit <span> ipython\_config.py</span> — search for “Set the color
scheme” in the file.

Python variables
----------------

Now, let’s continue our python intro. We will first learn about the
<span> python</span> variable types that were mentioned above. The types
are:

In [None]:
In [ ]: a = 2 #integer

In [ ]: ?a
Type:        int
String form: 2
Docstring:
int(x=0) -> int or long
int(x, base=10) -> int or long

Convert a number or string to an integer, or return 0 if no arguments
are given.  If x is floating point, the conversion truncates towards zero.
If x is outside the integer range, the function returns a long instead.

If x is not a number or if base is given, then x must be a string or
Unicode object representing an integer literal in the given base.  The
literal can be preceded by '+' or '-' and be surrounded by whitespace.
The base defaults to 10.  Valid bases are 0 and 2-36.  Base 0 means to
interpret the base from the string as an integer literal.
>>> int('0b100', base=0)
4

In [ ]: a = 2. #Float

In [ ]: ?a
Type:        float
String form: 2.0
Docstring:
float(x) -> floating point number

Convert a string or number to a floating point number, if possible.

In [ ]: a = "Two" #String

In [ ]: ?a
Type:        str
String form: Two
Length:      3
Docstring:
str(object='') -> string

Return a nice string representation of the object.
If the argument is a string, the return value is the same object.

In [10]: a = True #Boolean

In [11]: ?a
Type:        bool
String form: True
Docstring:
bool(x) -> bool

Returns True when the argument x is true, False otherwise.
The builtins True and False are the only two instances of the class bool.
The class bool is a subclass of the class int, and cannot be subclassed.

Thus, <span>python</span> has integer, float (real numbers, with
different precision levels) and string variables.

### <span>python</span> operators

Here are are the operators in python that you can use on variables:

  -------------------------- ------------------
  <span>+</span>             Addition
  <span>-</span>             Subtraction
                             Multiplication
  <span>/</span>             Division
  <span>\*</span>            Power
  <span>%</span>             Modulo
  <span>//</span>            Integer division
  <span>==</span>            Equals
  <span>!=</span>            Differs
  <span>$>$</span>           Greater
  <span>$>$=</span>          Greater or equal
  <span>&, and</span>        Logical and
  <span>$\vert$, or</span>   Logical or
  <span>!, not</span>        Logical not
  -------------------------- ------------------

### Assigning and manipulating variables

In [None]:
In []: 2 == 2
    Out []: True

In []: 2 != 2
    Out []: False

In []: 3 / 2
    Out []: 1

In []: 3 / 2.
    Out []: 1.5

In []: 'hola, ' + 'mi llamo Samraat' #why not two languages at the same
    time?! 
    Out []: 'hola, mi llamo Samraat'

In []: x = 5

In [None]:
In []: x + 3
    Out []: 8

In []: y = 8

In []: x + y
    Out []: 13

In []: x = 'My string'

In []: x + ' now has more stuff'
    Out []: 'My string now has more stuff'

In []: x + y
    Out []: TypeError: cannot concatenate 'str' and 'int' objects

OK, so concatenating string and numeric (integer in this case) variables
doesn’t work. No problem, we can convert from one type to another:

In [None]:
In []: x + str(y)
    Out []: 'My string8'

In []: z = '88'

In []: x + z
    Out []: 'My string88'

In []: y + int(z)
    Out []: 96


In <span>python</span>, the type of a variable is determined when the
program or command is running (dynamic typing) (like <span>R</span>,
unlike <span>C</span> or <span>FORTRAN</span>). This is convenient, but
can make programs slow. More on efficient computing later.

<span>python</span> data types and data structures
--------------------------------------------------

<span>python</span> number or string variables (or both) can be stored
and manipulated in:

<span>**List**</span>: most versatile, can contain compound data,
“mutable”, enclosed in brackets, \[ \]

<span>**Tuple**</span>: like a list, but “immutable” — like a read only
list, enclosed in parentheses, ( )

<span>**Dictionary**</span>: a kind of “hash table” of key-value pairs
enclosed by curly braces, { } — key can be number or string, values can
be any object! (well OK, a python object)

<span>**numpy arrays**</span>: Fast, compact, convenient for numerical
computing — more on this later!

### Lists

In [None]:
In []: MyList = [3,2.44,'green',True]

In []: MyList[1]
    Out []: 2.44

In []: MyList[0] # NOTE: FIRST ELEMENT -> 0
    Out []: 3

In []: MyList[4]
    Out []: IndexError: list index out of range

In []: MyList[2] = 'blue'

In []: MyList
    Out []: [3, 2.44, 'blue', True]

In []: MyList[0] = 'blue'

In []: MyList
    Out []: ['blue', 2.44, 'blue', True]

In []: MyList.append('a new item') # NOTE: ".append"!

In []: MyList
    Out []: ['blue', 2.44, 'blue', True, 'a new item']

In []: MyList.sort() # NOTE: suffix a ".", hit tab, and wonder!

In []: MyList
    Out []: [True, 2.44, 'a new item', 'blue', 'blue']

In the above commands, notice that <span>python</span> “indexing” starts
at 0, not 1!

### Tuples

In [None]:
In []: FoodWeb=[('a','b'),('a','c'),('b','c'),('c','c')]

In []: FoodWeb[0]
    Out []: ('a', 'b')

In []: FoodWeb[0][0]
    Out []: 'a'

In []: FoodWeb[0][0] = "bbb"  # NOTE: tuples are "immutable"
     TypeError: 'tuple' object does not support item assignment

In []: FoodWeb[0] = ("bbb","ccc")

In []: FoodWeb[0]
    Out []: ('bbb', 'ccc')

Note that tuples are “immutable”; that is, a particular pair or sequence
of strings or numbers cannot be modified after it is created.

In the above example, why assign these food web data to a list of tuples
and not a list of lists? — because we want to maintain the species
associations, no matter what — they are sacrosanct!

Tuples contain immutable sequences, but you can append to them:

In [None]:
In []: a = (1, 2, [])

In []: a[2].append(1000)

In []: a
    Out []: (1, 2, [1000])

### Sets

You can convert a list to an immutable “set” — an unordered collection
with no duplicate elements. Once you create a set you can perform set
operations on it:

In [None]:
In []: a = [5,6,7,7,7,8,9,9]

In []: b = set(a)

In []: b
    Out []: set([8, 9, 5, 6, 7])

In []: c = set([3,4,5,6])

In []: b & c
    Out []: set([5, 6])

In []: b | c
    Out []: set([3, 4, 5, 6, 7, 8, 9])

In []: list(b | c) # set to list
    Out []: [3, 4, 5, 6, 7, 8, 9]

The key set operations in <span>python</span> are:

  -------------------------- -------------------
  <span>a - b </span>        a.difference(b)
  <span>a $<=$ b</span>      a.issubset(b)
  <span>a $>=$ b</span>      b.issubset(a)
  <span>a & b</span>         a.intersection(b)
  <span>a $\vert$ b</span>   a.union(b)
  -------------------------- -------------------

### Dictionaries

A set of values (any <span>python</span> object) indexed by keys (string
or number), a bit like <span>R</span> lists.

In [None]:
In []: GenomeSize = {'Homo sapiens': 3200.0, 'Escherichia coli': 4.6,
'Arabidopsis thaliana': 157.0}

 In []: GenomeSize
Out []: 
{'Arabidopsis thaliana': 157.0,
  'Escherichia coli': 4.6,
  'Homo sapiens': 3200.0}

 In []: GenomeSize['Arabidopsis thaliana']
Out []: 157.0

 In []: GenomeSize['Saccharomyces cerevisiae'] = 12.1

 In []: GenomeSize
Out []: 
{'Arabidopsis thaliana': 157.0,
'Escherichia coli': 4.6,
'Homo sapiens': 3200.0,
'Saccharomyces cerevisiae': 12.1}

 In []: GenomeSize['Escherichia coli'] = 4.6  # ALREADY IN DICTIONARY!

 In []: GenomeSize
Out []: 
{'Arabidopsis thaliana': 157.0,
 'Escherichia coli': 4.6,
 'Homo sapiens': 3200.0,
 'Saccharomyces cerevisiae': 12.1}

 In []: GenomeSize['Homo sapiens'] = 3201.1

 In []: GenomeSize
Out []: 
{'Arabidopsis thaliana': 157.0,
 'Escherichia coli': 4.6,
 'Homo sapiens': 3201.1,
 'Saccharomyces cerevisiae': 12.1} 

So, in summary,

If your elements/data are unordered and indexed by numbers use
<span>**lists**</span>

If they are ordered sequences use a <span>**tuple**</span>

If you want to perform set operations on them, use a
<span>**set**</span>

If they are unordered and indexed by keys (e.g., names), use a
<span>**dictionary**</span>

<span>*But why not use dictionaries for everything?*</span> – because it
can slow down your code!

### Copying mutable objects

Copying mutable objects can be tricky. Try this:

So, you need to employ <span>deepcopy</span> to really copy an existing
object or variable and assign a new name to the copy.

Python does shallow copying of mutable objects for computing performance
considerations. By not copying the underlaying object when you re-assign
a mutable object to a new (“variable”) name, Python avoids unnecessary
memory copying (“passing by reference”). That does nit change the fact
that shallow vs. deep copying can be confusing, of course!

### <span>python</span> with strings

One of the things that makes python so useful and versatile, is that it
has a powerful set of inbuilt commands to perform string manipulations.
For example, try these:

Writing <span>python</span> code
--------------------------------

Now let’s learn to write and run python code from a <span>.py</span>
file. But first, some some guidelines for good code-writing practices
(see [python.org/dev/peps/pep-0008/](python.org/dev/peps/pep-0008/)):

Wrap lines to be $<$80 characters long. You can use parentheses $()$ or
signal that the line continues using a “backslash” $\backslash$

Use either 4 spaces for indentation or tabs, but not both! (I use tabs!)

Separate functions using a blank line

When possible, write comments on separate lines

Make sure you have chosen a particular indent type (space or tab) in
<span>geany</span> (or whatever IDE you are using) — indentation is
all-important in <span>python</span>. Furthermore,

Use “docstrings” to <span>**document how to use the code**</span>, and
<span>**comments to explain why and how the code works**</span>

Naming conventions (bit of a mess, you’ll learn as you go!):

<span>\_internal\_global\_variable</span> (for use inside module only)

<span>a\_variable</span>

<span>SOME\_CONSTANT</span>

<span>a\_function</span>

Never call a variable <span>l</span> or <span>O</span> or
<span>o</span>\
<span> *why not?*</span> – you are likely to confuse it with
<span>1</span> or <span>0</span>!

Use spaces around operators and after commas:\
<span>a = func(x, y) + other(3, 4)</span>

<span>python</span> Input/Output
--------------------------------

Let’s look at importing and exporting data. Make a textfile called
<span>test.txt</span> in <span>Week2/Sandbox/</span> with the following
content (including the empty lines):

In [None]:
First Line
Second Line

Third Line

Fourth Line

Then, type the following in <span>Week2/Code/basic\_io.py</span> (note
the indentation!):

Note the following:

The <span>for line in f</span> is an implicit loop — implicit because
stating the range of things in <span>f</span> to loop over in this way
allows python to handle any kind of objects to loop thorugh. For
example, if <span>f</span> was an array of numbers 1 to 10, it would
loop thorugh them; if <span>f</span> is a file, as in the case of the
script above, it will loop through the lines in the file.

<span>is len(line.strip()) &gt; 0</span> checks if the line is empty.
Try <span>?</span> to see what <span>.strip()</span> does.

The <span>csv</span> package makes it easy to manipulate CSV files (get
<span> testcsv.csv</span> from <span>CMEEMasteRepo</span>). Type the
following script in <span>Week2/Code/basic\_csv.py</span>

Now that you have seen how all-important indentation of python code is,
you might find the <span>ipython %cpaste</span> function very handy, as
it allows you to run fragments of code, indentation and all, directly in
the <span>ipython</span> commandline. Let’s try it. Type the following
code in a temporary file:

In [None]:
for i in range(x):
        if i > 3: #4 spaces or 2 tabs in this case
                print i 

Now, assign some integer value to a variable <span>x</span>:

In [None]:
In [ ]: x = 11

Then,

In [None]:
In [ ]: %cpaste
Pasting code; enter '--' alone on the line to stop or use Ctrl-D.
:for i in range(x):
:    if i > 3: #4 spaces or 2 tabs in this case
:        print i
:--
4
5
6
7
8
9
10

Of course, this code is simple, so directly pasting works as well —
<span>%cpaste</span> is really useful when you have more complex code
fragments you want to try out. Se haow far you have to pus direct
pasting till you need <span>%cpaste</span>

### Writing <span>python</span> functions (or modules)

Now let’s writing proper <span>python</span> functions. We will start
with a “boilerplate” code. Type the code below and save as <span>
boilerplate.py</span> in <span>CMEECourseWork/Week2/Code</span>:

#### Running your <span>python</span> code

Now <span>cd </span> to the directory and run the code:

In [None]:
$ cd ~/Documents/../CMEECourseWork/Week2/Code
$ python boilerplate.py

You should see “This is a boilerplate” in your terminal window.

Alternatively, you can use ipython:

In [None]:
$ ipython boilerplate.py

You can also execute a python script file from within the
<span>ipython</span> shell with <span>run MyScript.py</span>. So, enter
<span>ipython</span> from bash, and do:

In [None]:
In [ ]: run boilerplate.py

To run the script from the native python shell, you would use <span>
execfile(“MyScript.py”)</span>.

### Components of the <span>python</span> function

Now let’s look at the elements of your first, boilerplate code:

#### The shebang

Just like UNIX shell scripts, the first “shebang” line tells the
computer where to look for python. It determines the script’s ability to
be executed like an standalone executable without typing python
beforehand in the terminal or when double clicking it in a file manager
(when configured properly to be an executable). It isn’t necessary but
generally put there so when someone sees the file opened in an editor,
they immediately know what they’re looking at. However, which shebang
line you use is important.

Here by using <span>\#!/usr/bin/python</span> we are specifying the
location to the python executable in your machine that rest of the
script needs to be interpreted with. You may want to use
<span>\#!/usr/bin/env python</span> instead, which will prevent failure
to run if the Python executable on some other machine or distribution
isn’t actually located at <span>\#!/usr/bin/python</span>, but
elsewhere.

#### The Docstring

Triple quotes start a “docstring” comment, which is meant to describe
the operation of the script or a function/module within it. docstrings
are considered part of the running code, while normal comments are
stripped. Hence, you can access your docstrings at run time. It is a
good idea to have doctrings at the start of every python script and
module as it can provide useful information to the user and you as well,
down the line.

You can access the docstring(s) in a script (both for the overall script
and the ones in each of its functions), by importing the function (say,
<span>my\_func</span>), and then typing <span> help(my\_func)</span> in
the python or ipython shell. For example, try <span> import
boilerplate</span> and then <span>help(boilerplate)</span> (but you have
to be in the python or ipython shell).

For more info, see <https://www.python.org/dev/peps/pep-0257>

#### Internal Variables

“<span>\_\_</span>” signal “internal” variables (never name your
variables so!)

#### Function <span>def</span>initions and “modules”

<span>def</span> indicates the start of a python function; all
subsequent lines must be indented.

It’s important to know that somewhat confusingly, Pythonistas call a
file containing function <span>def</span>itions’s) and statements (e.g.,
assignments of constant variables) a “module”. There is a practical
reason (there’s always one!) for this. You might want to use a
particular set of python <span>def</span>’s (functions) and statements
either as a standalone function, or use it or subsets of it from other
scripts. So in theory, every function you <span>def</span>ine can be a
sub-module usable by other scripts.

<span>*In other words, <span>def</span>initions from a module can be
imported into other modules and scripts, or into the main module
itself.*</span>

At this juncture, you might also want to know more about a Python
“class”. Have a look at
<http://learnpythonthehardway.org/book/ex40.html> — a nice, intuitive
tutorial that should help you understand functions vs. modules vs.
classes in Python.

The last few lines, including the <span>main</span> function/module are
somewhat esoteric but important; more on this below.

#### Why include <span>\_\_name\_\_ == “\_\_main\_\_”</span> and all that jazz

When you run a Python module with or without arguments, the code in the
called module will be executed just as if you imported it, but with the
<span>\_\_name\_\_</span> set to <span>“\_\_main\_\_”</span>. So adding
this code at the end of your module,

In [None]:
if (__name__ == "__main__"):

directs the <span>python</span> interpreter to set the special <span>
\_\_name\_\_</span> variable to have a value
“<span>\_\_main\_\_</span>”, so that the file is usable as a script as
well as an importable module. How do you import? Simply as (in python or
ipython shell):

In [None]:
In []: import boilerplate

Then type

In [None]:
In []: boilerplate
Out[]: <module 'boilerplate' from 'boilerplate.py'>

One more script to hopefully clarify this further. Type and save the
following in a script file called <span>using\_name.py</span>: Now run
it:

In [None]:
In []: run using_name.py
This program is being run by itself

Now, try:

In [None]:
In []: import using_name
I am being imported from another module

The output <span>I am being imported from another module</span> will
only show up once.

Also please look up <https://docs.python.org/2/tutorial/modules.html>

#### What on earth is <span>sys.argv</span>?

In your boilerplate code, as any other Python code, <span>argv</span> is
the “argument variable”. Such variables are necessarily very common
across programming languages, and play an important role —
<span>argv</span> is a variable that holds the arguments you pass to
your Python script when you run it. <span>sys.argv</span> is simply an
object created by python using the <span>sys</span> module (which you
imported at the beginning of the script) that contains the names of the
argument variables in the current script.

To understand this in a practical way, let’s write and save a script
called <span>sysargv.py</span>: Now run <span>sysargv.py</span> with
different numbers of arguments:

In [None]:
run sysargv.py
run sysargv.py var1 var2
run sysargv.py 1 2 var3

As you can see the first variable is always the file name, and is always
available as to the Python interpreter.

Then, the command <span>main(argv=sys.argv)</span> directs the
interpreter to pass the argument variables to the main function. Which
brings us to,

In [None]:
def main(argv):
    print 'This is a boilerplate' # NOTE: indented using two tabs or four spaces

This is the main function. Arguments obtained in the <span>if
(\_\_name\_\_ == “\_\_main\_\_”):</span> part of the script are “fed” to
this main function where the printing of the line “This is a
boilerplate” happens.

OK, finally, what about this bit:

In [None]:
sys.exit(status) 

It’s just a way to terminate and exit the Python program in an explicit
manner, returning an appropriate status code. In this case, we have
decided that <span>main()</span> returns 0 on a successful run, so
<span> sys.exit(status)</span> will return zero indicating “successful
termination”. Try putting <span>sys.exit(“I am exiting right now!”)
</span> in other places in <span>boilerplate.py</span> and see what
happens.

### Variable scope

One important thing to note about functions, in any language, is that
variables inside functions are invisible outside of it, nor do they
persist once the function has run. These are called “local” variables,
and are only accessible inside their function. However, “global”
variables are visible inside and outside of functions. In python, you
can assign global variables. Type the following script in <span>
scope.py</span> and try it:

However, in general, avoid assigning globals because you run the risk of
“exposing” unwanted variables to all functions within your name
work/namespace.

Control statements
------------------

OK, let’s get deeper into <span>python</span> functions. To begin, first
copy and rename <span>boilerplate.py</span> (to make use of it’s
existing structure and save you some typing):

In [None]:
$ cp boilerplate.py control_flow.py
$

Then type the following script into <span>control\_flow.py</span>:

Now run the code:

In [None]:
In []: run control_flow.py

You can also call any of the functions within
<span>control\_flow.py</span>:

In [None]:
In []: even_or_odd(11)
Out[]: '11 is Odd!'

This is possible without explicitly importing the modules because you
are only running one script. You would have to do an explicit <span>
import</span> if you needed a module from another python script file.

### Control flow exercises

\[$\quad\star$\]

Write the following, and save them to <span>cfexercises.py</span>.

Now try these <span>*function by function*</span>, pasting the block in
the ipython command line (hopefully you have set youe code editor to
send a selection to the commandline by now)

Loops
-----

Write the following, and save them to <span>loops.py</span>.

![In case you were wondering who Geronimo
was.](Geronimo.jpg){width=".5\textwidth"}

### List comprehensions

Python offers a way to combine loops, functions and logical tests in a
single line of code. Type the following in a script file called <span>
oaks.py</span>:

Don’t go mad with list comprehensions — code readability is more
important than squeezing lots into a single line!

Practicals
----------

As always, test, add, commit and push all your new code and data to your
git repository.

1.  Modify <span>cfexercises.py</span> to make it a “module” like <span>
    control\_flow.py</span>). That is, all the <span>fooXX</span>
    functions should take arguments from the user (like the functions
    inside <span> control\_flow.py</span>. Also, add some test arguments
    to show that they work (again, like <span>control\_flow.py</span>) —
    for example, “foo5(10)”. Thus, running <span>cfexercises.py</span>
    should now also output evaluations of all the <span>fooXX</span>
    modules along with a bunch of hellos.

2.  Open and complete the tasks in <span>lc1.py</span>,
    <span>lc2.py</span>, <span>dictionary.py</span>,
    <span>tuple.py</span> (you can tackle them in any order)

Functions, Modules, and code compartmentalization
-------------------------------------------------

Ideally you should aim to compartmentalize your code into a bunch of
functions, typically written in a single <span>.py</span> file: this are
Python “modules”, which you were introduced to previously. Why bother
with modules? Because:

Keeping code compartmentalized is good for debugging, unit testing, and
profiling (coming up later)

Makes code more compact by minimizing redundancies (write repeatedly
used code segments as a module)

Allows you to import and use useful functions that you yourself wrote,
just like you would from standard python packages (coming up)

### Importing Modules

There are different ways to <span>**import**</span> a module:

<span>import my\_module</span>, then functions in the module can be
called as\
<span>my\_module.one\_of\_my\_functions()</span>.

<span>from my\_module import my\_function</span> imports only the
function <span>my\_function</span> in the module
<span>my\_module</span>. It can then be called as if it were part of the
main file: <span>my\_function()</span>.

<span>import my\_module as mm</span> imports the module
<span>my\_module</span> and calls it <span>mm</span>. Convenient when
the name of the module is very long. The functions in the module can be
called as <span>mm.one\_of\_my\_functions()</span>.

<span>from my\_module import \*</span>. Avoid doing this!\
<span>*Why?*</span> – to avoid name conflicts!

You can also access variables written into modules: <span>import
my\_module</span>, then\
<span>my\_module.one\_of\_my\_variables</span>

Python packages
---------------

A Python package is simply a directory of Python modules (quite like an
<span>R</span> package). Many packages, such as the following that I
find particularly useful, are always available as standard libraries
(just require <span>import</span> from within python or ipython):

<span>io</span>: file input-output with <span>.csv</span>,
<span>.txt</span>, etc.

<span>subprocess</span>: to run other programs, including multiple ones
at the same time, including operating system-dependent functionality

<span>sqlite3</span>: for manipulating and querying <span>sqlite</span>
databases

<span>math</span>: for mathematical functions

Scores of other packages are accessible by explicitly installing them
using\
<span>sudo apt-get install python-packagename</span> (as you did
previously) or by using <span>pip</span>. Some particularly mentionable
ones are:

<span>sciPy</span> (<http://scipy.org>) contains a wide array of
numerical tools for scientific computing, including <span>NumPy</span>
for efficient data crunching

<span>matplotlib</span>: for plotting (very matlab-like, requires
<span>scipy</span>) (all packaged in <span>pylab</span>)

<span>pandas</span> provides a powerful set of methods to manipulating
data, and comes with a DataFrame object similar to the <span>R</span>
data frame.

<span>scikit-learn</span> <http://scikit-learn.org/> for applying
different machine learning algorithms to data

<span>ipython</span> an enhanced python terminal (which we are currently
using!)

<span>jupyter</span> an interactive notebook environment for exploratory
data analysis, visulaization, and creation of interactive documents that
can be shared. This course is in the process of being written entirely
in Jupyter notebooks.

<span>scrapy</span>: for writing web spiders that crawl web sites and
extract data from them

<span>beautifulsoup</span>: for parsing HTML and XML (can do what
<span>scrapy</span> does)

<span>biopython</span>: for bioinformatics

Of course, you have already installed some of these (<span>scipy</span>,
<span> matplotlib</span>).

For those of you interested in bioinformatics, the
<span>biopython</span> package will be particularly useful. We will not
cover bioinformatics in any depth within the python weeks, but you may
want to try to use Python for bioinformatics in other weeks, especially
the Genomics weeks, and perhaps use it for your own research projects. I
suggest that if bioinformatics is your thing, check out
<span>biopython</span> — in particular the worked examples at
<http://biopython.org/DIST/docs/tutorial/Tutorial.html>.

Practicals
----------

As always, test, add, commit and push all your new code and data to your
git repository.

#### Align DNA sequences

Align two DNA sequences such that they are as similar as possible.

The idea is to start with the longest string and try to position the
shorter string in all possible positions. For each position, count a
“score” : number of bases matched perfectly over the number of bases
attempted. Your tasks:

1.  Open and run <span>Practicals/Code/align\_seqs.py</span> — make sure
    you understand what each line is doing to do this)

In [None]:
Now convert <span>align\_seqs.py</span> to a Python function that
takes the DNA sequences as an input from a single external file and
saves the best alignment along with its corresponding score in a
single text file (your choice of format and file type) to an
appropriate location. No external should be needed; that is, you
should still only need to use <span>python align\_seq.py</span> to
run it.

For example, the input file can be a single <span>.csv</span> file
with the two example sequences given at the top of the
original script.

<span>*Don’t forget to add docstrings
where necessary/appropriate.*</span>

2.  Extra Credit – align all the <span>.fasta</span> sequences from Week
    1; call the new script\
    <span>align\_seqs\_fasta.py</span>. Unlike align\_seqs.py, this
    script should take <span>*any*</span> two fasta sequences (in
    separate files) to be aligned as input. So this script would
    typically run by using explicit inputs, by calling something like
    <span>python align\_seqs\_fasta.py seq1.csv seq2.csv</span>.
    However, it should still run if no inputs were given, using two
    fasta sequences from <span>Data</span> as defaults.

Errors in your <span>python</span> code
---------------------------------------

What do you want from your code? Rank the following by importance:

1.  it is very fast

2.  it gives me the right answer

3.  it is easy to read

4.  it uses lots of ’clever’ programming techniques

5.  it uses cool features of the language

Then, think about this:

If you are <span>*very lucky*</span>, your program will crash when you
run it

If you are <span>*lucky*</span>, you will get an answer that is
obviously wrong

If you are <span>*unlucky*</span>, you won’t notice until after
publication

If you are <span>*very unlucky*</span>, someone else will notice it
after publication

Ultimately, most of your time could well be spent error-checking and
fixing them “debugging”, not writing code. You can debug when errors
appear, but why not just nip as many as you can in the bud? For this,
you would use unit testing.

### Unit testing

Unit testing prevents the most common mistakes and helps write reliable
code. Indeed, there are many reasons for testing:

Can you prove (to yourself) that your code does what you think it does?

Did you think about the things that might go wrong?

Can you prove to other people that your code works?

Does it still all work if you fix a bug?

Does it still all work if you add a feature?

Does it work with that new dataset?

Does it work on the latest version of the language (e.g., Python 3.x
vs. 2.7.x)?

Does it work on Mac? on Linux? on Windows?

Does it work on 64 bit <span>*and*</span> 32 bit?

Does it work on an old version of a Mac?

Does it work on Harvey, or Imperial’s Linux cluster?

The idea is to write *independent* tests for the <span>*smallest
units*</span> of code. Why the smallest units? — to be able to retain
the tests upon code modification.

#### Unit testing with <span>doctest</span>

Let’s try <span>doctest</span>, the simplest testing tool in python:
simpletests for each function are embedded in the docstring. Copy the
file <span>control\_flow.py</span> into the file
<span>test\_control\_flow.py</span> and edit the original function so:

Now type <span>run test\_control\_flow.py -v</span> :

In [None]:
In []: run  test_control_flow.py -v
Trying:
    even_or_odd(10)
Expecting:
    '10 is Even!'
ok
Trying:
    even_or_odd(5)
Expecting:
    '5 is Odd!'
ok
Trying:
    even_or_odd(3.2)
Expecting:
    '3 is Odd!'
ok
Trying:
    even_or_odd(-2)
Expecting:
    '-2 is Even!'
ok
1 items had no tests:
    __main__
1 items passed all tests:
   4 tests in __main__.even_or_odd
4 tests in 2 items.
4 passed and 0 failed.
Test passed.
    

You can also run doctest “on the fly”, without writing <span>
doctest.testmod()</span> in the code by typing in a terminal:
<span>python -m doctest -v your\_function\_to\_test.py</span>

<span>*Other unit testing approaches*</span>

For more complex testing, see documentation of <span>doctest</span> at
<https://docs.python.org/2/library/doctest.html> , the package <span>
nose</span> and the package <span>unittest</span>

Please start testing as early as possible, but don’t try to test
everything either! Remember, it is easier to test if code is
compartmentalized into functions.

### Debugging

OK, so you unit-tested, let’s go look at life through beer-goggles...
BUT NO! YOU WILL VERY LIKELY RUN INTO BUGS!

Bugs happen, inevitably, in life and programming. You need to find and
debug them. Banish all thoughts of littering your code with
<span>print</span> statements to find bugs.

Enter the debugger. The command <span>pdb</span> turns on the python
debugger. Type the following in a file and save as
<span>debugme.py</span> in your <span> Code</span> directory:

Now run it:

In [None]:
In []: %run debugme.py
[lots of text]
createabug(x)
      2     y = x**4
      3     z = 0.
----> 4     y = y/z
      5     return y
      6 

ZeroDivisionError: float division by zero
  

OK, so let’s <span>%pdb</span> it

In [None]:
In []: %pdb
Automatic pdb calling has been turned ON

In []: run debugme.py
[lots of text]
ZeroDivisionError: float division by zero
> createabug()
      3     z = 0.
----> 4     y = y/z
      5     return y

ipdb> 

Now we’re in the debugger shell, and can use the following commands to
naviagate and test the code line by line or block by block:

In “normal” python, you would use <span>pdb</span> instead of
<span>ipdb</span>.

  -------------------------- -------------------------------------------------------------------------------------------------------------------
  <span>n</span>             move to the next line
  <span>ENTER</span>         repeat the previous command
  <span>s</span>             “step” into function or procedure (i.e., continue the debugging inside the function, as opposed to simply run it)
  <span>p x</span>           print variable x
  <span>pp locals()</span>   pretty print all variables and objects in current workspace scope
  <span>c</span>             continue until next break-point
  <span>q</span>             quit
  <span>l</span>             print the code surrounding the current position (you can specify how many)
  <span>r</span>             continue until the end of the function
  -------------------------- -------------------------------------------------------------------------------------------------------------------

\
So let’s continue our debugging:

In [None]:
ipdb> p x
25
ipdb> p y
390625
ipdb> p z
0.0
ipdb> p y/z
*** ZeroDivisionError: ZeroDivisionError
('float division by zero',)
ipdb> l
      1 def createabug(x):
      2     y = x**4
      3     z = 0.
----> 4     y = y/z
      5     return y
      6 
      7 createabug(25)

ipdb> q

In []: %pdb
Automatic pdb calling has been turned OFF
  

Once in the debugger, use <span>pp locals()</span> and/or <span>pp
globals()</span> to see all local or global objects (including variables
and functions) available at the point where the debugger stopped in the
script. <span>pp</span> stands for “pretty print”.

### Paranoid programming: debugging with breakpoints

You may want to pause the program run and inspect a given line or block
of code (<span>*why?*</span> — impromptu unit-testing is one reason). To
do so, simply put this snippet of code where you want to pause and start
a debugging session and then run the program again:

In [None]:
import ipdb; ipdb.set_trace()

Or, you can use <span>import pdb; pdb.set\_trace()</span>

Alternatively, running the code with the flag <span>%run -d</span>
starts a debugging session from the first line of your code (you can
also specify the line to stop at). If you are serious about programming,
please start using a debugger (R, Python, whatever...)!

Practicals
----------

As always, test, add, commit and push all your new code and data to your
git repository.

#### Missing oaks problem

1.  Open and run the code <span>test\_oaks.py</span> — there’s a bug,
    for no oaks are being found! (where’s
    <span>TestOaksData.csv</span>?)

2.  Fix the bug (hint: <span>import ipdb; ipdb.set\_trace()</span>)

3.  Now, write doctests to make sure that, bug or no bug, your <span>
    is\_an\_oak</span> function is working as expected (hint:
    <span>$>>>$ is\_an\_oak(’Fagus sylvatica’)</span> should return
    <span>False</span>)

4.  If you wrote good doctests, you will note that you found another
    error that you might not have come across just by debugging (hint:
    what happens if you try the doctest with ’Quercuss’ instead
    of ’Quercus’?). How would you fix the new error you found using the
    doctest?

Practicals wrap-up
------------------

1.  Review and make sure you can run all the commands, code fragments,
    and scripts we have till now and get the expected outputs — all
    scripts should work on any other linux laptop.

2.  Run <span>boilerplate.py</span> and <span>control\_flow.py</span>
    from the bash terminal instead of from within the ipython shell (try
    both python and ipython from the bash)

3.  Include an appropriate docstring (if one is missing) at the
    beginning of <span>*each*</span> of each of the python script /
    module files you have written, as well as at the start of every
    function (or sub-module) in a module.

4.  Also annotate your code lines as much and as often as necessary
    using \#.

5.  Keep all code files organized in <span>
    CMEECourseWork/Week2/Code</span>

*<span>git add</span>, <span>commit</span> and <span>push</span> all
your code and data to your git repository by next Wednesday 5 PM.*

Readings and Resources
----------------------

Code like a Pythonista: Idiomatic <span>python</span> (Google it)

Also good: the Google <span>python</span> Style Guide

Browse the python tutorial: <https://docs.python.org/3/tutorial/>

For functions and modules:\
<https://learnpythonthehardway.org/book/ex40.html>

For IPython:\
<http://ipython.org/ipython-doc/stable/interactive/tips.html>

Cookbooks can be very useful: <https://github.com/ipython/ipython/wiki>

Look up <https://docs.python.org/2/library/index.html> – Read about the
packages you think will be important to you

Some of you might find the python package <span>biopython</span>
particularly useful — check out <http://biopython.org/>, and especially,
the cookbook

In general, scores of good module/package-specific cookbooks are out
there — google “cookbook” along with the name of the package you are
interested in (e.g., “scipy cookbook”).