#Packaging and distribution

##Background

A **module** is a collection of useful functions and classes (actually just a Python file, as we'll see). 

**Built in modules** are ones that are part of the standard Python distribution (`re`, `os`, `datetime`, etc.)

**Third party modules** are ones that have to be installed separately (including modules which you write).

**Packages** have two purposes. They can collect a bunch of different related modules together (e.g. BioPython). They also make it easy to **distribute** Python modules i.e. make it easy for others to install.

In other languages, **modules** are usually called **libraries** but not in Python.

##Built in modules
You already hopefully know how to use built in modules:

In [None]:
import random

# simulate a die roll
print(random.choice(range(1,7)))

Where does the code for this actually live?

In [None]:
random.__file__

`.pyc` means a compiled Python file, but if we drop the `c` from the end and open up the `.py` we can prove that it's just normal Python code.

Notice that the name of the module is just the name of the file without the extension. 

How does Python know what folders to look for the file in?

In [None]:
import sys
print sys.path

###Third party modules

This is where things get complicated. To install a new module, we could just copy the file to one of the folders in the path but....
- how to remove old modules
- how to update modules when new versions become available
- how to ensure module dependencies are met (A->B->C,D)
- how to install extra command-line tools (we'll see later)

Solution: `pip`, the Python package manager. 

`pip install mymodule`
`pip install --upgrade mymodule`
`pip uninstall mymodule`

How does `pip` know about modules? Uses Python Package Index (more later).



##Making a new module

Hello old friend...

In [None]:
from __future__ import division 
 
# calculate the AT content 
def calculate_at(dna): 
    length = len(dna) 
    a_count = dna.count('A') 
    t_count = dna.count('T') 
    at_content = (a_count + t_count) / length 
    return at_content 

Let's create a module to store this function. All we have to do is create a new file and move the function into it.

Now we can `import` and use it just like any other module.

In [None]:
import at_calculator

In [None]:
at_calculator.calculate_at('TAGCTCGACTAGCTA')

Note that the file it's reading from is in the current directory:

In [None]:
at_calculator.__file__

so this will only work as long as we are in the same folder. We could copy it to one of the path folders to use it universally (but we will see a better way).

We can write a small program that uses the module:

In [None]:
import at_calculator 
 
# ask the user for a DNA sequence and filename 
dna = raw_input("Enter a DNA sequence:\n")
output_filename = raw_input("Enter a filename:\n")
 
# write the AT content to the output file 
with open(output_filename, "w") as out: 
    out.write(str(at_calculator.calculate_at(dna))) 

###Names and namespaces

Notice how we need to call `modulename.functionname()` i.e. `at_calculator.calculate_at()` . Annoying, but good because it means each module has its own **namespace** - authors (including you!) don't have to worry about what names might have been used by other people. 

For long names this gets anoying:

In [None]:
import some_incredibly_long_awkward_to_type_module_name
some_incredibly_long_awkward_to_type_module_name.foo()

We can get around it by using an alias:

In [None]:
import some_incredibly_long_awkward_to_type_module_name as bob
bob.foo()

If we're really sure that names don't clash we can import the function directly:

In [None]:
from at_calculator import calculate_at
calculate_at("ACTGATCGTCGAT")

If we know that names do clash, we can import functions with aliases:

In [None]:
from at_calculator import calculate_at as at1
from another_package import calculate_at as at2

But don't do this if you're expecting another person to read the code.

###Documenting modules

Python's docstring system is very simple. Include a single string literal as the first line of a module or function. Triple quotes let us run over multiple lines:

In [31]:
"""Functions for calculating metrics of DNA sequences"""

from __future__ import division 

def calculate_at(dna): 
    """Return the AT content of the argument. 
    Only works for uppercase DNA sequences
    """
    
    length = len(dna)
    a_count = dna.count('A') 
    t_count = dna.count('T') 
    at_content = (a_count + t_count) / length 
    return at_content 



In [38]:
# in iPython we have to explicitly reload the module after making changes
reload(at_calculator)
help(at_calculator)

Help on module at_calculator:

NAME
    at_calculator - Functions for calculating metrics of DNA sequences

FILE
    /home/martin/Dropbox/projects/course_notebooks/eg_ap/at_calculator.py

FUNCTIONS
    calculate_at(dna)
        Return the AT content of the argument. 
        Only works for uppercase DNA sequences

DATA
    division = _Feature((2, 2, 0, 'alpha', 2), (3, 0, 0, 'alpha', 0), 8192...




###Modules as programs

Sometimes we want to have a Python file that can act either as a module, or as a program:
- if there's currently only a single program that uses the module
- if it's handy to have a small demonstration program for the module
- to include a program which tests the module

A Python file can access its own name using the `__name__` variable.
- If the file is being run as a program, then the `__name__` variable is `__main__`
- If the file is being imported, then the `__name__` variable is the name of the file minus `.py`

Here's how it works:

In [40]:
if __name__ == "__main__":
    print("I am being run as a script!")    
else:
    print("I am being imported as a module!")

I am being run as a script!


Most of the time we don't need to do anything if running as a module, so it looks like:

In [None]:
if __name__ == "__main__":
    print("I am being run as a script!")   
    # demo program code goes here


So for our purposes:

In [None]:
from __future__ import division 

# calculate the AT content 
def calculate_at(dna): 
    length = len(dna) 
    a_count = dna.count('A') 
    t_count = dna.count('T') 
    at_content = (a_count + t_count) / length 
    return at_content 

if __name__ == "__main__": 
    dna = raw_input("Enter a DNA sequence:\n").rstrip("\n") 
    print("AT content is " + str(calculate_at(dna))) 

This is hard to demonstrate inside iPython. 

###Initialization code in modules

Some modules require code to run before we can call any functions. Consider a DNA translation function:

In [None]:
def translate_dna(dna): 
    """Return the translation of a DNA sequence"""

    # define a dict to hold the genetic code 
    gencode = { 
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R', 
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'} 

    last_codon_start = len(dna) - 2 
    protein = "" 

    # for each codon in the dna, append an amino acid to the protein
    for start in range(0,last_codon_start,3): 
        codon = dna[start:start+3] 
        amino_acid = gencode.get(codon.upper(), 'X') 
        protein = protein + amino_acid

    return protein 

There's a pretty glaring inefficiency here: we redefine the `gencode` dict every time the function is run. If we call `translate_dna()` many times then this will be slow. Better to put it outside the function definition in the module:

In [None]:
# define a dict to hold the genetic code 
gencode = { 
        'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M', 
        'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T', 
        'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 
        'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R', 
        'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L', 
        'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 
        'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q', 
        'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R', 
        'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 
        'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A', 
        'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E', 
        'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 
        'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S', 
        'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L', 
        'TAC':'Y', 'TAT':'Y', 'TAA':'_', 'TAG':'_', 
        'TGC':'C', 'TGT':'C', 'TGA':'_', 'TGG':'W'} 

def translate_dna(dna): 
    last_codon_start = len(dna) - 2 
    protein = "" 

    # for each codon in the dna, append an amino acid to the protein
    for start in range(0,last_codon_start,3): 
        codon = dna[start:start+3] 
        amino_acid = gencode.get(codon.upper(), 'X') 
        protein = protein + amino_acid

    return protein 

Now the dict will be defined at the time of import. Be careful though: don't run slow code on import unless it will definitely be used. We can imagine more sophisticated solutions here e.g. build the dict when first needed then reuse. 

##Packages

A module gathers together related useful modules, just like a module gathers together related useful functions/classes. Also, we need to build packages if we want to distribute code, even if it's just one module. To make a package we create a folder with the package name, and add a special file called `__init__.py` inside it:

```
dnatools/
    __init__.py
    dna_translation.py
```

Then we have to use the package name as part of the module name:


In [None]:
import dnatools.dna_translation
print(dnatools.dna_translation.translate_dna("ACTGTGAC"))

Packaging is a necessary prerequisite for distribution....

##Distributing packages

- to distribute a program that uses built in modules, just send the .py file
- to distribute a program that uses third party modules, send the .py file and tell the end use what modules to install
- to distribute a program plus separate module, send a zipped folder
- for anything more complicated, use `pip`

Let's look at an overview of the process (we will likely not be able to do this with the computing setup we have here).

###Register on PyPI
[This page](https://pypi.python.org/pypi) then *register*.

###Put the module folder inside a package folder
and also create a readme file
```
dnatools/
     README.txt
     dnatools/
         __init__.py
         dna_translation_2.py
```

###Create a `setup.py` file
which is where we put our metadata - it just uses a single call to `setuptools.setup()`

In [None]:
from setuptools import setup 
 
setup(name='dnatools', 
      version = '0.1', 
      description = 'Functions for working with DNA sequences', 
      url = 'http://example.com', 
      author = 'Martin Jones', 
      author_email = 'martin@pythonforbiologists.com', 
      license = 'MIT', 
      packages = ['dnatools']) 

Important: if we have dependencies we can add them to the `setup()` call like this:

In [None]:
install_requires = ['requests', 'BioPython']

This allows `pip` to know that when installing our package it also needs to install these others. 

###Register the package on PyPI
This is easy, a single command line:

```
python setup.py register
```

###Upload the package to PyPI
This is also easy:

```
python setup.py sdist upload
```
We use the same command to update our package (remember to change the version number)

This is a lot of work; however it means that anyone in the world can now run 

```
pip install dnatools
```
and get a copy of your code plus all dependencies. 

##Other distribution stuff you can do
- distribute a package with dependencies that are not on PyPI
- include a test suite as part of your package
- include command-line tools along with your package
- include data files along with your package
- include code written in a non-Python language as part of your package
- tell setup() to include/exclude specific files when it builds the distribution
- create a Windows installer or a Linux rpm/deb for your package


##Exercises

Pick some exercise solutions from previous sessions (or use your existing code) and turn the code into a module plus program. Try it for both object-oriented and imperative code. 

If you want to experiment with creating packages (either here or later) without polluting the PyPI namespace, there's a testing server you can use - [follow instructions here](https://wiki.python.org/moin/TestPyPI).

