# A bioinformatics exercise
This notebook uses a small bioinformatics exercise to show aspects of the Python programming 
language in the context of a real(ish) data processing activity.

We will be reading, writing, and manipulating text files and running a small sequence alignment
program.  Over the course of this we will cover programming topics such as:

   * Built-in Python types including strings, ints, floats
   * Python code blocks including if/then/else, for loops, functions,
     and context managers
   * Data structures like lists and dictionaries
   * System calls, including multiprocessing Pools
   
Additional topics including Python packages and environments and the object-orientation of Python
will be covered elsewhere.

## Setup an annotation file name (in 5 different ways)
This section shows four different ways to get to a filename that can be opened.

### 1. Assign a string literal to a variable

In Python, the equal sign means "assignment".  Double equal ("==") tests equality.
You can use tab completion to fill out the filename, because Jupyter lets you do that.

The single quotes ensure that file_name will be a Python string (single quotes and double quotes are indistiguishable).  You can check this with the _type()_ function.

#### A brief interlude on Python's basic types

In addition to strings, Python has integers...

... which are different than strings that look like numbers.

Python also has floating point numbers

... that have the same problems that floats in other systems have

Addition works like you'd expect for numbers, 

but the plus sign means concatenation when strings are involved

Boolean is a type as well

that is important for expressions

### 2. Concatenate string elements.

Strings can be concatenated with the '+' operator.  Non-strings must be
converted first with _str()_

```python
data_dir = 'data'
project_name = 'chr12'
annotations_file_name = 'annotations'
annotations_file_version = 1
annotations_file_ext = 'txt'
```

Let's make a function out of it using the _def_ keyword and a code block

```python
def get_annotation_file_name(
    data_dir, 
    project_name, 
    annotations_file_version, 
    annotations_file_name='annotations', 
    annotations_file_ext='txt'):
    
    '''
    Concatenates data_dir and project_name for path.  "annotations.<version>.<extension>" is the file name.
    '''
    
    return data_dir + '/' + project_name + '/' + annotations_file_name + '.' + str(annotations_file_version) + '.' + annotations_file_ext
```

#### A brief interlude about functions

A function is a block of code that can be run on 0 or more arguments using the "call" operator _()_ and return some value.

```python
def get_dna_chars():
    return 'ATCG'

dna_chars = get_dna_chars()
dna_chars
```

A function can have an arbitrary number of arguments.  They can be treated like positional arguments

```python
def get_nuc_chars(nuc_type, copies):
    if nuc_type.upper() == 'DNA':
        return 'ATCG' * copies
    else:
        return 'AUCG' * copies
result = get_nuc_chars('RNA', 5)
result
```

They can also be treated as keyword arguments and specified in arbitrary order

```python
result = get_nuc_chars(copies=5, nuc_type='DNA')
result
```

Arguments that don't have a default must be specified

```python
result = get_nuc_chars()
```

You can specify defaults when it makes sense, but positional arguments must come first

```python
def get_nuc_chars(nuc_type, copies=1):
    if nuc_type.upper() == 'DNA':
        return 'ATCG' * copies
    else:
        return 'AUCG' * copies
result = get_nuc_chars('RNA')
result
```

Run our annotation file function with required arguments and defaults

Specify the annotations_file_name with a different value

Specify arguments as keyword args in arbitrary order

The triple quote string is called a "docstring".  Besides being useful to developers that need to read your code, the Python help function can be used to display it.

### 3. Formatted strings

Python supports both positional and named string template substitution.  See the
[Pyformat page](https://pyformat.info/) for details

#### String concatentation is expensive because Python strings are immutable

#### Old style string formatting is common

```python
def get_annotation_file_name(data_dir, project_name, annotations_file_version, annotations_file_name='annotations', annotations_file_ext='txt'):
    '''Concatenates data_dir and project_name for path.  "annotations.<version>.<extension>" is the file name.'''
    
    return '%s/%s/%s.%d.%s' % (data_dir, project_name, annotations_file_name, annotations_file_version, annotations_file_ext)
```

#### format function is more readable and powerful

The format function of strings allows for positional substitution like old style
formatting, but also supports named place holders and rich formatting options

Types can be enforced using type specifiers like ':d'

Precision (or width) can be specified

Keyword arguments can be really helpful for readability

```python
'{data_dir}/{project_name}/{annotations_file_name}.{annotations_file_version:d}.{annotations_file_ext}'.format(
    annotations_file_name=annotations_file_name, 
    annotations_file_version=1, 
    annotations_file_ext=annotations_file_ext,
    data_dir=data_dir, 
    project_name=project_name, 
)
```

#### A brief interlude about classes, functions, and objects in Python

_format()_ is a good example of functions that are part of defined on object-oriented 
"classes" and used on instances called "objects".

### 4. Joining list elements

A list of elements can be "join"ed into a string.

#### A brief interlude about Python lists. 

Like arrays in other languages, Python lists are a group of items that can be indexed by an integer.

Lists are initialized with [] or list() and indexing starts with zero.

Check the length with _len()_

You can use negative indexes

Slices can be taken from lists using [:] notation.  Don't forget that the upper bound index is not included.

And you can slice with negative indexes

Lists can be appended to

and extended

List elements are mutable

You can also create an immutable list, a tuple, using parens.

List values can be iterated with a _for_ block

If you need the index, _enumerate()_

Strings act like lists...

but they are not mutable

#### We can redefine the function to join the list of path elements using the _join()_ function of strings

```python
def get_annotation_file_name(data_dir, project_name, annotations_file_version, annotations_file_name='annotations', annotations_file_ext='txt'):
    '''Concatenates data_dir and project_name for path.  "annotations.<version>.<extension>" is the file name.'''
    
    path_elements = [data_dir, project_name, '{}.{:d}.{}'.format(annotations_file_name, annotations_file_version, annotations_file_ext)]
    
    return '/'.join(path_elements)
```

### 5. Joining list elements with os.path.join
The _os_ module must be imported and contains functions that are sensitive to the operating system

Everything you use in a Python script must either be a built-in (e.g. __return__), defined in your code (e.g. _file_name_, _get_annotation_file_name_) or imported

## Convert the annotation file data into useful records and add to the FASTA sequence headers

We want to read the annotations file, read a sequence FASTA file and add the annotations to the FASTA file description line

#### There are lots of ways to read a text file.

In Python you interact with a file by opening a file handle in a particular mode, in this case 'read'.  The file handle is a lot like a pointer to the next part of the file that you're going to read.

Read it all into a single string using _read()_

Read it into a list of lines using _readlines()_.  You may need to re-open the file, because the fileh is now pointing to the end.

Or, especially if your file is large, you can read one line at a time using _for_ because a file handle acts like a list. <br/>Using print() will convert the \t and \n into tabs and newlines respectively

Using a context manager (_with_ _as_) is a good way to ensure that the file will close when you're done with it.

#### Read the data lines and stash the header line by itself using _if_

An _if_ statement is another Python block that will execute code (or not) based on an expression that evaluates to _True_ or _False_

#### Convert the lines into lists of data fields using _split()_.  Add them to a list to make a 2D matrix.

#### Report out the unique organism common names using a list

#### Report out the unique organism common names using a _set()_

A _set_ is a collection of unique elements that can participate in set operations like unions and intersects

#### Use a dictionary to map the common names

Python dictionaries (analogous to hashes or maps in other languages) are really just arrays with named indexes called 'keys'.  They can be initialized with curly braces (or dict()) and are generally mutable.

You can access individual elements by key

It's an error to access a key that isn't there.

But you can use the _get()_ function to safely return a default value

You can iterate over a dictionary with _for_ using the _items()_ function

It's important to remember that dictionary keys may not be in the order you added them (though in Python 3.6+ they usually are)

```python
org_name_map = {
    'Homo sapiens': 'Human',
    'Pan troglodytes': 'Chimp',
    'Macaca mulatta': 'Macaque'
}

for key, val in org_name_map.items():
    print(key, val)
```
_Pan troglodytes Chimp_<br/>
_Homo sapiens Human_<br/>
_Macaca mulatta Macaque_<br/>

If you want to ensure keys are in order, use an OrderedDict from the collections module

#### Now that we know what dicts are, wouldn't it be great if we could access our data row elements by the column headers?

You can do it by iterating through the column headers and row values simultaneously

Or use the very cool _zip()_ function to combine them in a couple of lines

### Sort the records by length

#### Python sorts lists by 'natural' order, either in place...

#### ... or as new list

#### Reversing the direction is easy

#### A key function provides flexibility in sorting

### Read FASTA records and set a more informative description line

FASTA records have two parts, a description line, starting with '>', and the sequence, e.g.

    >NC_000012.12 Homo sapiens chromosome 12, GRCh38.p13 Primary Assembly     <-- Description line
    ATCGAGACCATCCTGGCCAACATAGTGAAAACCTTTCTCTACTAAAAATACAAAAATTAGCCAGGTATGG    <-- Sequence (DNA in this case)
    TCGAGAGGCTGAGGCAGGAGGATCGCTTAAACCTGGGAGGTAGAGGTTCCAGTGAGCTGAGATTGCGACA
    ...
    >NC_000013.12 Homo sapiens chromosome 13, GRCh38.p13 Primary Assembly

In this example, the first line is the description line, starting with a '>' and the second line starts the DNA sequence.
There can be multiple lines of sequence separated by newlines or just a single line.

The description line has further structure in that the characters between the '>' and the first whitespace are 
treated as the sequence record identifier, in this case NC_000012.12 or NC_000013.12

More than one FASTA record may be in a FASTA file.


First, let's look at the description lines in our samples.fa sequence file

Next, let's read them into a list of dictionaries so that we can make changes before we write them out. 

We'll need to create a new dictionary for each record (each time we see '>')

There are multiple lines of DNA sequence for each record that should get saved

Change the description lines to include the gene name, organism and sequence type so that sample1, for example, looks like this:

    >sample1 Homo sapiens acrosin binding protein, mRNA
    
The .format() function should work well.

First, make a dictionary out of our annotations data, keyed by the sample name
    

Use the write function of the file handle to write to the new file.  Don't forget to add newlines.

## Run minimap2 using annotated-samples.fa as the query and chr12.fa.gz as the reference sequence

minimap2 is a command line tool for mapping query sequences to a reference.  This is useful for characterizing 
query sequences, SNP detection, finding orthologs (from close relatives), etc.  Command line usage is described 
as follows:

    Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]

where 'target' is the reference sequence (chr12.fa.gz for us)

### The most convenient way to run a shell command is _os.system()_

_os.system_ runs a command in a bash shell and outputs stderr and stdout to the console.  It returns the shell return code (e.g. zero for success)

Because it goes to the console, your Python code does not capture the output.

Execution is synchronous, so your program has to wait until it's done.

Bash shell (or whatever your current shell is) interpolation is done so PATH is honored, redirection works, etc.

You can check the return code for non-zero-ness

But you need to capture stderr to find out what happened

### The subprocess _Popen()_ constructor allows more flexibility and power in the execution of shell commands.

The _Popen()_ constructor creates a process handle that can be used to capture stderr, stdout or pipe data into
stdin.

Run a process using Popen just like _os.system()_

To capture stderr and stdout, use _PIPE_ and _.communicate()_

In Python 3, shell output is returned as a bytearray that must be decoded

A runcmd function can be handy

### A Pool from the multiprocessing module can support parallel execution

Python cannot do real, parallel multithreading due to the [GIL](https://realpython.com/python-gil/).  The multiprocessing module simulates a threading library, but uses forked processes.

#### An interlude about Python modules

##### A module is a file with Python definitions and statements.  The _import_ statement allows you to use those definitions in your code

The creation of modules is how Python libraries are made and shared.

For example, if you're doing several projects with DNA sequence, you might like a module that had common DNA sequence manipulations.  In a file called dna.py you could define several functions and data that you might use repeatedly:

```python
DNA_COMPLEMENT = {
    'A': 'T',
    'T': 'A',
    'C': 'G',
    'G': 'C',
}

def reverse_complement(dna):
    '''
    Return the reverse complement of the DNA sequence
    '''
    complement = []
    for base in reversed(dna):
        complement.append(DNA_COMPLEMENT[base.upper()])
    return complement


def translate(dna, frame=0):
    '''
    Translate a string of dna sequence into protein sequence using the given frame
    '''
    protein_sequence = []
    for i in range(frame, len(dna), 3):
        ...
    return ''.join(protein_sequence)

def transcribe(dna):
    '''
    Convert DNA into RNA
    '''
    return dna.replace('T', 'U')
```


To use the functions in this file, you would have to either import the entire module and use the functions (via the dot operator):

```python
import dna

transcript_sequence = 'TACGATCGATCGATCGATTATCGATCAGTCA'
protein_sequence = dna.translate(transcript_sequence)
```

Or you could import specific functions from the file

```python
from dna import translate

protein_sequence = translate('TACGATCGATCGATCGATTATCGATCAGTCA')
``` 
    
The _from_ keyword will get you to the thing you want to import, but the import is what you're allowed to use in your code

##### Python modules can be organized in directories traversed by _from_

If the _dna.py_ file described above is placed under a path, e.g. _seqlib/seq/nuc/dna.py_, functions could be accessed using the _from_ keyword with dots replacing the path separator.

```python
from seqlib.seq.nuc.dna import transcribe
```
    
This will work, but a file named \_\_init\_\_.py must be present in each of the directories

##### Python starts looking for modules based on the value of _sys.path_, which may include PYTHONPATH, the current directory, and ~/.local

    [akitzmiller@bioinf01 ~]$ echo $PYTHONPATH
    /odyssey/rc_admin/sw/admin/rcpy:

    [akitzmiller@bioinf01 ~]$ pwd
    /n/home_rc/akitzmiller

    [akitzmiller@bioinf01 ~]$ python
    Python 2.7.5 (default, Apr  9 2019, 14:30:50) 
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    
    >>> import sys, os
    
    >>> os.environ['PYTHONPATH']
    '/odyssey/rc_admin/sw/admin/rcpy:'
    
    >>> print '\n'.join(sys.path)

    /odyssey/rc_admin/sw/admin/rcpy
    /n/home_rc/akitzmiller
    /usr/lib64/python27.zip
    /usr/lib64/python2.7
    /usr/lib64/python2.7/plat-linux2
    /usr/lib64/python2.7/lib-tk
    /usr/lib64/python2.7/lib-old
    /usr/lib64/python2.7/lib-dynload
    /usr/lib64/python2.7/site-packages
    /usr/lib64/python2.7/site-packages/gtk-2.0
    /usr/lib/python2.7/site-packages
    >>> 


##### You can find where a module comes from using the \_\_file\_\_ property of the module
Seriously, everything is an object

##### sys.path is setup relative to the interpreter path, which is why virtual environments work (more about them later)

#### A multiprocessing Pool allows you to manage parallel processes easily

A multiprocessing Pool is an object that allows you to launch, manage, and retrieve results from a set of forked processes.

#### The _map_ function applies a set of values to a single argument function.  This is a useful way to do a "parameter sweep" type of execution.

```python
from multiprocessing import Pool
import os

def echo(echoable):
    os.system('echo %s && sleep 10' % echoable)
    
echoables = [
    'ajk',
    '123',
    'qwerty',
    'uiop',
    'lkjdsa',
]

numprocs = 3
pool = Pool(numprocs)
result = pool.map(echo,echoables)
```

_123_ <br/>
_ajk_ <br/>
_qwerty_ <br/>
_lkjdsa_ <br/>
_uiop_ <br/>


#### The _apply_async_ function allows you to apply many arguments and returns a 'handle' for interacting with the process.

In order for this to work in parallel, you'll need to collect the result handles in a list

```python
from multiprocessing import Pool
import os
def greet(name, message):
    os.system('echo "Hi %s, %s" && sleep 10' % (name,message))
    return '%s was greeted' % name

greetings = [
    ('Aaron', "What's up?"),
    ('Bert', "Where's Ernie?"),
    ('Donald', "What're you thinking?"),
    ('folks', 'Sup!'),
]
numprocs = 3
pool = Pool(numprocs)
results = []
for greeting in greetings:
    result = pool.apply_async(greet, greeting)
    results.append(result)
```

_Hi Bert, Where's Ernie?_ <br/>
_Hi Aaron, What's up?_ <br/>
_Hi Donald, What're you thinking?_ <br/>
_Hi folks, Sup!_ <br/>
    
```python
for result in results:
    print(result.get())
```

_Aaron was greeted_ <br/>
_Bert was greeted_ <br/>
_Donald was greeted_ <br/>
_folks was greeted_ <br/>


#### Run several minimap2 processes in parallel

Create a function that runs minimap2

Setup function arguments in a list

Running in series will be pretty slow

But in parallel

## Search for patterns in the DNA sequence using regular expressions

Python has a full-featured, Perl-ish regular expression syntax provided by the _re_ module

First, a simple search for DNA-ness in each of the fasta record sequences.

Using _re.search_ looks for at least one instance of the pattern

You can search for multiple character patterns, like 'A' followed by 'T'

You can also search for character sets, e.g. one of A,T,C, or G, using square brackets [].

There are more general character classes built in, like \S (any non-whitespace) or \s (any whitespace)

Quantifiers ({n,m}) can define how many times you see the character(s) you're searching for.

Without the second number and comma, it must be an exact number

If you leave the comma in, it's n or more

There are special quantifiers '+' (one or more) and '*' (zero or more)

Non-capturing groups _(?:)_ support or-ing together strings

Using capture groups, you can extract the matches

The _split()_ function allows you to break a string based on a regular expression.

Find potential genes by splitting chr12 on stop codons followed by a lot of T