# A bioinformatics exercise
This notebook uses a small bioinformatics exercise to show aspects of the Python programming 
language in the context of a real(ish) data processing activity.

We will be reading, writing, and manipulating text files and running a small sequence alignment
program.  Over the course of this we will cover programming topics such as:

   * Built-in Python types including strings, ints, floats
   * Python code blocks including if/then/else, for loops, functions,
     and context managers
   * Data structures like lists and dictionaries
   * System calls, including multiprocessing Pools
   
Additional topics including Python packages and environments and the object-orientation of Python
will be covered elsewhere.

## Setup an annotation file name (in 5 different ways)

This section shows five different ways to get to a filename that can be opened.

### 1. Assign a string literal to a variable

In Python, the equal sign means "assignment".  Double equal ("==") tests equality.
You can use tab completion to fill out the filename, because Jupyter lets you do that.

In [None]:
file_name = 'data/chr12/annotations.1.txt'
file_name

The single quotes ensure that file_name will be a Python string (single quotes and double quotes are indistiguishable).  You can check this with the _type()_ function.

In [None]:
type(file_name)

#### A brief interlude on Python's basic types

In addition to strings, Python has integers...

In [None]:
file_number = 1
type(file_number)

... which are different than strings that look like numbers.

In [None]:
file_number = '1'
type(file_number)

Python also has floating point numbers

In [None]:
file_number = 1.5
type(file_number)

... that have the same problems that floats in other systems have

In [None]:
small_number = 0.000000232023402031029834721043
small_number

Addition works like you'd expect for numbers, 

In [None]:
1 + 1

but the plus sign means concatenation when strings are involved

In [None]:
'1' + '1'

Boolean is a type as well

In [None]:
type(True)

that is important for expressions

In [None]:
type('a' == 'b')

### 2. Concatenate string elements.

Strings can be concatenated with the '+' operator.  Non-strings must be
converted first with _str()_

```python
data_dir = 'data'
project_name = 'chr12'
annotations_file_name = 'annotations'
annotations_file_version = 1
annotations_file_ext = 'txt'
```

In [None]:
data_dir = 'data'
project_name = 'chr12'
annotations_file_name = 'annotations'
annotations_file_version = 1
annotations_file_ext = 'txt'

In [None]:
file_name = data_dir + '/' + project_name + '/' + annotations_file_name + '.' + str(annotations_file_version) + '.' + annotations_file_ext
file_name

Let's make a function out of it using the _def_ keyword and a code block

In [None]:
def get_annotation_file_name(
    data_dir, 
    project_name, 
    annotations_file_version, 
    annotations_file_name='annotations', 
    annotations_file_ext='txt'):

    '''
    Concatenates data_dir and project_name for path.  "annotations.<version>.<extension>" is the file name.
    '''

    return data_dir + '/' + project_name + '/' + annotations_file_name + '.' + str(annotations_file_version) + '.' + annotations_file_ext

In [None]:
get_annotation_file_name(data_dir, project_name, annotations_file_version)

#### A brief interlude about functions

A function is a block of code that can be run on 0 or more arguments using the "call" operator _()_ and return some value.

In [None]:
def get_dna_chars():
    return 'ATCG'

dna_chars = get_dna_chars()
dna_chars

A function can have an arbitrary number of arguments.  They can be treated like positional arguments

In [None]:
def get_nuc_chars(nuc_type, copies):
    if nuc_type.upper() == 'DNA':
        return 'ATCG' * copies
    else:
        return 'AUCG' * copies
result = get_nuc_chars('RNA', 5)
result

They can also be treated as keyword arguments and specified in arbitrary order

In [None]:
result = get_nuc_chars(copies=5, nuc_type='DNA')
result

Arguments that don't have a default must be specified

In [None]:
result = get_nuc_chars()

You can specify defaults when it makes sense, but positional arguments must come first

In [None]:
def get_nuc_chars(nuc_type, copies=1):
    if nuc_type.upper() == 'DNA':
        return 'ATCG' * copies
    else:
        return 'AUCG' * copies
result = get_nuc_chars('RNA')
result

Run our annotation file function with required arguments and defaults

In [None]:
get_annotation_file_name(data_dir, project_name, annotations_file_version)

Specify the annotations_file_name with a different value

In [None]:
get_annotation_file_name(data_dir, project_name, annotations_file_version, annotations_file_name='anothername')

Specify arguments as keyword args in arbitrary order

In [None]:
get_annotation_file_name(annotations_file_ext='csv', data_dir=data_dir, annotations_file_version=3, project_name='chr13')

The triple quote string is called a "docstring".  Besides being useful to developers that need to read your code, the Python help function can be used to display it.

In [None]:
help(get_annotation_file_name)

### 3. Formatted strings

Python supports both positional and named string template substitution.  See the
[Pyformat page](https://pyformat.info/) for details

#### String concatentation is expensive because Python strings are immutable

In [None]:
file_name = get_annotation_file_name(data_dir, project_name, annotations_file_version)

In [None]:
file_name[0] = 'a'

#### Old style string formatting is common

In [None]:
def get_annotation_file_name(data_dir, project_name, annotations_file_version, annotations_file_name='annotations', annotations_file_ext='txt'):
    '''Concatenates data_dir and project_name for path.  "annotations.<version>.<extension>" is the file name.'''
    
    return '%s/%s/%s.%d.%s' % (data_dir, project_name, annotations_file_name, annotations_file_version, annotations_file_ext)

In [None]:
file_name = get_annotation_file_name(data_dir, project_name, annotations_file_version)
file_name

#### format function is more readable and powerful

The format function of strings allows for positional substitution like old style
formatting, but also supports named place holders and rich formatting options

In [None]:
'{}/{}/{}.{}.{}'.format(data_dir, project_name, annotations_file_name, annotations_file_version, annotations_file_ext)

Types can be enforced using type specifiers like ':d'

In [None]:
'{}/{}/{}.{:d}.{}'.format(data_dir, project_name, annotations_file_name, annotations_file_version, annotations_file_ext)

Precision (or width) can be specified

In [None]:
'{}/{}/{:.5}.{:d}.{}'.format(data_dir, project_name, annotations_file_name, annotations_file_version, annotations_file_ext)

Keyword arguments can be really helpful for readability

In [None]:
'{data_dir}/{project_name}/{annotations_file_name}.{annotations_file_version:d}.{annotations_file_ext}'.format(
    annotations_file_name=annotations_file_name, 
    annotations_file_version=1, 
    annotations_file_ext=annotations_file_ext,
    data_dir=data_dir, 
    project_name=project_name, 
)

#### A brief interlude about classes, functions, and objects in Python

_format()_ is a good example of functions that are part of defined on object-oriented 
"classes" and used on instances called "objects".

### 4. Joining list elements

A list of elements can be "join"ed into a string.

In [None]:
path = '/'.join(['data', 'chr12', 'annotations.1.txt'])
path

#### A brief interlude about Python lists. 

Like arrays in other languages, Python lists are a group of items that can be indexed by an integer.

Lists are initialized with [] or list() and indexing starts with zero.

In [None]:
path_elements = ['nano-course', 'python', 'data', 'chr12']

In [None]:
path_elements[0]

Check the length with _len()_

In [None]:
len(path_elements)

You can use negative indexes

In [None]:
path_elements[-1]

Slices can be taken from lists using [:] notation.  Don't forget that the upper bound index is not included.

In [None]:
path_elements[0:2]

And you can slice with negative indexes

In [None]:
path_elements[-2:-1]

Lists can be appended to

In [None]:
path_elements.append('annotations.1.txt')
path_elements

and extended

In [None]:
full_path = ['Users','akitzmiller']
full_path.extend(path_elements)
print(full_path)

List elements are mutable

In [None]:
path_elements[1] = 'R'
path_elements

You can also create an immutable list, a tuple, using parens

In [None]:
path_tuple = ('nano-course', 'python', 'data', 'chr12')
path_tuple[1] = 'x'

List values can be iterated with a _for_ block

In [None]:
for path_element in path_elements:
    print(path_element)

If you need the index, _enumerate()_

In [None]:
for i, path_element in enumerate(path_elements):
    print(i, path_element)

Strings act like lists...

In [None]:
data_dir[-1]

In [None]:
for ch in data_dir:
    print(ch)

but they are not mutable

In [None]:
data_dir[1] = 'x'

#### We can redefine the function to join the list of path elements using the _join()_ function of strings

In [None]:
'/'.join(path_elements)

In [None]:
def get_annotation_file_name(data_dir, project_name, annotations_file_version, annotations_file_name='annotations', annotations_file_ext='txt'):
    '''Concatenates data_dir and project_name for path.  "annotations.<version>.<extension>" is the file name.'''
    
    path_elements = [data_dir, project_name, '{}.{:d}.{}'.format(annotations_file_name, annotations_file_version, annotations_file_ext)]
    
    return '/'.join(path_elements)

In [None]:
file_name = get_annotation_file_name(data_dir, project_name, annotations_file_version)
file_name

### 5. Joining list elements with os.path.join
The _os_ module must be imported and contains functions that are sensitive to the operating system

In [None]:
os.path.join()

Everything you use in a Python script must either be a built-in (e.g. __return__), defined in your code (e.g. _file_name_, _get_annotation_file_name_) or imported

In [None]:
import os

In [None]:
help(os.path.join)

In [None]:
os.path.join(data_dir, project_name, '{}.{:d}.{}'.format(annotations_file_name, annotations_file_version, annotations_file_ext))

In [None]:
def get_annotation_file_name(data_dir, project_name, annotations_file_version, annotations_file_name='annotations', annotations_file_ext='txt'):
    '''Concatenates data_dir and project_name for path.  "annotations.<version>.<extension>" is the file name.'''
    
    path = os.path.join(data_dir, project_name, '{}.{:d}.{}'.format(annotations_file_name, annotations_file_version, annotations_file_ext))
    
    return path

In [None]:
file_name = get_annotation_file_name(data_dir, project_name, annotations_file_version)
file_name

## Convert the annotation file data into useful records and add to the FASTA sequence headers

We want to read the annotations file, read a sequence FASTA file and add the annotations to the FASTA file description line

#### There are lots of ways to read a text file.

In Python you interact with a file by opening a file handle in a particular mode, in this case 'read'.  The file handle is a lot like a pointer to the next part of the file that you're going to read.

In [None]:
fileh = open(file_name, 'r')

Read it all into a single string using _read()_

In [None]:
fileh.read()

Read it into a list of lines using _readlines()_.  You may need to re-open the file, because the fileh is now pointing to the end.

In [None]:
fileh.readlines()

In [None]:
fileh = open(file_name, 'r')

In [None]:
lines = fileh.readlines()
lines

Or, especially if your file is large, you can read one line at a time using _for_ because a file handle acts like a list. <br/>Using print() will convert the \t and \n into tabs and newlines respectively

In [None]:
fileh = open(file_name, 'r')

In [None]:
for line in fileh:
    print(line.strip())

Using a context manager (_with_ _as_) is a good way to ensure that the file will close when you're done with it.

In [None]:
lines = []
with open(file_name, 'r') as fileh:
    for line in fileh:
        lines.append(line.strip())

In [None]:
fileh.closed

In [None]:
print(lines)

#### Read the data lines and stash the header line by itself using _if_

An _if_ statement is another Python block that will execute code (or not) based on an expression that evaluates to _True_ or _False_

In [None]:
lines = []
header_line = ''
with open(file_name, 'r') as fileh:
    for line in fileh:
        line = line.strip()
        if line.startswith('Accession'):
            header_line = line
        else:
            lines.append(line)
lines

#### Convert the lines into lists of data fields using _split()_.  Add them to a list to make a 2D matrix.

In [None]:
data = []
for line in lines:
    field_list = line.split('\t')
    data.append(field_list)
data

In [None]:
data[0]

In [None]:
data[0][1]

#### Report out the unique organism common names using a list

In [None]:
common_names = []
for row in data:
    org = row[1]
    common_name = ''
    if org == 'Homo sapiens':
        common_name = 'Human'
    elif org == 'Pan troglodytes':
        common_name = 'Chimp'
    elif org == 'Macaca mulatta':
        common_name = 'Macaque'
    else:
        print('Unknown organism %s' % org)
        common_name = org
        
    if common_name not in common_names:
        common_names.append(common_name)
common_names

#### Report out the unique organism common names using a _set()_

A _set_ is a collection of unique elements that can participate in set operations like unions and intersects

In [None]:
common_names = set()
for row in data:
    org = row[1]
    common_name = ''
    if org == 'Homo sapiens':
        common_name = 'Human'
    elif org == 'Pan troglodytes':
        common_name = 'Chimp'
    elif org == 'Macaca mulatta':
        common_name = 'Macaque'
    else:
        print('Unknown organism %s' % org)
        common_name = org
        
    common_names.add(common_name)
common_names

```python
model_organisms = set(['Human', 'Mouse', 'Fruit fly', 'Macaque', 'Zebrafish', 'E. coli'])
```

In [None]:
model_organisms = set(['Human', 'Mouse', 'Fruit fly', 'Macaque', 'Zebrafish', 'E. coli'])
common_names - model_organisms

#### Use a dictionary to map the common names

Python dictionaries (analogous to hashes or maps in other languages) are really just arrays with named indexes called 'keys'.  They can be initialized with curly braces (or dict()) and are generally mutable.

In [None]:
org_name_map = {
    'Homo sapiens': 'Human',
    'Pan troglodytes': 'Chimp'
}
org_name_map

In [None]:
org_name_map['Macaca mulatta'] = 'Macaque'
org_name_map

You can access individual elements by key

In [None]:
org_name_map['Homo sapiens']

It's an error to access a key that isn't there.

In [None]:
org_name_map['Mus musculus']

But you can use the _get()_ function to safely return a default value

In [None]:
org_name_map.get('Mus musculus', 'Not found')

You can iterate over a dictionary with _for_ using the _items()_ function

In [None]:
for org, common in org_name_map.items():
    print('%s (%s)' % (org, common))

In [None]:
common_names = set()
for row in data:
    org = row[1]
    common_name = org_name_map.get(org, org)
    common_names.add(common_name)

common_names

It's important to remember that dictionary keys may not be in the order you added them (though in Python 3.6+ they usually are)

```python
org_name_map = {
    'Homo sapiens': 'Human',
    'Pan troglodytes': 'Chimp',
    'Macaca mulatta': 'Macaque'
}

for key, val in org_name_map.items():
    print(key, val)
```
_Pan troglodytes Chimp_<br/>
_Homo sapiens Human_<br/>
_Macaca mulatta Macaque_<br/>

If you want to ensure keys are in order, use an OrderedDict from the collections module

#### Now that we know what dictionaries are, wouldn't it be great if we could access our data row elements by the column headers?

In [None]:
header_line

In [None]:
col_names = header_line.split('\t')
col_names

You can do it by iterating through the column headers and row values simultaneously

In [None]:
labeled_data = []
for row in data:
    labeled_row = {}
    for i, col_name in enumerate(col_names):
        labeled_row[col_name] = row[i]
    labeled_data.append(labeled_row)
labeled_data
    

Or use the very cool _zip()_ function to combine them in a couple of lines

In [None]:
labeled_data = []
for row in data:
    labeled_row = zip(col_names, row)
    labeled_data.append(dict(labeled_row))
labeled_data

### Sort the records by length

#### Python sorts lists by 'natural' order, either in place...

In [None]:
letters = ['a','x','t']
letters.sort()
letters

In [None]:
numbers  = [1, 5, 20, 1.5]
numbers.sort()
numbers

In [None]:
numberchars = ['1', '2', '100', '150']
numberchars.sort()
numberchars

#### ... or as new list

In [None]:
numbers = [1,5,3,8]
sortednumbers = sorted(numbers)
numbers

In [None]:
sortednumbers

#### Reversing the direction is easy

In [None]:
sortednumbers.sort(reverse=True)
sortednumbers

#### A key function provides flexibility in sorting

In [None]:
def case_insensitive(item):
    return item.lower()

words = ['and', 'or', 'But']
sortedwords = sorted(words)
sortedwords

In [None]:
sortedwords = sorted(words, key=case_insensitive)
sortedwords

In [None]:
def seq_length(item):
    return int(item['Length'])

sorted_labeled_data = sorted(labeled_data, key=seq_length, reverse=True)
sorted_labeled_data

### Read FASTA records and set a more informative description line

FASTA records have two parts, a description line, starting with '>', and the sequence, e.g.

    >NC_000012.12 Homo sapiens chromosome 12, GRCh38.p13 Primary Assembly     <-- Description line
    ATCGAGACCATCCTGGCCAACATAGTGAAAACCTTTCTCTACTAAAAATACAAAAATTAGCCAGGTATGG    <-- Sequence (DNA in this case)
    TCGAGAGGCTGAGGCAGGAGGATCGCTTAAACCTGGGAGGTAGAGGTTCCAGTGAGCTGAGATTGCGACA
    ...
    >NC_000013.12 Homo sapiens chromosome 13, GRCh38.p13 Primary Assembly

In this example, the first line is the description line, starting with a '>' and the second line starts the DNA sequence.
There can be multiple lines of sequence separated by newlines or just a single line.

The description line has further structure in that the characters between the '>' and the first whitespace are 
treated as the sequence record identifier, in this case NC_000012.12 or NC_000013.12

More than one FASTA record may be in a FASTA file.


First, let's look at the description lines in our samples.fa sequence file

In [None]:
sample_file = 'data/chr12/samples.fa'
fileh = open(sample_file, 'r')
for line in fileh:
    line = line.strip()
    if line.startswith('>'):
        print(line)

Next, let's read them into a list of dictionaries so that we can make changes before we write them out. 

We'll need to create a new dictionary for each record (each time we see '>')

There are multiple lines of DNA sequence for each record that should get saved

In [None]:
fasta_records = []
sample_file = 'data/chr12/samples.fa'
fileh = open(sample_file, 'r')
current_description = None
current_sequence_lines = []
for line in fileh:
    line = line.strip()
    if line.startswith('>'):
        if current_description is not None:
            new_record = {'description': current_description, 'sequence_lines': current_sequence_lines}
            fasta_records.append(new_record)
        current_description = line
        current_sequence_lines = []
    else:
        current_sequence_lines.append(line)
fasta_records.append({'description': current_description, 'sequence_lines': current_sequence_lines})
    

In [None]:
fasta_records

Change the description lines to include the gene name, organism and sequence type so that sample1, for example, looks like this:

    >sample1 Homo sapiens acrosin binding protein, mRNA
    
The .format() function should work well.

First, make a dictionary out of our annotations data, keyed by the sample name
    

In [None]:
labeled_data_dict = {}
for record in sorted_labeled_data:
    labeled_data_dict[record['Accession']] = record
labeled_data_dict

In [None]:
for fasta_record in fasta_records:
    key = fasta_record['description'][1:]
    record = labeled_data_dict[key]
    new_description = '>{accession} {organism} {gene_name}, {seq_type}'.format(
        accession=record['Accession'],
        organism=record['Organism'],
        gene_name=record['Gene name'],
        seq_type=record['Seq type'],
    )
    fasta_record['description'] = new_description
fasta_records
    

Use the write function of the file handle to write to the new file.  Don't forget to add newlines.

In [None]:
annotated_sample_file = 'data/chr12/annotated-samples.fa'
fileh = open(annotated_sample_file, 'w')
for fasta_record in fasta_records:
    fileh.write('%s\n' % fasta_record['description'])
    fileh.write('%s\n' % '\n'.join(fasta_record['sequence_lines']))
fileh.close()

## Run minimap2 using annotated-samples.fa as the query and chr12.fa.gz as the reference sequence

minimap2 is a command line tool for mapping query sequences to a reference.  This is useful for characterizing 
query sequences, SNP detection, finding orthologs (from close relatives), etc.  Command line usage is described 
as follows:

    Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]

where 'target' is the reference sequence (chr12.fa.gz for us)

In [None]:
target_file = 'data/chr12/chr12.fa.gz'

In [None]:
cmd = './minimap2 {} {}'.format(target_file, annotated_sample_file)

In [None]:
cmd

### The most convenient way to run a shell command is _os.system()_

_os.system_ runs a command in a bash shell and outputs stderr and stdout to the console.  It returns the shell return code (e.g. zero for success)

Because it goes to the console, your Python code does not capture the output.

Execution is synchronous, so your program has to wait until it's done.

Bash shell (or whatever your current shell is) interpolation is done so PATH is honored, redirection works, etc.

In [None]:
os.system(cmd)

You can check the return code for non-zero-ness

In [None]:
cmd = './minimap2 --non-existent-switch {} {}'.format(target_file, annotated_sample_file)

In [None]:
if os.system(cmd) != 0:
    print('Fail!')
else:
    print('Success!')

But you need to capture stderr to find out what happened

In [None]:
cmd = './minimap2 --non-existent-switch {} {} 2> stderr 1> stdout'.format(target_file, annotated_sample_file)

In [None]:
if os.system(cmd) != 0:
    stderrh = open('stderr', 'r')
    print(stderrh.read())

### The subprocess _Popen()_ constructor allows more flexibility and power in the execution of shell commands.

The _Popen()_ constructor creates a process handle that can be used to capture stderr, stdout or pipe data into
stdin.

Run a process using Popen just like _os.system()_

In [None]:
import subprocess

In [None]:
cmd = './minimap2 -a {} {}'.format(target_file, annotated_sample_file)

In [None]:
proc = subprocess.Popen(cmd, shell=True)
proc.wait()

To capture stderr and stdout, use _PIPE_ and _.communicate()_

In [None]:
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

In [None]:
stdout, stderr = proc.communicate()
if proc.returncode == 0:
    print(stdout)
else:
    print('Fail %s' % stderr)

In Python 3, shell output is returned as a bytearray that must be decoded

In [None]:
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = proc.communicate()
if proc.returncode == 0:
    print(stdout.decode('ascii'))
else:
    print('Fail %s' % stderr)

A runcmd function can be handy

In [None]:
def runcmd(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE):
    proc = subprocess.Popen(cmd, shell=True, stdout=stdout, stderr=stderr)
    stdout, stderr = proc.communicate()
    return {'returncode': proc.returncode, 'stdout': stdout.decode('utf-8'), 'stderr': stderr.decode('utf-8')}

In [None]:
result = runcmd(cmd)

In [None]:
print(result['returncode'], "\n", result['stdout'].split("\n")[:10], "\n", result['stderr'])

### A Pool from the multiprocessing module can support parallel execution

Python cannot do real, parallel multithreading due to the [GIL](https://realpython.com/python-gil/).  The multiprocessing module simulates a threading library, but uses forked processes.

#### An interlude about Python modules

##### A module is a file with Python definitions and statements.  The _import_ statement allows you to use those definitions in your code

The creation of modules is how Python libraries are made and shared.

For example, if you're doing several projects with DNA sequence, you might like a module that had common DNA sequence manipulations.  In a file called dna.py you could define several functions and data that you might use repeatedly:

```python
DNA_COMPLEMENT = {
    'A': 'T',
    'T': 'A',
    'C': 'G',
    'G': 'C',
}

def reverse_complement(dna):
    '''
    Return the reverse complement of the DNA sequence
    '''
    complement = []
    for base in reversed(dna):
        complement.append(DNA_COMPLEMENT[base.upper()])
    return complement


def translate(dna, frame=0):
    '''
    Translate a string of dna sequence into protein sequence using the given frame
    '''
    protein_sequence = []
    for i in range(frame, len(dna), 3):
        ...
    return ''.join(protein_sequence)

def transcribe(dna):
    '''
    Convert DNA into RNA
    '''
    return dna.replace('T', 'U')
```


To use the functions in this file, you would have to either import the entire module and use the functions (via the dot operator):

```python
import dna

transcript_sequence = 'TACGATCGATCGATCGATTATCGATCAGTCA'
protein_sequence = dna.translate(transcript_sequence)
```

Or you could import specific functions from the file

```python
from dna import translate

protein_sequence = translate('TACGATCGATCGATCGATTATCGATCAGTCA')
``` 
    
The _from_ keyword will get you to the thing you want to import, but the import is what you're allowed to use in your code

##### Python modules can be organized in directories traversed by _from_

If the _dna.py_ file described above is placed under a path, e.g. _seqlib/seq/nuc/dna.py_, functions could be accessed using the _from_ keyword with dots replacing the path separator.

```python
from seqlib.seq.nuc.dna import transcribe
```
    
This will work, but a file named \_\_init\_\_.py must be present in each of the directories

##### Python starts looking for modules based on the value of _sys.path_, which may include PYTHONPATH, the current directory, and ~/.local

    [akitzmiller@bioinf01 ~]$ echo $PYTHONPATH
    /odyssey/rc_admin/sw/admin/rcpy:

    [akitzmiller@bioinf01 ~]$ pwd
    /n/home_rc/akitzmiller

    [akitzmiller@bioinf01 ~]$ python
    Python 2.7.5 (default, Apr  9 2019, 14:30:50) 
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    
    >>> import sys, os
    
    >>> os.environ['PYTHONPATH']
    '/odyssey/rc_admin/sw/admin/rcpy:'
    
    >>> print '\n'.join(sys.path)

    /odyssey/rc_admin/sw/admin/rcpy
    /n/home_rc/akitzmiller
    /usr/lib64/python27.zip
    /usr/lib64/python2.7
    /usr/lib64/python2.7/plat-linux2
    /usr/lib64/python2.7/lib-tk
    /usr/lib64/python2.7/lib-old
    /usr/lib64/python2.7/lib-dynload
    /usr/lib64/python2.7/site-packages
    /usr/lib64/python2.7/site-packages/gtk-2.0
    /usr/lib/python2.7/site-packages
    >>> 


##### You can find where a module comes from using the \_\_file\_\_ property of the module
Seriously, everything is an object

In [None]:
os.__file__

##### sys.path is setup relative to the interpreter path, which is why virtual environments work (more about them later)

In [None]:
import sys
print('\n'.join(sys.path))

#### A multiprocessing Pool allows you to manage parallel processes easily

A multiprocessing Pool is an object that allows you to launch, manage, and retrieve results from a set of forked processes.

#### The _map_ function applies a set of values to a single argument function.  This is a useful way to do a "parameter sweep" type of execution.

```python
from multiprocessing import Pool
import os

def echo(echoable):
    os.system('echo %s && sleep 10' % echoable)
    
echoables = [
    'ajk',
    '123',
    'qwerty',
    'uiop',
    'lkjdsa',
]

numprocs = 3
pool = Pool(numprocs)
result = pool.map(echo,echoables)
```

_123_ <br/>
_ajk_ <br/>
_qwerty_ <br/>
_lkjdsa_ <br/>
_uiop_ <br/>


#### The _apply_async_ function allows you to apply many arguments and returns a 'handle' for interacting with the process.

In order for this to work in parallel, you'll need to collect the result handles in a list

```python
from multiprocessing import Pool
import os
def greet(name, message):
    os.system('echo "Hi %s, %s" && sleep 10' % (name,message))
    return '%s was greeted' % name

greetings = [
    ('Aaron', "What's up?"),
    ('Bert', "Where's Ernie?"),
    ('Donald', "What're you thinking?"),
    ('folks', 'Sup!'),
]
numprocs = 3
pool = Pool(numprocs)
results = []
for greeting in greetings:
    result = pool.apply_async(greet, greeting)
    results.append(result)
```

_Hi Bert, Where's Ernie?_ <br/>
_Hi Aaron, What's up?_ <br/>
_Hi Donald, What're you thinking?_ <br/>
_Hi folks, Sup!_ <br/>
    
```python
for result in results:
    print(result.get())
```

_Aaron was greeted_ <br/>
_Bert was greeted_ <br/>
_Donald was greeted_ <br/>
_folks was greeted_ <br/>


#### Run several minimap2 processes in parallel

Create a function that runs minimap2

In [None]:
def minimap2(target_file, query_file):
    cmd = './minimap2 {} {}'.format(target_file, query_file)
    return runcmd(cmd)

Setup function arguments in a list

In [None]:
queries = [
    'data/chr12/annotated-samples.fa',
    'data/chr12/mouse.fa',
    'data/chr12/zebrafish.fa',
]
target = 'data/chr12/chr12.fa.gz'

Running in series will be pretty slow

In [None]:
import time

starttime = time.time()
for query in queries:
    output = minimap2(target, query)
    print(output['stderr'])
elapsed = time.time() - starttime
print('%d seconds elapsed' % elapsed)

But in parallel

In [None]:
from multiprocessing import Pool

numprocs = 2
pool = Pool(numprocs)
results = []
starttime = time.time()
for query in queries:
    result = pool.apply_async(minimap2, [target, query])
    results.append(result)

print('Finished applying to Pool')

for result in results:
    output = result.get()
    print(output['stderr'])
elapsed = time.time() - starttime
print('%d seconds elapsed' % elapsed)

In [None]:
annotated_sample_file

In [None]:
fasta_records = []
sample_file = annotated_sample_file
fileh = open(sample_file, 'r')
current_description = None
current_sequence_lines = []
for line in fileh:
    line = line.strip()
    if line.startswith('>'):
        if current_description is not None:
            new_record = {'description': current_description, 'sequence': ''.join(current_sequence_lines)}
            fasta_records.append(new_record)
        current_description = line
        current_sequence_lines = []
    else:
        current_sequence_lines.append(line)
fasta_records.append({'description': current_description, 'sequence': ''.join(current_sequence_lines)})
    

In [None]:
print(fasta_records[0])

## Search for patterns in the DNA sequence using regular expressions

Python has a full-featured, Perl-ish regular expression syntax provided by the _re_ module

First, a simple search for DNA-ness in each of the fasta record sequences.

Using _re.search_ looks for at least one instance of the pattern

In [None]:
import re

In [None]:
for fasta_record in fasta_records:
    if re.search(r'A', fasta_record['sequence']):
        print('Found at least one Adenine in FASTA record %s' % fasta_record['description'])
        

You can search for multiple character patterns, like 'A' followed by 'T'

In [None]:
for fasta_record in fasta_records:
    if re.search(r'AT', fasta_record['sequence']):
        print('Found at least one Adenine-Thymine in FASTA record %s' % fasta_record['description'])

You can also search for character sets, e.g. one of A,T,C, or G, using square brackets [].

In [None]:
for fasta_record in fasta_records:
    if re.search(r'[ATCG]', fasta_record['sequence']):
        print('Found at least one of A or T or C or G in FASTA record %s' % fasta_record['description'])

In [None]:
for fasta_record in fasta_records:
    if re.search(r'[U]', fasta_record['sequence']):
        print('Found a U in FASTA record %s' % fasta_record['description'])
    else:
        print('No U found in %s' % fasta_record['description'])

There are more general character classes built in, like \S (any non-whitespace) or \s (any whitespace)

In [None]:
for fasta_record in fasta_records:
    if re.search(r'\S', fasta_record['sequence']):
        print('Found at least one non whitespace character in FASTA record %s' % fasta_record['description'])

In [None]:
for fasta_record in fasta_records:
    if re.search(r'\S\s', fasta_record['sequence']):
        print('Found a non-whitespace followed by a whitespace in FASTA record %s' % fasta_record['description'])
    else:
        print('No non-whitespace followed by a whitespace found in %s' % fasta_record['description'])

Quantifiers ({n,m}) can define how many times you see the character(s) you're searching for.

In [None]:
for fasta_record in fasta_records:
    if re.search(r'CA{2,3}', fasta_record['sequence']):
        print('Found C followed by 2 or 3 As in FASTA record %s' % fasta_record['description'])

Without the second number and comma, it must be an exact number

In [None]:
for fasta_record in fasta_records:
    if re.search(r'CA{6}', fasta_record['sequence']):
        print('Found at least one C followed by 6 As in FASTA record %s' % fasta_record['description'])

If you leave the comma in, it's n or more

In [None]:
for fasta_record in fasta_records:
    if re.search(r'CA{5,}', fasta_record['sequence']):
        print('Found at least C followed by 5 or more As in FASTA record %s' % fasta_record['description'])

There are special quantifiers '+' (one or more) and '*' (zero or more)

In [None]:
for fasta_record in fasta_records:
    if re.search(r'ATG+', fasta_record['sequence']):
        print('Found AT followed by at least one G %s' % fasta_record['description'])

In [None]:
for fasta_record in fasta_records:
    if re.search(r'U*', fasta_record['sequence']):
        print('Found zero or more uracil bases in FASTA record %s' % fasta_record['description'])

Non-capturing groups _(?:)_ support or-ing together strings

In [None]:
for fasta_record in fasta_records:
    if re.search(r'ATG.+(?:TAG|TAA|TGA)', fasta_record['sequence']):
        print('Found a transcript looking thing %s' % fasta_record['description'])

Using capture groups, you can extract the matches

In [None]:
for fasta_record in fasta_records:
    match = re.search(r'(ATG.+(?:TAG|TAA|TGA))', fasta_record['sequence'])
    if match:
        print('Found a transcript looking thing %s' % fasta_record['description'])
        print(match.group(1))

The _split()_ function allows you to break a string based on a regular expression.

Find potential genes by splitting chr12 on stop codons followed by a lot of T

In [None]:
os.system('gzip -d data/chr12/chr12.fa.gz')
chr12 = []
with open('data/chr12/chr12.fa', 'r') as fileh:
    for line in fileh:
        if not line.startswith('>'):
            chr12.append(line.strip())

In [None]:
len(chr12)

In [None]:
chr12 = ''.join(chr12)

In [None]:
coding = re.split(r'(T[GA][GA]A{20,})', chr12)

In [None]:
len(coding)

In [None]:
for c in coding:
    print(len(c))