# A bioinformatics exercise
This notebook uses a small bioinformatics exercise to show aspects of the Python programming 
language in the context of a real(ish) data processing activity.

We will be reading, writing, and manipulating text files and running a small sequence alignment
program.  Over the course of this we will cover programming topics such as:

   * Built-in Python types including strings, ints, floats
   * Python code blocks including if/then/else, for loops, functions,
     and context managers
   * Data structures like lists and dictionaries
   * System calls, including multiprocessing Pools
   
Additional topics including Python packages and environments and the object-orientation of Python
will be covered elsewhere.

In this course we will often demonstrate something and then follow that with a quick excercise where you try something.  Then we'll go over what it should have looked like and take any questions. 

# 1. Navigating [JupyterLab](https://jupyterlab.readthedocs.io/)
    
  * you've already discovered the lefthand file navigator, double click to open files or enter a folder
  * you will notice some blocks are text and some are code boxes
  * double click on a text block and you will see the mark down version
  * notice there is a play botton at the top of the file
  * inside the text block you clicked on push the play button to see the formatted text
  * code blocks are interpreted by python click in the code block below and type 1, then push the play button
  * you will see the output of the code below the code block (note that you cannot click on this)
  * jupyter by default gives any output of the last line in the code block
  * jupyter also displays any output from functions in the code block


In [108]:
1

1

In [109]:
1
2
3

3

In [110]:
print(1)
print(2)
3

1
2


3

# 2. Numbers and strings at the interactive interpreter

Math works as expected 

In [111]:
1+1

2

Strings are output by the interpreter with single quotes

In [112]:
'hello world'

'hello world'

# 3. Hello world with the print function

Let's try one more hello world, type "print('hello world')"

In [113]:
print('hello world')

hello world


print is a builtin python language function and 'hello world' is a parameter passed to the function.

python language is open source:

https://github.com/python/cpython/blob/master/Python/bltinmodule.c#L1821

More about functions shortly.

# 4. Python types and variables

Rahter than working with raw numbers and strings at the interpreter it can be handy to assign them to variables that can be used multiple times

Let's create a variable called string (the name is a spoiler alert for the type, ha). In Python, the equal sign means "assignment".  Double equal ("==") tests equality.

In [114]:
string = 'hello world'

In [115]:
print(string)

hello world


Use the function type() to see what python type 'hello world' is

In [116]:
type('hello world')

str

All variables are objects in python, we'll examine more of what that means later, but str is a type of object.   Our variable string points to an object of the type str.

In [117]:
type(string)

str

### Exercise 4.1: Let's try some of Python's basic types

Set a variable called number to 1, then check the type

In [118]:
number = 1
type(number)

int

Set a variable called number to '1' and see what the type is then

In [119]:
number = '1'
type(number)

str

Set a varialbe to the number 1.5 and see what type it is

In [120]:
number = 1.5
type(number)

float

Set a variable to True and see what type it is

In [121]:
type(True)

bool

In preparation for reading an annotations file set a variable named file_name to the file path 'data/chr12/annotations.1.txt

In [122]:
file_name = 'data/chr12/annotations.1.txt'
file_name

'data/chr12/annotations.1.txt'

# 5. String concatenation

Strings can be concatenated with the '+' operator.  Non-strings must be
converted first with _str()_

In [123]:
'python ' + 'is ' + 'number ' + str(1)

'python is number 1'

In [124]:
data_dir = 'data'
project = 'chr12'
name = 'annotations'
version = 1
ext = 'txt'

### Exercise 5.1: use concatenation of the above variables to create a variable called file_name which is the same as the one from end end of Exercise 1  (file_name = 'data/chr12/annotations.1.txt')

In [125]:
file_name = data_dir + '/' + project + '/' + name + '.' + str(version) + '.' + ext
file_name

'data/chr12/annotations.1.txt'

# 6. Functions

A function is a block of code that can be run on 0 or more arguments using the "call" operator _()_ and may return some value. 

In [126]:
def hello_world():
    print('hello world')
    
hello_world()

hello world


In [127]:
def python_is(descriptor, action):
    string = 'Python is ' + str(descriptor) + ' everyone should ' + action + ' it '
    return string

python_is('fun', 'try')

'Python is fun everyone should try it '

You can add a multiline comment, surrounded by ''', to a function for documentation.  Functions are objects too and you can see this comment by passing the object to the help function. 

In [128]:
def python_is(descriptor, action):
    '''
    Concatenates the string 'Python is ' with descriptor and ' everyone should ' with action 
    '''
    string = 'Python is ' + str(descriptor) + ' everyone should ' + action + ' it '
    return string

python_is('easy', 'learn')

'Python is easy everyone should learn it '

In [129]:
help(python_is)

Help on function python_is in module __main__:

python_is(descriptor, action)
    Concatenates the string 'Python is ' with descriptor and ' everyone should ' with action



Positionsl arguments must be passed to the function in the order they are listed

In [130]:
def python_is(descriptor, action):
    '''
    Concatenates the string 'Python is ' with descriptor and ' everyone should ' with action 
    '''
    string = 'Python is ' + str(descriptor) + ' everyone should ' + action + ' it '
    return string

python_is('learn', 'easy')

'Python is learn everyone should easy it '

### Exercise 6.1: Make a function called get_annotation_file_name which takes the 5 variables we used in exercies 2 as parameters and returns the concatenated file path (hint you can copy paste that part from exercise 2)

In [131]:
def get_annotation_file_name(
    data_dir, 
    project, 
    version, 
    name, 
    ext):

    '''
    Concatenates data_dir and project_name for path.  "annotations.<version>.<extension>" is the file name.
    '''

    file_name = data_dir + '/' + project + '/' + name + '.' + str(version) + '.' + ext
    return file_name

use the below to call your function and test it

In [132]:
get_annotation_file_name(data_dir, project, version, name, ext)

'data/chr12/annotations.1.txt'

# 7. Function argumemnts

You can specify defaults when it makes sense, but positional arguments must come first

In [133]:
def python_is(descriptor, action = 'learn'):
    '''
    Concatenates the string 'Python is ' with descriptor and ' everyone should ' with action 
    '''
    string = 'Python is ' + str(descriptor) + ' everyone should ' + action + ' it '
    return string

python_is('easy')

'Python is easy everyone should learn it '

Arguments that don't have a default must be specified

In [134]:
result = python_is()

TypeError: python_is() missing 1 required positional argument: 'descriptor'

They can also be treated as keyword arguments and specified in arbitrary order

In [None]:
result = python_is(action='enjoy', descriptor='useful')
result

### Exercise 7: Defaults for annotation file name function

#### 7.1. Copy the function you wrote from exercise 6 and add defaults for extension and name. 

In [None]:
def get_annotation_file_name(
    data_dir, 
    project, 
    version, 
    name = 'annotations', 
    ext = 'txt'):

    '''
    Concatenates data_dir and project for path.  "annotations.<version>.<extension>" is the file name.
    '''

    file_name = data_dir + '/' + project + '/' + name + '.' + str(version) + '.' + ext
    return file_name

#### 7.2 Run our annotation file function with only the required arguments.

In [None]:
get_annotation_file_name(data_dir, project, version)

#### 7.3. Specify the name with a different value

In [None]:
get_annotation_file_name(data_dir, project, version, 'anothername')

#### 7.4. Try specifying the arguments as keyword arguments in a different order than their position in the function definition

In [None]:
get_annotation_file_name(ext='csv', data_dir=data_dir, version=3, project='chr13')

# 8. Formatted strings

Python supports both positional and named string template substitution.  See the
[Pyformat page](https://pyformat.info/) for details

#### String concatentation is expensive because Python strings are immutable

In [None]:
file_name = get_annotation_file_name(data_dir, project, version)

In [None]:
file_name[0] = 'a'

#### Old style string formatting is common

In [None]:
address = '%d %s %s %s,%s' % (52, 'Oxford', 'Street', 'Cambridge', 'MA')
address

#### format function is more readable and powerful

The format function of strings allows for positional substitution like old style
formatting, but also supports named place holders and rich formatting options

_format()_ is a good example of functions that are part of defined on object-oriented 
"classes" and used on instances called "objects".

You can access the properties of an object, both its public functions and public properties through the dot notation (.)

In [None]:
address = '{} {} {} {},{}'.format(52, 'Oxford', 'Street', 'Cambridge', 'MA')
address

Types can be enforced using type specifiers like ':d'

In [None]:
address = '{:d} {} {} {},{}'.format(52, 'Oxford', 'Street', 'Cambridge', 'MA')
address

In [None]:
address = '{:d} {} {:.2}. {},{}'.format(52, 'Oxford', 'Street', 'Cambridge', 'MA')
address

Keyword arguments can be really helpful for readability

In [None]:
address = '{number:d} {street} {suffix:.2}. {city},{state}'.format(
    number=52, 
    street="Oxford", 
    suffix="Street",
    city="Cambridge", 
    state="MA" 
)
address

### Exercise 8. use string formatting to rewrite the get_annotation_file_name function

#### 8.1 use old style formatting 

In [None]:
def get_annotation_file_name(data_dir, project, version, name='annotations', ext='txt'):
    '''Concatenates data_dir and project for path.  "annotations.<version>.<extension>" is the file name.'''
    
    return '%s/%s/%s.%d.%s' % (data_dir, project, name, version, ext)

In [None]:
file_name = get_annotation_file_name(data_dir, project, version)
file_name

#### 8.2 use the format function to rewrite the get_annotation_file_name function

In [None]:
def get_annotation_file_name(data_dir, project, version, name='annotations', ext='txt'):
    '''Concatenates data_dir and project for path.  "annotations.<version>.<extension>" is the file name.'''

    return '{data_dir}/{project}/{name}.{version:d}.{ext}'.format(
        data_dir = data_dir, 
        project = project, 
        name = name, 
        version = version, 
        ext = ext
    )

In [None]:
file_name = get_annotation_file_name(data_dir, project, version)
file_name

# 9. Lists 

Like arrays in other languages, Python lists are a group of items that can be indexed by an integer.

Lists are initialized with [] or list() and indexing starts with zero.

In [None]:
path_elements = ['nano-course', 'python', 'data', 'chr12']

In [None]:
path_elements[0]

In [None]:
path_elements[2]

Check the length with _len()_

In [None]:
len(path_elements)

You can use negative indexes

In [None]:
path_elements[-1]

Slices can be taken from lists using [:] notation.  Don't forget that the upper bound index is not included.

In [None]:
path_elements[0:2]

And you can slice with negative indexes

In [None]:
path_elements[-2:-1]

strings can be sliced just like lists 

In [None]:
hello = 'Greetings'
hello[0]

if you leave a number out with the : notation it assumes you mean until the end or from the beginning

In [None]:
hello[:5]

In [None]:
hello[5:]

Lists can be appended to

In [None]:
path_elements.append('annotations.1.txt')
path_elements

and extended

In [None]:
full_path = ['Users','maria']
full_path.extend(path_elements)
print(full_path)

List elements are mutable

In [None]:
path_elements[1] = 'R'
path_elements

You can also create an immutable list, a tuple, using parens.  These are generally used for data that does not changes and should remain together such as a commpound key like ('category', 'color') might be the key for a group of products. 

In [None]:
path_tuple = ('nano-course', 'python', 'data', 'chr12')
path_tuple[1] = 'x'

## Exercise 9. try out lists

#### 9.1 create a list called list1 with 3 objects 1, 2, 3

In [None]:
list1 = [1, 2, 3]
list1

#### 9.2 append 4, 5, and 6 to the list

In [None]:
list1.append(4)
list1.append(5)
list1.append(6)
list1

#### 9.3 create a list2 with 7, 8, 9 and then extend list1 with list2

In [None]:
list2 = [7, 8, 9]
list1.extend(list2)
list1

#### 9.4 print out the element 1 of the list1 and the element 9 of list1

In [None]:
print(len(list1))
print(list1[0])
print(list1[8])
print(list1[-1])

# 10. Iterating, joining and spliting lists

We can iterate a list with a for loop.

In [None]:
for path_element in path_elements:
    print(path_element)

If you need the index, _enumerate()_

In [None]:
for i, path_element in enumerate(path_elements):
    print(i, path_element)

Strings act like lists...

In [None]:
data_dir[-1]

In [None]:
for ch in data_dir:
    print(ch)

but they are not mutable

In [None]:
data_dir[1] = 'x'

#### You can join list elements into a string with the join function 

In [None]:
address_list = ['52', 'Oxford', 'Street', 'Cambridge', 'MA']
address_display = ' '.join(address_list)
address_display

#### You can also split a string into a list, space is the default seperator

In [None]:
address_list2 = address_display.split()
address_list2

## Exercise 10. More with lists 

#### 10.1 use join on path_elements to create a '/' sepearted path

In [None]:
'/'.join(path_elements)

#### 10.2 We can redefine the function get_annotation_file_name using a list which contains 3 elements (data_dir, project, and a formatted string of the name, version, ext) then join with '/'

In [None]:
def get_annotation_file_name(data_dir, project, version, name='annotations', ext='txt'):
    '''Concatenates data_dir and project for path.  "annotations.<version>.<extension>" is the file name.'''
    
    path_elements = [data_dir, project, '{}.{:d}.{}'.format(name, version, ext)]
    
    return '/'.join(path_elements)

In [None]:
file_name = get_annotation_file_name(data_dir, project, version)
file_name

#### 10.3 create a list of your own with atleast 3 items in it

In [None]:
grocery = ['milk', 'berries', 'chocolate']

#### 10.4 iterate over the list you created using a for loop and print out each element

In [None]:
for item in grocery:
    print(item)

#### 10.5 split the following string into a list using the split() function

In [None]:
header = 'Name;Email;Address;City;State;Country'
header.split(';')

# 11. Modules 

A module is a file with Python definitions and statements.  The _import_ statement allows you to use those definitions in your code

The creation of modules is how Python libraries are made and shared.

For example, if you're doing several projects with DNA sequence, you might like a module that had common DNA sequence manipulations.  In a file called dna.py you could define several functions and data that you might use repeatedly:

```python
DNA_COMPLEMENT = {
    'A': 'T',
    'T': 'A',
    'C': 'G',
    'G': 'C',
}

def reverse_complement(dna):
    '''
    Return the reverse complement of the DNA sequence
    '''
    complement = []
    for base in reversed(dna):
        complement.append(DNA_COMPLEMENT[base.upper()])
    return complement


def translate(dna, frame=0):
    '''
    Translate a string of dna sequence into protein sequence using the given frame
    '''
    protein_sequence = []
    for i in range(frame, len(dna), 3):
        ...
    return ''.join(protein_sequence)

def transcribe(dna):
    '''
    Convert DNA into RNA
    '''
    return dna.replace('T', 'U')
```


To use the functions in this file, you would have to either import the entire module and use the functions (via the dot operator):

```python
import dna

transcript_sequence = 'TACGATCGATCGATCGATTATCGATCAGTCA'
protein_sequence = dna.translate(transcript_sequence)
```

Or you could import specific functions from the file

```python
from dna import translate

protein_sequence = translate('TACGATCGATCGATCGATTATCGATCAGTCA')
``` 
    
The _from_ keyword will get you to the thing you want to import, but the import is what you're allowed to use in your code

##### Python modules can be organized in directories traversed by _from_

If the _dna.py_ file described above is placed under a path, e.g. _seqlib/seq/nuc/dna.py_, functions could be accessed using the _from_ keyword with dots replacing the path separator.

```python
from seqlib.seq.nuc.dna import transcribe
```
    
This will work, but a file named \_\_init\_\_.py must be present in each of the directories

##### Python starts looking for modules based on the value of _sys.path_, which may include PYTHONPATH, the current directory, and ~/.local

    [akitzmiller@bioinf01 ~]$ echo $PYTHONPATH
    /odyssey/rc_admin/sw/admin/rcpy:

    [akitzmiller@bioinf01 ~]$ pwd
    /n/home_rc/akitzmiller

    [akitzmiller@bioinf01 ~]$ python
    Python 2.7.5 (default, Apr  9 2019, 14:30:50) 
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    
    >>> import sys, os
    
    >>> os.environ['PYTHONPATH']
    '/odyssey/rc_admin/sw/admin/rcpy:'
    
    >>> print '\n'.join(sys.path)

    /odyssey/rc_admin/sw/admin/rcpy
    /n/home_rc/akitzmiller
    /usr/lib64/python27.zip
    /usr/lib64/python2.7
    /usr/lib64/python2.7/plat-linux2
    /usr/lib64/python2.7/lib-tk
    /usr/lib64/python2.7/lib-old
    /usr/lib64/python2.7/lib-dynload
    /usr/lib64/python2.7/site-packages
    /usr/lib64/python2.7/site-packages/gtk-2.0
    /usr/lib/python2.7/site-packages
    >>> 


The _os_ module must be imported and contains functions that are sensitive to the operating system

os.path.join will join each parameter with a '/' to create a path

In [None]:
os.path.join()

Notice we get an error above because there is no os currently defined. 

Everything you use in a Python script must either be a built-in (e.g. print), defined in your code (e.g. , get_annotation_file_name) or imported

In [None]:
import os

##### You can find where a module comes from using the \_\_file\_\_ property of the module
Seriously, everything is an object

In [236]:
os.__file__

'/opt/conda/lib/python3.7/os.py'

In [None]:
help(os.path.join)

In [None]:
os.path.join('nano-course', 'python', 'data', 'chr12')

let's import another module like datetime

In [None]:
import datetime

In [None]:
today = datetime.datetime.now()
today

since what we really want to use is the datetime function inside the datetime module we can import that directly with the from keyword

In [None]:
from datetime import datetime

In [None]:
datetime(2020, 1, 21)

## Exercise 11

#### 11.1 rewrite the get_annotation_file_name function again using the os.path.join function

In [None]:
def get_annotation_file_name(data_dir, project, version, name='annotations', ext='txt'):
    '''Concatenates data_dir and project for path.  "annotations.<version>.<extension>" is the file name.'''
    
    path = os.path.join(data_dir, project, '{}.{:d}.{}'.format(name, version, ext))
    
    return path

In [None]:
file_name = get_annotation_file_name(data_dir, project, version)
file_name

#### 11.2 import the random module and use the random() function to generate a random number

In [None]:
import random 

In [None]:
random.random()

# 12. If else

Python's if statement allow flow control by executing blocks of code only when conditions are met.  Else if is shortened to elif, there can be zero or more elif parts, and the else part is optional.

In [None]:
x = int(input("input an int please: "))

In [None]:
if x > 0:
    print('positive')
elif x == 0:
    print('zero')
else:
    print('negative')

## Exercise 12 species to common name

#### 12.1 write an input string called org for organism

In [None]:
org = input("input a organism")

#### 12.2 set a variable common_name to empty string ('') then write an if else block that maps a few species to the common name and then an else block set any other species to the common_name 'Unknown'

In [None]:
'''
here are the species -> common names that your function should use
Homo sapiens -> Human
Pan troglodytes -> Chimp
Macaca mulatta -> Macaque
other -> Unknown
'''
common_name = ''
if org == 'Homo sapiens':
    common_name = 'Human'
elif org == 'Pan troglodytes':
    common_name = 'Chimp'
elif org == 'Macaca mulatta':
    common_name = 'Macaque'
else:
    common_name = 'Unknown'
common_name

#### 12.3 write a function called get_common_name() that takes an organism and returns the common name or Unknown (feel free to copy and paste some of your code from above)

In [None]:
def get_common_name(org):
    common_name = ''
    if org == 'Homo sapiens':
        common_name = 'Human'
    elif org == 'Pan troglodytes':
        common_name = 'Chimp'
    elif org == 'Macaca mulatta':
        common_name = 'Macaque'
    else:
        common_name = 'Unknown'
    return common_name

get_common_name('Homo sapiens')

# 13. Open a file 

First let's open data/samples.txt by double clicking so we can take a look at what the content looks like

In Python you interact with a file by opening a file handle in a particular mode, in this case 'read'.  A file handle is a lot like a pointer to the next part of the file that you're going to read.

In [None]:
sample_file = 'data/samples.txt'
fileh = open(sample_file, 'r')

Read it all into a single string using _read()_

In [None]:
fileh.read()

Read it into a list of lines using _readlines()_.  You may need to re-open the file, because the fileh is now pointing to the end.

In [None]:
fileh.readlines()

In [None]:
fileh = open(sample_file, 'r')

In [None]:
lines = fileh.readlines()
lines

Or, especially if your file is large, you can read one line at a time using _for_ because a file handle acts like a list. <br/>Using print() will convert the \t and \n into tabs and newlines respectively

In [None]:
fileh = open(sample_file, 'r')

In [None]:
for line in fileh:
    print(line.strip())

In [None]:
fileh = open(sample_file, 'r')

for line in fileh:
    if not line.startswith('Sample'):
        print(line.strip())

Using a context manager (_with_ _as_) is a good way to ensure that the file will close when you're done with it.

In [None]:
with open(sample_file, 'r') as fileh:
    for line in fileh:
        if not line.startswith('Sample'):
            print(line.strip())

We can see that the fileh is closed because we are using the context manager.

In [None]:
fileh.closed

## Exercise 13. Try opening the annotations.1.txt file

#### 13.1 open the data/chr12/annotations.1.txt file and get a sense of the content, how are the values delimited?

#### 13.2 open the annotations file and print out the lines 

In [None]:
lines = []
header_line = ''
with open(file_name, 'r') as fileh:
    for line in fileh:
        lines.append(line.strip())
lines

#### 13.3 print out everything but the header row, optionally store the header row in a variable called header

In [None]:
header = ''
with open(file_name, 'r') as fileh:
    for line in fileh:
        if line.startswith('Accession'):
            header = line.strip()
        else:
            print(line.strip())
print(header)

#### 13.4 now use the split function on each line before you add it to the list (as you recall the split function splits a string into a list)

In [None]:
lines = []
header = ''
with open(file_name, 'r') as fileh:
    for line in fileh:
        line = line.strip()
        if line.startswith('Accession'):
            header = line
        else:
            lines.append(line.split('\t'))
lines

#### 13.5 using list indexing get the second element of the first row of lines

In [None]:
lines[0]

In [None]:
lines[0][1]

#### 13.6 loop through each line of data and use the get_common_name function you wrote to get the common name for that row and add it to a list called common_names

#### Report out the unique organism common names using a list

In [None]:
common_names = []
for row in lines:
    org = row[1]
    common_names.append(get_common_name(org))
common_names

# 14. Sets and Dictionaries

A _set_ is a collection of unique elements that can participate in set operations like unions and intersects

In [None]:
model_organisms = set(['Human', 'Mouse', 'Fruit fly', 'Macaque', 'Zebrafish'])
model_organisms

you can add elements to a set with the function add which is similar to list append

In [None]:
model_organisms.add('E. coli')
model_organisms

But it will not add duplicates

In [None]:
model_organisms.add('Human')
model_organisms

A dictionary as a set of key: value pairs.  The keys are unique (within one dictionary). A pair of braces creates an empty dictionary: {}. You can add keys with a braken notation setting them equal to values.

In [None]:
capitals = {}
capitals['MA'] = 'Boston'
capitals['NH'] = 'Concord'
capitals

A dictionary can also be initialized with values using a colon to seperate keys and values.

In [None]:
capitals = {'MA': 'Boston', 'NH': 'Concord'}
capitals

You can access individual elements by key

In [None]:
capitals['MA']

It's an error to access a key that isn't there.

In [None]:
capitals['ME']

But you can use the _get()_ function to safely return a default value

In [None]:
capitals.get('ME', 'not available')

You can iterate over a dictionary with _for_ using the _items()_ function

In [None]:
capitals.items()

In [None]:
for k, v in capitals.items():
    print('the capital of %s is %s' % (k, v))

in addition to items function dictionaries have keys() and values() which return lists of the keys or values

In [None]:
capitals.keys()

In [None]:
list(capitals.keys())

In [None]:
capitals.values()

You can use the zip function to turn two lists into a dictionary 

In [None]:
states = ['MA', 'NH', 'ME']
cities = ['Boston', 'Concord', 'Agusta']
capitals = zip(states, cities)
capitals

Note that this is a zip object to complete turning it into a dictionary use the dict() funciton

In [None]:
dict(capitals)

## Exercise 14

#### 14.1 rewrite your code from 13.5 use a set rather than a list

In [None]:
common_names = set()
for row in lines:
    org = row[1]
    common_names.add(get_common_name(org))
common_names

#### 14.2 create a dictionary to map the organism to common names we used in get_common_name()

In [None]:
org_names = {
    'Homo sapiens': 'Human',
    'Pan troglodytes': 'Chimp',
    'Macaca mulatta': 'Macaque'
}
org_names

#### 14.3 loop through the dictionary and print out the keys and values

In [None]:
for org, common in org_names.items():
    print('%s (%s)' % (org, common))

#### 14.4 use the __dict__.get() function to look for 'Mus musculus' and make sure to add a default (the second parameter)

In [None]:
org_names.get('Mus musculus', 'Not found')

#### 14.5 rewrite the fetching of our data rows from annotations.1.txt file (13.3) to make each row a dictionary using the zip function, use the header as the keys (you will have to split the header into a list)

In [None]:
col_names = header.split('\t')
col_names

In [None]:
labeled_data = []
for row in lines:
    labeled_row = zip(col_names, row)
    labeled_data.append(dict(labeled_row))
labeled_data

#### 14.6 optional if you have time - rather than using zip iterate through the header and the line elements at one time using the enumerate function

In [None]:
labeled_data = []
for row in lines:
    labeled_row = {}
    for i, col_name in enumerate(col_names):
        labeled_row[col_name] = row[i]
    labeled_data.append(labeled_row)
labeled_data
    

#  15. Sorting lists

#### Python sorts lists by 'natural' order, either in place...

In [None]:
letters = ['a','x','t']
letters.sort()
letters

In [None]:
numbers  = [1, 5, 20, 1.5]
numbers.sort()
numbers

In [None]:
numberchars = ['1', '2', '100', '150']
numberchars.sort()
numberchars

#### ... or as new list

In [None]:
numbers = [1,5,3,8]
sortednumbers = sorted(numbers)
numbers

In [None]:
sortednumbers

#### Reversing the direction is easy

In [None]:
sortednumbers.sort(reverse=True)
sortednumbers

#### A key function provides flexibility in sorting

In [None]:
def case_insensitive(item):
    return item.lower()

words = ['and', 'or', 'But']
sortedwords = sorted(words)
sortedwords

In [None]:
sortedwords = sorted(words, key=case_insensitive)
sortedwords

## Exercise 15 

#### 15.1 we are going to sort the labeled_data by seqence length, to do this first write a key function which returns the sequencing length of a row as an integer 

In [None]:
def seq_length(item):
    return int(item['Length'])

#### 15.2 use that function to sort the labeled data with the longest sequence first

In [None]:
sorted_labeled_data = sorted(labeled_data, key=seq_length, reverse=True)
sorted_labeled_data

#### 15.2 let's create a dictionary of dictionaries here, create an empty dictionary called labeled_data_dictionary loop through sorted_labeled_data list and use Acccession as the key add the record dictionary as the value for a new record in labeled_data_dict.

In [None]:
labeled_data_dict = {}
for record in sorted_labeled_data:
    labeled_data_dict[record['Accession']] = record
labeled_data_dict

# 16. Writing to file

using the open function to create a filehandle we can not only read as we've already seen with annotations.1.txt but write.  To open the file for writing use a 'w' for the second param for open.

In [None]:
test_file = 'data/test.txt'
fileh = open(test_file, 'w')
fileh.write('this is a line')
fileh.write('another line')
fileh.close()

to get a newline we have to add the newline character \n

In [None]:
test_file = 'data/test.txt'
fileh = open(test_file, 'w')
fileh.write('this is a line\n')
fileh.write('another line')
fileh.close()

opening with a, opens in append mode

In [None]:
test_file = 'data/test.txt'
fileh = open(test_file, 'a')
fileh.write('this is a line\n')
fileh.write('another line')
fileh.close()

## Exercise 16. write an annotated sample file

FASTA records have two parts, a description line, starting with '>', and the sequence, e.g.

    >NC_000012.12 Homo sapiens chromosome 12, GRCh38.p13 Primary Assembly     <-- Description line
    ATCGAGACCATCCTGGCCAACATAGTGAAAACCTTTCTCTACTAAAAATACAAAAATTAGCCAGGTATGG    <-- Sequence (DNA in this case)
    TCGAGAGGCTGAGGCAGGAGGATCGCTTAAACCTGGGAGGTAGAGGTTCCAGTGAGCTGAGATTGCGACA
    ...
    >NC_000013.12 Homo sapiens chromosome 13, GRCh38.p13 Primary Assembly

In this example, the first line is the description line, starting with a '>' and the second line starts the DNA sequence.
There can be multiple lines of sequence separated by newlines or just a single line.

The description line has further structure in that the characters between the '>' and the first whitespace are 
treated as the sequence record identifier, in this case NC_000012.12 or NC_000013.12

More than one FASTA record may be in a FASTA file.


First, open the samples.fa sequence file from data/chr12

#### 16.1 open that file for reading and loop through all the lines printing out only the sample descriptions which start with > 

In [None]:
sample_file = 'data/chr12/samples.fa'
fileh = open(sample_file, 'r')
for line in fileh:
    line = line.strip()
    if line.startswith('>'):
        print(line)

#### 16.2 now let's create a dictionary called fastq_records then as you loop through the file if it's a sample name then set a variable called description to the sample name and create a key in fasta_records for the description set to an empty list, if the line is sequence then append it to that list

In [None]:
fasta_records = {}
sample_file = 'data/chr12/samples.fa'
fileh = open(sample_file, 'r')
current_sequence_lines = []
for line in fileh:
    line = line.strip()
    if line.startswith('>'):
        current_description = line
        fasta_records[current_description] = []
    else:
        fasta_records[current_description].append(line)

In [None]:
fasta_records

#### 16.3 create a new dictionary called annotated_description then loop through labeled_data_dict from 15.2 and add a record to annotated_description whith the same key (exp: 'sample1') and a new description with this format '>{accession} {organism} {gene_name}, {seq_type}'

In [None]:
annotated_description = {}
for k, record in labeled_data_dict.items():
    new_description = '>{accession} {organism} {gene_name}, {seq_type}'.format(
        accession=record['Accession'],
        organism=record['Organism'],
        gene_name=record['Gene name'],
        seq_type=record['Seq type'],
    )
    annotated_description[k] = new_description
    
annotated_description

#### 16.4 loop through fasta_records and print out only the sample name from each key excluding the < (one way to do this is to use string indexing)

In [None]:
for key in fasta_records.keys():
    print(key[1:])

#### 16.5 create a new dictionary called annotated_records, loop through the fasta_records and use the new annotated_description as the key and all the seq lines as the value (hint: you can use string slicing to remove the >, see section 9)

In [None]:
annotated_records = {}
for key, fasta_record in fasta_records.items():
    new_description = annotated_description[key[1:]]
    annotated_records[new_description] = fasta_record
annotated_records

#### 16.6 write a file 'data/chr12/annotated-samples.fa' which contains the new description as a line then all the sequence lines, you can do this by looping through your annotated_records dictionary. Don't forget to add newlines.

In [None]:
annotated_sample_file = 'data/chr12/annotated-samples.fa'
fileh = open(annotated_sample_file, 'a')
for key, record in annotated_records.items():
    fileh.write('%s\n' % key)
    fileh.write('%s\n' % '\n'.join(record))
fileh.close()

# 17. Running shell commands from python 

### The most convenient way to run a shell command is _os.system()_

_os.system_ runs a command in a bash shell and outputs stderr and stdout to the console.  It returns the shell return code (e.g. zero for success)

Because it goes to the console, your Python code does not capture the output.

Execution is synchronous, so your program has to wait until it's done.

Bash shell (or whatever your current shell is) interpolation is done so PATH is honored, redirection works, etc.

In [204]:
cmd = 'cat nonexistant_file.txt'

In [195]:
os.system(cmd)

256

You can check the return code for non-zero-ness

In [196]:
if os.system(cmd) != 0:
    print('Fail!')
else:
    print('Success!')

Fail!


In [197]:
cmd

'cat nonexistant_file.txt'

But you need to capture stderr to find out what happened

In [198]:
cmd = 'cat nonexistant_file.txt 2> stderr 1> stdout'

In [199]:
if os.system(cmd) != 0:
    stderrh = open('stderr', 'r')
    print(stderrh.read())

cat: nonexistant_file.txt: No such file or directory



### The subprocess _Popen()_ constructor allows more flexibility and power in the execution of shell commands.

The _Popen()_ constructor creates a process handle that can be used to capture stderr, stdout or pipe data into
stdin.

Run a process using Popen just like _os.system()_

In [200]:
import subprocess

In [208]:
cmd = 'echo "hello shell"'

In [209]:
proc = subprocess.Popen(cmd, shell=True)
proc.wait()

0

To capture stderr and stdout, use _PIPE_ and _.communicate()_

In [210]:
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

In [211]:
stdout, stderr = proc.communicate()
if proc.returncode == 0:
    print(stdout)
else:
    print('Fail %s' % stderr)

b'hello shell\n'


In Python 3, shell output is returned as a bytearray that must be decoded

In [212]:
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = proc.communicate()
if proc.returncode == 0:
    print(stdout.decode('ascii'))
else:
    print('Fail %s' % stderr)

hello shell



## Exercise 17

minimap2 is a command line tool for mapping query sequences to a reference.  This is useful for characterizing 
query sequences, SNP detection, finding orthologs (from close relatives), etc.  Command line usage is described 
as follows:

    Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]

where 'target' is the reference sequence (chr12.fa.gz for us)

use your annotated_sample_file from 16 as the query

In [219]:
target_file = 'data/chr12/chr12.fa'

#### 17.1 set the cmd variable to a minimap2 command, note that it should start with ./minimap2 because this is a bash script that we want to execute.  Optionally, use string formatting like in section 8.

In [232]:
cmd = './minimap2 {} {}'.format(target_file, annotated_sample_file)

In [221]:
cmd

'./minimap2 data/chr12/chr12.fa data/chr12/annotated-samples.fa'

#### 17.2 use subprocess.popen to run the cmd, make sure to capture stdout, please be patient it may take a minute to run

In [223]:
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = proc.communicate()
if proc.returncode == 0:
    print(stdout.decode('ascii'))
else:
    print('Fail %s' % stderr)

sample1	2759	1983	2749	-	NC_000012.12	133275309	6556880	6557644	598	766	60	tp:A:P	cm:i:81	s1:i:598	s2:i:0	dv:f:0.0411	rl:i:0
sample1	2759	1421	1974	-	NC_000012.12	133275309	6560108	6561052	408	944	60	tp:A:P	cm:i:52	s1:i:341	s2:i:0	dv:f:0.0481	rl:i:0
sample1	2759	679	1170	-	NC_000012.12	133275309	6563084	6563943	377	859	60	tp:A:P	cm:i:53	s1:i:315	s2:i:0	dv:f:0.0323	rl:i:0
sample1	2759	321	671	-	NC_000012.12	133275309	6566105	6566818	235	713	60	tp:A:P	cm:i:31	s1:i:176	s2:i:0	dv:f:0.0514	rl:i:0
sample1	2759	1175	1400	-	NC_000012.12	133275309	6561666	6561970	184	304	60	tp:A:P	cm:i:28	s1:i:170	s2:i:0	dv:f:0.0254	rl:i:0
sample1	2759	244	293	-	NC_000012.12	133275309	6567839	6567888	49	49	14	tp:A:P	cm:i:6	s1:i:49	s2:i:0	dv:f:0.0192	rl:i:0
sample1	2759	1983	2749	-	NC_000012.12	133275309	6556880	6557644	598	766	60	tp:A:P	cm:i:81	s1:i:598	s2:i:0	dv:f:0.0411	rl:i:0
sample1	2759	1421	1974	-	NC_000012.12	133275309	6560108	6561052	408	944	60	tp:A:P	cm:i:52	s1:i:341	s2:i:0	dv:f:0.0481	rl:i:0
sample1	2

#### 17.3 let's run the same command similar to 17.2 but let's create a dictionary with the following in it: returncode, stdout, stderr

In [228]:
proc_dict = {}
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = proc.communicate()
proc_dict = {'returncode': proc.returncode, 'stdout': stdout.decode('utf8'), 'stderr': stderr.decode('utf8')}

#### 17.4 let's write a function called runcmd, the definition is there for you, fill in the body with the subprocess.Popen similar to 17.3 and return the dictionary. Create a variable called result and set it equal to the runcmd with the minimap2 cmd then pring out the returncode and output from the result.  

In [233]:
def runcmd(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE):
    proc = subprocess.Popen(cmd, shell=True, stdout=stdout, stderr=stderr)
    stdout, stderr = proc.communicate()
    return {'returncode': proc.returncode, 'stdout': stdout.decode('utf-8'), 'stderr': stderr.decode('utf-8')}

In [234]:
result = runcmd(cmd)

In [235]:
print(result['returncode'], "\n", result['stdout'].split("\n")[:10], "\n", result['stderr'])

0 
 ['sample1\t2759\t1983\t2749\t-\tNC_000012.12\t133275309\t6556880\t6557644\t598\t766\t60\ttp:A:P\tcm:i:81\ts1:i:598\ts2:i:0\tdv:f:0.0411\trl:i:0', 'sample1\t2759\t1421\t1974\t-\tNC_000012.12\t133275309\t6560108\t6561052\t408\t944\t60\ttp:A:P\tcm:i:52\ts1:i:341\ts2:i:0\tdv:f:0.0481\trl:i:0', 'sample1\t2759\t679\t1170\t-\tNC_000012.12\t133275309\t6563084\t6563943\t377\t859\t60\ttp:A:P\tcm:i:53\ts1:i:315\ts2:i:0\tdv:f:0.0323\trl:i:0', 'sample1\t2759\t321\t671\t-\tNC_000012.12\t133275309\t6566105\t6566818\t235\t713\t60\ttp:A:P\tcm:i:31\ts1:i:176\ts2:i:0\tdv:f:0.0514\trl:i:0', 'sample1\t2759\t1175\t1400\t-\tNC_000012.12\t133275309\t6561666\t6561970\t184\t304\t60\ttp:A:P\tcm:i:28\ts1:i:170\ts2:i:0\tdv:f:0.0254\trl:i:0', 'sample1\t2759\t244\t293\t-\tNC_000012.12\t133275309\t6567839\t6567888\t49\t49\t14\ttp:A:P\tcm:i:6\ts1:i:49\ts2:i:0\tdv:f:0.0192\trl:i:0', 'sample1\t2759\t1983\t2749\t-\tNC_000012.12\t133275309\t6556880\t6557644\t598\t766\t60\ttp:A:P\tcm:i:81\ts1:i:598\ts2:i:0\tdv:f:0.0411

# 18. parallel execution with a multiprocessing pool

Python cannot do real, parallel multithreading due to the [GIL](https://realpython.com/python-gil/).  The multiprocessing module simulates a threading library, but uses forked processes.

#### A multiprocessing Pool allows you to manage parallel processes easily

A multiprocessing Pool is an object that allows you to launch, manage, and retrieve results from a set of forked processes.

#### The _map_ function applies a set of values to a single argument function.  This is a useful way to do a "parameter sweep" type of execution.

```python
from multiprocessing import Pool
import os

def echo(echoable):
    os.system('echo %s && sleep 10' % echoable)
    
echoables = [
    'ajk',
    '123',
    'qwerty',
    'uiop',
    'lkjdsa',
]

numprocs = 3
pool = Pool(numprocs)
result = pool.map(echo,echoables)
```

_123_ <br/>
_ajk_ <br/>
_qwerty_ <br/>
_lkjdsa_ <br/>
_uiop_ <br/>


Let's try something in serial then commpare with the parallel

In [249]:
def greet(name, message):
    os.system('echo "Hi %s, %s" && sleep 5' % (name,message))
    return '%s was greeted' % name

greetings = [
    ('Howa', "What's up?"),
    ('Sidney', "How are you?"),
    ('Maria', "What're you thinking?"),
    ('folks', 'Sup!'),
]
import time
starttime = time.time()
for greeting in greetings:
    print(greet(greeting[0], greeting[1]))
elapsed = time.time() - starttime
print('%d seconds elapsed' % elapsed)

Howa was greeted
Sidney was greeted
Maria was greeted
folks was greeted
20 seconds elapsed


#### The _apply_async_ function allows you to apply many arguments and returns a 'handle' for interacting with the process.

In order for this to work in parallel, you'll need to collect the result handles in a list

note that apply_async takes a function and then a list of function arguments

In [250]:
from multiprocessing import Pool
import os
numprocs = 3
pool = Pool(numprocs)
results = []

starttime = time.time()
for greeting in greetings:
    result = pool.apply_async(greet, greeting)
    results.append(result)

for result in results:
    print(result.get())
elapsed = time.time() - starttime
print('%d seconds elapsed' % elapsed)

Howa was greeted
Sidney was greeted
Maria was greeted
folks was greeted
10 seconds elapsed


# Exercise 18. Run several minimap2 processes in parallel

#### 18.1 Create a function that runs minimap2 and return the result

In [251]:
def minimap2(target_file, query_file):
    cmd = './minimap2 {} {}'.format(target_file, query_file)
    return runcmd(cmd)

#### 18.2 use the following function arguments and set up a multiprocessing pool with 2 processes then add these queries to the pool with apply_async, don't forget to create a list from [target, query] when you pass it to apply_async because the function arguments must be a list

In [254]:
queries = [
    'data/chr12/annotated-samples.fa',
    'data/chr12/mouse.fa',
    'data/chr12/zebrafish.fa',
]
target = 'data/chr12/chr12.fa'

In [255]:
from multiprocessing import Pool

numprocs = 2
pool = Pool(numprocs)
results = []
starttime = time.time()
for query in queries:
    result = pool.apply_async(minimap2, [target, query])
    results.append(result)

print('Finished applying to Pool')

for result in results:
    output = result.get()
    print(output['stderr'])
elapsed = time.time() - starttime
print('%d seconds elapsed' % elapsed)

Finished applying to Pool
[M::mm_idx_gen::44.889*0.22] collected minimizers
[M::mm_idx_gen::47.598*0.27] sorted minimizers
[M::main::47.599*0.27] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::48.051*0.27] mid_occ = 188
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::48.303*0.28] distinct minimizers: 15811443 (80.14% are singletons); average occurrences: 1.587; average spacing: 5.312
[M::worker_pipeline::48.318*0.28] mapped 14 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: ./minimap2 data/chr12/chr12.fa data/chr12/annotated-samples.fa
[M::main] Real time: 48.363 sec; CPU: 13.480 sec; Peak RSS: 0.998 GB

Killed

[M::mm_idx_gen::8.257*0.76] collected minimizers
[M::mm_idx_gen::9.663*0.86] sorted minimizers
[M::main::9.663*0.86] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::9.964*0.87] mid_occ = 188
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::10.240*0.87] distinct minimize

# 19. Regular Expressions 

Python has a full-featured, Perl-ish regular expression syntax provided by the _re_ module

First, a simple search for DNA-ness in each of the fasta record sequences.

Using _re.search_ looks for at least one instance of the pattern

In [137]:
import re

In [138]:
for fasta_record in fasta_records:
    if re.search(r'A', fasta_record['sequence']):
        print('Found at least one Adenine in FASTA record %s' % fasta_record['description'])
        

Found at least one Adenine in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one Adenine in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one Adenine in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one Adenine in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found at least one Adenine in FASTA record >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found at least one Adenine in FASTA record >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
Found at least one Adenine in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one Adenine in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found at least one Adenine in FASTA record >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found at least one Adenine in FASTA record >sample4 Ma

You can search for multiple character patterns, like 'A' followed by 'T'

In [139]:
for fasta_record in fasta_records:
    if re.search(r'AT', fasta_record['sequence']):
        print('Found at least one Adenine-Thymine in FASTA record %s' % fasta_record['description'])

Found at least one Adenine-Thymine in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one Adenine-Thymine in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one Adenine-Thymine in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one Adenine-Thymine in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found at least one Adenine-Thymine in FASTA record >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found at least one Adenine-Thymine in FASTA record >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
Found at least one Adenine-Thymine in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one Adenine-Thymine in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found at least one Adenine-Thymine in FASTA record >sample3 Homo sapiens inhibitor of growth fami

You can also search for character sets, e.g. one of A,T,C, or G, using square brackets [].

In [140]:
for fasta_record in fasta_records:
    if re.search(r'[ATCG]', fasta_record['sequence']):
        print('Found at least one of A or T or C or G in FASTA record %s' % fasta_record['description'])

Found at least one of A or T or C or G in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one of A or T or C or G in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one of A or T or C or G in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one of A or T or C or G in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found at least one of A or T or C or G in FASTA record >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found at least one of A or T or C or G in FASTA record >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
Found at least one of A or T or C or G in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one of A or T or C or G in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found at least one of A or T or C or G in FASTA record >sample3 H

In [141]:
for fasta_record in fasta_records:
    if re.search(r'[U]', fasta_record['sequence']):
        print('Found a U in FASTA record %s' % fasta_record['description'])
    else:
        print('No U found in %s' % fasta_record['description'])

No U found in >sample1 Homo sapiens acrosin binding protein, mRNA
No U found in >sample1 Homo sapiens acrosin binding protein, mRNA
No U found in >sample1 Homo sapiens acrosin binding protein, mRNA
No U found in >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
No U found in >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
No U found in >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
No U found in >sample1 Homo sapiens acrosin binding protein, mRNA
No U found in >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
No U found in >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
No U found in >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
No U found in >sample1 Homo sapiens acrosin binding protein, mRNA
No U found in >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
No U found in >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
No U 

There are more general character classes built in, like \S (any non-whitespace) or \s (any whitespace)

In [142]:
for fasta_record in fasta_records:
    if re.search(r'\S', fasta_record['sequence']):
        print('Found at least one non whitespace character in FASTA record %s' % fasta_record['description'])

Found at least one non whitespace character in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one non whitespace character in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one non whitespace character in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one non whitespace character in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found at least one non whitespace character in FASTA record >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found at least one non whitespace character in FASTA record >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
Found at least one non whitespace character in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least one non whitespace character in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found at least one non wh

In [143]:
for fasta_record in fasta_records:
    if re.search(r'\S\s', fasta_record['sequence']):
        print('Found a non-whitespace followed by a whitespace in FASTA record %s' % fasta_record['description'])
    else:
        print('No non-whitespace followed by a whitespace found in %s' % fasta_record['description'])

No non-whitespace followed by a whitespace found in >sample1 Homo sapiens acrosin binding protein, mRNA
No non-whitespace followed by a whitespace found in >sample1 Homo sapiens acrosin binding protein, mRNA
No non-whitespace followed by a whitespace found in >sample1 Homo sapiens acrosin binding protein, mRNA
No non-whitespace followed by a whitespace found in >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
No non-whitespace followed by a whitespace found in >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
No non-whitespace followed by a whitespace found in >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
No non-whitespace followed by a whitespace found in >sample1 Homo sapiens acrosin binding protein, mRNA
No non-whitespace followed by a whitespace found in >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
No non-whitespace followed by a whitespace found in >sample3 Homo sapiens inhibitor of gr

Quantifiers ({n,m}) can define how many times you see the character(s) you're searching for.

In [144]:
for fasta_record in fasta_records:
    if re.search(r'CA{2,3}', fasta_record['sequence']):
        print('Found C followed by 2 or 3 As in FASTA record %s' % fasta_record['description'])

Found C followed by 2 or 3 As in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found C followed by 2 or 3 As in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found C followed by 2 or 3 As in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found C followed by 2 or 3 As in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found C followed by 2 or 3 As in FASTA record >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found C followed by 2 or 3 As in FASTA record >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
Found C followed by 2 or 3 As in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found C followed by 2 or 3 As in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found C followed by 2 or 3 As in FASTA record >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found C followed by 2 or 3 

Without the second number and comma, it must be an exact number

In [145]:
for fasta_record in fasta_records:
    if re.search(r'CA{6}', fasta_record['sequence']):
        print('Found at least one C followed by 6 As in FASTA record %s' % fasta_record['description'])

Found at least one C followed by 6 As in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found at least one C followed by 6 As in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found at least one C followed by 6 As in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA


If you leave the comma in, it's n or more

In [146]:
for fasta_record in fasta_records:
    if re.search(r'CA{5,}', fasta_record['sequence']):
        print('Found at least C followed by 5 or more As in FASTA record %s' % fasta_record['description'])

Found at least C followed by 5 or more As in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least C followed by 5 or more As in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found at least C followed by 5 or more As in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found at least C followed by 5 or more As in FASTA record >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found at least C followed by 5 or more As in FASTA record >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
Found at least C followed by 5 or more As in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found at least C followed by 5 or more As in FASTA record >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found at least C followed by 5 or more As in FASTA record >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
Found at least C followed by

There are special quantifiers '+' (one or more) and '*' (zero or more)

In [147]:
for fasta_record in fasta_records:
    if re.search(r'ATG+', fasta_record['sequence']):
        print('Found AT followed by at least one G %s' % fasta_record['description'])

Found AT followed by at least one G >sample1 Homo sapiens acrosin binding protein, mRNA
Found AT followed by at least one G >sample1 Homo sapiens acrosin binding protein, mRNA
Found AT followed by at least one G >sample1 Homo sapiens acrosin binding protein, mRNA
Found AT followed by at least one G >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found AT followed by at least one G >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found AT followed by at least one G >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
Found AT followed by at least one G >sample1 Homo sapiens acrosin binding protein, mRNA
Found AT followed by at least one G >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found AT followed by at least one G >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found AT followed by at least one G >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
Found AT followed by at leas

In [148]:
for fasta_record in fasta_records:
    if re.search(r'U*', fasta_record['sequence']):
        print('Found zero or more uracil bases in FASTA record %s' % fasta_record['description'])

Found zero or more uracil bases in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found zero or more uracil bases in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found zero or more uracil bases in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found zero or more uracil bases in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found zero or more uracil bases in FASTA record >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found zero or more uracil bases in FASTA record >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
Found zero or more uracil bases in FASTA record >sample1 Homo sapiens acrosin binding protein, mRNA
Found zero or more uracil bases in FASTA record >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found zero or more uracil bases in FASTA record >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found zer

Non-capturing groups _(?:)_ support or-ing together strings

In [149]:
for fasta_record in fasta_records:
    if re.search(r'ATG.+(?:TAG|TAA|TGA)', fasta_record['sequence']):
        print('Found a transcript looking thing %s' % fasta_record['description'])

Found a transcript looking thing >sample1 Homo sapiens acrosin binding protein, mRNA
Found a transcript looking thing >sample1 Homo sapiens acrosin binding protein, mRNA
Found a transcript looking thing >sample1 Homo sapiens acrosin binding protein, mRNA
Found a transcript looking thing >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found a transcript looking thing >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found a transcript looking thing >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
Found a transcript looking thing >sample1 Homo sapiens acrosin binding protein, mRNA
Found a transcript looking thing >sample2 Pan troglodytes potassium voltage-gated channel subfamily A member 1 , mRNA
Found a transcript looking thing >sample3 Homo sapiens inhibitor of growth family member 4, mRNA
Found a transcript looking thing >sample4 Macaca mulatta NOP2 nucleolar protein, mRNA
Found a transcript looking thing >sample1 Homo sapiens acr

Using capture groups, you can extract the matches

In [150]:
for fasta_record in fasta_records:
    match = re.search(r'(ATG.+(?:TAG|TAA|TGA))', fasta_record['sequence'])
    if match:
        print('Found a transcript looking thing %s' % fasta_record['description'])
        print(match.group(1))

Found a transcript looking thing >sample1 Homo sapiens acrosin binding protein, mRNA
ATGGGGCGCAAATTGGACCCTACAAAGAAGGAGAAGCGGGGGCCAGGCCGAAAGGCCCGGAAGCAGAAGGGTGCCGAGACAGAACTCGCCAGATTCTTGCCTGCAGTAAGTGACGAAAATTCCAAGAGGCTGTCTAGTCGTGCTCGAAAGAGGGCAGCCAAGAGGAGGCTGGGTTCTGCTGAAGTCCCTAAGACAAATAAGTCCCCTGAGGCCAAACCATTGCCTGGAAAGCTACCAAAAGGAGCTGTCCAGACAGCTGGTAAGAAGGGACCCCAGTCCCTATTTAATGCTGCTCAAGGCAAGAAGCGCCCAGCACCTAGCAGTGATGAGGAAGAGGAGGAGGAAGACTCTGAAGAAGATGATGTGGTGAACCAGGGGGACCTCTGGGGCTCCGAGGATGATGCTGATATGGTAGATGACTATGGAGCTGACTCCAACTCTGAGGATGAGGAGGAAGGTGAAGAGCTGCTGCCCATTGAAAGAGCTGCTCGGAAGCAGAAGGTCCGGGAAGCTGCTGCTGGGGTCCAGTGGAGTGAAGAGGAGACGGAGGATGAGGAGGAAGAAGTGACCCCTGAGTCCGGCCCCTCAAAGGAGGAGGAGGCAGATGGGGGCCTGCAGATCAATGTGGATGAGGAACCATTTGTGCTGCCCCCTGCCGGGGAGATGGAGCAGGATGCCCAGGCTCCAGACCTGCAACGAGTTCACAAGCGGATCCAGGATATCGTGGGAATTCTGCGTGATTTTGGGGCTCAGCGGGAGGAAGGGCGGTCTCGTTCTGAATACCTGAACCGGCTCAAGAAGGATCTGGCCACTTACTACTCCTATGGAGACTTCCTGCTTGGCAAGCTCATGGACCTCTTCCCTCTGTCTGAGCTGGTGGAGTTCTTAGAAGCTAATGAGGTGCCTCGGCCCGTC

The _split()_ function allows you to break a string based on a regular expression.

Find potential genes by splitting chr12 on stop codons followed by a lot of T

In [151]:
os.system('gzip -d data/chr12/chr12.fa.gz')
chr12 = []
with open('data/chr12/chr12.fa', 'r') as fileh:
    for line in fileh:
        if not line.startswith('>'):
            chr12.append(line.strip())

In [152]:
len(chr12)

1903934

In [153]:
chr12 = ''.join(chr12)

In [154]:
coding = re.split(r'(T[GA][GA]T{20,})', chr12)

In [155]:
len(coding)

885

## Perform a virtual restriction fragment digest on chromosome 12 and map the annotated samples to the largest fragments

* Read the chromosome 12 sequence
* Cut it with the BisI restriction enzyme (recognizes GCAGC, GCTGC, GCGGC and GCCGC)
* Sort the unique fragments by size
* Write the 3 largest fragments to 3 separate FASTA files
* Run minimap2 with the annotated samples against each of the 3 largest fragments

In [157]:
chr12 = []
with open('data/chr12/chr12.fa', 'r') as fileh:
    for line in fileh:
        if not line.startswith('>'):
            chr12.append(line.strip())

In [158]:
len(chr12)

1903934

In [159]:
chr12 = ''.join(chr12)

In [160]:
len(chr12)

133275309

In [161]:
import re
digested = re.split(r'GC[ATGC]GC', chr12)

In [162]:
len(digested)

250772

In [163]:
unique_digested = set(digested)
len(unique_digested)

241783

In [164]:
def bylen(item):
    return len(item)

sorted_unique_digested = sorted(unique_digested, key=bylen, reverse=True)

In [165]:
len(sorted_unique_digested[0])

72368

In [166]:
len(sorted_unique_digested[-1])

0

In [167]:
target_names = []
for i in [0,1,2]:
    file_name = 'digested%d.fa' % i
    file_path = 'data/chr12/%s' % file_name
    with open(file_path, 'w') as fileh:
        fileh.write('>%s\n' % file_name)
        fileh.write('%s\n' % sorted_unique_digested[i])
    target_names.append(file_path)

In [168]:
target_names

['data/chr12/digested0.fa',
 'data/chr12/digested1.fa',
 'data/chr12/digested2.fa']

In [169]:
from multiprocessing import Pool
numprocs = 2
pool = Pool(numprocs)
results = []
query = 'data/chr12/annotated-samples.fa'
for target in target_names:
    result = pool.apply_async(minimap2, [target, query])
    results.append(result)
print('Done launching')

for result in results:
    output = result.get()
    print(output['stderr'])

NameError: name 'minimap2' is not defined

Process ForkPoolWorker-2:
Process ForkPoolWorker-1:
Traceback (most recent call last):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 351, in get
    with self._rlock:
  File "/opt/conda/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.7/multiprocessing/queues.py", line 352, in get
    res = self._reader.recv_bytes()
  File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/opt/conda/lib