# A bioinformatics exercise
This notebook uses a small bioinformatics exercise to show aspects of the Python programming 
language in the context of a real(ish) data processing activity.

We will be reading, writing, and manipulating text files and running a small sequence alignment
program.  Over the course of this we will cover programming topics such as:

   * Built-in Python types including strings, ints, floats
   * Python code blocks including if/then/else, for loops, functions,
     and context managers
   * Data structures like lists and dictionaries
   * System calls, including multiprocessing Pools
   
Additional topics including Python packages and environments and the object-orientation of Python
will be covered elsewhere.

In this course we will often demonstrate something and then follow that with a quick excercise where you try something.  Then we'll go over what it should have looked like and take any questions. 

# 1. Navigating [JupyterLab](https://jupyterlab.readthedocs.io/)
    
  * you've already discovered the lefthand file navigator, double click to open files or enter a folder
  * you will notice some blocks are text and some are code boxes
  * double click on a text block and you will see the mark down version
  * notice there is a play botton at the top of the file
  * inside the text block you clicked on push the play button to see the formatted text
  * code blocks are interpreted by python click in the code block below and type 1, then push the play button
  * you will see the output of the code below the code block (note that you cannot click on this)
  * jupyter by default gives any output of the last line in the code block
  * jupyter also displays any output from functions in the code block


In [21]:
1

1

In [24]:
1
2
3

3

In [25]:
print(1)
print(2)
3

1
2


3

# 2. Numbers and strings at the interactive interpreter

Math works as expected 

In [6]:
1+1

2

Strings are output by the interpreter with single quotes

In [5]:
'hello world'

'hello world'

# 3. Hello world with the print function

Let's try one more hello world, type "print('hello world')"

In [16]:
print('hello world')

hello world


print is a builtin python language function and 'hello world' is a parameter passed to the function.

python language is open source:

https://github.com/python/cpython/blob/master/Python/bltinmodule.c#L1821

More about functions shortly.

# 4. Python types and variables

Rahter than working with raw numbers and strings at the interpreter it can be handy to assign them to variables that can be used multiple times

Let's create a variable called string (the name is a spoiler alert for the type, ha). In Python, the equal sign means "assignment".  Double equal ("==") tests equality.

In [9]:
string = 'hello world'

In [10]:
print(string)

hello world


Use the function type() to see what python type 'hello world' is

In [29]:
type('hello world')

str

All variables are objects in python, we'll examine more of what that means later, but str is a type of object.   Our variable string points to an object of the type str.

In [13]:
type(string)

str

### Exercise 4.1: Let's try some of Python's basic types

Set a variable called number to 1, then check the type

In [30]:
number = 1
type(number)

int

Set a variable called number to '1' and see what the type is then

In [31]:
number = '1'
type(number)

str

Set a varialbe to the number 1.5 and see what type it is

In [32]:
number = 1.5
type(number)

float

Set a variable to True and see what type it is

In [None]:
type(True)

In preparation for reading an annotations file set a variable named file_name to the file path 'data/chr12/annotations.1.txt

In [42]:
file_name = 'data/chr12/annotations.1.txt'
file_name

'data/chr12/annotations.1.txt'

# 5. String concatenation

Strings can be concatenated with the '+' operator.  Non-strings must be
converted first with _str()_

In [18]:
'python ' + 'is ' + 'number ' + str(1)

'python is number 1'

In [14]:
data_dir = 'data'
project = 'chr12'
name = 'annotations'
version = 1
ext = 'txt'

### Exercise 5.1: use concatenation of the above variables to create a variable called file_name which is the same as the one from end end of Exercise 1  (file_name = 'data/chr12/annotations.1.txt')

In [44]:
file_name = data_dir + '/' + project + '/' + name + '.' + str(version) + '.' + ext
file_name

'data/chr12/annotations.1.txt'

# 6. Functions

A function is a block of code that can be run on 0 or more arguments using the "call" operator _()_ and may return some value. 

In [63]:
def hello_world():
    print('hello world')
    
hello_world()

hello world


In [185]:
def python_is(descriptor, action):
    string = 'Python is ' + str(descriptor) + ' everyone should ' + action + ' it '
    return string

python_is('fun', 'try')

'Python is fun everyone should try it '

You can add a multiline comment, surrounded by ''', to a function for documentation.  Functions are objects too and you can see this comment by passing the object to the help function. 

In [189]:
def python_is(descriptor, action):
    '''
    Concatenates the string 'Python is ' with descriptor and ' everyone should ' with action 
    '''
    string = 'Python is ' + str(descriptor) + ' everyone should ' + action + ' it '
    return string

python_is('easy', 'learn')

'Python is easy everyone should learn it '

In [190]:
help(python_is)

Help on function python_is in module __main__:

python_is(descriptor, action)
    Concatenates the string 'Python is ' with descriptor and ' everyone should ' with action



Positionsl arguments must be passed to the function in the order they are listed

In [191]:
def python_is(descriptor, action):
    '''
    Concatenates the string 'Python is ' with descriptor and ' everyone should ' with action 
    '''
    string = 'Python is ' + str(descriptor) + ' everyone should ' + action + ' it '
    return string

python_is('learn', 'easy')

'Python is learn everyone should easy it '

### Exercise 6.1: Make a function called get_annotation_file_name which takes the 5 variables we used in exercies 2 as parameters and returns the concatenated file path (hint you can copy paste that part from exercise 2)

In [181]:
def get_annotation_file_name(
    data_dir, 
    project, 
    version, 
    name, 
    ext):

    '''
    Concatenates data_dir and project_name for path.  "annotations.<version>.<extension>" is the file name.
    '''

    file_name = data_dir + '/' + project + '/' + name + '.' + str(version) + '.' + ext
    return file_name

use the below to call your function and test it

In [47]:
get_annotation_file_name(data_dir, project, version, name, ext)

'data/chr12/annotations.1.txt'

# 7. Function argumemnts

You can specify defaults when it makes sense, but positional arguments must come first

In [192]:
def python_is(descriptor, action = 'learn'):
    '''
    Concatenates the string 'Python is ' with descriptor and ' everyone should ' with action 
    '''
    string = 'Python is ' + str(descriptor) + ' everyone should ' + action + ' it '
    return string

python_is('easy')

'Python is easy everyone should learn it '

Arguments that don't have a default must be specified

In [193]:
result = python_is()

TypeError: python_is() missing 1 required positional argument: 'descriptor'

They can also be treated as keyword arguments and specified in arbitrary order

In [194]:
result = python_is(action='enjoy', descriptor='useful')
result

'Python is useful everyone should enjoy it '

### Exercise 7: Defaults for annotation file name function

#### 7.1. Copy the function you wrote from exercise 6 and add defaults for extension and name. 

In [52]:
def get_annotation_file_name(
    data_dir, 
    project, 
    version, 
    name = 'annotations', 
    ext = 'txt'):

    '''
    Concatenates data_dir and project for path.  "annotations.<version>.<extension>" is the file name.
    '''

    file_name = data_dir + '/' + project + '/' + name + '.' + str(version) + '.' + ext
    return file_name

#### 7.2 Run our annotation file function with only the required arguments.

In [54]:
get_annotation_file_name(data_dir, project, version)

'data/chr12/annotations.1.txt'

#### 7.3. Specify the name with a different value

In [61]:
get_annotation_file_name(data_dir, project, version, 'anothername')

'data/chr12/anothername.1.txt'

#### 7.4. Try specifying the arguments as keyword arguments in a different order than their position in the function definition

In [None]:
get_annotation_file_name(ext='csv', data_dir=data_dir, version=3, project='chr13')

# 8. Formatted strings

Python supports both positional and named string template substitution.  See the
[Pyformat page](https://pyformat.info/) for details

#### String concatentation is expensive because Python strings are immutable

In [64]:
file_name = get_annotation_file_name(data_dir, project, version)

In [65]:
file_name[0] = 'a'

TypeError: 'str' object does not support item assignment

#### Old style string formatting is common

In [70]:
address = '%d %s %s %s,%s' % (52, 'Oxford', 'Street', 'Cambridge', 'MA')
address

'52 Oxford Street Cambridge,MA'

#### format function is more readable and powerful

The format function of strings allows for positional substitution like old style
formatting, but also supports named place holders and rich formatting options

_format()_ is a good example of functions that are part of defined on object-oriented 
"classes" and used on instances called "objects".

You can access the properties of an object, both its public functions and public properties through the dot notation (.)

In [72]:
address = '{} {} {} {},{}'.format(52, 'Oxford', 'Street', 'Cambridge', 'MA')
address

'52 Oxford Street Cambridge,MA'

Types can be enforced using type specifiers like ':d'

In [73]:
address = '{:d} {} {} {},{}'.format(52, 'Oxford', 'Street', 'Cambridge', 'MA')
address

'52 Oxford Street Cambridge,MA'

In [75]:
address = '{:d} {} {:.2}. {},{}'.format(52, 'Oxford', 'Street', 'Cambridge', 'MA')
address

'52 Oxford St. Cambridge,MA'

Keyword arguments can be really helpful for readability

In [81]:
address = '{number:d} {street} {suffix:.2}. {city},{state}'.format(
    number=52, 
    street="Oxford", 
    suffix="Street",
    city="Cambridge", 
    state="MA" 
)
address

'52 Oxford St. Cambridge,MA'

### Exercise 8. use string formatting to rewrite the get_annotation_file_name function

#### 8.1 use old style formatting 

In [84]:
def get_annotation_file_name(data_dir, project, version, name='annotations', ext='txt'):
    '''Concatenates data_dir and project for path.  "annotations.<version>.<extension>" is the file name.'''
    
    return '%s/%s/%s.%d.%s' % (data_dir, project, name, version, ext)

In [85]:
file_name = get_annotation_file_name(data_dir, project, version)
file_name

'data/chr12/annotations.1.txt'

#### 8.2 use the format function to rewrite the get_annotation_file_name function

In [91]:
def get_annotation_file_name(data_dir, project, version, name='annotations', ext='txt'):
    '''Concatenates data_dir and project for path.  "annotations.<version>.<extension>" is the file name.'''

    return '{data_dir}/{project}/{name}.{version:d}.{ext}'.format(
        data_dir = data_dir, 
        project = project, 
        name = name, 
        version = version, 
        ext = ext
    )

In [92]:
file_name = get_annotation_file_name(data_dir, project, version)
file_name

'data/chr12/annotations.1.txt'

# 9. Lists 

Like arrays in other languages, Python lists are a group of items that can be indexed by an integer.

Lists are initialized with [] or list() and indexing starts with zero.

In [5]:
path_elements = ['nano-course', 'python', 'data', 'chr12']

In [95]:
path_elements[0]

'nano-course'

In [6]:
path_elements[2]

'data'

Check the length with _len()_

In [96]:
len(path_elements)

4

You can use negative indexes

In [97]:
path_elements[-1]

'chr12'

Slices can be taken from lists using [:] notation.  Don't forget that the upper bound index is not included.

In [98]:
path_elements[0:2]

['nano-course', 'python']

And you can slice with negative indexes

In [102]:
path_elements[-2:-1]

['data']

Lists can be appended to

In [103]:
path_elements.append('annotations.1.txt')
path_elements

['nano-course', 'python', 'data', 'chr12', 'annotations.1.txt']

and extended

In [104]:
full_path = ['Users','akitzmiller']
full_path.extend(path_elements)
print(full_path)

['Users', 'akitzmiller', 'nano-course', 'python', 'data', 'chr12', 'annotations.1.txt']


List elements are mutable

In [105]:
path_elements[1] = 'R'
path_elements

['nano-course', 'R', 'data', 'chr12', 'annotations.1.txt']

You can also create an immutable list, a tuple, using parens

In [106]:
path_tuple = ('nano-course', 'python', 'data', 'chr12')
path_tuple[1] = 'x'

TypeError: 'tuple' object does not support item assignment

## Exercise 9. try out lists

#### 9.1 create a list called list1 with 3 objects 1, 2, 3

In [116]:
list1 = [1, 2, 3]
list1

[1, 2, 3]

#### 9.2 append 4, 5, and 6 to the list

In [117]:
list1.append(4)
list1.append(5)
list1.append(6)
list1

[1, 2, 3, 4, 5, 6]

#### 9.3 create a list2 with 7, 8, 9 and then extend list1 with list2

In [119]:
list2 = [7, 8, 9]
list1.extend(list2)
list1

[1, 2, 3, 4, 5, 6, 7, 8, 9]

#### 9.4 print out the element 1 of the list1 and the element 9 of list1

In [124]:
print(len(list1))
print(list1[0])
print(list1[8])
print(list1[-1])

9
1
9
9


# 10. Iterating, joining and spliting lists

We can iterate a list with a for loop.

In [127]:
for path_element in path_elements:
    print(path_element)

nano-course
R
data
chr12
annotations.1.txt


If you need the index, _enumerate()_

In [126]:
for i, path_element in enumerate(path_elements):
    print(i, path_element)

0 nano-course
1 R
2 data
3 chr12
4 annotations.1.txt


Strings act like lists...

In [128]:
data_dir[-1]

'a'

In [129]:
for ch in data_dir:
    print(ch)

d
a
t
a


but they are not mutable

In [130]:
data_dir[1] = 'x'

TypeError: 'str' object does not support item assignment

#### You can join list elements into a string with the join function 

In [133]:
address_list = ['52', 'Oxford', 'Street', 'Cambridge', 'MA']
address_display = ' '.join(address_list)
address_display

'52 Oxford Street Cambridge MA'

#### You can also split a string into a list, space is the default seperator

In [136]:
address_list2 = address_display.split()
address_list2

['52', 'Oxford', 'Street', 'Cambridge', 'MA']

## Exercise 10. More with lists 

#### 10.1 use join on path_elements to create a '/' sepearted path

In [140]:
'/'.join(path_elements)

'nano-course/R/data/chr12/annotations.1.txt'

#### 10.2 We can redefine the function get_annotation_file_name using a list which contains 3 elements (data_dir, project, and a formatted string of the name, version, ext) then join with '/'

In [145]:
def get_annotation_file_name(data_dir, project, version, name='annotations', ext='txt'):
    '''Concatenates data_dir and project for path.  "annotations.<version>.<extension>" is the file name.'''
    
    path_elements = [data_dir, project, '{}.{:d}.{}'.format(name, version, ext)]
    
    return '/'.join(path_elements)

In [146]:
file_name = get_annotation_file_name(data_dir, project, version)
file_name

'data/chr12/annotations.1.txt'

#### 10.3 create a list of your own with atleast 3 items in it

In [207]:
grocery = ['milk', 'berries', 'chocolate']

#### 10.4 iterate over the list you created using a for loop and print out each element

In [208]:
for item in grocery:
    print(item)

milk
berries
chocolate


#### 10.5 split the following string into a list using the split() function

In [211]:
header = 'Name;Email;Address;City;State;Country'
header.split(';')

['Name', 'Email', 'Address', 'City', 'State', 'Country']

# 11. Modules 

The _os_ module must be imported and contains functions that are sensitive to the operating system

os.path.join will join each parameter with a '/' to create a path

In [152]:
os.path.join()

TypeError: join() missing 1 required positional argument: 'a'

Everything you use in a Python script must either be a built-in (e.g. print), defined in your code (e.g. , get_annotation_file_name) or imported

In [17]:
import os

In [149]:
help(os.path.join)

Help on function join in module posixpath:

join(a, *p)
    Join two or more pathname components, inserting '/' as needed.
    If any component is an absolute path, all previous path components
    will be discarded.  An empty last part will result in a path that
    ends with a separator.



In [153]:
os.path.join('nano-course', 'python', 'data', 'chr12')

'nano-course/python/data/chr12'

## Exercise 11.1 rewrite the get_annotation_file_name function again using the os.path.join function

In [18]:
def get_annotation_file_name(data_dir, project, version, name='annotations', ext='txt'):
    '''Concatenates data_dir and project for path.  "annotations.<version>.<extension>" is the file name.'''
    
    path = os.path.join(data_dir, project, '{}.{:d}.{}'.format(name, version, ext))
    
    return path

In [19]:
file_name = get_annotation_file_name(data_dir, project, version)
file_name

'data/chr12/annotations.1.txt'

# 12. If else

Python's if statement allow flow control by executing blocks of code only when conditions are met.  Else if is shortened to elif, there can be zero or more elif parts, and the else part is optional.

In [205]:
x = int(input("input an int please: "))

input an int please:  7


In [206]:
if x > 0:
    print('positive')
elif x == 0:
    print('zero')
else:
    print('negative')

positive


#### Exercise 12 species to common name

#### 12.1 write an input string called org for organism

In [2]:
org = input("input a organism")

input a organism dog


#### 12.2 set a variable common_name to empty string ('') then write an if else block that maps a few species to the common name and then an else block set any other species to the common_name 'Unknown'

In [3]:
'''
here are the species -> common names that your function should use
Homo sapiens -> Human
Pan troglodytes -> Chimp
Macaca mulatta -> Macaque
other -> Unknown
'''
common_name = ''
if org == 'Homo sapiens':
    common_name = 'Human'
elif org == 'Pan troglodytes':
    common_name = 'Chimp'
elif org == 'Macaca mulatta':
    common_name = 'Macaque'
else:
    common_name = 'Unknown'
common_name

Unknown organism dog


'dog'

#### 12.3 write a function called get_common_name() that takes an organism and returns the common name or Unknown (feel free to copy and paste some of your code from above)

In [8]:
def get_common_name(org):
    common_name = ''
    if org == 'Homo sapiens':
        common_name = 'Human'
    elif org == 'Pan troglodytes':
        common_name = 'Chimp'
    elif org == 'Macaca mulatta':
        common_name = 'Macaque'
    else:
        common_name = 'Unknown'
    return common_name

get_common_name('Homo sapiens')

'Human'

# 13. Open a file 

In Python you interact with a file by opening a file handle in a particular mode, in this case 'read'.  A file handle is a lot like a pointer to the next part of the file that you're going to read.

In [195]:
sample_file = 'data/samples.txt'
fileh = open(sample_file, 'r')

Read it all into a single string using _read()_

In [196]:
fileh.read()

'sample1\tHomo sapiens\t\nsample2\tPan troglodytes\t\nsample3\tHomo sapiens\t\nsample4\tMacaca mulatta\t\n'

Read it into a list of lines using _readlines()_.  You may need to re-open the file, because the fileh is now pointing to the end.

In [197]:
fileh.readlines()

[]

In [198]:
fileh = open(sample_file, 'r')

In [199]:
lines = fileh.readlines()
lines

['sample1\tHomo sapiens\t\n',
 'sample2\tPan troglodytes\t\n',
 'sample3\tHomo sapiens\t\n',
 'sample4\tMacaca mulatta\t\n']

Or, especially if your file is large, you can read one line at a time using _for_ because a file handle acts like a list. <br/>Using print() will convert the \t and \n into tabs and newlines respectively

In [217]:
fileh = open(sample_file, 'r')

In [218]:
for line in fileh:
    print(line.strip())

Sample	Species
sample1	Homo sapiens
sample2	Pan troglodytes
sample3	Homo sapiens
sample4	Macaca mulatta


In [214]:
fileh = open(sample_file, 'r')

for line in fileh:
    if not line.startswith('Sample'):
        print(line.strip())

sample1	Homo sapiens
sample2	Pan troglodytes
sample3	Homo sapiens
sample4	Macaca mulatta


Using a context manager (_with_ _as_) is a good way to ensure that the file will close when you're done with it.

In [215]:
with open(sample_file, 'r') as fileh:
    for line in fileh:
        if not line.startswith('Sample'):
            print(line.strip())

sample1	Homo sapiens
sample2	Pan troglodytes
sample3	Homo sapiens
sample4	Macaca mulatta


We can see that the fileh is closed because we are using the context manager.

In [174]:
fileh.closed

True

## Exercise 13. Try opening the annotations.1.txt file

#### 13.1 open the annotations file and print out the lines 

In [204]:
lines = []
header_line = ''
with open(file_name, 'r') as fileh:
    for line in fileh:
        lines.append(line.strip())
lines

['Accession\tOrganism\tGene name\tSeq type\tReference\tLength',
 'sample1\tHomo sapiens\tacrosin binding protein\tmRNA\tNM_032489.3\t1905',
 'sample2\tPan troglodytes\tpotassium voltage-gated channel subfamily A member 1 \tmRNA\tXM_003313436.4\t7990',
 'sample3\tHomo sapiens\tinhibitor of growth family member 4\tmRNA\tXM_011520964.2\t1321',
 'sample4\tMacaca mulatta\tNOP2 nucleolar protein\tmRNA\tXM_015150909.2\t2759']

#### 13.2 print out everything but the header row, optionally store the header row in a variable called header

In [219]:
header = ''
with open(file_name, 'r') as fileh:
    for line in fileh:
        if line.startswith('Accession'):
            header = line.strip()
        else:
            print(line.strip())
print(header)

sample1	Homo sapiens	acrosin binding protein	mRNA	NM_032489.3	1905
sample2	Pan troglodytes	potassium voltage-gated channel subfamily A member 1 	mRNA	XM_003313436.4	7990
sample3	Homo sapiens	inhibitor of growth family member 4	mRNA	XM_011520964.2	1321
sample4	Macaca mulatta	NOP2 nucleolar protein	mRNA	XM_015150909.2	2759
Accession	Organism	Gene name	Seq type	Reference	Length


#### 13.3 now use the split function on each line before you add it to the list (as you recall the split function splits a string into a list)

In [49]:
lines = []
header = ''
with open(file_name, 'r') as fileh:
    for line in fileh:
        line = line.strip()
        if line.startswith('Accession'):
            header = line
        else:
            lines.append(line.split('\t'))
lines

[['sample1',
  'Homo sapiens',
  'acrosin binding protein',
  'mRNA',
  'NM_032489.3',
  '1905'],
 ['sample2',
  'Pan troglodytes',
  'potassium voltage-gated channel subfamily A member 1 ',
  'mRNA',
  'XM_003313436.4',
  '7990'],
 ['sample3',
  'Homo sapiens',
  'inhibitor of growth family member 4',
  'mRNA',
  'XM_011520964.2',
  '1321'],
 ['sample4',
  'Macaca mulatta',
  'NOP2 nucleolar protein',
  'mRNA',
  'XM_015150909.2',
  '2759']]

#### 13.4 using list indexing get the second element of the first row of lines

In [223]:
lines[0]

['sample1',
 'Homo sapiens',
 'acrosin binding protein',
 'mRNA',
 'NM_032489.3',
 '1905']

In [224]:
lines[0][1]

'Homo sapiens'

#### 13.5 loop through each line of data and use the get_common_name function you wrote to get the common name for that row and add it to a list called common_names

#### Report out the unique organism common names using a list

In [22]:
common_names = []
for row in lines:
    org = row[1]
    common_names.append(get_common_name(org))
common_names

['Human', 'Chimp', 'Human', 'Macaque']

# 14 Sets and Dictionaries

A _set_ is a collection of unique elements that can participate in set operations like unions and intersects

In [30]:
model_organisms = set(['Human', 'Mouse', 'Fruit fly', 'Macaque', 'Zebrafish'])
model_organisms

{'Fruit fly', 'Human', 'Macaque', 'Mouse', 'Zebrafish'}

you can add elements to a set with the function add which is similar to list append

In [31]:
model_organisms.add('E. coli')
model_organisms

{'E. coli', 'Fruit fly', 'Human', 'Macaque', 'Mouse', 'Zebrafish'}

But it will not add duplicates

In [32]:
model_organisms.add('Human')
model_organisms

{'E. coli', 'Fruit fly', 'Human', 'Macaque', 'Mouse', 'Zebrafish'}

A dictionary as a set of key: value pairs.  The keys are unique (within one dictionary). A pair of braces creates an empty dictionary: {}. You can add keys with a braken notation setting them equal to values.

In [34]:
capitals = {}
capitals['MA'] = 'Boston'
capitals['NH'] = 'Concord'
capitals

{'MA': 'Boston', 'NH': 'Concord'}

A dictionary can also be initialized with values using a colon to seperate keys and values.

In [35]:
capitals = {'MA': 'Boston', 'NH': 'Concord'}
capitals

{'MA': 'Boston', 'NH': 'Concord'}

You can access individual elements by key

In [36]:
capitals['MA']

'Boston'

It's an error to access a key that isn't there.

In [37]:
capitals['ME']

KeyError: 'ME'

But you can use the _get()_ function to safely return a default value

In [38]:
capitals.get('ME', 'not available')

'not available'

You can iterate over a dictionary with _for_ using the _items()_ function

In [39]:
for k, v in capitals.items():
    print('the capital of %s is %s' % (k, v))

the capital of MA is Boston
the capital of NH is Concord


You can use the zip function to turn two lists into a dictionary 

In [40]:
states = ['MA', 'NH', 'ME']
cities = ['Boston', 'Concord', 'Agusta']
capitals = zip(states, cities)
capitals

<zip at 0x7f5b38292388>

Note that this is a zip object to complete turning it into a dictionary use the dict() funciton

In [41]:
dict(capitals)

{'MA': 'Boston', 'NH': 'Concord', 'ME': 'Agusta'}

## Exercise 14

#### 14.1 rewrite your code from 13.5 use a set rather than a list

In [43]:
common_names = set()
for row in lines:
    org = row[1]
    common_names.add(get_common_name(org))
common_names

{'Chimp', 'Human', 'Macaque'}

#### 14.2 create a dictionary to map the organism to common names we used in get_common_name()

In [44]:
org_names = {
    'Homo sapiens': 'Human',
    'Pan troglodytes': 'Chimp',
    'Macaca mulatta': 'Macaque'
}
org_names

{'Homo sapiens': 'Human',
 'Pan troglodytes': 'Chimp',
 'Macaca mulatta': 'Macaque'}

#### 14.3 loop through the dictionary and print out the keys and values

In [45]:
for org, common in org_names.items():
    print('%s (%s)' % (org, common))

Homo sapiens (Human)
Pan troglodytes (Chimp)
Macaca mulatta (Macaque)


#### 14.4 use the __dict__.get() function to look for 'Mus musculus' and make sure to add a default (the second parameter)

In [46]:
org_names.get('Mus musculus', 'Not found')

'Not found'

#### 14.5 rewrite the fetching of our data rows from annotations.1.txt file (13.3) to make each row a dictionary using the zip function, use the header as the keys (you will have to split the header into a list)

In [50]:
col_names = header.split('\t')
col_names

['Accession', 'Organism', 'Gene name', 'Seq type', 'Reference', 'Length']

In [51]:
labeled_data = []
for row in lines:
    labeled_row = zip(col_names, row)
    labeled_data.append(dict(labeled_row))
labeled_data

[{'Accession': 'sample1',
  'Organism': 'Homo sapiens',
  'Gene name': 'acrosin binding protein',
  'Seq type': 'mRNA',
  'Reference': 'NM_032489.3',
  'Length': '1905'},
 {'Accession': 'sample2',
  'Organism': 'Pan troglodytes',
  'Gene name': 'potassium voltage-gated channel subfamily A member 1 ',
  'Seq type': 'mRNA',
  'Reference': 'XM_003313436.4',
  'Length': '7990'},
 {'Accession': 'sample3',
  'Organism': 'Homo sapiens',
  'Gene name': 'inhibitor of growth family member 4',
  'Seq type': 'mRNA',
  'Reference': 'XM_011520964.2',
  'Length': '1321'},
 {'Accession': 'sample4',
  'Organism': 'Macaca mulatta',
  'Gene name': 'NOP2 nucleolar protein',
  'Seq type': 'mRNA',
  'Reference': 'XM_015150909.2',
  'Length': '2759'}]

#### 14.6 optional if you have time - rather than using zip iterate through the header and the line elements at one time using the enumerate function

In [53]:
labeled_data = []
for row in lines:
    labeled_row = {}
    for i, col_name in enumerate(col_names):
        labeled_row[col_name] = row[i]
    labeled_data.append(labeled_row)
labeled_data
    

[{'Accession': 'sample1',
  'Organism': 'Homo sapiens',
  'Gene name': 'acrosin binding protein',
  'Seq type': 'mRNA',
  'Reference': 'NM_032489.3',
  'Length': '1905'},
 {'Accession': 'sample2',
  'Organism': 'Pan troglodytes',
  'Gene name': 'potassium voltage-gated channel subfamily A member 1 ',
  'Seq type': 'mRNA',
  'Reference': 'XM_003313436.4',
  'Length': '7990'},
 {'Accession': 'sample3',
  'Organism': 'Homo sapiens',
  'Gene name': 'inhibitor of growth family member 4',
  'Seq type': 'mRNA',
  'Reference': 'XM_011520964.2',
  'Length': '1321'},
 {'Accession': 'sample4',
  'Organism': 'Macaca mulatta',
  'Gene name': 'NOP2 nucleolar protein',
  'Seq type': 'mRNA',
  'Reference': 'XM_015150909.2',
  'Length': '2759'}]

### Sort the records by length

#### Python sorts lists by 'natural' order, either in place...

In [None]:
letters = ['a','x','t']
letters.sort()
letters

In [None]:
numbers  = [1, 5, 20, 1.5]
numbers.sort()
numbers

In [None]:
numberchars = ['1', '2', '100', '150']
numberchars.sort()
numberchars

#### ... or as new list

In [None]:
numbers = [1,5,3,8]
sortednumbers = sorted(numbers)
numbers

In [None]:
sortednumbers

#### Reversing the direction is easy

In [None]:
sortednumbers.sort(reverse=True)
sortednumbers

#### A key function provides flexibility in sorting

In [None]:
def case_insensitive(item):
    return item.lower()

words = ['and', 'or', 'But']
sortedwords = sorted(words)
sortedwords

In [None]:
sortedwords = sorted(words, key=case_insensitive)
sortedwords

In [None]:
def seq_length(item):
    return int(item['Length'])

sorted_labeled_data = sorted(labeled_data, key=seq_length, reverse=True)
sorted_labeled_data

### Read FASTA records and set a more informative description line

FASTA records have two parts, a description line, starting with '>', and the sequence, e.g.

    >NC_000012.12 Homo sapiens chromosome 12, GRCh38.p13 Primary Assembly     <-- Description line
    ATCGAGACCATCCTGGCCAACATAGTGAAAACCTTTCTCTACTAAAAATACAAAAATTAGCCAGGTATGG    <-- Sequence (DNA in this case)
    TCGAGAGGCTGAGGCAGGAGGATCGCTTAAACCTGGGAGGTAGAGGTTCCAGTGAGCTGAGATTGCGACA
    ...
    >NC_000013.12 Homo sapiens chromosome 13, GRCh38.p13 Primary Assembly

In this example, the first line is the description line, starting with a '>' and the second line starts the DNA sequence.
There can be multiple lines of sequence separated by newlines or just a single line.

The description line has further structure in that the characters between the '>' and the first whitespace are 
treated as the sequence record identifier, in this case NC_000012.12 or NC_000013.12

More than one FASTA record may be in a FASTA file.


First, let's look at the description lines in our samples.fa sequence file

In [None]:
sample_file = 'data/chr12/samples.fa'
fileh = open(sample_file, 'r')
for line in fileh:
    line = line.strip()
    if line.startswith('>'):
        print(line)

Next, let's read them into a list of dictionaries so that we can make changes before we write them out. 

We'll need to create a new dictionary for each record (each time we see '>')

There are multiple lines of DNA sequence for each record that should get saved

In [None]:
fasta_records = []
sample_file = 'data/chr12/samples.fa'
fileh = open(sample_file, 'r')
current_description = None
current_sequence_lines = []
for line in fileh:
    line = line.strip()
    if line.startswith('>'):
        if current_description is not None:
            new_record = {'description': current_description, 'sequence_lines': current_sequence_lines}
            fasta_records.append(new_record)
        current_description = line
        current_sequence_lines = []
    else:
        current_sequence_lines.append(line)
fasta_records.append({'description': current_description, 'sequence_lines': current_sequence_lines})
    

In [None]:
fasta_records

Change the description lines to include the gene name, organism and sequence type so that sample1, for example, looks like this:

    >sample1 Homo sapiens acrosin binding protein, mRNA
    
The .format() function should work well.

First, make a dictionary out of our annotations data, keyed by the sample name
    

In [None]:
labeled_data_dict = {}
for record in sorted_labeled_data:
    labeled_data_dict[record['Accession']] = record
labeled_data_dict

In [None]:
for fasta_record in fasta_records:
    key = fasta_record['description'][1:]
    record = labeled_data_dict[key]
    new_description = '>{accession} {organism} {gene_name}, {seq_type}'.format(
        accession=record['Accession'],
        organism=record['Organism'],
        gene_name=record['Gene name'],
        seq_type=record['Seq type'],
    )
    fasta_record['description'] = new_description
fasta_records
    

Use the write function of the file handle to write to the new file.  Don't forget to add newlines.

In [None]:
annotated_sample_file = 'data/chr12/annotated-samples.fa'
fileh = open(annotated_sample_file, 'w')
for fasta_record in fasta_records:
    fileh.write('%s\n' % fasta_record['description'])
    fileh.write('%s\n' % '\n'.join(fasta_record['sequence_lines']))
fileh.close()

## Run minimap2 using annotated-samples.fa as the query and chr12.fa.gz as the reference sequence

minimap2 is a command line tool for mapping query sequences to a reference.  This is useful for characterizing 
query sequences, SNP detection, finding orthologs (from close relatives), etc.  Command line usage is described 
as follows:

    Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]

where 'target' is the reference sequence (chr12.fa.gz for us)

In [None]:
target_file = 'data/chr12/chr12.fa.gz'

In [None]:
cmd = './minimap2 {} {}'.format(target_file, annotated_sample_file)

In [None]:
cmd

### The most convenient way to run a shell command is _os.system()_

_os.system_ runs a command in a bash shell and outputs stderr and stdout to the console.  It returns the shell return code (e.g. zero for success)

Because it goes to the console, your Python code does not capture the output.

Execution is synchronous, so your program has to wait until it's done.

Bash shell (or whatever your current shell is) interpolation is done so PATH is honored, redirection works, etc.

In [None]:
os.system(cmd)

You can check the return code for non-zero-ness

In [None]:
cmd = './minimap2 --non-existent-switch {} {}'.format(target_file, annotated_sample_file)

In [None]:
if os.system(cmd) != 0:
    print('Fail!')
else:
    print('Success!')

But you need to capture stderr to find out what happened

In [None]:
cmd = './minimap2 --non-existent-switch {} {} 2> stderr 1> stdout'.format(target_file, annotated_sample_file)

In [None]:
if os.system(cmd) != 0:
    stderrh = open('stderr', 'r')
    print(stderrh.read())

### The subprocess _Popen()_ constructor allows more flexibility and power in the execution of shell commands.

The _Popen()_ constructor creates a process handle that can be used to capture stderr, stdout or pipe data into
stdin.

Run a process using Popen just like _os.system()_

In [24]:
import subprocess

In [None]:
cmd = './minimap2 -a {} {}'.format(target_file, annotated_sample_file)

In [None]:
proc = subprocess.Popen(cmd, shell=True)
proc.wait()

To capture stderr and stdout, use _PIPE_ and _.communicate()_

In [None]:
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)

In [None]:
stdout, stderr = proc.communicate()
if proc.returncode == 0:
    print(stdout)
else:
    print('Fail %s' % stderr)

In Python 3, shell output is returned as a bytearray that must be decoded

In [None]:
proc = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
stdout, stderr = proc.communicate()
if proc.returncode == 0:
    print(stdout.decode('ascii'))
else:
    print('Fail %s' % stderr)

A runcmd function can be handy

In [25]:
def runcmd(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE):
    proc = subprocess.Popen(cmd, shell=True, stdout=stdout, stderr=stderr)
    stdout, stderr = proc.communicate()
    return {'returncode': proc.returncode, 'stdout': stdout.decode('utf-8'), 'stderr': stderr.decode('utf-8')}

In [None]:
result = runcmd(cmd)

In [None]:
print(result['returncode'], "\n", result['stdout'].split("\n")[:10], "\n", result['stderr'])

### A Pool from the multiprocessing module can support parallel execution

Python cannot do real, parallel multithreading due to the [GIL](https://realpython.com/python-gil/).  The multiprocessing module simulates a threading library, but uses forked processes.

#### An interlude about Python modules

##### A module is a file with Python definitions and statements.  The _import_ statement allows you to use those definitions in your code

The creation of modules is how Python libraries are made and shared.

For example, if you're doing several projects with DNA sequence, you might like a module that had common DNA sequence manipulations.  In a file called dna.py you could define several functions and data that you might use repeatedly:

```python
DNA_COMPLEMENT = {
    'A': 'T',
    'T': 'A',
    'C': 'G',
    'G': 'C',
}

def reverse_complement(dna):
    '''
    Return the reverse complement of the DNA sequence
    '''
    complement = []
    for base in reversed(dna):
        complement.append(DNA_COMPLEMENT[base.upper()])
    return complement


def translate(dna, frame=0):
    '''
    Translate a string of dna sequence into protein sequence using the given frame
    '''
    protein_sequence = []
    for i in range(frame, len(dna), 3):
        ...
    return ''.join(protein_sequence)

def transcribe(dna):
    '''
    Convert DNA into RNA
    '''
    return dna.replace('T', 'U')
```


To use the functions in this file, you would have to either import the entire module and use the functions (via the dot operator):

```python
import dna

transcript_sequence = 'TACGATCGATCGATCGATTATCGATCAGTCA'
protein_sequence = dna.translate(transcript_sequence)
```

Or you could import specific functions from the file

```python
from dna import translate

protein_sequence = translate('TACGATCGATCGATCGATTATCGATCAGTCA')
``` 
    
The _from_ keyword will get you to the thing you want to import, but the import is what you're allowed to use in your code

##### Python modules can be organized in directories traversed by _from_

If the _dna.py_ file described above is placed under a path, e.g. _seqlib/seq/nuc/dna.py_, functions could be accessed using the _from_ keyword with dots replacing the path separator.

```python
from seqlib.seq.nuc.dna import transcribe
```
    
This will work, but a file named \_\_init\_\_.py must be present in each of the directories

##### Python starts looking for modules based on the value of _sys.path_, which may include PYTHONPATH, the current directory, and ~/.local

    [akitzmiller@bioinf01 ~]$ echo $PYTHONPATH
    /odyssey/rc_admin/sw/admin/rcpy:

    [akitzmiller@bioinf01 ~]$ pwd
    /n/home_rc/akitzmiller

    [akitzmiller@bioinf01 ~]$ python
    Python 2.7.5 (default, Apr  9 2019, 14:30:50) 
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    
    >>> import sys, os
    
    >>> os.environ['PYTHONPATH']
    '/odyssey/rc_admin/sw/admin/rcpy:'
    
    >>> print '\n'.join(sys.path)

    /odyssey/rc_admin/sw/admin/rcpy
    /n/home_rc/akitzmiller
    /usr/lib64/python27.zip
    /usr/lib64/python2.7
    /usr/lib64/python2.7/plat-linux2
    /usr/lib64/python2.7/lib-tk
    /usr/lib64/python2.7/lib-old
    /usr/lib64/python2.7/lib-dynload
    /usr/lib64/python2.7/site-packages
    /usr/lib64/python2.7/site-packages/gtk-2.0
    /usr/lib/python2.7/site-packages
    >>> 


##### You can find where a module comes from using the \_\_file\_\_ property of the module
Seriously, everything is an object

In [None]:
os.__file__

##### sys.path is setup relative to the interpreter path, which is why virtual environments work (more about them later)

In [None]:
import sys
print('\n'.join(sys.path))

#### A multiprocessing Pool allows you to manage parallel processes easily

A multiprocessing Pool is an object that allows you to launch, manage, and retrieve results from a set of forked processes.

#### The _map_ function applies a set of values to a single argument function.  This is a useful way to do a "parameter sweep" type of execution.

```python
from multiprocessing import Pool
import os

def echo(echoable):
    os.system('echo %s && sleep 10' % echoable)
    
echoables = [
    'ajk',
    '123',
    'qwerty',
    'uiop',
    'lkjdsa',
]

numprocs = 3
pool = Pool(numprocs)
result = pool.map(echo,echoables)
```

_123_ <br/>
_ajk_ <br/>
_qwerty_ <br/>
_lkjdsa_ <br/>
_uiop_ <br/>


#### The _apply_async_ function allows you to apply many arguments and returns a 'handle' for interacting with the process.

In order for this to work in parallel, you'll need to collect the result handles in a list

```python
from multiprocessing import Pool
import os
def greet(name, message):
    os.system('echo "Hi %s, %s" && sleep 10' % (name,message))
    return '%s was greeted' % name

greetings = [
    ('Aaron', "What's up?"),
    ('Bert', "Where's Ernie?"),
    ('Donald', "What're you thinking?"),
    ('folks', 'Sup!'),
]
numprocs = 3
pool = Pool(numprocs)
results = []
for greeting in greetings:
    result = pool.apply_async(greet, greeting)
    results.append(result)
```

_Hi Bert, Where's Ernie?_ <br/>
_Hi Aaron, What's up?_ <br/>
_Hi Donald, What're you thinking?_ <br/>
_Hi folks, Sup!_ <br/>
    
```python
for result in results:
    print(result.get())
```

_Aaron was greeted_ <br/>
_Bert was greeted_ <br/>
_Donald was greeted_ <br/>
_folks was greeted_ <br/>


#### Run several minimap2 processes in parallel

Create a function that runs minimap2

In [21]:
def minimap2(target_file, query_file):
    cmd = './minimap2 {} {}'.format(target_file, query_file)
    return runcmd(cmd)

Setup function arguments in a list

In [None]:
queries = [
    'data/chr12/annotated-samples.fa',
    'data/chr12/mouse.fa',
    'data/chr12/zebrafish.fa',
]
target = 'data/chr12/chr12.fa.gz'

Running in series will be pretty slow

In [None]:
import time

starttime = time.time()
for query in queries:
    output = minimap2(target, query)
    print(output['stderr'])
elapsed = time.time() - starttime
print('%d seconds elapsed' % elapsed)

But in parallel

In [None]:
from multiprocessing import Pool

numprocs = 2
pool = Pool(numprocs)
results = []
starttime = time.time()
for query in queries:
    result = pool.apply_async(minimap2, [target, query])
    results.append(result)

print('Finished applying to Pool')

for result in results:
    output = result.get()
    print(output['stderr'])
elapsed = time.time() - starttime
print('%d seconds elapsed' % elapsed)

In [None]:
annotated_sample_file

In [None]:
fasta_records = []
sample_file = annotated_sample_file
fileh = open(sample_file, 'r')
current_description = None
current_sequence_lines = []
for line in fileh:
    line = line.strip()
    if line.startswith('>'):
        if current_description is not None:
            new_record = {'description': current_description, 'sequence': ''.join(current_sequence_lines)}
            fasta_records.append(new_record)
        current_description = line
        current_sequence_lines = []
    else:
        current_sequence_lines.append(line)
fasta_records.append({'description': current_description, 'sequence': ''.join(current_sequence_lines)})
    

In [None]:
print(fasta_records[0])

## Search for patterns in the DNA sequence using regular expressions

Python has a full-featured, Perl-ish regular expression syntax provided by the _re_ module

First, a simple search for DNA-ness in each of the fasta record sequences.

Using _re.search_ looks for at least one instance of the pattern

In [None]:
import re

In [None]:
for fasta_record in fasta_records:
    if re.search(r'A', fasta_record['sequence']):
        print('Found at least one Adenine in FASTA record %s' % fasta_record['description'])
        

You can search for multiple character patterns, like 'A' followed by 'T'

In [None]:
for fasta_record in fasta_records:
    if re.search(r'AT', fasta_record['sequence']):
        print('Found at least one Adenine-Thymine in FASTA record %s' % fasta_record['description'])

You can also search for character sets, e.g. one of A,T,C, or G, using square brackets [].

In [None]:
for fasta_record in fasta_records:
    if re.search(r'[ATCG]', fasta_record['sequence']):
        print('Found at least one of A or T or C or G in FASTA record %s' % fasta_record['description'])

In [None]:
for fasta_record in fasta_records:
    if re.search(r'[U]', fasta_record['sequence']):
        print('Found a U in FASTA record %s' % fasta_record['description'])
    else:
        print('No U found in %s' % fasta_record['description'])

There are more general character classes built in, like \S (any non-whitespace) or \s (any whitespace)

In [None]:
for fasta_record in fasta_records:
    if re.search(r'\S', fasta_record['sequence']):
        print('Found at least one non whitespace character in FASTA record %s' % fasta_record['description'])

In [None]:
for fasta_record in fasta_records:
    if re.search(r'\S\s', fasta_record['sequence']):
        print('Found a non-whitespace followed by a whitespace in FASTA record %s' % fasta_record['description'])
    else:
        print('No non-whitespace followed by a whitespace found in %s' % fasta_record['description'])

Quantifiers ({n,m}) can define how many times you see the character(s) you're searching for.

In [None]:
for fasta_record in fasta_records:
    if re.search(r'CA{2,3}', fasta_record['sequence']):
        print('Found C followed by 2 or 3 As in FASTA record %s' % fasta_record['description'])

Without the second number and comma, it must be an exact number

In [None]:
for fasta_record in fasta_records:
    if re.search(r'CA{6}', fasta_record['sequence']):
        print('Found at least one C followed by 6 As in FASTA record %s' % fasta_record['description'])

If you leave the comma in, it's n or more

In [None]:
for fasta_record in fasta_records:
    if re.search(r'CA{5,}', fasta_record['sequence']):
        print('Found at least C followed by 5 or more As in FASTA record %s' % fasta_record['description'])

There are special quantifiers '+' (one or more) and '*' (zero or more)

In [None]:
for fasta_record in fasta_records:
    if re.search(r'ATG+', fasta_record['sequence']):
        print('Found AT followed by at least one G %s' % fasta_record['description'])

In [None]:
for fasta_record in fasta_records:
    if re.search(r'U*', fasta_record['sequence']):
        print('Found zero or more uracil bases in FASTA record %s' % fasta_record['description'])

Non-capturing groups _(?:)_ support or-ing together strings

In [None]:
for fasta_record in fasta_records:
    if re.search(r'ATG.+(?:TAG|TAA|TGA)', fasta_record['sequence']):
        print('Found a transcript looking thing %s' % fasta_record['description'])

Using capture groups, you can extract the matches

In [None]:
for fasta_record in fasta_records:
    match = re.search(r'(ATG.+(?:TAG|TAA|TGA))', fasta_record['sequence'])
    if match:
        print('Found a transcript looking thing %s' % fasta_record['description'])
        print(match.group(1))

The _split()_ function allows you to break a string based on a regular expression.

Find potential genes by splitting chr12 on stop codons followed by a lot of T

In [9]:
os.system('gzip -d data/chr12/chr12.fa.gz')
chr12 = []
with open('data/chr12/chr12.fa', 'r') as fileh:
    for line in fileh:
        if not line.startswith('>'):
            chr12.append(line.strip())

In [10]:
len(chr12)

1903934

In [11]:
chr12 = ''.join(chr12)

In [12]:
coding = re.split(r'(T[GA][GA]T{20,})', chr12)

In [13]:
len(coding)

885

In [None]:
for c in coding:
    print(len(c))

## Perform a virtual restriction fragment digest on chromosome 12 and map the annotated samples to the largest fragments

* Read the chromosome 12 sequence
* Cut it with the BisI restriction enzyme (recognizes GCAGC, GCTGC, GCGGC and GCCGC)
* Sort the unique fragments by size
* Write the 3 largest fragments to 3 separate FASTA files
* Run minimap2 with the annotated samples against each of the 3 largest fragments

In [1]:
chr12 = []
with open('data/chr12/chr12.fa', 'r') as fileh:
    for line in fileh:
        if not line.startswith('>'):
            chr12.append(line.strip())

In [2]:
len(chr12)

1903934

In [3]:
chr12 = ''.join(chr12)

In [4]:
len(chr12)

133275309

In [9]:
import re
digested = re.split(r'GC[ATGC]GC', chr12)

In [10]:
len(digested)

250772

In [13]:
unique_digested = set(digested)
len(unique_digested)

241783

In [14]:
def bylen(item):
    return len(item)

sorted_unique_digested = sorted(unique_digested, key=bylen, reverse=True)

In [15]:
len(sorted_unique_digested[0])

72368

In [16]:
len(sorted_unique_digested[-1])

0

In [17]:
target_names = []
for i in [0,1,2]:
    file_name = 'digested%d.fa' % i
    file_path = 'data/chr12/%s' % file_name
    with open(file_path, 'w') as fileh:
        fileh.write('>%s\n' % file_name)
        fileh.write('%s\n' % sorted_unique_digested[i])
    target_names.append(file_path)

In [18]:
target_names

['data/chr12/digested0.fa',
 'data/chr12/digested1.fa',
 'data/chr12/digested2.fa']

In [26]:
from multiprocessing import Pool
numprocs = 2
pool = Pool(numprocs)
results = []
query = 'data/chr12/annotated-samples.fa'
for target in target_names:
    result = pool.apply_async(minimap2, [target, query])
    results.append(result)
print('Done launching')

for result in results:
    output = result.get()
    print(output['stderr'])

Done launching
[M::mm_idx_gen::0.020*1.03] collected minimizers
[M::mm_idx_gen::0.024*1.05] sorted minimizers
[M::main::0.024*1.05] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.025*1.05] mid_occ = 29
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.026*1.04] distinct minimizers: 2556 (80.16% are singletons); average occurrences: 1.669; average spacing: 16.964
[M::worker_pipeline::0.048*0.64] mapped 4 sequences
[M::main] Version: 2.17-r941
[M::main] CMD: ./minimap2 data/chr12/digested0.fa data/chr12/annotated-samples.fa
[M::main] Real time: 0.050 sec; CPU: 0.033 sec; Peak RSS: 0.010 GB

[M::mm_idx_gen::0.019*0.85] collected minimizers
[M::mm_idx_gen::0.020*0.86] sorted minimizers
[M::main::0.020*0.86] loaded/built the index for 1 target sequence(s)
[M::mm_mapopt_update::0.021*0.86] mid_occ = 4
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 1
[M::mm_idx_stat::0.021*0.86] distinct minimizers: 143 (76.92% are singletons)