In [21]:
import argparse
import collections

Scripting
=========

As opposed to R, with python is extremely easy to create handy scripts. Those are very useful when working from the command line and/or in HPC (high performance computing).

A word on the Unix philosophy
------------

When writing a script, it's always a good idea to follow the [Unix philosophy](https://en.wikipedia.org/wiki/Unix_philosophy), which emphasizes simplicity, interoperability and modularity instead of overengineering. In short:

* Write programs that do one thing and do it well.
* Write programs to work together.
* Write programs to handle text streams, because that is a universal interface.

If you have even a basic knowledge of the use of the `bash` (or bash-like) command line, you would probably already be familiar with these concepts. Consider the following example:

    > curl --silent "http://wodaklab.org/cyc2008/resources/CYC2008_complex.tab" | head -n 5
    ORF     Name    Complex PubMed_id       Method  GO_id   GO_term Jaccard_Index
    YKR068C BET3    TRAPP complex   10727015        "Affinity Capture-Western,Affinity Capture-MS"  GO:0030008      TRAPP complex   1
    YML077W BET5    TRAPP complex
    YDR108W GSG1    TRAPP complex
    YGR166W KRE11   TRAPP complex
    
Here we have chained two command line tools: `curl` to stream a text file from the internet and piped it into `head` to show only the first 5 rows. Anideal python script should follow the same principles. Immagine we wanted to substitute `head` with a little script that transforms the text file in a way such that for each complex name (`Name` column) we report all the genes belonging to that complex. For instance:

    > curl --silent "http://wodaklab.org/cyc2008/resources/CYC2008_complex.tab" | ./cyc2txt | head -n 5
    SIR     YLR442C,YDL042C,YDR227W
    SIP     YGL208W,YDR422C
    PAC1    YGR078C,YDR488C
    SIT     YDL047W
    CPA     YJR109C,YOR303W
    
Parsing the command line
-----------

As shown in the example above, command line tools often accept options and even input files (i.e. `head -n 5`). Parsing these arguments with the necessary flexibility is not trivial. Writing a command line argument parser that handles positional and optional arguments, potentially with some checks on their type is not trivial.

In [8]:
def parse_args(cmd_line):
    Args = collections.namedtuple('Args',
                                  ['n', 'in_file'])
    n_trigger = False
    # default value for "n"
    n = 1
    for arg in cmd_line:
        if n_trigger:
            n = int(arg)
            n_trigger = False
            continue
        if arg == '-n':
            # next argument belongs to "-n"
            n_trigger = True
            continue
        else:
            # it must be the positional argument
            in_file = arg
    return Args(n=n, in_file=in_file)

In [9]:
# immaginary command line
cmd_line = '-n 5 myfile.txt'
parse_args(cmd_line.split())

Args(n=5, in_file='myfile.txt')

In [10]:
# immaginary command line with multiple input files
cmd_line = '-n 5 myfile.txt another_one.txt'
parse_args(cmd_line.split())

Args(n=5, in_file='another_one.txt')

**Note:** in real life we would use the following startegy to read the arguments from the command line:

    import sys
    cmd_line = ' '.join(sys.argv[1:])
    
`sys.argv[0]` will be the name of the script, as called from the command line

We need to extend our original function, to account for additional positional arguments. We'll also add an extra boolean option. 

In [13]:
def parse_args(cmd_line):
    Args = collections.namedtuple('Args',
                                  ['n',
                                   'verbose',
                                   'in_file',
                                   'another_file'])
    n_trigger = False
    # default value for "n"
    n = 1
    # default value for "verbose"
    verbose = False
    # list to hold the positional arguments
    positional = []
    for arg in cmd_line:
        if n_trigger:
            n = int(arg)
            n_trigger = False
            continue
        if arg == '-n':
            # next argument belongs to "-n"
            n_trigger = True
        elif arg == '--verbose' or arg == '-v':
            verbose = True
        else:
            # it must be the positional argument
            positional.append(arg)
    return Args(n=n,
                verbose=verbose,
                in_file=positional[0],
                another_file=positional[1])

In [14]:
# immaginary command line with multiple input files
cmd_line = '-n 5 myfile.txt another_one.txt'
parse_args(cmd_line.split())

Args(n=5, verbose=False, in_file='myfile.txt', another_file='another_one.txt')

What if the `--verbose` option can be called multiple times to modulate the amount of verbosity of our script?

In [15]:
def parse_args(cmd_line):
    Args = collections.namedtuple('Args',
                                  ['n',
                                   'verbose',
                                   'in_file',
                                   'another_file'])
    n_trigger = False
    # default value for "n"
    n = 1
    # default value for "verbose"
    verbose = 0
    # list to hold the positional arguments
    positional = []
    for arg in cmd_line:
        if n_trigger:
            n = int(arg)
            n_trigger = False
            continue
        if arg == '-n':
            # next argument belongs to "-n"
            n_trigger = True
        elif arg == '--verbose' or arg == '-v':
            verbose += 1
        else:
            # it must be the positional argument
            positional.append(arg)
    return Args(n=n,
                verbose=verbose,
                in_file=positional[0],
                another_file=positional[1])

In [16]:
# immaginary command line with increased verbosity
cmd_line = '-n 5 -v -v myfile.txt another_one.txt'
parse_args(cmd_line.split())

Args(n=5, verbose=2, in_file='myfile.txt', another_file='another_one.txt')

In [18]:
# by convention we can also increase verbosity in the following manner
cmd_line = '-n 5 -vvv myfile.txt another_one.txt'
parse_args(cmd_line)

Args(n=5, verbose=0, in_file='-vvv', another_file='myfile.txt')

Let's add this additional functionality, hopefully you are starting to see how complicated and prone to bugs is writing your own command line parser!

In [19]:
def parse_args(cmd_line):
    Args = collections.namedtuple('Args',
                                  ['n',
                                   'verbose',
                                   'in_file',
                                   'another_file'])
    n_trigger = False
    # default value for "n"
    n = 1
    # default value for "verbose"
    verbose = 0
    # list to hold the positional arguments
    positional = []
    for arg in cmd_line:
        if n_trigger:
            n = int(arg)
            n_trigger = False
            continue
        if arg == '-n':
            # next argument belongs to "-n"
            n_trigger = True
        elif arg == '--verbose' or arg == '-v' or arg.startswith('-v'):
            if arg.startswith('-v') and len(arg) > 2 and len({char for char in arg[1:]}) == 1:
                verbose += len(arg[1:])
            else:
                verbose += 1
        else:
            # it must be the positional argument
            positional.append(arg)
    return Args(n=n,
                verbose=verbose,
                in_file=positional[0],
                another_file=positional[1])

In [20]:
# by convention we can also increase verbosity in the following manner
cmd_line = '-n 5 -vvv myfile.txt another_one.txt'
parse_args(cmd_line.split())

Args(n=5, verbose=3, in_file='myfile.txt', another_file='another_one.txt')

The `argparse` module
----------

Python as a very useful module to create scripts, and it is included the standard library: [`argparse`](https://docs.python.org/3/library/argparse.html). It allows to create command line parser that are concise yet very flexible and powerful.

Let's rewrite our last example using `argparse`.

In [24]:
def parse_args(cmd_line):
    parser = argparse.ArgumentParser()
    
    # positional arguments
    parser.add_argument('my_file',
                        help='My input file')
    parser.add_argument('another_file',
                        help='Another input file')
    
    # optional arguments
    parser.add_argument('-n',
                        type=int,
                        default=1,
                        help='Number of Ns [Default: 1]')
    parser.add_argument('-v', '--verbose',
                        action='count',
                        default=0,
                        help='Increase verbosity level')
    
    return parser.parse_args(cmd_line)

In [25]:
# by convention we can also increase verbosity in the following manner
cmd_line = '-n 5 -vvv myfile.txt another_one.txt'
parse_args(cmd_line.split())

Namespace(another_file='another_one.txt', my_file='myfile.txt', n=5, verbose=3)

By indicating the type of the `-n` options, we can easily check for its type.

In [27]:
# by convention we can also increase verbosity in the following manner
cmd_line = '-n not_an_integer -vvv myfile.txt another_one.txt'
parse_args(cmd_line.split())

usage: __main__.py [-h] [-n N] [-v] my_file another_file
__main__.py: error: argument -n: invalid int value: 'not_an_integer'


SystemExit: 2

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


...and we also get an `-h` (help) option for free, already formatted!

In [29]:
# by convention we can also increase verbosity in the following manner
cmd_line = '-h'
parse_args(cmd_line.split())

usage: __main__.py [-h] [-n N] [-v] my_file another_file

positional arguments:
  my_file        My input file
  another_file   Another input file

optional arguments:
  -h, --help     show this help message and exit
  -n N           Number of Ns [Default: 1]
  -v, --verbose  Increase verbosity level


SystemExit: 0

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


More `argparse` examples
--------------

Logging
-------

Script template
---------------

Multiprocessing
---------------