# A real use case

![ngs sequencing](http://www.genohm.com/wp-content/uploads/2015/05/DNA-actg1.jpg)

## Preparation

In [1]:
# Base dir
mydir = './tmp/'
# Python variables
myfile = mydir + 'job.py'
myinput = mydir + 'ngs.sam'
myshortinput = myinput + '.short'
ngs_module = mydir + 'ngs.py'
# Bash variables
%env mydir $mydir
%env myinput $myinput
%env myshortinput $myshortinput

env: mydir=./tmp/
env: myinput=./tmp/ngs.sam
env: myshortinput=./tmp/ngs.sam.short


In [2]:
%%bash
mkdir -p $mydir
head -1500 $myinput > $myshortinput
wc -l $mydir*sam*

   334121 ./tmp/ngs.sam
     1500 ./tmp/ngs.sam.short
   335621 total


## Constants

In [3]:
%%writefile $ngs_module
#!/usr/bin/env python3
# -*- coding: utf-8 -*-

""" NGS functions """

# Symbol for header inside SAM format
HEADER_CHAR = "@"
# Chromosomic separator
SEP = ":"
# Formats for quality reading: http://j.mp/1zo30bt
# Illumina 1.8+ Phred+33,  raw reads typically (0, 41)
PHRED_INIT = 33
# What is the minimum quality?
PHRED_THOLD = 20  # Handle STRAND:
FORWARD_STRAND_FLAG = 16



Overwriting ./tmp/ngs.py


Handling the strand

In [4]:
%%writefile -a $ngs_module

def compute_strand(flag, myseq='ACTG', mystart=1):
    # Convert the flag to decimal, and check
    # the bit in the 5th position from the right.
    if flag & FORWARD_STRAND_FLAG:       # Strand forward
        mystop = mystart + len(myseq)
    else:               # Strand reverse
        mystop = mystart + 1
        mystart = mystop - len(myseq)
    return mystart, mystop



Appending to ./tmp/ngs.py


Generating ngs data from one line

In [5]:
%%writefile -a $ngs_module

def ngs_split(line=None):
    """ Recover necessary data for NGS analysis """
    pieces = line.split("\t")

    myflag = int(pieces[1])
    mychr = pieces[2]
    mystart = int(pieces[3])
    myseq = pieces[9]
    myqc = pieces[10]
    mystart, mystop = compute_strand(myflag, myseq, mystart)

    if isinstance(mychr[3:], int):
        code = int(mychr[3:]).zfill(2)  # sortable chromosome code
    else:
        code = ord(mychr[3:]).__str__()

    return mychr, mystart, mystop, myseq, myflag, myqc, code



Appending to ./tmp/ngs.py


In [6]:
cat $ngs_module

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

""" NGS functions """

# Symbol for header inside SAM format
HEADER_CHAR = "@"
# Chromosomic separator
SEP = ":"
# Formats for quality reading: http://j.mp/1zo30bt
# Illumina 1.8+ Phred+33,  raw reads typically (0, 41)
PHRED_INIT = 33
# What is the minimum quality?
PHRED_THOLD = 20  # Handle STRAND:
FORWARD_STRAND_FLAG = 16

def compute_strand(flag, myseq='ACTG', mystart=1):
    # Convert the flag to decimal, and check
    # the bit in the 5th position from the right.
    if flag & FORWARD_STRAND_FLAG:       # Strand forward
        mystop = mystart + len(myseq)
    else:               # Strand reverse
        mystop = mystart + 1
        mystart = mystop - len(myseq)
    return mystart, mystop

def ngs_split(line=None):
    """ Recover necessary data for NGS analysis """
    pieces = line.split("\t")

    myflag = int(pieces[1])
    mychr = pieces[2]
    mystart = int(pieces[3])
    myseq = pieces[9]
   

In [7]:
%%writefile $myfile

from mrjob.job import MRJob
from ngs import ngs_split, \
    PHRED_INIT, PHRED_THOLD, HEADER_CHAR, SEP

class MRcoverage(MRJob):
    """ Map Reduce for NGS coverage computation"""

    def mapper(self, _, line):
        """ Split data into keys and value """

        if line[0] == HEADER_CHAR:
            yield "header", 1
        else:
            # Get data
            mychr, mystart, mystop, myseq, myflag, myqc, code = ngs_split(line)
            # For each base of my dna sequence
            for i in range(mystart, mystop):
                mypos = i - mystart
                current_qc = ord(myqc[mypos]) - PHRED_INIT  # quality value
                if current_qc > PHRED_THOLD:
                    # Choose the key to emit
                    label = code + SEP + i.__str__() + SEP + mychr
                    current_letter = myseq[mypos]
                    yield label, 1
                    yield label + SEP + current_letter, 1

    def reducer(self, key, values):
        """ Sum up values"""
        yield key, sum(values)

if __name__ == '__main__':
    MRcoverage().run()


Overwriting ./tmp/job.py


In [8]:
%system python $myfile $myshortinput

['no configs found; falling back on auto-configuration',
 'no configs found; falling back on auto-configuration',
 'creating tmp directory /tmp/job.jovyan.20151214.162912.231757',
 'writing to /tmp/job.jovyan.20151214.162912.231757/step-0-mapper_part-00000',
 'Counters from step 1:',
 '  (none found)',
 'writing to /tmp/job.jovyan.20151214.162912.231757/step-0-mapper-sorted',
 '> sort /tmp/job.jovyan.20151214.162912.231757/step-0-mapper_part-00000',
 'writing to /tmp/job.jovyan.20151214.162912.231757/step-0-reducer_part-00000',
 'Counters from step 1:',
 '  (none found)',
 'Moving /tmp/job.jovyan.20151214.162912.231757/step-0-reducer_part-00000 -> /tmp/job.jovyan.20151214.162912.231757/output/part-00000',
 'Streaming final output from /tmp/job.jovyan.20151214.162912.231757/output',
 '"49:10000:chr1"\t2',
 '"49:10000:chr1:A"\t2',
 '"49:10001:chr1"\t2',
 '"49:10001:chr1:C"\t2',
 '"49:10002:chr1"\t2',
 '"49:10002:chr1:A"\t2',
 '"49:10003:chr1"\t2',
 '"49:10003:chr1:G"\t2',
 '"49:10004:chr

## This is not going to work

In [9]:
! python $myfile -r hadoop $myshortinput 1> log.out 2> log.err

In [10]:
cat log.err

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/job.jovyan.20151214.163021.842422
writing wrapper script to /tmp/job.jovyan.20151214.163021.842422/setup-wrapper.sh
Looking for hadoop binary in /usr/local/hadoop/bin
Using Hadoop version 2.6.0
STDERR: mkdir: Permission denied: user=jovyan, access=WRITE, inode="/user":root:supergroup:drwxr-xr-x
Traceback (most recent call last):
  File "/opt/mrjob/mrjob/fs/hadoop.py", line 293, in mkdir
    self.invoke_hadoop(args, ok_stderr=[_HADOOP_FILE_EXISTS_RE])
  File "/opt/mrjob/mrjob/fs/hadoop.py", line 180, in invoke_hadoop
    raise CalledProcessError(proc.returncode, args)
subprocess.CalledProcessError: Command '['/usr/local/hadoop/bin/hadoop', 'fs', '-mkdir', '-p', 'hdfs:///user/jovyan/tmp/mrjob/job.jovyan.20151214.163021.842422/files/']' returned non-zero exit status 1

During handling of the above exception, another exception occurred:

Traceback (most rece

Your modules could be compressed and sent to HADOOP or EMR

http://mrjob.readthedocs.org/en/latest/guides/setup-cookbook.html#putting-your-source-tree-in-pythonpath

In [11]:
cat $ngs_module $myfile > tmp/hadoop.py

In [13]:
! python tmp/hadoop.py -r hadoop $myshortinput

no configs found; falling back on auto-configuration
no configs found; falling back on auto-configuration
creating tmp directory /tmp/hadoop.jovyan.20151214.164022.701186
writing wrapper script to /tmp/hadoop.jovyan.20151214.164022.701186/setup-wrapper.sh
Looking for hadoop binary in /usr/local/hadoop/bin
Using Hadoop version 2.6.0
Copying local files into hdfs:///user/jovyan/tmp/mrjob/hadoop.jovyan.20151214.164022.701186/files/
Looking for Hadoop streaming jar in /usr/local/hadoop
Found Hadoop streaming jar: /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar
HADOOP: packageJobJar: [/tmp/hadoop-unjar5440388343696524568/] [] /tmp/streamjob9001070335311574816.jar tmpDir=null
HADOOP: Connecting to ResourceManager at /0.0.0.0:8032
HADOOP: Connecting to ResourceManager at /0.0.0.0:8032
HADOOP: Total input paths to process : 1
HADOOP: number of splits:2
HADOOP: Submitting tokens for job: job_1450107962907_0002
HADOOP: Submitted application application_1450107962907_0002
HADO

## Counters

## Runners

You cannot use the programmatic runner functionality in the same file as your job class. As an example of what not to do, here is some code that does not work.