# Add Strandedness

I have been thinking about optimization of the workflows. I think a huge sticking point is using the database to store various flags. While conceptually this was a great way to do it, I think on the NFS performance is not good enough for maintainability. Instead I am going to migrate towards a using the file system for some things, and HDF5 file stores for storing aggregated files and flags.

Strandedness is determined during the pre-alignment workflow. I think the best way to store this is as a flag in a file `{srx}/{srr}/STRAND`. This will be immediately accessible and can be used as a file dependency. I want to go ahead and create this file for samples that I have already processed.

I output the following FLAGS:
* `same_strand` when reads align to the same strand as the gene model. Somtimes called `first strand`.
* `opposite_strand` when reads align to the opposite strand as the gene model. Somtimes called `second strand`.
* `unstranded` when reads equally align to the both strand regardless on which strand the gene model is on.

In [1]:
# %load ../start.py
# Load useful extensions

# Activate the autoreload extension for easy reloading of external packages
%reload_ext autoreload
%autoreload 2

# Trun on the water mark
%reload_ext watermark
%watermark -u -d -g

# Load ipycache extension
%reload_ext ipycache
from ipycache import CacheMagics
CacheMagics.cachedir = '../cachedir'

# Add project library to path
import sys
sys.path.insert(0, '../../lib/python')

# The usual suspects
import os
import numpy as np
import pandas as pd

# plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_context('poster')

# Turn off scientific notation
np.set_printoptions(precision=5, suppress=True)


last updated: 2017-10-18 
Git hash: bef20d94a84e5a2308a3513d0727c10724ef2729


In [2]:
# %load ../../bin/load.py
from pymongo import MongoClient
with open('/home/fearjm/Projects/ncbi_remap/output/.mongodb_host', 'r') as fh:
    host = fh.read().strip()
client = MongoClient(host=host, port=27022)
db = client['sra2']
remap = db['remap']


In [3]:
from dask import delayed, compute
from dask.diagnostics import ProgressBar

In [4]:
same = list(remap.aggregate([
    {'$unwind': '$runs'},
    {
        '$match': {
            'runs.pre_aln_flags': 'same_strand'
        }
    },
    {
        '$project': {
            '_id': 0,
            'srx': '$srx',
            'srr': '$runs.srr'
        }
    }
]))

In [7]:
opposite = list(remap.aggregate([
    {'$unwind': '$runs'},
    {
        '$match': {
            'runs.pre_aln_flags': 'opposite_strand'
        }
    },
    {
        '$project': {
            '_id': 0,
            'srx': '$srx',
            'srr': '$runs.srr'
        }
    }
]))

In [8]:
unstranded = list(remap.aggregate([
    {'$unwind': '$runs'},
    {
        '$match': {
            'runs.pre_aln_flags': 'unstranded'
        }
    },
    {
        '$project': {
            '_id': 0,
            'srx': '$srx',
            'srr': '$runs.srr'
        }
    }
]))

In [10]:
len(same), len(opposite), len(unstranded)

(3024, 5435, 14556)

In [11]:
# quick function to output LAYOUT file.
def write_file(fname, flag, dryrun=False):
    if dryrun:
        return fname, flag
    
    if os.path.exists(os.path.dirname(fname)):
        with open(fname, 'w') as fh:
            fh.write(flag)
            
pattern = '../../output/prealignment/raw/{srx}/{srr}/STRAND'

In [12]:
# Write same strand
dfs = [delayed(write_file)(pattern.format(**x), 'same_strand') for x in same]
with ProgressBar():
    compute(*dfs, num_workers=10)

[                                        ] | 0% Completed |  0.0s[                                        ] | 0% Completed |  0.2s[                                        ] | 0% Completed |  0.3s[                                        ] | 0% Completed |  0.4s[                                        ] | 1% Completed |  0.5s[                                        ] | 1% Completed |  0.6s[                                        ] | 1% Completed |  0.7s[                                        ] | 2% Completed |  0.8s[                                        ] | 2% Completed |  0.9s[#                                       ] | 2% Completed |  1.0s[#                                       ] | 3% Completed |  1.1s[#                                       ] | 3% Completed |  1.2s[#                                       ] | 3% Completed |  1.3s[#                                       ] | 4% Completed |  1.4s[#                                       ] | 4% Completed |  1.5s[#       

In [13]:
# Write opposite strand
dfs = [delayed(write_file)(pattern.format(**x), 'opposite_strand') for x in opposite]
with ProgressBar():
    compute(*dfs, num_workers=10)

[                                        ] | 0% Completed |  0.0s[                                        ] | 0% Completed |  0.2s[                                        ] | 0% Completed |  0.3s[                                        ] | 0% Completed |  0.4s[                                        ] | 0% Completed |  0.5s[                                        ] | 0% Completed |  0.6s[                                        ] | 0% Completed |  0.7s[                                        ] | 1% Completed |  0.8s[                                        ] | 1% Completed |  0.9s[                                        ] | 1% Completed |  1.0s[                                        ] | 1% Completed |  1.1s[                                        ] | 1% Completed |  1.2s[                                        ] | 2% Completed |  1.3s[                                        ] | 2% Completed |  1.4s[#                                       ] | 2% Completed |  1.5s[#       

In [14]:
# Write unstranded
dfs = [delayed(write_file)(pattern.format(**x), 'unstranded') for x in unstranded]
with ProgressBar():
    compute(*dfs, num_workers=10)

[                                        ] | 0% Completed |  0.0s[                                        ] | 0% Completed |  0.2s[                                        ] | 0% Completed |  0.5s[                                        ] | 0% Completed |  0.6s[                                        ] | 0% Completed |  0.7s[                                        ] | 0% Completed |  0.8s[                                        ] | 0% Completed |  0.9s[                                        ] | 0% Completed |  1.0s[                                        ] | 0% Completed |  1.1s[                                        ] | 0% Completed |  1.2s[                                        ] | 0% Completed |  1.3s[                                        ] | 0% Completed |  1.4s[                                        ] | 0% Completed |  1.5s[                                        ] | 1% Completed |  1.6s[                                        ] | 1% Completed |  1.7s[        