# Add Layout

I have been thinking about optimization of the workflows. I think a huge sticking point is using the database to store various flags. While conceptually this was a great way to do it, I think on the NFS performance is not good enough for maintainability. Instead I am going to migrate towards a using the file system for some things, and HDF5 file stores for storing aggregated files and flags.

Library layout is determined during file download and initial parsing. I also add various flags if I am keeping the first read of the second read, when a PE library is really an SE library. Unfortunately I need to this information often, so I think the best way to store this flag is in a file `{srx}/{srr}/LAYOUT`. This will be immediately accessible and can be used as a file dependency. I want to go ahead and create this file for samples that I have already processed.

I output the following FLAGS:
* `SE` for single end
* `PE` for paired end
* `keep_R1` for single end with Read 1
* `keep_R2` for single end with Read 2

In [2]:
# %load ../start.py
# Load useful extensions

# Activate the autoreload extension for easy reloading of external packages
%reload_ext autoreload
%autoreload 2

# Trun on the water mark
%reload_ext watermark
%watermark -u -d -g

# Load ipycache extension
%reload_ext ipycache
from ipycache import CacheMagics
CacheMagics.cachedir = '../cachedir'

# Add project library to path
import sys
sys.path.insert(0, '../../lib/python')

# The usual suspects
import os
import numpy as np
import pandas as pd

# plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_context('poster')

# Turn off scientific notation
np.set_printoptions(precision=5, suppress=True)


last updated: 2017-10-12 
Git hash: 8c83e87b7c4eac097d2ea2f50eee0e3a81393eaa


In [4]:
# %load ../../bin/load.py
from pymongo import MongoClient
with open('/home/fearjm/Projects/ncbi_remap/output/.mongodb_host', 'r') as fh:
    host = fh.read().strip()
client = MongoClient(host=host, port=27022)
db = client['sra2']
remap = db['remap']


In [None]:
from dask import delayed, compute
from dask.diagnostics import ProgressBar

In [5]:
paired = list(remap.aggregate([
    {'$unwind': '$runs'},
    {
        '$match': {
            'runs.pre_aln_flags': 'PE'
        }
    },
    {
        '$project': {
            '_id': 0,
            'srx': '$_id',
            'srr': '$runs.srr'
        }
    }
]))

In [7]:
single = list(remap.aggregate([
    {'$unwind': '$runs'},
    {
        '$match': {
            '$and': [
                {'runs.pre_aln_flags': 'SE'},
                {'runs.pre_aln_flags': {'$ne': 'keep_R1'}},
                {'runs.pre_aln_flags': {'$ne': 'keep_R2'}},
            ]
        }
    },
    {
        '$project': {
            '_id': 0,
            'srx': '$_id',
            'srr': '$runs.srr'
        }
    }
]))

In [8]:
keep1 = list(remap.aggregate([
    {'$unwind': '$runs'},
    {
        '$match': {
            'runs.pre_aln_flags': 'keep_R1',
        }
    },
    {
        '$project': {
            '_id': 0,
            'srx': '$_id',
            'srr': '$runs.srr'
        }
    }
]))

In [9]:
keep2 = list(remap.aggregate([
    {'$unwind': '$runs'},
    {
        '$match': {
            'runs.pre_aln_flags': 'keep_R2',
        }
    },
    {
        '$project': {
            '_id': 0,
            'srx': '$_id',
            'srr': '$runs.srr'
        }
    }
]))

len(single), len(paired), len(keep1), len(keep2)

In [12]:
# Sanity check 1
for x in keep1:
    if x in single:
        print(x)

In [50]:
# Sanity check 2
for x in keep2:
    if x in single:
        print(x)

In [40]:
# quick function to output LAYOUT file.
def write_file(fname, flag, dryrun=False):
    if dryrun:
        return fname, flag
    
    if os.path.exists(os.path.dirname(fname)):
        with open(fname, 'w') as fh:
            fh.write(flag)
            
pattern = '../../output/prealignment/raw/{srx}/{srr}/LAYOUT'

In [44]:
# Write single
dfs = [delayed(write_file)(pattern.format(**x), 'SE') for x in single]
with ProgressBar():
    compute(*dfs, num_workers=10)

[########################################] | 100% Completed |  3min 47.1s


In [47]:
# Write paired
dfs = [delayed(write_file)(pattern.format(**x), 'PE') for x in paired]
with ProgressBar():
    compute(*dfs, num_workers=10)

[########################################] | 100% Completed |  2min  2.1s


In [48]:
# Write keep_R1
dfs = [delayed(write_file)(pattern.format(**x), 'keep_R1') for x in keep1]
with ProgressBar():
    compute(*dfs, num_workers=10)

[########################################] | 100% Completed |  8.6s


In [49]:
# Write keep_R2
dfs = [delayed(write_file)(pattern.format(**x), 'keep_R2') for x in keep2]
with ProgressBar():
    compute(*dfs, num_workers=10)

[########################################] | 100% Completed |  8.5s
