# Reorganize store

I have realized that the data store is awkwardly organized. I think there are some better ways to go about organizing the various flags. I think a script that initializes the different parts of the store and then simply fill in the bits of data. I think this will be easier to use in the long run.

In [2]:
import os
import sys

import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Project level imports
sys.path.insert(0, '../lib')
from ncbi_remap.notebook import Nb
from ncbi_remap.plotting import make_figs

# Setup notebook
nbconfig = Nb.setup_notebook()

# Turn on cache
from joblib import Memory
memory = Memory(cachedir=nbconfig.cache, verbose=0)

# Connect to data store
store = pd.HDFStore('../output/sra.h5')

last updated: 2017-12-21 
Git hash: e8b461e4623cb168ee15bba032c8be008bcc76e9


In [5]:
store.root

/ (RootGroup) ''
  children := ['aln' (Group), 'ids' (Group), 'layout' (Group), 'prealn' (Group), 'strand' (Group), 'test' (Group)]

## Current organization

I think the broad structure of the store is ok. However, the set of flags should be combined into a single table for easy tracking. I think using indicator variables is the easiest way to go.

In [3]:
store.root.prealn

/prealn (Group) ''
  children := ['abi_solid' (Group), 'alignment_bad' (Group), 'complete' (Group), 'download_bad' (Group), 'qc_passed' (Group), 'quality_scores_bad' (Group), 'queue' (Group), 'workflow' (Group)]

In [4]:
store.root.prealn.workflow

/prealn/workflow (Group) ''
  children := ['bamtools_stats' (Group), 'collectrnaseqmetrics' (Group), 'fastq' (Group), 'fastq_screen' (Group), 'feature_counts' (Group), 'hisat2' (Group), 'markduplicates' (Group), 'merge' (Group), 'samtools_idxstats' (Group), 'samtools_stats' (Group)]

## Update flags

In [115]:
flags = pd.DataFrame([], index=store['ids'].set_index(['srx', 'srr']).index, 
                     columns=['flag_abi_solid', 'flag_alignment_bad', 'flag_complete', 'flag_download_bad', 'flag_quality_scores_bad', 'flag_qc_passed'])

flags.fillna(False, inplace=True)

flags.loc[store['prealn/abi_solid'].set_index(['srx', 'srr']).index, 'flag_abi_solid'] = True
flags.loc[store['prealn/alignment_bad'].set_index(['srx', 'srr']).index, 'flag_alignment_bad'] = True
flags.loc[store['prealn/complete'].set_index(['srx', 'srr']).index, 'flag_complete'] = True
flags.loc[store['prealn/download_bad'].set_index(['srx', 'srr']).index, 'flag_download_bad'] = True
flags.loc[store['prealn/quality_scores_bad'].set_index(['srx', 'srr']).index, 'flag_quality_scores_bad'] = True
flags.loc[store['prealn/qc_passed'].set_index(['srx', 'srr']).index, 'flag_qc_passed'] = True

In [162]:
store['prealn/flags'] = flags

## Update strand

In [93]:
strand = pd.Series(index=store['ids'].set_index(['srx', 'srr']).index, name='strand')
strand.fillna('Missing', inplace=True)

strand[store['strand/first'].set_index(['srx', 'srr']).index] = 'first'
strand[store['strand/second'].set_index(['srx', 'srr']).index] = 'second'
strand[store['strand/unstranded'].set_index(['srx', 'srr']).index] = 'unstranded'

In [164]:
store['strand'] = strand

## Update layout

In [167]:
layout = pd.Series(index=store['ids'].set_index(['srx', 'srr']).index, name='layout')
layout.fillna("Missing", inplace=True)

layout[store['layout/PE'].set_index(['srx', 'srr']).index] = 'PE'
layout[store['layout/SE'].set_index(['srx', 'srr']).index] = 'SE'
layout[store['layout/keep_R1'].set_index(['srx', 'srr']).index] = 'keep_R1'
layout[store['layout/keep_R2'].set_index(['srx', 'srr']).index] = 'keep_R2'

In [169]:
store['layout'] = layout