# ID Files for FlyBase

FlyBase has the following desires

1. High Quality stranded RNAseq data from Dmel (SRA project).
    * Minimally: the whole compressed set (if that is feasible)
    * Next priority: embryonic stages
    * Next priorities: testis, neural tissues
    * Are there other specific compressed sets that look particularly good? If so, we would like them, as well.
    
Splitting out the stages is going to be the hardest part. To address this I need:

1. List of high qualtiy samples
2. ID which samples are tech reps
3. Cluster samples to ID tissues
4. Corresponding normalized coverage tracks. 

I think I have the QC worked out and the merging worked out, but I still need to finalize. Once I have this list I can start building bedgraphs, while I figure out the tissues. 

In [91]:
import os
import sys

import numpy as np
import pandas as pd

import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# Project level imports
sys.path.insert(0, '../lib')
from ncbi_remap.notebook import Nb
from ncbi_remap.plotting import make_figs
from ncbi_remap.prealn_wf import srx_reproducibility_score

# Setup notebook
nbconfig = Nb.setup_notebook('flybase_files')

# Turn on cache
from joblib import Memory
memory = Memory(cachedir=nbconfig.cache, verbose=0)

# Connect to data store
store = pd.HDFStore('../sra.h5', mode='r')

last updated: 2017-12-14 
Git hash: e87982c4731fad7c2e653c44bdac2b3c0b09a594


In [15]:
# Constants
from ncbi_remap.prealn_wf import (LIBSIZE_CUTOFF, READLEN_CUTOFF, STRAND_CUTOFF1, 
    STRAND_CUTOFF2, UNALIGN_CUTOFF, CONTAMINATION_CUTOFF)

READY_SAMPLES = store['prealn/complete']
num_samples = READY_SAMPLES.shape[0]

## Quality Control

### Library Size

In [20]:
from ncbi_remap.prealn_wf import libsize, libsize_cnts
ok = libsize(store, cutoff=LIBSIZE_CUTOFF)
ok['flag_libsize_ok'] = True
READY_SAMPLES = READY_SAMPLES.merge(ok, how='left', on=['srx', 'srr']).fillna(False)

### Read Length

In [26]:
from ncbi_remap.prealn_wf import readlen
ok = readlen(store, cutoff=READLEN_CUTOFF)
ok['flag_readlen_ok'] = True
READY_SAMPLES = READY_SAMPLES.merge(ok, how='left', on=['srx', 'srr']).fillna(False)

### Stranded

In [41]:
from ncbi_remap.prealn_wf import strandedness
fs, sc, un = strandedness(store, cutoff=STRAND_CUTOFF2)
ok = pd.concat([fs, sc])
ok['flag_stranded_ok'] = True
READY_SAMPLES = READY_SAMPLES.merge(ok, how='left', on=['srx', 'srr']).fillna(False)

### Mappability

In [53]:
from ncbi_remap.prealn_wf import mappability
ok = mappability(store, cutoff=UNALIGN_CUTOFF)
ok['flag_map_ok'] = True
READY_SAMPLES = READY_SAMPLES.merge(ok, how='left', on=['srx', 'srr']).fillna(False)

### Summary

In [62]:
OK = READY_SAMPLES.loc[(READY_SAMPLES.sum(axis=1) == 4), ['srx', 'srr']].copy()

In [65]:
print('There are {:,} samples that pass all QC metrics.'.format(OK.shape[0]))

There are 5,069 samples that pass all QC metrics.


## Merge Technical Replicates

In [95]:
@memory.cache
def calc_score(dat, method='spearman', multi='pairwise', TH=1, show_warn=True, **kwargs):
    dfs = []
    for srx, df in dat.groupby('srx'):
        dfs.extend(srx_reproducibility_score(store, srx, srr=df.srr.values, method=method, multi=multi, TH=TH, show_warn=show_warn, **kwargs))
    return pd.DataFrame(dfs, columns=['srx', 'srrs', method])

In [156]:
# Which SRX only have 1 rep
cnts = OK.groupby('srx').count()

singletons = OK[OK.srx.isin(cnts[cnts.srr == 1].index)]
print('There are {:,} SRX with a single SRR.'.format(singletons.shape[0]))

multitons = OK[OK.srx.isin(cnts[cnts.srr > 1].index)]
print('There are {:,} SRX with multiple SRRs.'.format(multitons.shape[0]))

There are 4,591 SRX with a single SRR.
There are 478 SRX with multiple SRRs.


In [161]:
# Only keep SRX with multiple SRRs if all SRRs have a Spearman correlation >= 0.9
score = calc_score(multitons)
min_scores = score.groupby('srx').spearman.min()
multi_ok = multitons[multitons.srx.isin(min_scores[min_scores >= .9].index)]

print('There are {:,} SRX with multiple SRRs that that have a Spearman correlation ≥0.9'.format(multi_ok.shape[0]))

There are 309 SRX with multiple SRRs that that have a Spearman correlation ≥0.9


In [162]:
OK2 = pd.concat([singletons, multi_ok], ignore_index=True)

In [165]:
print('There are {:,} samples that passed all QC and have good SRRs.'.format(OK2.shape[0]))

There are 4,900 samples that passed all QC and have good SRRs.


## Output Sample List to get running

In [166]:
OK2[['srx', 'srr']].to_csv('../output/flybase_samples.tsv', sep='\t', index=False)