# What runs did the Thierry-Miegs use?

I want to make a list of runs that the Thierry Miegs used, so I can start pushing them through the alignment pipeline. I can also use this list to try to find their QC thresholds. All of the lists they have given me have the SRRs and SRXs partially collapsed, so I basically need to split everything back out to SRR for putting through the alignment pipeline.



In [31]:
# %load ../start.py
# Load useful extensions

# Activate the autoreload extension for easy reloading of external packages
%reload_ext autoreload
%autoreload 2

# Trun on the water mark
%reload_ext watermark
%watermark -u -d -g

# Load ipycache extension
%reload_ext ipycache
from ipycache import CacheMagics
CacheMagics.cachedir = '../cachedir'

# Add project library to path
import sys
sys.path.insert(0, '../../lib/python')

# The usual suspects
import os
import numpy as np
import pandas as pd

# plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_context('poster')

# Turn off scientific notation
np.set_printoptions(precision=5, suppress=True)


last updated: 2017-04-11 
Git hash: ca261abfa14497711573fc501eae08874a5e0da0


In [32]:
from pymongo import MongoClient

In [33]:
client = MongoClient(host='128.231.83.74', port=27022)
db = client['sra2']
ncbi = db['ncbi']
remap = db['remap']

In [34]:
# get list of mieg samples
df = pd.read_table('../../data/jean/RunsDroso_4annot_9549_libraries.txt', encoding='ISO-8859-1', skiprows=1, header=None, index_col=0)
mieg_analyzed = df.index.tolist()
print('There were {:,} samples analyzed by the Thierry-Miegs'.format(len(mieg_analyzed)))

There were 9,549 samples analyzed by the Thierry-Miegs


In [35]:
good = pd.DataFrame(list(remap.aggregate([
    {'$unwind': '$runs'},
    {
        '$match': {
            '$or': [
                {'runs.srr': {'$in': mieg_analyzed}},
                {'_id': {'$in': mieg_analyzed}},
            ]
        }
    },
    {'$project': {'_id': 0, 'srx': '$_id', 'srr': '$runs.srr'}}
])))

print('There were {:,} SRRs analyzed by the Thierry-Miegs'.format(good.shape[0]))

There were 13,495 SRRs analyzed by the Thierry-Miegs


In [36]:
# output to a file  for use in workflow. This is just a temporary thing, just trying to prioritize.
with open('../../data/13495_runs_analyzed_by_mieg.txt', 'w') as fh:
    for srr in good.srr.tolist():
        fh.write(srr + '\n')