# Inverse Filter Testing

Brian is wanting to know the location of all of reads that map to a non-annotated regions. These could be new genes, new exons/CDS, anti-sense transcripts, etc. I have played with several methods to pull these regions out, but they are difficult to capture because of technical reasons. The approach I am working out here:

1) Create First/Second strand BedGraphs
2) Iterate over BedGraph and expand out so that each position is a row in the file (BedGraphs collapse rows with the same counts).
3) Use BedTools intersect on the inverse First/Second strands to produce BedGraphs:

* first.sense.bg
* first.antisense.bg
* second.sense.bg
* second.antisense.bg
    
4) Combine the sense or antisense BGs adding strand information

In the end I should have two Bed files one is a list of locations that were un annotated sense and the other are anti-sense transcripts.

In [2]:
# %load ../start.py
# Load useful extensions

# Activate the autoreload extension for easy reloading of external packages
%reload_ext autoreload
%autoreload 2

# Trun on the water mark
%reload_ext watermark
%watermark -u -d -g

# Load ipycache extension
%reload_ext ipycache
from ipycache import CacheMagics
CacheMagics.cachedir = '../cachedir'

# Add project library to path
import sys
sys.path.insert(0, '../../lib/python')

# The usual suspects
import os
import numpy as np
import pandas as pd

# plotting
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set_context('poster')

# Turn off scientific notation
np.set_printoptions(precision=5, suppress=True)


last updated: 2017-04-21 
Git hash: d9167e06cac5e4fe3468cd0c675caa30879df4ef


In [3]:
import pybedtools

In [41]:
first = pybedtools.BedTool([x[:4] + ['.', '+'] for x in pybedtools.BedTool('../../output/alignment/raw/ERX455041/ERR489286/ERR489286.fq.first.bedgraph')])
second = pybedtools.BedTool([x[:4] + ['.', '-'] for x in pybedtools.BedTool('../../output/alignment/raw/ERX455041/ERR489286/ERR489286.fq.second.bedgraph')])
inverse_first = pybedtools.BedTool([x[:3] + ['.', '.', '+'] for x in pybedtools.BedTool('../../output/inverse_exons_20bp.first.bed')])
inverse_second = pybedtools.BedTool([x[:3] + ['.', '.', '-'] for x in pybedtools.BedTool('../../output/inverse_exons_20bp.second.bed')])

In [44]:
intersect_sense = first.intersect(inverse_first, s=True)
fil = intersect_sense.filter(lambda x: True if x[3] != '0' else False).saveas()
pybedtools.BedTool([x[:3] for x in fil]).saveas('../../output/test.first.bed')

<BedTool(../../output/test.first.bed)>

In [45]:
intersect_antisense = second.intersect(inverse_second, s=True)
fil = intersect_antisense.filter(lambda x: True if x[3] != '0' else False).saveas()
pybedtools.BedTool([x[:3] for x in fil]).saveas('../../output/test.second.bed')

<BedTool(../../output/test.second.bed)>

In [48]:
intersect_sense = first.intersect(inverse_first, s=True).saveas('../../output/test.first.bedgraph')
intersect_antisensesense = second.intersect(inverse_second, s=True).saveas('../../output/test.second.bedgraph')

In [49]:
intersect_sense.head()

chr2L	0	7508	0	.	+
 chr2L	8136	8172	0	.	+
 chr2L	9504	11940	0	.	+
 chr2L	11940	12000	1	.	+
 chr2L	12000	12020	0	.	+
 chr2L	12020	12080	1	.	+
 chr2L	12080	12350	0	.	+
 chr2L	12350	12410	1	.	+
 chr2L	12410	12670	0	.	+
 chr2L	12670	12730	1	.	+
 