# Compare different motif search locations

We are interested in figuring which genes are regulated by which transcription 
factors. Our results will likely depend on how we make this association.
**Question: Where should we look for transcription factor binding motifs?** 
Here I look into the affects of using various regulatory regions. This is a 
deceptively simple question, which quickly becomes complicated.

Given a single isoform, the regulatory region would include a region upstream 
and downstream of the transcription start site. Typically this region is 
defined as +/- 1 or 2 kb. Sometimes this region is expanded to include the 
first exon and/or first intron, or the entire genic region. Things are worse 
when looking at genes with multiple isoforms.

Here I take two major approaches: the first focuses on the gene level, and the 
second looks at each isoform individually and then summarizes to the gene 
level. For each of these approaches I define several regions:

* basic 1kb: Is a 1kb regions +/- the gene/transcription start site.
* basic 2kb: Is a 2kb regions +/- the gene/transcription start site.
* first exon: Is a 1kb upstream of gene/transcription start site through the 
  longest first exon.
* first intron: Is a 1kb upstream of gene/transcription start site the longest 
  first intron.

In [1]:
# %load ../start.py
# Load useful extensions

# Activate the autoreload extension for easy reloading of external packages
%reload_ext autoreload
%autoreload 1

# Set up cashdir
from ipycache import CacheMagics
CacheMagics.cachedir = '../cachedir'

# Trun on the water mark
%reload_ext watermark
%watermark -u -d -v -g

# Load ipycache extension
%reload_ext ipycache
from ipycache import CacheMagics
CacheMagics.cachedir = '../cachedir'

# Add project library to path
import sys
sys.path.insert(0, '../../lib/python')


last updated: 2017-02-13 

CPython 3.5.2
IPython 5.1.0
Git hash: 791144d59378fbaea73087996644b5f6be125ca1


In [5]:
import pandas as pd
import gffutils
import pybedtools

In [3]:
db = gffutils.FeatureDB('/data/LCDB/lcdb-references/dm6/r6-11/gtf/dm6_r6-11.gtf.db')

## Gene Level Analysis

In [4]:
genes = []
for gene in db.features_of_type('gene'):
    if gene.strand == '+':
        gene_start = gene.start
    elif gene.strand == '-':
        gene_start = gene.end

    genes.append([gene.chrom, 
                  gene.start, 
                  gene.end, 
                  gene.id, 
                  '.',
                  gene.strand, 
                  gene_start - 1000, 
                  gene_start + 1000, 
                  gene_start - 2000, 
                  gene_start + 2000])

header = ['chrom', 'start', 'end', 'name', 'score', 'strand', '1kb_start', '1kb_end', '2kb_start', '2kb_end']
gene_df = pd.DataFrame(genes, columns=header)

In [6]:
gene_bed = pybedtools.BedTool.from_dataframe(gene_df.loc[gene_df['name'] == 'FBgn0000008',['chrom', 'start', 'end', 'name', 'score', 'strand']])
onekb_bed = pybedtools.BedTool.from_dataframe(gene_df.loc[gene_df['name'] == 'FBgn0000008',['chrom', '1kb_start', '1kb_end', 'name', 'score', 'strand']])
twokb_bed = pybedtools.BedTool.from_dataframe(gene_df.loc[:,['chrom', '2kb_start', '2kb_end', 'name', 'score', 'strand']])

In [7]:
onekb_bed.head()

chr2R	22135968	22137968	FBgn0000008	.	+
 

In [8]:
otf = pd.read_csv('../../output/fimo/motif_alignments_onTheFly_dm6.txt', sep='\t')

In [9]:
otf.head()

Unnamed: 0,#pattern name,sequence name,start,stop,strand,score,p-value,q-value,matched sequence
0,OTF0063.1,chr3R,16912689,16912710,+,19.3433,6.33e-13,0.000172,CCCACAAAAAAAACCCCCAAAA
1,OTF0481.1,chr3R,23630570,23630589,-,32.5321,1.63e-12,0.00019,GGGGGGGGGGGGGGGAAACT
2,OTF0481.1,chr2R,10690729,10690748,-,32.5046,2.25e-12,0.00019,GGGGGGGGGGGGGGGAATAT
3,OTF0481.1,chrX,9492014,9492033,-,32.3853,3e-12,0.00019,GAGGGGGGGGGGGGGAAAAT
4,OTF0481.1,chrX,6505661,6505680,+,31.945,5e-12,0.00019,GGGGGGGGGGGGGGGAAAAG


In [10]:
otf_bed = pybedtools.BedTool.from_dataframe(otf.loc[:, ['sequence name', 'start', 'stop', '#pattern name', 'q-value', 'strand']])

In [11]:
otf_bed.head()

chr3R	16912689	16912710	OTF0063.1	0.000172	+
 chr3R	23630570	23630589	OTF0481.1	0.00018999999999999998	-
 chr2R	10690729	10690748	OTF0481.1	0.00018999999999999998	-
 chrX	9492014	9492033	OTF0481.1	0.00018999999999999998	-
 chrX	6505661	6505680	OTF0481.1	0.00018999999999999998	+
 chrX	7020679	7020698	OTF0481.1	0.00018999999999999998	-
 chrX	10536826	10536845	OTF0481.1	0.00018999999999999998	-
 chrX	12153398	12153417	OTF0481.1	0.00018999999999999998	-
 chrX	14601015	14601034	OTF0481.1	0.00018999999999999998	+
 chr3R	22891728	22891747	OTF0481.1	0.00018999999999999998	-
 

In [12]:
otf_gene = otf_bed.intersect(gene_bed, wo=True)

In [13]:
otf_gene.count()

58

In [113]:
res = []
for row in otf_gene:
    res.append([row[3], row[9], row[4]])

In [114]:
df2 = pd.DataFrame(res, columns=['otf', 'FBgn', 'q-value'])
df2['q-value'] = df2['q-value'].astype(float)

In [115]:
df2.groupby(['otf', 'FBgn']).agg(['count', 'mean', 'std'])

Unnamed: 0_level_0,Unnamed: 1_level_0,q-value,q-value,q-value
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std
otf,FBgn,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
OTF0039.1,FBgn0000008,4,0.0234,0.0138
OTF0063.1,FBgn0000008,13,0.041846,0.006446
OTF0231.2,FBgn0000008,5,0.0219,0.0
OTF0249.1,FBgn0000008,6,0.011822,0.003621
OTF0304.1,FBgn0000008,1,0.0388,
OTF0351.1,FBgn0000008,9,0.012,0.0
OTF0361.1,FBgn0000008,6,0.006653,0.007231
OTF0397.1,FBgn0000008,1,0.033,
OTF0481.1,FBgn0000008,3,0.0329,0.000755
OTF0516.1,FBgn0000008,10,0.02,0.000422


In [118]:
df2[df2['otf'] == 'OTF0039.1']

Unnamed: 0,otf,FBgn,q-value
16,OTF0039.1,FBgn0000008,0.0165
17,OTF0039.1,FBgn0000008,0.0165
18,OTF0039.1,FBgn0000008,0.0165
43,OTF0039.1,FBgn0000008,0.0441
