# CTCF / SIX5 / ZNF143 Data Analysis

Here’s an idea: we can try to replicate the test using the raw number of bases that overlap between the files. That would be as close to the original data as we can get, and it’s something Anshul would have the time to replicate. Try setting up a single notebook that does this analysis:

The total number of bases on the assembled human chromosomes is 3095677412 - you get this by summing up columns chr1-chr21 here: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.chrom.sizes. We can let 3095677412 represent the size of our “universe”. Ignore ambiguous regions for now - it would be surprising if they altered the overall trend.
(1) The total number of bases covered by idr optimal CTCF peaks are the “special” set - count those up and print out the number in a cell. Good to put the CTCF peaks through `bedtools merge -i ctcf_file.bed > merged_ctcf_file.bed` before you do this to avoid double-counting bases in overlapping regions.
(2) The total number of bases covered by idr optimal SIX5 peaks are the “picked” set - count those up and print out the number in a cell. Again, good to put the SIX5 peaks through `bedtools merge` to avoid double-counting bases.
(3) The total number of bases in the intersection of CTCF and SIX5 peaks are the “special, picked” set. Find those bases by counting up the number of bases in the file produced by `bedtools intersect -a ctcf_file.bed -b six5_file.bed | bedtools merge > merged_ctcf_six5_intersection.bed`
(4) Print out the ratio (special,picked)/(special,not_picked) / (not_special,picked)/(not_special,not_picked). Also print out the p-value using the fisher’s exact test.
(5) Repeat the analysis above, but conditioning on bases that overlap the ZNF143 peak set.

In the meantime, I will let Anshul know the results of this analysis to get his initial thoughts.

In [1]:
universe = 3095677412
data_dir = "/home/ktian/kundajelab/tfnet/ENCODE_data/"
#data_dir = "/home/ktian/oak/stanford/groups/akundaje/marinovg/TF-models/2018-06-15-ENCODE-TF-ChIP-seq-for-multitask-models/"
dnase_dir = "/mnt/data/integrative/dnase/ENCSR845CFB.Primary_hematopoietic_stem_cells.UW_Stam.DNase-seq/" + \
            "out_50m/peak/idr/pseudo_reps/rep1/"

ctcf_file   = data_dir + "GM12878-CTCF-human-ENCSR000AKB-optimal_idr.narrowPeak.gz"
six5_file   = data_dir + "GM12878-SIX5-human-ENCSR000BJE-optimal_idr.narrowPeak.gz"
znf143_file = data_dir + "GM12878-ZNF143-human-ENCSR000DZL-optimal_idr.narrowPeak.gz"
dnase_file  = dnase_dir + \
            "ENCSR845CFB.Primary_hematopoietic_stem_cells.UW_Stam.DNase-seq_rep1-pr.IDR0.1.filt.narrowPeak.gz"

import os

os.system("ls -l " + ctcf_file)

os.system("mkdir -p _tmp_")
merged_ctcf   = "_tmp_/ctcf.bed"
merged_six5   = "_tmp_/six5.bed"
merged_znf143 = "_tmp_/znf143.bed"
merged_dnase  = "_tmp_/dnase.bed"

os.system("gunzip -c " + ctcf_file   + " | bedtools sort | bedtools merge > " + merged_ctcf)
os.system("gunzip -c " + six5_file   + " | bedtools sort | bedtools merge > " + merged_six5)
os.system("gunzip -c " + znf143_file + " | bedtools sort | bedtools merge > " + merged_znf143)
os.system("gunzip -c " + dnase_file  + " | bedtools sort | bedtools merge > " + merged_dnase)

0

In [10]:
!ls ./_tmp_
!cat ./_tmp_/dnase.bed | head -10

ctcf.bed	      ctcf_znf143.bed	     six5_znf143.bed
ctcf_dnase.bed	      ctcf_znf143_dnase.bed  six5_znf143_dnase.bed
ctcf_six5.bed	      dnase.bed		     znf143.bed
ctcf_six5_dnase.bed   six5.bed		     znf143_ctcf.bed
ctcf_six5_znf143.bed  six5_dnase.bed	     znf143_dnase.bed
chr1	115626	115840
chr1	713880	714405
chr1	762639	763159
chr1	773680	774027
chr1	778139	778531
chr1	793421	793620
chr1	805107	805431
chr1	839811	840278
chr1	842177	842434
chr1	846590	846941
cat: write error: Broken pipe


In [3]:
import pandas as pd
def sum_up_intervals_len(file, desc):
    df = pd.read_csv(file, sep='\t', index_col=None, header=None)
    #experiments = ['ENCSR000AKB', 'ENCSR000BJE','ENCSR000DZL']
    #df.columns = ['chr', 'start', 'end']
    
    df_new = df.iloc[:,1:].astype('int') # select all rows and all columns except the first one (IDs), convert to int

    df_len = df_new.iloc[:,1] - df_new.iloc[:,0] # end - start
    
    total_len = df_len.sum()
    print("%18s: #bp in merged file is %10d" % (desc, total_len))
    return total_len

In [4]:
len_ctcf   = sum_up_intervals_len(merged_ctcf, "ctcf")
len_six5   = sum_up_intervals_len(merged_six5, "six5")
len_znf143 = sum_up_intervals_len(merged_znf143, "znf143")
len_dnase  = sum_up_intervals_len(merged_dnase, "dnase")

              ctcf: #bp in merged file is   12407139
              six5: #bp in merged file is    1180900
            znf143: #bp in merged file is   10903313
             dnase: #bp in merged file is   41137677


In [5]:
ctcf_six5_bed   = "_tmp_/ctcf_six5.bed"
six5_znf143_bed = "_tmp_/six5_znf143.bed"
ctcf_znf143_bed = "_tmp_/ctcf_znf143.bed"
znf143_ctcf_bed = "_tmp_/znf143_ctcf.bed"
ctcf_siz5_znf143_bed = "_tmp_/ctcf_siz5_znf143.bed"

In [11]:
def intersect(tf_a, tf_b):
    tf_a_bed = "_tmp_/" + tf_a + ".bed"
    tf_b_bed = "_tmp_/" + tf_b + ".bed"
    tf_ab_intersect = "_tmp_/" + tf_a + "_" + tf_b + ".bed"
    
    cmd = "bedtools intersect -sorted -a " + tf_a_bed + " -b " + tf_b_bed + " | bedtools merge > " + tf_ab_intersect
    os.system(cmd)
    
    len_intersect = sum_up_intervals_len(tf_ab_intersect, tf_a + "_" + tf_b)
    return len_intersect

len_ctcf_six5   = intersect("ctcf", "six5")
len_six5_znf143 = intersect("six5", "znf143")
len_znf143_ctcf = intersect("znf143", "ctcf")
len_ctcf_znf143 = intersect("ctcf","znf143")

len_ctcf_six5_znf143 = intersect("ctcf_six5","znf143")
len_all = len_ctcf_six5_znf143

         ctcf_six5: #bp in merged file is     130386
       six5_znf143: #bp in merged file is     729608
       znf143_ctcf: #bp in merged file is    5765970
       ctcf_znf143: #bp in merged file is    5765970
  ctcf_six5_znf143: #bp in merged file is     105690


### Fisher's Exact Test -- presence of ZNF143
(4) Print out the ratio (special,picked)/(special,not_picked) / (not_special,picked)/(not_special,not_picked). Also print out the p-value using the fisher’s exact test.
(5) Repeat the analysis above, but conditioning on bases that overlap the ZNF143 peak set.

In [7]:
#question 1: what is the coenrichment of CTCF with six5 (special = CTCF, picked = six5)
#contingency table should look like:

# [ pos_for_ctcf_and_six5, pos_for_ctcf_but_not_six5]
# [ neg_for_ctcf_pos_for_six5, neg_for_ctcf_and_six5]

from scipy.stats import fisher_exact
print(fisher_exact(table=[
    [len_ctcf_six5, len_ctcf - len_ctcf_six5],
    [len_six5 - len_ctcf_six5, universe - len_ctcf - len_six5 + len_ctcf_six5]
])) # PIE

#question 2: What is the coenrichment of CTCF with six5 in the presence of znf143
print(fisher_exact(table=[
    [len_all, len_ctcf_znf143 - len_all],
    [len_six5_znf143 - len_all, len_znf143 - len_ctcf_znf143 - len_six5_znf143 + len_all]
])) # znf "universe"

(31.160843103361707, 0.0)
(0.13507491336845634, 0.0)


### Fisher's Exact Test conditioned on DHS peaks (accessible regions)

In [8]:
len_ctcf   = intersect("ctcf","dnase")
len_six5   = intersect("six5","dnase")
len_znf143 = intersect("znf143","dnase")
len_ctcf_six5   = intersect("ctcf_six5","dnase")
len_ctcf_znf143 = intersect("ctcf_znf143","dnase")
len_six5_znf143 = intersect("six5_znf143","dnase")

        ctcf_dnase: #bp in merged file is    7226162
        six5_dnase: #bp in merged file is    1006727
      znf143_dnase: #bp in merged file is    7860007
   ctcf_six5_dnase: #bp in merged file is     126364
 ctcf_znf143_dnase: #bp in merged file is    4739905
 six5_znf143_dnase: #bp in merged file is     654317


In [9]:
#question 3: condition on DHS peaks
# [ pos_for_ctcf_and_six5_DHS, pos_for_ctcf_but_not_six5_DHS]
# [ neg_for_ctcf_pos_for_six5_DHS, neg_for_ctcf_and_six5_DHS]

print(fisher_exact(table=[
    [len_ctcf_six5, len_ctcf - len_ctcf_six5],
    [len_six5 - len_ctcf_six5, len_dnase - len_ctcf - len_six5 + len_ctcf_six5]
]))

#question 2: What is the coenrichment of CTCF with six5 in the presence of znf143
print(fisher_exact(table=[
    [len_all, len_ctcf_znf143 - len_all],
    [len_six5_znf143 - len_all, len_znf143 - len_ctcf_znf143 - len_six5_znf143 + len_all]
]))

(0.6677890770954302, 0.0)
(0.10689634318360579, 0.0)
