# Realtionship between CHC interactions that were not filtered for distance-dependent interaction frequencies and TAD boundaries

The heatmaps show that the interaction profiles are cell type specific, especially for strong interaction (with imbalanced read pair counts), but also for all other interaction categories, inculding the profiles that are obtained when all categories are combined. In the UCSC genome browser it can be guessed that the coverage, in profiles from strong interactions (`DIX`) drops at TAD boundaries. One psoosible explanation would be that interactions spread from a bait and are prevented from further spreading by structural obstacles such as TAD boundaries.

The aim here is to investigate whether interactions end more often at TAD boundaries than expected by chance. Firthermore, it will be investigated whether interactions span TAD boundaries less often than expected by chance. For this purpose, there is the module `TadBoundarySet`, which contains TAD boundaries and supports two functions:

1. `get_distance_to_nearest_tad_boundary(chr, pos) -> distance_to_next_tad`
2. `get_number_of_boundaries_spanned_by_region(chr, sta_pos, end_pos) -> number_of_spanned_tads`

The first function returns the distance to the next TAD boudary for a given position. The second function returns the number of TAD boundaries that are spanned by a given region. To process the interactions, the module `DiachromaticInteractionSet` is used.

In [1]:
import sys
import os
import numpy as np
sys.path.append("..")
from diachr import TadBoundarySet
from diachr import DiachromaticInteractionSet
from scipy import stats
import random

There are interactions that were evaluated using the `ST` or `HT` rule. An FDR of 5% was used for the `ST` rule and an FDR of 1% was used for the `HT` rule. For x and y there are interactions without `RPC` filter, with `RPC` filter and `RPC` filter complement. With the `RPC` filter all interactions in which at least one of the four read pair counts is `0` were discarded at the very beginning of the analysis (before the P-value threshold was determined) were discarded.

In [2]:
RPC_RULE = "st"
ANALYSIS='ST_FDR005'
#ANALYSIS='ST_RMRO_FDR005'
#ANALYSIS='ST_KRO_FDR005'
#RPC_RULE = "ht"
#ANALYSIS='HT_FDR001'
#ANALYSIS='HT_RMRO_FDR001'
#ANALYSIS='HT_KRO_FDR001'

There is one CHC dataset for each of the 17 cell types and, for eight of the cell types, there are HC data and TAD boundaries.

In [3]:
CELL_TYPE_SHORT = 'MK'             # Has HC data
#CELL_TYPE_SHORT = 'ERY'           # Has HC data
#CELL_TYPE_SHORT = 'NEU'           # Has HC data
#CELL_TYPE_SHORT = 'MON'           # Has HC data
#CELL_TYPE_SHORT = 'MAC_M0'        # Has HC data
#CELL_TYPE_SHORT = 'MAC_M1'
#CELL_TYPE_SHORT = 'MAC_M2'
#CELL_TYPE_SHORT = 'EP'
#CELL_TYPE_SHORT = 'NB'            # Has HC data
#CELL_TYPE_SHORT = 'TB'
#CELL_TYPE_SHORT = 'FOET'
#CELL_TYPE_SHORT = 'NCD4'          # Has HC data
#CELL_TYPE_SHORT = 'TCD4'
#CELL_TYPE_SHORT = 'NACD4'
#CELL_TYPE_SHORT = 'ACD4'
#CELL_TYPE_SHORT = 'NCD8'          # Has HC data
#CELL_TYPE_SHORT = 'TCD8'

A `TadBoundarySet` can be created with one of the eight BED files with the published TADs or a BED file with TADs from all eight cell types that was created using `BedTools`. See bash script in: `../additional_files/javierre_2016/tad_regions_hg38/`.

In [4]:
#tad_boundaries = '../additional_files/javierre_2016/tad_regions_hg38/hglft_genome_TADs_' + CELL_TYPE_SHORT + '_hg38.bed'
tad_boundaries = '../additional_files/javierre_2016/tad_regions_hg38/merged_tad_boundary_centers.bed'
tbs = TadBoundarySet(tad_boundaries)

An interaction file that was created with `DICer` is read in and reference interactions are re-selected afterwards. The selection of reference interactions can be omitted as soon as the new reference selection (no distinction between `NE` and `EN` and additional `DIX` category) is integrated into `DICer`.

In [5]:
INTERACTION_FILE = '../DICer_interactions/' + ANALYSIS.upper() + '/CHC/JAV_' + CELL_TYPE_SHORT + '_RALT_20000_' + ANALYSIS.lower() + '_evaluated_and_categorized_interactions.tsv.gz'
OUT_PREFIX = 'JAV_' + CELL_TYPE_SHORT + '_RALT_20000_' + ANALYSIS.lower()

d11_interaction_set = DiachromaticInteractionSet(rpc_rule = RPC_RULE)
d11_interaction_set.parse_file(
    i_file = INTERACTION_FILE,
    verbose = True)

report_dict = d11_interaction_set.select_reference_interactions_2x(verbose=True)

[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../DICer_interactions/ST_FDR005/CHC/JAV_MK_RALT_20000_st_fdr005_evaluated_and_categorized_interactions.tsv.gz
	[INFO] Parsed 1,000,000 interaction lines ...
	[INFO] Parsed 2,000,000 interaction lines ...
	[INFO] Parsed 3,000,000 interaction lines ...
	[INFO] Parsed 4,000,000 interaction lines ...
	[INFO] Parsed 5,000,000 interaction lines ...
	[INFO] Set size: 5,249,507
[INFO] ... done.
[INFO] Select reference interactions ...
	[INFO] Treating NE and EN as one category ...
	[INFO] First pass: Count directed interactions for different read pair counts ...
	[INFO] Second pass: Select undirected reference interactions for different read pair counts ...
	[INFO] Third pass: Mark directed interactions for which there is no reference ...
[INFO] ... done.


## Test whether interactions end near TAD boundaries more often than expected by chance

Now we have everything in place to do the first analysis. We compare the distances to the next TAD for the following interaction categories:

1. `DIX`: Imbalanced interactions with high read pair counts and without counterpart in the reference interactions
2. `DI`: Imbalanced interactions with counterpart in the reference interactions
3. `URI`: Balanced reference interactions (comparable to `DI` with respect to total number and distribution of read pair numbers)
4. `UI`: Balanced interactions (remaining powered interactions)
5. `ALL`: All interaction categories combined
6. `RANDOM`: Distances to next TAD boundaries from randomly chosen position

In [6]:
tad_dist_lists = {
    'DIX': [],
    'DI': [],
    'UIR': [],
    'UI': [],
    'ALL': [],
    'RANDOM': []
}
for d11_inter in d11_interaction_set._inter_dict.values():
    
    # Determine the distance to the next TAD from the outermost position of the 'N' digest
    if d11_inter.enrichment_status_tag_pair == 'NE':
        dist = tbs.get_distance_to_nearest_tad_boundary(d11_inter.chrA, d11_inter.fromA)
    if d11_inter.enrichment_status_tag_pair == 'EN':
        dist = tbs.get_distance_to_nearest_tad_boundary(d11_inter.chrA, d11_inter.toB)
    
    # Add determined distance to list
    tad_dist_lists[d11_inter.get_category()].append(dist)
    tad_dist_lists['ALL'].append(dist)
    
    # Draw random postion  
    random_position = random.randint(d11_inter.fromA, d11_inter.toB)
    random_dist = tbs.get_distance_to_nearest_tad_boundary(d11_inter.chrA, random_position)

    # Add determined distance to list
    tad_dist_lists['RANDOM'].append(random_dist)
    
print('... done.')

... done.


Output summary statistics and results of two-sample KS test for selected pairs of categories.

In [7]:
print(OUT_PREFIX)
print()
for i_cat in ['DIX','DI','UIR','UI', 'ALL', 'RANDOM']:
    print(i_cat + ' -----------')
    print('\tQuantiles: ' + str(np.quantile(tad_dist_lists[i_cat], [0.25, 0.50, 0.75])))
    print('\tn=' + str(len(tad_dist_lists[i_cat])))

print()
print('DIX vs. DI: ' + str(stats.ks_2samp(tad_dist_lists['DIX'], tad_dist_lists['DI'])))
print('DI vs. UIR: ' + str(stats.ks_2samp(tad_dist_lists['DI'], tad_dist_lists['UIR'])))
print('UIR vs. UI: ' + str(stats.ks_2samp(tad_dist_lists['UIR'], tad_dist_lists['UI'])))
print('UI vs. RANDOM: ' + str(stats.ks_2samp(tad_dist_lists['UI'], tad_dist_lists['RANDOM'])))
print('ALL vs. RANDOM: ' + str(stats.ks_2samp(tad_dist_lists['ALL'], tad_dist_lists['RANDOM'])))
print()

JAV_MK_RALT_20000_st_fdr005

DIX -----------
	Quantiles: [24094. 55287. 99168.]
	n=379
DI -----------
	Quantiles: [ 24727.75  57735.   111228.  ]
	n=209728
UIR -----------
	Quantiles: [ 25893.    60083.5  115234.75]
	n=209728
UI -----------
	Quantiles: [ 26551.  61649. 118971.]
	n=4829672
ALL -----------
	Quantiles: [ 26447.  61425. 118484.]
	n=5249507
RANDOM -----------
	Quantiles: [ 25577.  63036. 122781.]
	n=5249507

DIX vs. DI: KstestResult(statistic=0.0574021544578307, pvalue=0.15928971711902928)
DI vs. UIR: KstestResult(statistic=0.01462370308208727, pvalue=6.567710282478044e-20)
UIR vs. UI: KstestResult(statistic=0.011950831224567637, pvalue=2.3012593216484192e-25)
UI vs. RANDOM: KstestResult(statistic=0.01250798279386095, pvalue=0.0)
ALL vs. RANDOM: KstestResult(statistic=0.013350206028870937, pvalue=0.0)



## Test whether interactions span TAD boundaries less often than expected by chance, taking into account their length

In the second analysis it is investigated how often interctions span TAD boundaries.

In [8]:
spanned_boundary_length_dict = {
    'DIX': {
        'I_NUM': 0, 
        'I_DIST': 0, 
        'SB_NUM': 0       
    },
        'DI': {
        'I_NUM': 0,
        'I_DIST': 0, 
        'SB_NUM': 0        
    },
    'UIR': {
        'I_NUM': 0,
        'I_DIST': 0, 
        'SB_NUM': 0          
    },
    'UI': {
        'I_NUM': 0,
        'I_DIST': 0, 
        'SB_NUM': 0          
    },
    'ALL': {
        'I_NUM': 0,
        'I_DIST': 0, 
        'SB_NUM': 0      
    },
    'RANDOM': {
        'I_NUM': 0,
        'I_DIST': 0, 
        'SB_NUM': 0     
    }
}
for d11_inter in d11_interaction_set._inter_dict.values():
    
    if d11_inter.enrichment_status_tag_pair == 'NE' or d11_inter.enrichment_status_tag_pair == 'EN':
        
         # Get interaction distance and number of spanned TAD boundaries
        i_dist = d11_inter.i_dist
        sb_num = tbs.get_number_of_boundaries_spanned_by_region(d11_inter.chrA, d11_inter.fromA, d11_inter.toB)
        
        # Increment numbers for interaction category
        spanned_boundary_length_dict[d11_inter.get_category()]['I_NUM'] += 1
        spanned_boundary_length_dict[d11_inter.get_category()]['I_DIST'] += i_dist
        spanned_boundary_length_dict[d11_inter.get_category()]['SB_NUM'] += sb_num
        
        # Increment numbers for all interaction categories combined
        spanned_boundary_length_dict['ALL']['I_NUM'] += 1
        spanned_boundary_length_dict['ALL']['I_DIST'] += i_dist
        spanned_boundary_length_dict['ALL']['SB_NUM'] += sb_num

    
        # Draw postions of random region
        random_position = random.randint(d11_inter.fromA, d11_inter.toB)
        sb_num = tbs.get_number_of_boundaries_spanned_by_region(d11_inter.chrA, random_position, random_position + i_dist)
            
        # Increment numbers for random regions
        spanned_boundary_length_dict['RANDOM']['I_NUM'] += 1
        spanned_boundary_length_dict['RANDOM']['I_DIST'] += i_dist
        spanned_boundary_length_dict['RANDOM']['SB_NUM'] += sb_num
        
print('... done.')

... done.


Output summary statistics.

In [9]:
print(OUT_PREFIX)
print()
for i_cat in ['DIX','DI','UIR','UI', 'ALL','RANDOM']:
    print(i_cat + ' -----------')
    print("\tn=" + str(spanned_boundary_length_dict[i_cat]['I_NUM']))
    print("\t1000000*(tsb/len)=" + str(1000000*spanned_boundary_length_dict[i_cat]['SB_NUM']/spanned_boundary_length_dict[i_cat]['I_DIST']))

JAV_MK_RALT_20000_st_fdr005

DIX -----------
	n=145
	1000000*(tsb/len)=2.7889821261107994
DI -----------
	n=197902
	1000000*(tsb/len)=4.695716958403391
UIR -----------
	n=197902
	1000000*(tsb/len)=4.840515852521417
UI -----------
	n=4523334
	1000000*(tsb/len)=4.963951042804502
ALL -----------
	n=4919283
	1000000*(tsb/len)=4.956231046793079
RANDOM -----------
	n=4919283
	1000000*(tsb/len)=5.067359488114551
