In [1]:
import sys
import os
sys.path.append("../..")
from diachr import DiachromaticInteractionSet
from diachr import BaitedDigest
from diachr import BaitedDigestSet

# Create directory for output files generated in this notebook 
NOTEBOOK_RESULTS_DIR = 'bd_analysis_results'
%mkdir -p $NOTEBOOK_RESULTS_DIR

# Classification of baited digests into BDC0, BDC1 or BDC2

In this notebook, we divide baited digests into three classes BDC0, BDC1 and BDC2, based on which configurations predominate in interactions that go from a baited digest towards 5' or 3' direction.

## Input file with interactions

The input is a file in `DiachromaticInteraction11` format created with the Python script `DICer.py`.

In [2]:
AUTHOR = 'JAV' # MIF or JAV
PROTOCOL = 'CHC' # HC or CHC
CELL_TYPE_SHORT = 'MAC_M0' # GM12878, MK, ERY, NEU, MON, MAC_M0, ...
OUT_PREFIX = AUTHOR + '_' + CELL_TYPE_SHORT + '_' + PROTOCOL +'_REPC'
INTERACTION_FILE = '../../DICer_interactions/' + PROTOCOL + '/' + OUT_PREFIX + '_evaluated_and_categorized_interactions.tsv.gz' 

## Create  ``BaitedDigestSet``

In a `BaitedDigestSet` object, interactions are grouped by bait.

In [3]:
# Create DiachromaticInteractionSet
d11_interaction_set = DiachromaticInteractionSet()
d11_interaction_set.parse_file(
    i_file = INTERACTION_FILE,
    verbose = True)
# Create BaitedDigestSet
baited_digest_set = BaitedDigestSet()
read_interactions_info_dict = baited_digest_set.ingest_interaction_set(d11_interaction_set, verbose=True)
print(baited_digest_set.get_ingest_interaction_set_info_report())

[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../../DICer_interactions/CHC/JAV_MAC_M0_CHC_REPC_evaluated_and_categorized_interactions.tsv.gz
	[INFO] Parsed 1,000,000 interaction lines ...
	[INFO] Parsed 2,000,000 interaction lines ...
	[INFO] Parsed 3,000,000 interaction lines ...
	[INFO] Parsed 4,000,000 interaction lines ...
	[INFO] Parsed 5,000,000 interaction lines ...
	[INFO] Parsed 6,000,000 interaction lines ...
	[INFO] Parsed 7,000,000 interaction lines ...
	[INFO] Parsed 8,000,000 interaction lines ...
	[INFO] Parsed 9,000,000 interaction lines ...
	[INFO] Set size: 9,648,210
[INFO] ... done.
[INFO] Reading interactions and group them according to chromosomes and baited digests ...
	[INFO] Read 1,000,000 interactions ...
	[INFO] Read 2,000,000 interactions ...
	[INFO] Read 3,000,000 interactions ...
	[INFO] Read 4,000,000 interactions ...
	[INFO] Read 5,000,000 interactions ...
	[INFO] Read 6,000,000 interactions ...
	[INFO] Read 7,000,000 interactions ...
	[INFO] R

## Determine frequencies of configurations at individual baited digest

The following function determines the frequencies of interactions separately for interaction category, enrichment state and configuration for a given list of interactions. We will use this function to determine the frequencies at individual baited digests by passing a list with all interactions that are associated with a specific baited digest.

In [4]:
def get_htc_freq_dicts(interaction_list):
    
    # Initialize count dictionary returned by this function
    HTC_TAG_FREQ_DICT = dict()
    for i_cat in ['DIX', 'DI', 'UIR', 'UI', 'ALL']:
        HTC_TAG_FREQ_DICT[i_cat] = dict()
        for e_cat in ['NN', 'EE', 'NE', 'EN', 'ALL']:
            HTC_TAG_FREQ_DICT[i_cat][e_cat] = dict()
            for i_conf in ['0X', '1X', '2X', '3X', '01', '02', '03', '12', '13', '23']:
                HTC_TAG_FREQ_DICT[i_cat][e_cat][i_conf] = 0

    # Get frequencies of configurations
    for d11_inter in interaction_list:
        i_cat = d11_inter.get_category()
        e_cat = d11_inter.enrichment_status_tag_pair
        HTC_TAG_FREQ_DICT[i_cat][e_cat][d11_inter.get_ht_tag()] += 1
        HTC_TAG_FREQ_DICT['ALL'][e_cat][d11_inter.get_ht_tag()] += 1
        HTC_TAG_FREQ_DICT[i_cat]['ALL'][d11_inter.get_ht_tag()] += 1
        HTC_TAG_FREQ_DICT['ALL']['ALL'][d11_inter.get_ht_tag()] += 1
                    
    return HTC_TAG_FREQ_DICT

## Calculate baited digest score

We use the following function to calculate to calculate a score for each baited digest. We first determine the two sums of the interaction counts for the configurations and enrichment states associated with the two classes BDC1 and BDC2. Then we divide the smaller sum by the larger sum. To avoid divisions by zero, we add a pseudo count to both sums. If the score is smaller than a pre-specified threshold, then we assign the baited digests to the BDC1 or BDC2 class. Depending on which of the two sums is larger, we assign the baited digest to BDC1 or BDC2. If the score is greater than the threshold, we assign the baited digest to the class BDC0, which is for digests without imbalances in the configurations.

In [5]:
def get_bd_score_and_class(NE_dict, EN_dict, bds_threshold):
    
    # Get sum of interactions that are associated with BDC1 baited digest
    sum_0313 = NE_dict['03'] + EN_dict['13']

    # Get sum of interactions that are associated with BDC2 baited digest    
    sum_1202 = NE_dict['12'] + EN_dict['02']

    # Calculate imbalanced configuration score
    if sum_1202 < sum_0313:
        bd_score = (sum_0313 + 1)/(sum_1202 + 1)
        bd_class = 'BDC1'
    else:
        bd_score = (sum_1202 + 1)/(sum_0313 + 1)
        bd_class = 'BDC2'

    # No imbalances in the configurations
    if bd_score < bds_threshold:
        bd_class = 'BDC0'

    return bd_score, bd_class, sum_0313, sum_1202

## Classify baited digests

The following code iterates over all baited digests of a `BaitedDigestSet`. For each baited digest, a list of `NE` and `EN` interactions is retrieved and the function above is used to determine the frequencies of configurations. From these frequencies, we calculate a score. Based on the score, we classify a digest as BDC0, BDC1 or BDC2. For each of the three classes, we create a BED file that can be loaded into UCSC's genome browser. In the browser, BDC1 class digests are shown in blue, BDC2 class digests are shown green, and BDC0 class digests are shown gray. Digest ends that are predominantly sequenced are highlighted with thick ends.

In [6]:
# If true, details are reported for each baited digests
verbose = False

# Threshold for baited digest score
bds_threshold = 20

# Interaction categories taken into account
i_cat = 'ALL'

# Directory for output
OUT_DIR = NOTEBOOK_RESULTS_DIR + '/bdc_lists'
%mkdir -p $OUT_DIR

# The coordinates of baited digests of classes BDC0, BDC1 and BDC2 are written to separate BED files
BDC_FH2 = dict()
BDC_NUM_T = dict()
for bd_class in ['BDC0','BDC1','BDC2']:
    BDC_FH2[bd_class] = open(OUT_DIR + '/' + OUT_PREFIX + '_' + bd_class.lower() + '.bedx', 'w')
    BDC_FH2[bd_class].write("track name=\"" + OUT_PREFIX + "_" + bd_class.lower() + "\" description=\"" + OUT_PREFIX + " " + bd_class + "\" itemRgb=\"On\"" + '\n')
    BDC_NUM_T[bd_class] = 0

# Iterate over all chromosomes
for chrom in baited_digest_set._baited_digest_dict.keys():
    
    print('Chromosome: ' + chrom)
    
    # Numbers baited digests of classes BDC0, BDC1 and BDC2 on this chromosome
    BDC_NUM_C = dict()
    for i in ['BDC0','BDC1','BDC2']:
        BDC_NUM_C[i] = 0
    
    # Iterate over all baited digests on this chromosome   
    for baited_digest_key, baited_digest in baited_digest_set._baited_digest_dict[chrom].items():
        
        # Prepare list of NE and EN interactions that belong to this baited digest 
        interaction_list = baited_digest.interactions[i_cat]['NE'] + baited_digest.interactions[i_cat]['EN']

        # Get frequencies of interactions
        HTC_TAG_FREQ_DICT = get_htc_freq_dicts(interaction_list)
        
        # Calculate score and assign to a class       
        bd_score, bd_class, sum_0313, sum_1202 = get_bd_score_and_class(
            HTC_TAG_FREQ_DICT[i_cat]['NE'],
            HTC_TAG_FREQ_DICT[i_cat]['EN'],
            bds_threshold)
        
        # Count baited digests of different classes
        BDC_NUM_C[bd_class] += 1
        BDC_NUM_T[bd_class] += 1
        
        # Get coordinates from key for output
        chom, sta, end = baited_digest_key.split('\t')
        
        # Format score for output
        bd_score_formatted = "{:.2f}".format(bd_score)
        
        # Get total number of interactions at this bait for output
        sum_total = len(interaction_list)
        
        # Write coordinates and additional information to corresponding BED files        
        name = bd_class + '|' + bd_score_formatted + ':' + str(sum_1202) + ':' + str(sum_0313) + ':' + str(sum_total)
        BED_line = chom + '\t' + sta + '\t' + end + '\t' + name + '\t' + bd_score_formatted + '\t' '.'
        if bd_class == 'BDC0':
            BED_line += '\t' + sta + '\t' + sta + '\t' + '128,128,128' + '\n'
        elif bd_class == 'BDC1':
            BED_line += '\t' + sta + '\t' + str(int(sta)+100) + '\t' + '0,0,100' + '\n'
        elif bd_class == 'BDC2':
            BED_line += '\t' + str(int(end)-100) + '\t' + end + '\t' + '0,100,0' + '\n'
        else:
            print('[ERROR] Invalid class ID: ' + bd_class + '!')
        BDC_FH2[bd_class].write(BED_line)
    
        # Output details about each individual baited digest            
        if verbose:
            print('-------------------------')
            print(baited_digest_key)
            print('sum_1202: ' + str(sum_1202))
            print('sum_0313: ' + str(sum_0313))
            print('sum_total: ' + str(sum_total))
            print('bd_class: ' + bd_class)            
            print('bd_score: ' + bd_score_formatted)
            print()
            for i_conf in ['0X', '1X', '2X', '3X', '01', '02', '03', '12', '13', '23']:
                for e_cat in ['NE','EN']:
                    print(i_cat + '-' + e_cat + '-' + i_conf + ': ' + str(HTC_TAG_FREQ_DICT[i_cat][e_cat][i_conf]))
                print()
                
    print('\tNumber of baited digests: ' + "{:,}".format(sum(BDC_NUM_C.values())))
    for bd_class in ['BDC0','BDC1','BDC2']:
        print('\t\t' + bd_class  + ": {:,}".format(BDC_NUM_C[bd_class]))

print()
print('Total number of baited digests: ' + "{:,}".format(sum(BDC_NUM_T.values())))
for bd_class in ['BDC0','BDC1','BDC2']:
    print('\t' + bd_class  + ": {:,}".format(BDC_NUM_T[bd_class]))
    BDC_FH2[bd_class].close()

Chromosome: chr2
	Number of baited digests: 1,598
		BDC0: 592
		BDC1: 446
		BDC2: 560
Chromosome: chr6
	Number of baited digests: 1,211
		BDC0: 457
		BDC1: 342
		BDC2: 412
Chromosome: chr9
	Number of baited digests: 856
		BDC0: 308
		BDC1: 258
		BDC2: 290
Chromosome: chr10
	Number of baited digests: 928
		BDC0: 331
		BDC1: 272
		BDC2: 325
Chromosome: chr12
	Number of baited digests: 1,196
		BDC0: 433
		BDC1: 335
		BDC2: 428
Chromosome: chr7
	Number of baited digests: 1,008
		BDC0: 375
		BDC1: 276
		BDC2: 357
Chromosome: chr3
	Number of baited digests: 1,330
		BDC0: 518
		BDC1: 344
		BDC2: 468
Chromosome: chrX
	Number of baited digests: 844
		BDC0: 345
		BDC1: 224
		BDC2: 275
Chromosome: chr4
	Number of baited digests: 940
		BDC0: 347
		BDC1: 255
		BDC2: 338
Chromosome: chr1
	Number of baited digests: 2,238
		BDC0: 781
		BDC1: 654
		BDC2: 803
Chromosome: chr18
	Number of baited digests: 351
		BDC0: 126
		BDC1: 97
		BDC2: 128
Chromosome: chr5
	Number of baited digests: 1,111
		BDC0: 459
