<H1> `pooler.py`: Pooling read pair counts of interactions from different files </H1>
<p>
<a href="https://diachromatic.readthedocs.io/en/latest/" target="__blank">Diachromatic</a> is a Java application that implements a capture Hi-C preprocessing pipeline followed by analysis of differential chromatin interactions (“loopings”). The diachromatic pipeline transforms the data contained in a FASTQ file from a Hi-C or capture Hi-C experiment to an interaction file that records the chromosomal positions of the two interacting restriction fragments together with some additional information.
</p>
<p>For some analyses presented in "Using paired-end read orientations to assess and mitigate technical biases in capture Hi-C", we use Diachromatic to process multiple input files (e.g., representing the same experiment or cell type) separately, and then combine the interaction files for downstream analysis.</p>
<p>In this notebook, we describe how to use the python script `pooler.py` to combine paired-end read counts of interactions from different Diachromatic interaction files. Two interactions are considered equal if they have the same restriction fragment coordinates. Interactions with the same fragment coordinates are pooled by summing the read pair counts separately for the four relative paired-end read orientations. For instance, the two interactions:
 </p>


<pre>
chr2	95043367	95054745	E	chr2	121918565	121924527	N	5:2:8:0
chr2	95043367	95054745	E	chr2	121918565	121924527	N	4:1:7:2
</pre>

will be combined to:

<pre>
chr2	95043367	95054745	E	chr2	121918565	121924527	N	9:3:15:2
</pre>

The format of the columns is as follows.
<ul>
    <li>chromosome (fragment 1)</li>
    <li>start pos (fragment 1)</li>
    <li>end pos (fragment 1)</li>
    <li>baiting status "E" for "enriched", "N" for "not enriched" (fragment 1)</li>
    <li>chromosome (fragment 2)</li>
    <li>start pos (fragment 2)</li>
    <li>end pos (fragment 2)</li>
    <li>baiting status "E" for "enriched", "N" for "not enriched" (fragment 2)</li>
    <li>read-pair counts</li>
</ul>
<p>The readpair counts column (e.g., 9:3:15:2) shows the counts of configurations 0:1:2:3 as defined in Figure 1B of the main manuscript.</p> 

<h2>Example</h2>
<p>We present a small example to demonstrate how to use the <tt>pooler.py</tt> script to pool ldata from separate files. The script basically searches for lines representing identical readpairs and combines these, while summing up the four counts from the original files.</p>

In [1]:
import pandas as pd
import os
import gzip
import shutil

In [2]:
NOTEBOOK_RESULTS_DIR = 'results/usage_of_pooler'
if not os.path.exists(NOTEBOOK_RESULTS_DIR):
    os.makedirs(NOTEBOOK_RESULTS_DIR)

<p>Because Diachromatic reports all interactions with at least one supporting read pair, pooling interaction files can be extremely memory intensive. We have prepared four small diachromatic interaction files for testing and demonstration purposes.</p>
<p>The files are provided in gzipped form and so we use a convenience function to unzip them.</p>

In [3]:
def convert_gzip_file(gzfile_name):
    file_name = gzfile_name.replace(".gz", "")
    with gzip.open(gzfile_name, 'rb') as f_in:
        with open(file_name, 'wb') as f_out:
            shutil.copyfileobj(f_in, f_out)
    print()
    print(f"{gzfile_name}: interactions")
    with open(file_name) as f:
        for line in f:
            print(line.rstrip())

<p>The first file contains an interaction that is present in all four files, but with different read pair counts. The second file contains a second interaction that is also present in the third and fourth files. The third file contains an interaction that is is present in the third and the fourth file. The fourth file also contains a fourth interaction that is not present in any other file.</p>

In [4]:
#$!gunzip -c ../../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz
convert_gzip_file("../../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz")
convert_gzip_file("../../tests/data/test_01/diachromatic_interaction_file_r2.tsv.gz")
convert_gzip_file("../../tests/data/test_01/diachromatic_interaction_file_r3.tsv.gz")
convert_gzip_file("../../tests/data/test_01/diachromatic_interaction_file_r4.tsv.gz")


../../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz: interactions
chr1	46297999	46305684	E	chr1	51777391	51781717	N	1:1:1:0

../../tests/data/test_01/diachromatic_interaction_file_r2.tsv.gz: interactions
chr1	46297999	46305684	E	chr1	51777391	51781717	N	2:0:1:0
chr17	72411026	72411616	N	chr17	72712662	72724357	N	3:0:1:1

../../tests/data/test_01/diachromatic_interaction_file_r3.tsv.gz: interactions
chr1	46297999	46305684	E	chr1	51777391	51781717	N	0:2:1:0
chr17	72411026	72411616	N	chr17	72712662	72724357	N	3:0:0:2
chr7	69513952	69514636	N	chr7	87057837	87061499	E	3:1:1:2

../../tests/data/test_01/diachromatic_interaction_file_r4.tsv.gz: interactions
chr1	46297999	46305684	E	chr1	51777391	51781717	N	1:1:1:0
chr17	72411026	72411616	N	chr17	72712662	72724357	N	3:0:2:0
chr7	69513952	69514636	N	chr7	87057837	87061499	E	2:2:2:1
chr11	47259263	47272706	N	chr11	91641153	91642657	E	3:2:1:3


<h2>Pooling</h2>
<p>We use the `pooler.py` script to pool the interactions from the four files.</p>

In [5]:
%run ../../pooler.py \
--out-prefix $NOTEBOOK_RESULTS_DIR/DEMO \
--required-replicates 2 \
--interaction-files-path ../../tests/data/test_01/

[INFO] Input parameters
	[INFO] --out-prefix: results/usage_of_pooler/DEMO
	[INFO] --interaction-files-path: ../../tests/data/test_01/
	[INFO] --required-replicates: 2

[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../../tests/data/test_01/diachromatic_interaction_file_r2.tsv.gz
	[INFO] Set size: 2
[INFO] ... done.
[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../../tests/data/test_01/diachromatic_interaction_file_r4.tsv.gz
	[INFO] Set size: 4
[INFO] ... done.
[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz
	[INFO] Set size: 4
[INFO] ... done.
[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../../tests/data/test_01/diachromatic_interaction_file_r3.tsv.gz
	[INFO] Set size: 4
[INFO] ... done.

[INFO] Writing Diachromatic interaction file ...
	[INFO] Required replicates: 2
	[INFO] Target file: results/usage_of_pooler/DEMO_at_least_2_combined_interactions.tsv.gz
[INFO] ... done.


<h2>Result of pooling</h2>
<p>The four files we pooled contained three different interactions, one on chromosomes 1, 7, 11, and 17. The interaction from chromosome 11 is filtered out of the pooled results because we indicated that we want to retain only interactions found in at least two input files (<tt>--required-replicates 2</tt>). </p>

In [16]:
import pandas as pd
infile = f"{NOTEBOOK_RESULTS_DIR}/DEMO_at_least_2_combined_interactions.tsv.gz"
df = pd.read_csv(infile, compression='gzip', header=None, sep='\t')
print(df.to_string(index=False, header=False))

 chr1 46297999 46305684 E  chr1 51777391 51781717 N 4:4:4:0
chr17 72411026 72411616 N chr17 72712662 72724357 N 9:0:3:3
 chr7 69513952 69514636 N  chr7 87057837 87061499 E 5:3:3:3


The interaction on chromosome 11, which is only present in the fourth file, does not occur because we required an interaction to occur for at least two replicates (`--required-replicates 2`). For the remaining interactions, the four read pair counts from the different files were summed up separately.

In addition to the interaction file, a file with summary statistics is created.

In [17]:
with open(f"{NOTEBOOK_RESULTS_DIR}/DEMO_at_least_2_combined_summary.txt") as f:
    for line in f:
        print(line.rstrip())

[INFO] Input parameters
	[INFO] --out-prefix: results/usage_of_pooler/DEMO
	[INFO] --interaction-files-path: ../../tests/data/test_01/
	[INFO] --required-replicates: 2

[INFO] Report on reading files:
	[INFO] Read interaction data from 4 files:
		[INFO] 2 interactions from:
			[INFO] ../../tests/data/test_01/diachromatic_interaction_file_r2.tsv.gz
			[INFO] Minimum number of read pairs: 0
			[INFO] Skipped because less than 0 read pairs: 0
			[INFO] Minimum interaction distance: 0
			[INFO] Skipped because shorter than 0 bp: 0
			[INFO] Added to set: 2
			[INFO] Set size: 2
		[INFO] 4 interactions from:
			[INFO] ../../tests/data/test_01/diachromatic_interaction_file_r4.tsv.gz
			[INFO] Minimum number of read pairs: 0
			[INFO] Skipped because less than 0 read pairs: 0
			[INFO] Minimum interaction distance: 0
			[INFO] Skipped because shorter than 0 bp: 0
			[INFO] Added to set: 4
			[INFO] Set size: 4
		[INFO] 1 interactions from:
			[INFO] ../../tests/data/test_01/diachromatic_inter