In [1]:
import os
import pandas as pd

# Create directory for output files generated in this notebook
NOTEBOOK_RESULTS_DIR = 'results/usage_of_pooler'
os.makedirs(NOTEBOOK_RESULTS_DIR, exist_ok=True)

# `pooler.py`: Pooling read pair counts of interactions from different files

[Diachromatic](https://diachromatic.readthedocs.io/en/latest/) is a Java application that implements a preprocessing and quality control pipeline for Hi-C and CHi-C data. The Diachromatic pipeline transforms the data contained in FASTQ files from a Hi-C or capture Hi-C experiment to an interaction file that records the chromosomal positions and enrichment states of the two interacting restriction fragments together with the counts of supporting mapped paired-end reads. For this work, we extended the output of Diachromatic so that for an interaction not just a single read pair count is reported, but four counts, one for each relative orientation of mapped paired-end reads. We use Diachromatic to process FASTQ input files that represent the same experiment separately, and then use the Python script pooler.py to combine the interaction files for downstream analysis.

In this notebook, we describe how to use `pooler.py` to combine paired-end read counts of interactions from different Diachromatic interaction files. Each row in a Diachromatic interaction file contains the coordinates of the two restriction fragments followed by their enrichment status (`E` for enriched and `N` for non-enriched) and the last column contains the four paired-end read counts separated by colons.

In [2]:
col_names = ['chrA','staA','endA','enrA','chrB','staB','endB','enrB','rp_counts']

Two interactions are considered equal if they have the same restriction fragment coordinates. Interactions with the same fragment coordinates are pooled by summing the read pair counts separately for the four relative orientations of mapped paired-end reads. For instance, the two interactions:


```
chr2	95043367	95054745	E	chr2	121918565	121924527	N	5:2:8:0
chr2	95043367	95054745	E	chr2	121918565	121924527	N	4:1:7:2
```

will be combined to:

```
chr2	95043367	95054745	E	chr2	121918565	121924527	N	9:3:15:2
```

## Example files

We have prepared four small diachromatic interaction files for testing and demonstration purposes. The first file contains an interaction that is present in all four files, but with different read pair counts.

In [3]:
i_file = "../../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz"
pd.read_csv(i_file, compression='gzip',  sep='\t', names=col_names)

Unnamed: 0,chrA,staA,endA,enrA,chrB,staB,endB,enrB,rp_counts
0,chr1,46297999,46305684,E,chr1,51777391,51781717,N,1:1:1:0


The second file contains a second interaction that is also present in the third and fourth files.

In [4]:
i_file = "../../tests/data/test_01/diachromatic_interaction_file_r2.tsv.gz"
pd.read_csv(i_file, compression='gzip',  sep='\t', names=col_names)

Unnamed: 0,chrA,staA,endA,enrA,chrB,staB,endB,enrB,rp_counts
0,chr1,46297999,46305684,E,chr1,51777391,51781717,N,2:0:1:0
1,chr17,72411026,72411616,N,chr17,72712662,72724357,N,3:0:1:1


The third file contains a third interaction that is also present in the fourth file.

In [5]:
i_file = "../../tests/data/test_01/diachromatic_interaction_file_r3.tsv.gz"
pd.read_csv(i_file, compression='gzip',  sep='\t', names=col_names)

Unnamed: 0,chrA,staA,endA,enrA,chrB,staB,endB,enrB,rp_counts
0,chr1,46297999,46305684,E,chr1,51777391,51781717,N,0:2:1:0
1,chr17,72411026,72411616,N,chr17,72712662,72724357,N,3:0:0:2
2,chr7,69513952,69514636,N,chr7,87057837,87061499,E,3:1:1:2


And the fourth file contains a fourth interaction that is not present in any other file.

In [6]:
i_file = "../../tests/data/test_01/diachromatic_interaction_file_r4.tsv.gz"
pd.read_csv(i_file, compression='gzip',  sep='\t', names=col_names)

Unnamed: 0,chrA,staA,endA,enrA,chrB,staB,endB,enrB,rp_counts
0,chr1,46297999,46305684,E,chr1,51777391,51781717,N,1:1:1:0
1,chr17,72411026,72411616,N,chr17,72712662,72724357,N,3:0:2:0
2,chr7,69513952,69514636,N,chr7,87057837,87061499,E,2:2:2:1
3,chr11,47259263,47272706,N,chr11,91641153,91642657,E,3:2:1:3


## Pooling

We use the `pooler.py` script to pool the interactions from the four files.

In [7]:
%run ../../pooler.py \
--out-prefix $NOTEBOOK_RESULTS_DIR/DEMO \
--required-replicates 2 \
--interaction-files-path ../../tests/data/test_01/

[INFO] Input parameters
	[INFO] --out-prefix: results/usage_of_pooler/DEMO
	[INFO] --interaction-files-path: ../../tests/data/test_01/
	[INFO] --required-replicates: 2

[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz
	[INFO] Set size: 1
[INFO] ... done.
[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../../tests/data/test_01/diachromatic_interaction_file_r3.tsv.gz
	[INFO] Set size: 3
[INFO] ... done.
[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../../tests/data/test_01/diachromatic_interaction_file_r4.tsv.gz
	[INFO] Set size: 4
[INFO] ... done.
[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../../tests/data/test_01/diachromatic_interaction_file_r2.tsv.gz
	[INFO] Set size: 4
[INFO] ... done.

[INFO] Writing Diachromatic interaction file ...
	[INFO] Required replicates: 2
	[INFO] Target file: results/usage_of_pooler/DEMO_at_least_2_combined_interactions.tsv.gz
[INFO] ... done.


Here is the content of the resulting Diachromatic interaction file:

## Result of pooling

In [8]:
i_file = NOTEBOOK_RESULTS_DIR + "/DEMO_at_least_2_combined_interactions.tsv.gz"
pd.read_csv(i_file, compression='gzip',  sep='\t', names=col_names)

Unnamed: 0,chrA,staA,endA,enrA,chrB,staB,endB,enrB,rp_counts
0,chr1,46297999,46305684,E,chr1,51777391,51781717,N,4:4:4:0
1,chr17,72411026,72411616,N,chr17,72712662,72724357,N,9:0:3:3
2,chr7,69513952,69514636,N,chr7,87057837,87061499,E,5:3:3:3


The four files we pooled contained four different interactions, one on chromosomes 1, 7, 11, and 17. The interaction from chromosome 11 is filtered out of the pooled results because we indicated that we want to retain only interactions found in at least two input files (`--required-replicates 2`).

In addition to the interaction file, a file with summary statistics is created.

In [9]:
with open(f"{NOTEBOOK_RESULTS_DIR}/DEMO_at_least_2_combined_summary.txt") as f:
    for line in f:
        print(line.rstrip())

[INFO] Input parameters
	[INFO] --out-prefix: results/usage_of_pooler/DEMO
	[INFO] --interaction-files-path: ../../tests/data/test_01/
	[INFO] --required-replicates: 2

[INFO] Report on reading files:
	[INFO] Read interaction data from 4 files:
		[INFO] 1 interactions from:
			[INFO] ../../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz
			[INFO] Minimum number of read pairs: 0
			[INFO] Skipped because less than 0 read pairs: 0
			[INFO] Minimum interaction distance: 0
			[INFO] Skipped because shorter than 0 bp: 0
			[INFO] Added to set: 1
			[INFO] Set size: 1
		[INFO] 3 interactions from:
			[INFO] ../../tests/data/test_01/diachromatic_interaction_file_r3.tsv.gz
			[INFO] Minimum number of read pairs: 0
			[INFO] Skipped because less than 0 read pairs: 0
			[INFO] Minimum interaction distance: 0
			[INFO] Skipped because shorter than 0 bp: 0
			[INFO] Added to set: 3
			[INFO] Set size: 3
		[INFO] 4 interactions from:
			[INFO] ../../tests/data/test_01/diachromatic_inter