# Combine interactions from different Diachromatic interaction files

In this notebook, we describe how to use the modules of ``diachrscripts`` to combine interactions from different Diachromatic interaction files. Two interactions are considered equal if they have the same coordinates and the simple and twisted read pair counts are summed up separately. For instance, the two interactions:

```
chr2	95043367	95054745	E	chr2	121918565	121924527	N	5:2
chr2	95043367	95054745	E	chr2	121918565	121924527	N	4:1
```

will be combined to:

```
chr2	95043367	95054745	E	chr2	121918565	121924527	N	9:3
```

It does not matter whether the interactions occur in the same file or in different files. This condition can be met by first applying the combine procedure to individual files and then to different files. In this notebook, we assume that interactions occur only once within idividual files.

## Setting up the notebook

In [1]:
import sys
import os
import pandas
pandas.set_option('max_colwidth', 400)
sys.path.append("..")
from diachr import DiachromaticInteractionSet

In this notebook, we use only class ``DiachromaticInteractionSet`` and the following functions from this class:
- ``parse_file``: Read interactions from file
- ``get_read_file_info_dict``: Get information about files that have already been read
- ``get_read_file_info_report``: Get information about files that have already been read in form of a formatted string
- ``write_diachromatic_interaction_file``: Writing to interaction file

## Test files

In this notebook, the four Diachromatic interaction test files are used. We prepared these files by creating one file with four interactions and then gradually deleting one interaction to create the other files. Execute the following four cells to see the content of these files.

In [2]:
!gunzip -c ../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz

chr1	46297999	46305684	E	chr1	51777391	51781717	N	2:1


In [3]:
!gunzip -c ../tests/data/test_01/diachromatic_interaction_file_r2.tsv.gz

chr1	46297999	46305684	E	chr1	51777391	51781717	N	2:1
chr17	72411026	72411616	N	chr17	72712662	72724357	N	3:2


In [4]:
!gunzip -c ../tests/data/test_01/diachromatic_interaction_file_r3.tsv.gz

chr1	46297999	46305684	E	chr1	51777391	51781717	N	2:1
chr17	72411026	72411616	N	chr17	72712662	72724357	N	3:2
chr7	69513952	69514636	N	chr7	87057837	87061499	E	4:3


In [5]:
!gunzip -c ../tests/data/test_01/diachromatic_interaction_file_r4.tsv.gz

chr1	46297999	46305684	E	chr1	51777391	51781717	N	2:1
chr17	72411026	72411616	N	chr17	72712662	72724357	N	3:2
chr7	69513952	69514636	N	chr7	87057837	87061499	E	4:3
chr11	91641153	91642657	N	chr11	47259263	47272706	E	5:4


## Reading interactions from multiple Diachromatic interaction files

The class ``DiachromaticInteractionSet`` is instantiated as follows:

In [2]:
interaction_set = DiachromaticInteractionSet()

The function ``parse_file`` can be used to read interactions from Diachromatic interaction files.

In [3]:
interaction_set.parse_file(i_file="../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz")

An ``DiachromaticInteractionSet`` keeps tracks how many interactions were from which files. This information can be queried using the function ``get_read_file_info_dict``.

In [4]:
read_file_info_dict = interaction_set.get_read_file_info_dict()
pandas.DataFrame(read_file_info_dict)

Unnamed: 0,I_FILE,I_NUM,MIN_RP_NUM,MIN_DIST,I_NUM_SKIPPED_RP,I_NUM_SKIPPED_DIST,I_NUM_ADDED,I_SET_SIZE
0,../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz,1,0,0,0,0,1,1


Internally, the interaction set contains a dictionary with interaction objects that are considered to be the same if they have the same coordinates. In addition to the number of interactions from individual files (``I_NUM``), the current size of the interaction set is tracked (``I_SET_SIZE``).

For a given interaction set, the function ``parse_file`` can be applied to multiple files. Execute the cell below to read in the second file and to query the information about read files.

In [9]:
interaction_set.parse_file(i_file="../tests/data/test_01/diachromatic_interaction_file_r2.tsv.gz")
read_file_info_dict = interaction_set.get_read_file_info_dict()
pandas.DataFrame(read_file_info_dict)

Unnamed: 0,I_FILE,I_NUM,I_SET_SIZE
0,../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz,1,1
1,../tests/data/test_01/diachromatic_interaction_file_r2.tsv.gz,2,2


A total of three interactions have been read from two files. The size of the interaction set is two because the interaction on chromosome ``chr1`` occurs in both files. Execute the cell below to read the interactions from the remaining files.

In [4]:
interaction_set.parse_file(i_file="../tests/data/test_01/diachromatic_interaction_file_r3.tsv.gz")
interaction_set.parse_file(i_file="../tests/data/test_01/diachromatic_interaction_file_r4.tsv.gz")
read_file_info_dict = interaction_set.get_read_file_info_dict()
pandas.DataFrame(read_file_info_dict)

Unnamed: 0,I_FILE,I_NUM,MIN_RP_NUM,MIN_DIST,I_NUM_SKIPPED_RP,I_NUM_SKIPPED_DIST,I_NUM_ADDED,I_SET_SIZE
0,../tests/data/test_01/diachromatic_interaction_file_r3.tsv.gz,3,0,0,0,0,3,3
1,../tests/data/test_01/diachromatic_interaction_file_r4.tsv.gz,4,0,0,0,0,4,4


A total of 10 interactions have been read from four files. The interaction set has a size of 4. This corresponds to what we expect for these test files.  

Alternatively, the information about read files and interaction set size can be returned as a formatted string.

In [5]:
print(interaction_set.get_read_file_info_report())

[INFO] Report on reading files:
	[INFO] Read interaction data from 2 files:
		[INFO] 3 interactions from: 
			[INFO] ../tests/data/test_01/diachromatic_interaction_file_r3.tsv.gz
			[INFO] Minimum number of read pairs: 0
			[INFO] Skipped because less than 0 read pairs: 0
			[INFO] Skipped because shorter than 0 bp: 0
			[INFO] Added to set: 3
			[INFO] Set size: 3
		[INFO] 4 interactions from: 
			[INFO] ../tests/data/test_01/diachromatic_interaction_file_r4.tsv.gz
			[INFO] Minimum number of read pairs: 0
			[INFO] Skipped because less than 0 read pairs: 0
			[INFO] Skipped because shorter than 0 bp: 0
			[INFO] Added to set: 4
			[INFO] Set size: 4
	[INFO] The interaction set has 4 interactions.
[INFO] End of report.



### Warning when reading the same file multiple times

The function ``DiachromaticInteractionSet.parse_file`` can be used multiple times for a given interaction set. However, it is not possible to add the same file more than once. In this case, no real new interaction would be added to the set, but the simple and twisted read pair counts would still be added up. An interaction set keeps track of the names (including path) of the files that have already been read. If a file is read in repeatedly, a warning is issued and the interaction set remains unchanged. Execute the following cell to trigger the warning.

In [12]:
interaction_set.parse_file(i_file="../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz")

Filename: ../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz
Won't add interactions from this file to the interaction set.
  "Won't add interactions from this file to the interaction set.")


## Write interactions that occur in a required number replicates to an interaction file

As soon as interactions have been read in to a DiachromaticInteractionSet object, they can be written out to a file in Diachromatic interaction format using the function write_diachromatic_interaction_file. The function expects a path to an output file as an argument. In addition, a required number of replicates can be specified. Every object of class ``DiachromaticInteraction`` contains the information, how often an interaction with identical coordinates was added to the interaction set. When the number of required replicates is 2, then only interactions that have been read in at least twice are written out.

In [13]:
target_file = "combined_interactions.tsv.gz"
required_replicates = 2
write_file_info_dict = interaction_set.write_diachromatic_interaction_file(target_file=target_file, required_replicates=required_replicates)

As the other functions of class DiachromaticInteractionSet, this function returns a dictionary with informations about the performed operation.

In [14]:
pandas.DataFrame(write_file_info_dict)

Unnamed: 0,TARGET_FILE,REQUIRED_REPLICATES,N_INCOMPLETE_DATA,N_COMPLETE_DATA
0,combined_interactions.tsv.gz,2,1,3


The dictionary contains the following information:
- ``TARGET_FILE``: Output file in Diachromatic interaction format
- ``REQUIRED_REPLICATES``: Chosen number of required replicates (defaults to 1)
- ``N_INCOMPLETE_DATA``: Number of interactions that occur in fewer replicates than required
- ``N_COMPLETE_DATA``: Number of interactions that occur in required number of replicates and were written to the output file

Alternatively, the same information can be retieved as a formatted string.

In [15]:
print(interaction_set.get_write_file_info_report())

[INFO] Report on writing files:
	[INFO] Wrote interactions that occur in at least 2 replicates to:
		[INFO] combined_interactions.tsv.gz
	[INFO] Interactions that occur in at least 2 replicates: 3
	[INFO] Other interactions: 1
[INFO] End of report.



Or in form of a table that consists only of two tab separated lines, a header line and a line with values.

In [16]:
print(interaction_set.get_write_file_info_table_row())

TARGET_FILE	REQUIRED_REPLICATES	N_INCOMPLETE_DATA	N_COMPLETE_DATA
combined_interactions.tsv.gz	2	1	3



If the interaction have not yet been evaluated and categorized, then the generated file contains nine columns. The first eight columns contain the coordinates and enrichment states of the digests involved and column nine the simple and twisted read pair counts separated by a colon.

In [17]:
df_interaction_file = pandas.read_csv('combined_interactions.tsv.gz', compression='gzip', sep='\t', header=None)
df_interaction_file.columns = ['CHR_D1','STA_D1','END_D1','ENR_CAT_D1',
                               'CHR_D2','STA_D2','END_D2','ENR_CAT_D2',
                               'RP_S:RP_T']
df_interaction_file

Unnamed: 0,CHR_D1,STA_D1,END_D1,ENR_CAT_D1,CHR_D2,STA_D2,END_D2,ENR_CAT_D2,RP_S:RP_T
0,chr1,46297999,46305684,E,chr1,51777391,51781717,N,8:4
1,chr17,72411026,72411616,N,chr17,72712662,72724357,N,9:6
2,chr7,69513952,69514636,N,chr7,87057837,87061499,E,8:6


The file contains three interactions. The interactions on chromosome ``chr11`` was discarded because it only occurs once. The summed up simple and twisted read pair counts correspond to what we expect for this test dataset.

## Run the entire workflow with one script

The steps described in this notebook are summarized in a script that can be executed as follows:

In [8]:
%run ../01_combine_interactions_from_replicates.py \
--out-prefix DEMO \
--required-replicates 2 \
--interaction-files-path ../tests/data/test_01/

[INFO] Input parameters
	[INFO] --out-prefix: DEMO
	[INFO] --interaction-files-path: ../tests/data/test_01/
	[INFO] --required-replicates: 2

[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz
	[INFO] Set size: 1
[INFO] ... done.
[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../tests/data/test_01/diachromatic_interaction_file_r3.tsv.gz
	[INFO] Set size: 3
[INFO] ... done.
[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../tests/data/test_01/diachromatic_interaction_file_r4.tsv.gz
	[INFO] Set size: 4
[INFO] ... done.
[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../tests/data/test_01/diachromatic_interaction_file_r2.tsv.gz
	[INFO] Set size: 4
[INFO] ... done.

[INFO] Writing Diachromatic interaction file ...
	[INFO] Required replicates: 2
	[INFO] Target file: DEMO_at_least_2_combined_interactions.tsv.gz
[INFO] ... done.


The script outputs information about the status of processing and creates two files. The first file contains the reports and summary statistics that were presented above for each step individually.

In [19]:
cat DEMO_at_least_2_combined_summary.txt

[INFO] Input parameters
	[INFO] --out-prefix: DEMO
	[INFO] --interaction-files-path: ../tests/data/test_01/
	[INFO] --required-replicates: 2

[INFO] Report on reading files:
	[INFO] Read interaction data from 4 files:
		[INFO] 1 interactions from: 
			[INFO] ../tests/data/test_01/diachromatic_interaction_file_r1.tsv.gz
			[INFO] Set size: 1
		[INFO] 3 interactions from: 
			[INFO] ../tests/data/test_01/diachromatic_interaction_file_r3.tsv.gz
			[INFO] Set size: 3
		[INFO] 4 interactions from: 
			[INFO] ../tests/data/test_01/diachromatic_interaction_file_r4.tsv.gz
			[INFO] Set size: 4
		[INFO] 2 interactions from: 
			[INFO] ../tests/data/test_01/diachromatic_interaction_file_r2.tsv.gz
			[INFO] Set size: 4
	[INFO] The interaction set has 4 interactions.
[INFO] End of report.

[INFO] Report on writing files:
	[INFO] Wrote interactions that occur in at least 2 replicates to:
		[INFO] DEMO_at_least_2_combined_interactions.tsv.gz
	[INFO] Interactions that occur i

The second file contains the combined interactions.

In [20]:
!gunzip -c DEMO_at_least_2_combined_interactions.tsv.gz | head

chr1	46297999	46305684	E	chr1	51777391	51781717	N	8:4
chr17	72411026	72411616	N	chr17	72712662	72724357	N	9:6
chr7	69513952	69514636	N	chr7	87057837	87061499	E	8:6
