# Evaluate and categorize interactions

In this notebook, we describe how to use the modules of ``diachrscripts`` to calculate P-values for interactions and, based on this, to divide them into the categories directed (``DI``) and undirected (``UI``). We also show how to select a set of undirected reference interactions (``UIR``) from the unndirected interactions, which are comparable to the directed interactions in terms of enrichment status of digests and read pair number per interactions.

## Setting up the notebook

In [1]:
import sys
import os
import pandas
sys.path.append("..")
from diachr import DiachromaticInteractionSet

In this notebook, we use only the class ``DiachromaticInteractionSet`` and the following functions from this class:
- ``parse_file``: Reading in interaction files
- ``get_read_file_info_dict``: Get information about files that have already been read
- ``get_read_file_info_report``: Get information about files that have already been read in form of a formatted string
- ``evaluate_and_categorize_interactions``: Calculation of P-values and assignment to categories ``DI`` and ``UI``
- ``get_eval_cat_info_report``: Get information about P-value calculation and categorization in form of a formatted string
- ``get_eval_cat_info_table_row``: Get information about P-value calculation and categorization in form of a table
- ``select_reference_interactions``: Select undirected reference interactions ``UIR`` from ``UI``
- ``get_select_ref_info_report``: Get information about reference interaction selection in form of a formatted string
- ``get_select_ref_info_table_row``: Get information about reference interaction selection in form of a table
- ``write_diachromatic_interaction_file``: Writing to interaction file
- ``get_write_file_info_report``: Get information about written interaction file in form of a formatted string
- ``get_write_file_info_table_row``: Get information about written interaction file in form of a table

## Test file

To test this section of the analysis, we constructed a test file that contains a certain number of directed and undirected interactions. In addition, this file contains interactions that cannot be significant at a P-value threshold of ``0.01`` because they do not have enough read pairs. Finally, we have chosen the numbers of read pairs in such a way that no matching undirected reference interaction can be selected for some directed interactions. More details on this file can be found below in this notebook and in the unit test in which it is used.

In [2]:
!head ../tests/data/test_04/diachromatic_interaction_file.tsv

chr14	43059116	43059494	N	chr14	43101212	43101810	N	100:1
chr8	129042054	129044258	N	chr8	129121269	129121986	N	100:2
chr15	73467156	73468652	N	chr15	73526903	73528438	N	100:3
chr17	72411026	72411616	N	chr17	72712662	72724357	E	100:1
chr18	38724804	38726198	N	chr18	76794986	76803172	E	100:2
chr11	114362648	114362686	N	chr11	114396073	114404234	E	100:3
chr15	56158017	56158267	N	chr15	56462978	56465983	E	100:4
chr14	34714080	34716362	E	chr14	50135355	50139051	N	100:1
chr1	91022201	91023797	E	chr1	116561813	116566655	N	100:2
chr1	15681566	15697108	E	chr1	19411358	19417940	N	100:3


## Reading in a Diachromatic interaction file

A central class in ``diachrscripts`` is the class ``DiachromaticInteractionSet``. It contain functions for reading, evaluating, categorization and writing of interactions. This class is instantiated as follows:

In [3]:
interaction_set = DiachromaticInteractionSet()

The function ``parse_file`` can be used to read interactions from Diachromatic interaction files:

In [4]:
interaction_set.parse_file(i_file="../tests/data/test_04/diachromatic_interaction_file.tsv", verbose=False)

For a given ``DiachromaticInteractionSet`` object, this function can also be used multiple times, with the interactions remaining unique with regard to their coordinates (see combining interactions). For this notebook, we only read in one interaction file.

An object of the class ``DiachromaticInteractionSet`` keeps track of which interaction files have been read and how many interactions were in each file. This information is stored in a dictionary that can be queried as follows:

In [5]:
read_file_info_dict = interaction_set.get_read_file_info_dict()
pandas.DataFrame(read_file_info_dict)

Unnamed: 0,I_FILE,I_NUM,MIN_RP_NUM,MIN_DIST,I_NUM_SKIPPED_RP,I_NUM_SKIPPED_DIST,I_NUM_ADDED,I_SET_SIZE
0,../tests/data/test_04/diachromatic_interaction...,50,0,0,0,0,50,50


Alternatively, the same information can be returned in form of a string that is formatted as follows:

In [6]:
print(interaction_set.get_read_file_info_report())

[INFO] Report on reading files:
	[INFO] Read interaction data from 1 files:
		[INFO] 50 interactions from: 
			[INFO] ../tests/data/test_04/diachromatic_interaction_file.tsv
			[INFO] Minimum number of read pairs: 0
			[INFO] Skipped because less than 0 read pairs: 0
			[INFO] Minimum interaction distance: 0
			[INFO] Skipped because shorter than 0 bp: 0
			[INFO] Added to set: 50
			[INFO] Set size: 50
	[INFO] The interaction set has 50 interactions.
[INFO] End of report.



Such strings can be used to compile the output of scripts in which different modules are used.

### Warning when reading the same file multiple times

The function ``DiachromaticInteractionSet.parse_file`` can be used multiple times for a given interaction set. However, it is not possible to add the same file more than once. In this case, a warning is issued and the interaction set remains unchanged (see combining interactions).

## P-value threshold

We use an empirical method to determine a P-value threshold for a given dataset so that the FDR remains below a chosen threshold. This method is demonstrated in a separate notebook. The test file used in this notebook is not suitable for estimating the FDR. Therefore, we are specifying an arbitary P-value threshold here.

In [7]:
pval_thresh = 0.01

## Calculate P-values and divide into directed and undirected interactions

We test the imbalance of simple and twisted read pairs within individual interactions using a two-sided binomial test. The function ``evaluate_and_categorize_interactions`` can be used to calculate P-values for all interactions in an ``DiachromaticInteractionSet`` object. The function takes a P-value threshold as an argument. Interactions that do not have enough read pairs to have a P-value that is less than or equal to the threshold are discarded, i.e. they are removed from the interaction set. All other interactions are categorized as undirected (``UI``), if they have a P-value greater than the threshold value, and as directed (``DI``), if they have a P-value less than or equal to the threshold value.

In [8]:
eval_cat_info_dict = interaction_set.evaluate_and_categorize_interactions(pval_thresh=pval_thresh, verbose=False)

The function ``evaluate_and_categorize_interactions`` returns a dictionary that contains information about the evaluation and categorization of interactions.

In [9]:
pandas.DataFrame(eval_cat_info_dict)

Unnamed: 0,PVAL_THRESH,MIN_RP,MIN_RP_PVAL,N_PROCESSED,N_DISCARDED,N_UNDIRECTED,N_DIRECTED
0,0.01,8,0.007813,50,4,28,18


Alternatively, the same information can be retieved as a formatted string.

In [10]:
print(interaction_set.get_eval_cat_info_report())

[INFO] Report on evaluation and categorization interactions:
	[INFO] P-value threshold: 0.01000
	[INFO] Minimum number of read pairs required for significance: 8
	[INFO] Smallest P-value with 8 read pairs: 0.00781
	[INFO] Processed interactions: 50
	[INFO] Discarded interactions: 4
	[INFO] Not significant interactions (UI): 28
	[INFO] Significant interactions (DI): 18
[INFO] End of report.



Or in form of a table that consists only of two tab separated lines, a header line and a line with values.

In [11]:
print(interaction_set.get_eval_cat_info_table_row())

:TR_EVAL_CAT:	DESCRIPTION	PVAL_THRESH	MIN_RP	MIN_RP_PVAL	N_PROCESSED	N_DISCARDED	N_UNDIRECTED	N_DIRECTED
:TR_EVAL_CAT:	None	0.01000	8	0.00781	50	4	28	18



This type of output is useful if want to compile the output from different datasets in one table.

### Warnings

An interaction set can be evaluated and categorized multiple times. However, if a larger P-value is used than before, then  interactions may already have been discarded because they did not have enough read pairs at the smaller threshold. Therefore, it is only possible to evaluate and re-categorize an interaction set, if a smaller threshold value is used as before. Otherwise, a warning is issued and the interaction set remains unchanged. Execute the cell below to trigger a warning.

In [12]:
pval_thresh = 0.02
eval_cat_info_dict = interaction_set.evaluate_and_categorize_interactions(pval_thresh=pval_thresh, verbose=False)

Nothing is done. Interaction set remains unchanged.


## Select undirected reference interactions

We use a binomial test with a fixed P-value threshold to categorize interactions as directed or undirected. The power of the test depends on the total number of read pairs per interaction, which can differ for different interactions. The test has a lower power for interactions with few read pairs. In order to compare directed with undirected interactions, regardless of the number of read pairs per interaction, we select a comparison set of reference interactions (``UIR``) from the undirected interactions (``UI``) that is comparable to the directed interactions (``DI``) with regard to the numbers of read pairs per interaction. Interactions also differ with regard to the enrichment states of the digests involved (``E`` stands for *enriched* and ``N`` for *not enriched*). Since it can be assumed that the enrichment status of interactions (``NN``,``NE``,``EN`` or ``EE``) affects the number of read pairs per interaction, we carry out the selection in different enrichment categories separately. For a given interaction set, the selction of reference interactions can be performed with the function ``DiachromaticInteractionSet.select_reference_interactions``:

In [13]:
interaction_set.shuffle_inter_dict(random_seed=1)
select_ref_report_dict = interaction_set.select_reference_interactions_2(verbose=True)

[INFO] Select reference interactions ...
	[INFO] Treating NE and EN as one category ...
	[INFO] First pass: Count directed interactions for different read pair counts ...
	[INFO] Second pass: Select undirected reference interactions for different read pair counts ...
[INFO] ... done.


In a first run, the function determines the numbers of directed interactions for different read pair numbers per interaction. In a second run, the function tries to selected the same number of undirected reference interactions for each read pair number.

It can happen that there is no undirected reference interaction for a certain number of read pairs per interaction. However, because there are many more undirected than directed interactions, such cases are rare. To keep track of such cases, the function ``select_reference_interactions`` returns a dictionary that contains, for each interaction enrichment category, information on how many directed, undirected, undirected reference interactions there are and how many reference interactions are missing. We also use this dictionary to test the function.

The dictionary returned is nested, with the first key being the enrichment status (``NN``,``NE``,``EN`` or``EE``) and the second key being the interaction category (``DI``,``UIR``,``M_UIR`` or ``UI``). Execute the cell below to print out the interaction counts for the enrichment category ``NN``.

In [14]:
df_nn = pandas.DataFrame(select_ref_report_dict['NN']).transpose()
df_nn.columns = ['Interaction counts for NN']
df_nn

Unnamed: 0,Interaction counts for NN
DI,3
UIR,3
M_UIR,0
UI,3


within ``NN``, we have three directed interactions and an undirected reference could be selected for each of these interactions. In addition, there are three undirected interactions that were not selected as reference.

Now execute the cell below to print out the interaction counts for the enrichment category ``NE``.

In [15]:
df_ne = pandas.DataFrame(select_ref_report_dict['NE']).transpose()
df_ne.columns = ['Interaction counts for NE']
df_ne

Unnamed: 0,Interaction counts for NE
DI,4
UIR,3
M_UIR,1
UI,3


Within ``NE``, we have four directed interactions and a reference could be selected only for three of these interactions. In addition, there are three undirected interactions that were not selected as reference. Use the keys ``EN`` and ``EE`` to print out the interaction numbers for the two other enrichment categories.

Alternatively, the same information can be retieved as a formatted string.

In [16]:
print(interaction_set.get_select_ref_info_report())

[INFO] Report on selection of undirected reference interactions:
	[INFO] Numbers of directed interactions
		[INFO] Interactions in NN: 3
		[INFO] Interactions in NE: 4
		[INFO] Interactions in EN: 5
		[INFO] Interactions in EE: 6
		[INFO] Total: 18
	[INFO] Numbers of undirected reference interactions
		[INFO] Interactions in NN: 3
		[INFO] Interactions in NE: 3
		[INFO] Interactions in EN: 5
		[INFO] Interactions in EE: 5
		[INFO] Total: 16
	[INFO] Numbers of missing undirected reference interactions
		[INFO] Interactions in NN: 0
		[INFO] Interactions in NE: 1
		[INFO] Interactions in EN: 0
		[INFO] Interactions in EE: 1
		[INFO] Total: 2
	[INFO] Numbers undirected interactions
		[INFO] Interactions in NN: 3
		[INFO] Interactions in NE: 3
		[INFO] Interactions in EN: 3
		[INFO] Interactions in EE: 3
		[INFO] Total: 12
[INFO] End of report.



In this type of output, it is easy to see that no reference could be selected for two directed interactions, one in ``NE`` and one in ``EE``. This is what we expect for this test dataset.

Or in form of a table that consists only of two tab separated lines, a header line and a line with values.

In [17]:
print(interaction_set.get_select_ref_info_table_row())

:TR_SELECT:	DESCRIPTION	DI_NN	DI_NE	DI_EN	DI_EE	DI_TOTAL	UIR_NN	UIR_NE	UIR_EN	UIR_EE	UIR_TOTAL	M_UIR_NN	M_UIR_NE	M_UIR_EN	M_UIR_EE	M_UIR_TOTAL	UI_NN	UI_NE	UI_EN	UI_EE	UI_TOTAL
:TR_SELECT:	None	3	4	5	6	18	3	3	5	5	16	0	1	0	1	2	3	3	3	3	12



This type of output is useful if want to compile the output from different datasets in one table.

## Write evaluated and categorized interactions to a Diachromatic interaction file

As soon as interactions have been read in to a ``DiachromaticInteractionSet`` object, they can be written out to a file in Diachromatic interaction format using the function ``write_diachromatic_interaction_file``. This function is also used when combining and is decribed in this context. In this case, only a path to an output file is passed to the function.

In [18]:
target_file = "evaluated_and_categorized_interactions.tsv.gz"
write_file_info_dict = interaction_set.write_diachromatic_interaction_file(target_file=target_file)

As the other functions of class ``DiachromaticInteractionSet``, this function returns a dictionary with informations about the performed operation.

In [19]:
pandas.DataFrame(write_file_info_dict)

Unnamed: 0,TARGET_FILE,REQUIRED_REPLICATES,N_INCOMPLETE_DATA,N_COMPLETE_DATA
0,evaluated_and_categorized_interactions.tsv.gz,1,0,46


By default the number of required replicates is 1, i.e. all interactions from the interaction set have been written to file.

There are other functions that embed the contents of the dictionary in strings that can be used for logging (see combine interactions).

### Diachromatic11 interaction format

If an interaction set prviously has been evaluated and categorized, as in this case, then the output file will contain two additional columns on the right. Column 10 then contains the negative of the natural logarithm of the P-value and column 11 the interaction category, either ``DI`` or ``UI``. If reference interactions have been selected, then column 11 can also contain ``UIR``. Execute the cell below to view the contents of the generated file.

In [20]:
df_interaction_file = pandas.read_csv('evaluated_and_categorized_interactions.tsv.gz', compression='gzip', sep='\t', header=None)
df_interaction_file.columns = ['CHR_D1','STA_D1','END_D1','ENR_CAT_D1',
                               'CHR_D2','STA_D2','END_D2','ENR_CAT_D2',
                               'RP_S:RP_T',
                               'NLN_PVAL',
                               'I_CAT']
df_interaction_file

Unnamed: 0,CHR_D1,STA_D1,END_D1,ENR_CAT_D1,CHR_D2,STA_D2,END_D2,ENR_CAT_D2,RP_S:RP_T,NLN_PVAL,I_CAT
0,chr15,73467156,73468652,N,chr15,73526903,73528438,N,100:3,25.44,DI
1,chr1,15681566,15697108,E,chr1,19411358,19417940,N,100:3,25.44,DI
2,chr17,80806561,80813742,E,chr17,80880355,80886553,E,25:25,0.0,UI
3,chr5,112154251,112162055,E,chr5,115769718,115774925,E,50:55,0.16,UIR
4,chr11,114362648,114362686,N,chr11,114396073,114404234,E,100:3,25.44,DI
5,chr18,31373121,31377878,E,chr18,53036132,53037663,N,50:53,0.07,UIR
6,chr7,25228385,25228778,E,chr7,42234764,42240281,E,100:1,28.09,DI
7,chr2,113676580,113686263,E,chr2,202796295,202797013,N,100:5,23.3,DI
8,chr9,120719242,120724112,N,chr9,135466848,135469617,E,25:27,0.05,UI
9,chr10,100185111,100188716,E,chr10,100911854,100914842,E,100:4,24.33,DI


## Run the entire workflow with one script

The steps described in this notebook are summarized in a script that can be executed as follows:

In [21]:
%run ../DICer.py \
--out-prefix DEMO \
--p-value-threshold 0.01 \
--diachromatic-interaction-file ../tests/data/test_04/diachromatic_interaction_file.tsv

[INFO] Input parameters
	[INFO] --out-prefix: DEMO
	[INFO] --description-tag: DESCRIPTION_TAG
	[INFO] --diachromatic-interaction-file:
		[INFO] ../tests/data/test_04/diachromatic_interaction_file.tsv
	[INFO] --min-inter-dist: 0
	[INFO] --p-value-threshold: 0.01
	[INFO] --random-seed-shuff-inter: 1
		[INFO] Will use this P-value threshold instead of the one determined by the FDR procedure.
		[INFO] We use the negative decadic logarithm of the P-values.
			[INFO] The chosen threshold corresponds to: -log10(0.01) = 2.0

[INFO] Parsing Diachromatic interaction file ...
	[INFO] ../tests/data/test_04/diachromatic_interaction_file.tsv
	[INFO] Set size: 50
[INFO] ... done.

[INFO] Shuffling dictionary with interactions ...
	[INFO] Random seed: 1
[INFO] ... done.

[INFO] Evaluate and categorize interactions ...
[INFO] ... done.

[INFO] Select reference interactions ...
	[INFO] Treating NE and EN as one category ...
	[INFO] First pass: Count directed interactions for different read pair counts .

The script outputs information about the status of processing and creates two files. The first file contains the reports and summary statistics that were presented above for each step individually.

In [22]:
cat DEMO_reports.txt

[INFO] Input parameters
	[INFO] --out-prefix: DEMO
	[INFO] --description-tag: DESCRIPTION_TAG
	[INFO] --diachromatic-interaction-file:
		[INFO] ../tests/data/test_04/diachromatic_interaction_file.tsv
	[INFO] --min-inter-dist: 0
	[INFO] --p-value-threshold: 0.01
	[INFO] --random-seed-shuff-inter: 1
		[INFO] Will use this P-value threshold instead of the one determined by the FDR procedure.
		[INFO] We use the negative decadic logarithm of the P-values.
			[INFO] The chosen threshold corresponds to: -log10(0.01) = 2.0

[INFO] Report on reading files:
	[INFO] Read interaction data from 1 files:
		[INFO] 50 interactions from: 
			[INFO] ../tests/data/test_04/diachromatic_interaction_file.tsv
			[INFO] Minimum number of read pairs: 5
			[INFO] Skipped because less than 5 read pairs: 0
			[INFO] Minimum interaction distance: 0
			[INFO] Skipped because shorter than 0 bp: 0
			[INFO] Added to set: 50
			[INFO] Set size: 50
	[INFO] The interaction set has 50 interactions.

The second file contains the evaluated and categorized interactions.

In [23]:
!gunzip -c DEMO_evaluated_and_categorized_interactions.tsv.gz | head

chr15	73467156	73468652	N	chr15	73526903	73528438	N	100:3	25.44	DI
chr10	100185111	100188716	E	chr10	100911854	100914842	E	100:4	24.33	DI
chr1	15681566	15697108	E	chr1	19411358	19417940	N	100:3	25.44	DI
chr11	86392415	86393959	N	chr11	125833282	125834157	E	50:51	0.00	UIR
chr3	185935734	185943372	E	chr3	194132402	194139770	E	50:51	0.00	UIR
chr11	114362648	114362686	N	chr11	114396073	114404234	E	100:3	25.44	DI
chr7	123787495	123793134	E	chr7	141015018	141017643	E	100:3	25.44	DI
chr2	113676580	113686263	E	chr2	202796295	202797013	N	100:5	23.30	DI
chr1	23789605	23791509	E	chr1	153951062	153962812	E	50:54	0.11	UIR
chr8	110169057	110171420	N	chr8	110203244	110203772	N	25:25	0.00	UI
