# Evaluate and categorize interactions

In this notebook, we describe how to use the modules of ``diachrscripts`` to calculate P-values for interactions and, based on this, to divide them into the categories directed (``DI``) and undirected (``UI``). We also show how to select a set of undirected reference interactions (``UIR``) from the unndirected interactions, which are comparable to the directed interactions in terms of enrichment status of digests and read pair number per interactions.

## Setting up the notebook

In [1]:
import sys
import os
import pandas
sys.path.append("..")
from diachr import DiachromaticInteractionSet

## Read a Diachromatic interaction file

A central class in ``diachrscripts`` is the class ``DiachromaticInteractionSet``. It contain functions for reading, evaluating, categorization and writing of interactions. This class is instantiated as follows:

In [2]:
interaction_set = DiachromaticInteractionSet()

The function ``DiachromaticInteractionSet.parse_file`` can be used to read interactions from Diachromatic interaction files:

In [3]:
interaction_set.parse_file(i_file="../tests/data/test_04/diachromatic_interaction_file.tsv", verbose=False)

For a given instance, this function can also be used multiple times, with the interactions remaining unique with regard to their coordinates. If interactions with identical coordinates are read, the numbers of simple and read pairs are added up separtely (see combinning interactions). However, a given interaction file can only be parsed once, otherwise a warning is issued (see below). For this notebook, we only read in one interaction file that is also used for tests.

An object of the class ``DiachromaticInteractionSet`` keeps track of which interaction files have been read and how many interactions were in each file. This information is stored in a dictionary that can be queried as follows:

In [4]:
read_file_info_dict = interaction_set.get_read_file_info_dict()
pandas.DataFrame(read_file_info_dict)

Unnamed: 0,I_FILE,I_NUM,I_UNION_NUM
0,../tests/data/test_04/diachromatic_interaction...,50,50


Alternatively, the same information can be returned in form of a string that is formatted as follows:

In [5]:
print(interaction_set.get_read_file_info_report())

[INFO] Report on reading files:
	[INFO] Read interaction data from 1 files:
		[INFO] 50 interactions from ../tests/data/test_04/diachromatic_interaction_file.tsv (Union: 50)
	[INFO] The union of all interactions has 50 interactions.
[INFO] End of report.



Such strings can be used to compile the output of scripts in which different modules are used.

### Warnings

The function ``DiachromaticInteractionSet.parse_file`` can be used multiple times for a given interaction set. However, it is not possible to add the same file more than once. In this case, no real new interaction would be added to the set, but the simple and twisted read pair counts would still be added up. An interaction set keeps track of the names (including path) of the files that have already been read. If a file is read in repeatedly, a warning is issued and the interaction set remains unchanged. Execute the following cell to trigger the warning.

In [6]:
interaction_set.parse_file(i_file="../tests/data/test_04/diachromatic_interaction_file.tsv", verbose=False)

Filename: ../tests/data/test_04/diachromatic_interaction_file.tsv
Won't add interactions from this file to the interaction set.
  "Won't add interactions from this file to the interaction set.")


## Calculate P-values and divide into directed and undirected interactions

We test the imbalance of simple and twisted read pairs within individual interactions using a two-sided binomial test. The function ``evaluate_and_categorize_interactions`` can be used to calculate P-values for all interactions in an ``DiachromaticInteractionSet`` object. The function takes a P-value threshold as an argument. Interactions that do not have enough read pairs to have a P-value that is less than or equal to the threshold are discarded, i.e. they are removed from the interaction set. All other interactions are categorized as undirected (``UI``), if they have a P-value greater than the threshold value, and as directed (``DI``), if they have a P-value less than or equal to the threshold value.

In [7]:
pval_thresh = 0.01
eval_cat_info_dict = interaction_set.evaluate_and_categorize_interactions(pval_thresh=pval_thresh, verbose=False)

The function ``evaluate_and_categorize_interactions`` returns a dictionary that contains information about the evaluation and categorization of interactions.

In [8]:
pandas.DataFrame(eval_cat_info_dict)

Unnamed: 0,PVAL_THRESH,MIN_RP,MIN_RP_PVAL,N_PROCESSED,N_DISCARDED,N_UNDIRECTED,N_DIRECTED
0,0.01,8,0.007813,50,4,28,18


Alternatively, the same information can be retieved as a formatted string.

In [9]:
print(interaction_set.get_eval_cat_info_report())

[INFO] Report on evaluation and categorization interactions:
	[INFO] P-value threshold: 0.0100000
	[INFO] Minimum number of read pairs required for significance: 8
	[INFO] Corresponding largest P-value: 0.0078125
	[INFO] Processed interactions: 50
	[INFO] Discarded interactions: 4
	[INFO] Not significant interactions (UI): 28
	[INFO] Significant interactions (DI): 18
[INFO] End of report.



Or in form of a table that consists only of two tab separated lines, a header line and a line with values.

In [10]:
print(interaction_set.get_eval_cat_info_table_row())

OUT_PREFIX	PVAL_THRESH	MIN_RP	MIN_RP_PVAL	N_PROCESSED	N_DISCARDED	N_UNDIRECTED	N_DIRECTED
None	0.0100000	8	0.0078125	50	4	28	18



This type of output is useful if want to compile the output from different datasets in one table.

### Warnings

An interaction set can be evaluated and categorized multiple times. However, if a larger P-value is used than before, then  interactions may already have been discarded because they did not have enough read pairs at the smaller threshold. Therefore, it is only possible to evaluate and re-categorize an interaction set, if a smaller threshold value is used as before. Otherwise, a warning is issued and the interaction set remains unchanged. Execute the cell below to trigger a warning.

In [11]:
pval_thresh = 0.02
eval_cat_info_dict = interaction_set.evaluate_and_categorize_interactions(pval_thresh=pval_thresh, verbose=False)

Nothing is done. Interaction set remains unchanged.


## Select undirected reference interactions

We use a binomial test with a fixed P-value threshold to categorize interactions as directed or undirected. The power of the test depends on the total number of read pairs per interaction, which can differ for different interactions. The test has a lower power for interactions with few read pairs. In order to compare directed with undirected interactions, regardless of the number of read pairs per interaction, we select a comparison set of reference interactions (``UIR``) from the undirected interactions (``UI``) that is comparable to the directed interactions (``DI``) with regard to the numbers of read pairs per interaction. Interactions also differ with regard to the enrichment states of the digests involved (``E`` stands for *enriched* and ``N`` for *not enriched*). Since it can be assumed that the enrichment status of interactions (``NN``,``NE``,``EN`` or ``EE``) affects the number of read pairs per interaction, we carry out the selection in different enrichment categories separately. For a given interaction set, the selction of reference interactions can be performed with the function ``DiachromaticInteractionSet.select_reference_interactions``:

In [12]:
select_ref_report_dict = interaction_set.select_reference_interactions(verbose=True)

[INFO] Select reference interactions ...
	[INFO] First pass: Count directed interactions for different read pair counts ...
	[INFO] Second pass: Select undirected reference interactions for different read pair counts ...
[INFO] ...done.


In a first run, the function determines the numbers of directed interactions for different read pair numbers per interaction. In a second run, the function tries to selected the same number of undirected reference interactions for each read pair number.

It can happen that there is no undirected reference interaction for a certain number of read pairs per interaction. However, because there are many more undirected than directed interactions, such cases are rare. To keep track of such cases, the function ``select_reference_interactions`` returns a dictionary that contains, for each interaction enrichment category, information on how many directed, undirected, undirected reference interactions there are and how many reference interactions are missing. We also use this dictionary to test the function.

The dictionary returned is nested, with the first key being the enrichment status (``NN``,``NE``,``EN`` or``EE``) and the second key being the interaction category (``DI``,``UIR``,``M_UIR`` or ``UI``). Execute the cell below to print out the interaction counts for the enrichment category ``NN``.

In [13]:
df_nn = pandas.DataFrame(select_ref_report_dict['NN']).transpose()
df_nn.columns = ['Interaction counts for NN']
df_nn

Unnamed: 0,Interaction counts for NN
DI,3
UIR,3
M_UIR,0
UI,3


within ``NN``, we have three directed interactions and an undirected reference could be selected for each of these interactions. In addition, there are three undirected interactions that were not selected as reference.

Now execute the cell below to print out the interaction counts for the enrichment category ``NE``.

In [14]:
df_ne = pandas.DataFrame(select_ref_report_dict['NE']).transpose()
df_ne.columns = ['Interaction counts for NE']
df_ne

Unnamed: 0,Interaction counts for NE
DI,4
UIR,3
M_UIR,1
UI,3


Within ``NE``, we have four directed interactions and a reference could be selected only for three of these interactions. In addition, there are three undirected interactions that were not selected as reference. Use the keys ``EN`` and ``EE`` to print out the interaction numbers for the two other enrichment categories.

Alternatively, the same information can be retieved as a formatted string.

In [15]:
print(interaction_set.get_select_ref_info_report())

[INFO] Report on selection of undirected reference interactions:
	[INFO] Numbers of directed interactions
		[INFO] Interactions in NN: 3
		[INFO] Interactions in NE: 4
		[INFO] Interactions in EN: 5
		[INFO] Interactions in EE: 6
		[INFO] Total: 18
	[INFO] Numbers of undirected reference interactions
		[INFO] Interactions in NN: 3
		[INFO] Interactions in NE: 3
		[INFO] Interactions in EN: 5
		[INFO] Interactions in EE: 5
		[INFO] Total: 16
	[INFO] Numbers of missing undirected reference interactions
		[INFO] Interactions in NN: 0
		[INFO] Interactions in NE: 1
		[INFO] Interactions in EN: 0
		[INFO] Interactions in EE: 1
		[INFO] Total: 2
	[INFO] Numbers undirected interactions
		[INFO] Interactions in NN: 3
		[INFO] Interactions in NE: 3
		[INFO] Interactions in EN: 3
		[INFO] Interactions in EE: 3
		[INFO] Total: 12
[INFO] End of report.



In this type of output, it is easy to see that no reference could be selected for two directed interactions, one in ``NE`` and one in ``EE``. This corresponds to the result that we expect for this test dataset.

Or in form of a table that consists only of two tab separated lines, a header line and a line with values.

In [16]:
print(interaction_set.get_select_ref_info_table_row())

OUT_PREFIX	DI_NN	DI_NE	DI_EN	DI_EE	UIR_NN	UIR_NE	UIR_EN	UIR_EE	M_UIR_NN	M_UIR_NE	M_UIR_EN	M_UIR_EE	UI_NN	UI_NE	UI_EN	UI_EE
None	3	4	5	6	3	3	5	5	0	1	0	1	3	3	3	3



This type of output is useful if want to compile the output from different datasets in one table.

## Write evaluated and categorized interactions to a Diachromatic interaction file

As soon as interactions have been read in to a ``DiachromaticInteractionSet`` object, the interactions can be written out at any time to a file in Diachromatic interaction format using the function ``write_diachromatic_interaction_file``. The function expects a path to an output file as an argument.

In [17]:
target_file = "evaluated_and_categorized_interactions.tsv.gz"
write_file_info_dict = interaction_set.write_diachromatic_interaction_file(target_file=target_file)

As the other functions of class ``DiachromaticInteractionSet``, this function returns a dictionary with informations about the performed operation.

In [18]:
pandas.DataFrame(write_file_info_dict)

Unnamed: 0,TARGET_FILE,REQUIRED_REPLICATES,N_INCOMPLETE_DATA,N_COMPLETE_DATA
0,evaluated_and_categorized_interactions.tsv.gz,1,0,46


The dictionary contains the following information:
- ``TARGET_FILE``: Output file in Diachromatic interaction format
- ``REQUIRED_REPLICATES``: Chosen number of required replicates (defaults to 1)
- ``N_INCOMPLETE_DATA``: Number of interactions that occur in fewer replicates than required
- ``N_COMPLETE_DATA``: Number of interactions that occur in required number of replicates and were written to the output file

If an interaction set prviously has been evaluated and categorized, as in this case, then the output file will contain two additional columns on the right. Column 10 then contains the negative of the natural logarithm of the P-value and column 11 the interaction category, either ``DI`` or ``UI``. If reference interactions have been selected, then column 11 can also contain ``UIR``. Execute the cell below to view the contents of the generated file.

In [19]:
df_interaction_file = pandas.read_csv('evaluated_and_categorized_interactions.tsv.gz', compression='gzip', sep='\t', header=None)
df_interaction_file.columns = ['CHR_D1','STA_D1','END_D1','ENR_CAT_D1',
                               'CHR_D2','STA_D2','END_D2','ENR_CAT_D2',
                               'RP_S:RP_T',
                               'NLN_PVAL',
                               'I_CAT']
df_interaction_file

Unnamed: 0,CHR_D1,STA_D1,END_D1,ENR_CAT_D1,CHR_D2,STA_D2,END_D2,ENR_CAT_D2,RP_S:RP_T,NLN_PVAL,I_CAT
0,chr14,43059116,43059494,N,chr14,43101212,43101810,N,100:1,64.69,DI
1,chr8,129042054,129044258,N,chr8,129121269,129121986,N,100:2,61.44,DI
2,chr15,73467156,73468652,N,chr15,73526903,73528438,N,100:3,58.59,DI
3,chr17,72411026,72411616,N,chr17,72712662,72724357,E,100:1,64.69,DI
4,chr18,38724804,38726198,N,chr18,76794986,76803172,E,100:2,61.44,DI
5,chr11,114362648,114362686,N,chr11,114396073,114404234,E,100:3,58.59,DI
6,chr15,56158017,56158267,N,chr15,56462978,56465983,E,100:4,56.01,DI
7,chr14,34714080,34716362,E,chr14,50135355,50139051,N,100:1,64.69,DI
8,chr1,91022201,91023797,E,chr1,116561813,116566655,N,100:2,61.44,DI
9,chr1,15681566,15697108,E,chr1,19411358,19417940,N,100:3,58.59,DI
