# Evaluate and categorize interactions

In this notebook, we describe how to use the modules of ``diachrscripts`` to calculate P-values for interactions and, based on this, to divide them into the categories directed (``DI``) and undirected (``UI``). We also show how to select a set of undirected reference interactions (``UIR``) from the unndirected interactions, which are comparable to the directed interactions in terms of enrichment status of digests and read pair number per interactions.

## Setting up the notebook

In [9]:
import sys
import os
import pandas
from diachr import DiachromaticInteractionSet
sys.path.append("..")

## Read a Diachromatic interaction file

A central class in ``diachrscripts`` is the class ``DiachromaticInteractionSet``. It contain functions for reading, evaluating, categorization and writing of interactions. This class is instantiated as follows:

In [77]:
interaction_set = DiachromaticInteractionSet()

The function ``DiachromaticInteractionSet.parse_file`` can be used to read interactions from Diachromatic interaction files:

In [78]:
interaction_set.parse_file(i_file="../tests/data/test_04/diachromatic_interaction_file.tsv", verbose=False)

For a given instance, this function can also be used multiple times, with the interactions remaining unique with regard to their coordinates. If interactions with identical coordinates are read, the numbers of simple and read pairs are added up separtely (see combinning interactions). For this notebook, we only read in one interaction file that is also used for tests.

An object of the class ``DiachromaticInteractionSet`` keeps track of which interaction files have been read and how many interactions were in each file. This information is stored in a dictionary that can be queried as follows:

In [82]:
read_file_info_dict = interaction_set.get_read_file_info_dict()
pandas.DataFrame(read_file_info_dict)

Unnamed: 0,I_FILE,I_NUM
0,../tests/data/test_04/diachromatic_interaction...,50
1,UNION,46


Alternatively, the same information can be returned in form of a string that is formatted as follows:

In [80]:
print(interaction_set.get_read_file_info_report())

[INFO] Report on reading files:
	[INFO] Read interaction data from 1 files:
		[INFO] 50 interactions from ../tests/data/test_04/diachromatic_interaction_file.tsv
	[INFO] The union of all interactions has 50 interactions.
[INFO] End of report.



## Calculate P-values and divide into directed and undirected interactions

We test the imbalance of simple and twisted read pairs within individual interactions using a two-sided binomial test. The function ``evaluate_and_categorize_interactions`` can be used to calculate P-values for all interactions in an ``DiachromaticInteractionSet`` object. It is important to note that the negative of the natural logarithm must be specified as threshold value.

In [83]:
nln_pval_thresh = -numpy.log(0.01)
rate_and_cat_report_dict = interaction_set.evaluate_and_categorize_interactions(nln_pval_thresh=nln_p_val_thresh, verbose=False)

The function ``rate_and_categorize_interactions`` returns a dictionary ...

In [87]:
pandas.DataFrame(rate_and_cat_report_dict)

Unnamed: 0,NLN_PVAL_THRESH,MIN_RP,MIN_RP_PVAL,N_PROCESSED,N_DISCARDED,N_UNDIRECTED,N_DIRECTED
0,4.60517,8,0.007813,46,0,28,18


## Select undirected reference interactions

In [6]:
select_ref_report_dict = interaction_set.select_reference_interactions(verbose=True)

[INFO] Select reference interactions ...
	[INFO] First pass: Count directed interactions for different read pair counts ...
	[INFO] Second pass: Select undirected reference interactions for different read pair counts ...
[INFO] ...done.


Use ``pandas`` to print out the dictionary returned by the function ``select_reference_interactions``.

In [43]:
df_nn = pandas.DataFrame(select_ref_report_dict).transpose()[:4]
df_ne = pandas.DataFrame(select_ref_report_dict).transpose()[4:8]
df_en = pandas.DataFrame(select_ref_report_dict).transpose()[8:12]
df_ee = pandas.DataFrame(select_ref_report_dict).transpose()[12:16]

The first four columns contain the interaction counts for the enrichment category ``NN``. In this category, we have three directed interactions (``DI_NN``) for which three undirected reference interactions (``UIR_NN``) could be selected. Since a reference interaction could be selected for all directed interactions, there are no missing reference interactions (``M_UIR_NN``). Finally, we have three undirected interactions (``UI_NN``).

In [49]:
df_nn.columns = ['Interaction counts']
df_nn

Unnamed: 0,Interaction counts
DI_NN,3
UIR_NN,3
M_UIR_NN,0
UI_NN,3


Columns five to eight contain the interaction counts for the category ``NE``. In this category, there is one directed interaction for which no matching reference interaction could be selected (``M_UIR_NE``). 

In [50]:
df_ne.columns = ['Interaction counts']
df_ne

Unnamed: 0,Interaction counts
DI_NE,4
UIR_NE,3
M_UIR_NE,1
UI_NE,3


Columns nine to twelve contain the interaction counts for the category ``EN``.

In [51]:
df_en.columns = ['Interaction counts']
df_en

Unnamed: 0,Interaction counts
DI_EN,5
UIR_EN,5
M_UIR_EN,0
UI_EN,3


Columns 12 to 16 contain the interaction counts for the category ``EE``. In this category, there is again one directed interaction for which no matching reference interaction could be selected (``M_UIR_EE``).

In [52]:
df_ee.columns = ['Interaction counts']
df_ee

Unnamed: 0,Interaction counts
DI_EE,6
UIR_EE,5
M_UIR_EE,1
UI_EE,3


## Write evaluated and categorized interactions to a Diachromatic interaction file

In [67]:
write_report_dict = interaction_set.write_diachromatic_interaction_file(target_file="evaluated_and_categorized_interactions.tsv.gz")

Use ``pandas`` to print out the dictionary returned by the function ``write_diachromatic_interaction_file``.

In [68]:
pandas.DataFrame(write_report_dict)

Unnamed: 0,TARGET_FILE,INTERACTIONS_NUMBERS,REQUIRED_REPLICATES,HAS_ALL_DATA,INCOMPLETE_DATA
0,evaluated_and_categorized_interactions.tsv.gz,"[50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50, 50]",1,50,0


In [69]:
df_interaction_file = pandas.read_csv('evaluated_and_categorized_interactions.tsv.gz', compression='gzip', sep='\t', header=None)
df_interaction_file.columns = ['CHR_D1','STA_D1','END_D1','ENR_CAT_D1', \
                               'CHR_D2','STA_D2','END_D2','ENR_CAT_D2', \
                               'RP_S:RP_T', \
                               'NLN_PVAL', \
                               'I_CAT']
df_interaction_file

Unnamed: 0,CHR_D1,STA_D1,END_D1,ENR_CAT_D1,CHR_D2,STA_D2,END_D2,ENR_CAT_D2,RP_S:RP_T,NLN_PVAL,I_CAT
0,chr14,43059116,43059494,N,chr14,43101212,43101810,N,1200:12,inf,DI
1,chr8,129042054,129044258,N,chr8,129121269,129121986,N,1200:24,732.07,DI
2,chr15,73467156,73468652,N,chr15,73526903,73528438,N,1200:36,695.93,DI
3,chr17,72411026,72411616,N,chr17,72712662,72724357,E,1200:12,inf,DI
4,chr18,38724804,38726198,N,chr18,76794986,76803172,E,1200:24,732.07,DI
5,chr11,114362648,114362686,N,chr11,114396073,114404234,E,1200:36,695.93,DI
6,chr15,56158017,56158267,N,chr15,56462978,56465983,E,1200:48,663.7,DI
7,chr14,34714080,34716362,E,chr14,50135355,50139051,N,1200:12,inf,DI
8,chr1,91022201,91023797,E,chr1,116561813,116566655,N,1200:24,732.07,DI
9,chr1,15681566,15697108,E,chr1,19411358,19417940,N,1200:36,695.93,DI
