# Determine P-value threshold at chosen a FDR threshold

In this notebook, we describe how to determine a P-value threshold for a given dataset so that the FDR is kept below a chosen threshold. For a given dataset, we determine the number of significant interactions for ascending P-value thresholds.

## Setting up the notebook

In [6]:
import sys
import os
from numpy import arange
import pandas
sys.path.append("..")
from diachr import DiachromaticInteractionSet
from diachr import Randomize

In this notebook, we use the following classes:
- ``DiachromaticInteractionSet``
- ``Randomize``

## Test file

We have prepared a test file that contains the same number of interactions for consecutive P-value intervals. For this purpose, the following parameters must be specified:
- an interaction file
- a maximum P-value
- a P-value step size
- requested number of interactions per interval

Make sure that the interaction file has not previously been categorized with a P-value threshold that is smaller than the specified maximum P-value. Otherwise, interactions that do not have enough read pairs to be significant at the smaller threshold may have already been discarded.

In [18]:
interaction_file = '../tests/data/test_03/MK_0.06_evaluated_and_categorized_interactions.tsv.gz'
p_value_max = 0.05
p_value_step = 0.0025
i_count_per_range = 10

First, we load the interaction file in a ``DiachromaticInteractionSet``.

In [6]:
interaction_set_test = DiachromaticInteractionSet()
interaction_set_test.parse_file(interaction_set_test, verbose=False)
read_file_info_report = interaction_set_test.get_read_file_info_report()
print(read_file_info_report)

[INFO] Report on reading files:
	[INFO] Read interaction data from 1 files:
		[INFO] 9000910 interactions from: 
			[INFO] ../tests/data/test_03/MK_0.06_evaluated_and_categorized_interactions.tsv.gz
			[INFO] Set size: 9000910
	[INFO] The interaction set has 9000910 interactions.
[INFO] End of report.



 Then we evaluate and categorize the interactions, using ``p_value_max`` as the threshold.

In [7]:
interaction_set_test.evaluate_and_categorize_interactions(p_value_max, verbose=False)
eval_cat_info_report = interaction_set_test.get_eval_cat_info_report()
print(eval_cat_info_report)

[INFO] Report on evaluation and categorization interactions:
	[INFO] P-value threshold: 0.0500000
	[INFO] Minimum number of read pairs required for significance: 6
	[INFO] Corresponding largest P-value: 0.0312500
	[INFO] Processed interactions: 9000910
	[INFO] Discarded interactions: 0
	[INFO] Not significant interactions (UI): 8266032
	[INFO] Significant interactions (DI): 734878
[INFO] End of report.



We use the step size together with the maximum P-value to create a list with P-value thresholds.

In [10]:
p_threshs = arange(p_value_step, p_value_max + p_value_step, p_value_step)
print(p_threshs)

[0.0025 0.005  0.0075 0.01   0.0125 0.015  0.0175 0.02   0.0225 0.025
 0.0275 0.03   0.0325 0.035  0.0375 0.04   0.0425 0.045  0.0475 0.05  ]


For the parameters given above, this results in a list of 20 P-value thresholds. For each of the threshold values, we go through the interaction set until we have selected the required number of interactions with P-values within the current interval.

In [20]:
out_fh = open('diachromatic_fdr_test_file.tsv', 'wt')

i_count = 0
for p_thresh in p_threshs:
    i_count_range = 0
    for d_inter in interaction_set.interaction_list:
        if (p_thresh - p_value_step < d_inter.get_pval()) and d_inter.get_pval() <= p_thresh:
            i_count_range  += 1
            out_fh.write(d_inter.get_diachromatic_interaction_line() + '\n')
            i_count += 1
        if i_count_range == i_count_per_range:
            break
    if i_count_range < i_count_per_range:
        print("[WARNING] Could not select the required number (only "
              + str(i_count_range) + " of " + str(i_count_per_range) +
              ") of interactions for the P-value range ]"
              + str(p_tresh - p_value_step) + ';' + str(str(p_thresh)) + ']')
    print(str(p_thresh - p_value_step) + '\t' + str(p_thresh) + '\t' + str(i_count_range)+ '\t' + str(i_count))
    
out_fh.close()

0.0	0.0025	10	10
0.0025	0.005	10	20
0.004999999999999999	0.0075	10	30
0.0075	0.01	10	40
0.01	0.0125	10	50
0.0125	0.015000000000000001	10	60
0.014999999999999998	0.017499999999999998	10	70
0.0175	0.02	10	80
0.02	0.0225	10	90
0.0225	0.024999999999999998	10	100
0.025	0.0275	10	110
0.0275	0.03	10	120
0.030000000000000002	0.0325	10	130
0.0325	0.035	10	140
0.035	0.037500000000000006	10	150
0.0375	0.04	10	160
0.04	0.0425	10	170
0.0425	0.045000000000000005	10	180
0.045	0.0475	10	190
0.0475	0.05	10	200


The code snippet above reports for each P-value range the number of selcted interaction within this range and cumulative number of selected ranges for this and all previous ranges.

## Reading in a Diachromatic interaction file and evaluate and categorize interactions

XXX

In [1]:
import sys
import os
from numpy import arange
import pandas
sys.path.append("..")
from diachr import DiachromaticInteractionSet
from diachr import Randomize

In [2]:
interaction_set = DiachromaticInteractionSet()
interaction_set.parse_file('diachromatic_fdr_test_file.tsv', verbose=False)
read_file_info_report = interaction_set.get_read_file_info_report()
print(read_file_info_report)

[INFO] Report on reading files:
	[INFO] Read interaction data from 1 files:
		[INFO] 200 interactions from: 
			[INFO] diachromatic_fdr_test_file.tsv
			[INFO] Set size: 200
	[INFO] The interaction set has 200 interactions.
[INFO] End of report.



In [3]:
p_value_max = 0.05
interaction_set.evaluate_and_categorize_interactions(p_value_max, verbose=False)
eval_cat_info_report = interaction_set.get_eval_cat_info_report()
print(eval_cat_info_report)

[INFO] Report on evaluation and categorization interactions:
	[INFO] P-value threshold: 0.0500000
	[INFO] Minimum number of read pairs required for significance: 6
	[INFO] Corresponding largest P-value: 0.0312500
	[INFO] Processed interactions: 200
	[INFO] Discarded interactions: 0
	[INFO] Not significant interactions (UI): 0
	[INFO] Significant interactions (DI): 200
[INFO] End of report.



## Pass interaction set to ``Randomize`` module

XXX

In [4]:
randomize = Randomize(interaction_set=interaction_set)

In [5]:
chosen_fdr_thresh = 0.05
p_value_max = 0.05
p_value_step = 0.0025
randomize.get_pval_tresh_at_chosen_fdr_tresh(chosen_fdr_thresh=chosen_fdr_thresh,
                                             pval_thresh_max=p_value_max,
                                             pval_thresh_step_size=p_value_step)

[0.0025 0.005  0.0075 0.01   0.0125 0.015  0.0175 0.02   0.0225 0.025
 0.0275 0.03   0.0325 0.035  0.0375 0.04   0.0425 0.045  0.0475 0.05  ]
X		0.765	0.0025	153	200
X		0.765	0.005	153	200
X		0.765	0.0075	153	200
X		0.765	0.01	153	200
X		0.765	0.0125	153	200
X		0.765	0.015000000000000001	153	200
X		0.765	0.017499999999999998	153	200
X		0.765	0.02	153	200
X		0.765	0.0225	153	200
X		0.765	0.024999999999999998	153	200
X		0.765	0.0275	153	200
X		0.765	0.03	153	200
X		0.765	0.0325	153	200
X		0.765	0.035	153	200
X		0.765	0.037500000000000006	153	200
X		0.765	0.04	153	200
X		0.765	0.0425	153	200
X		0.765	0.045000000000000005	153	200
X		0.765	0.0475	153	200
X		0.765	0.05	153	200
