In [1]:
import pathlib
import pandas as pd
import numpy as np

## Configuration and Inputs

First let's set up the environment.

Let's start with a few inputs, defined below.

**Input**: pipeline inputs (raw data) directory:

In [2]:
RAW_DATA_DIR = pathlib.Path('../../data/raw')

List of ENCODE files (from pipeline's raw data)

In [3]:
ENCODE_METADATA_TSV = RAW_DATA_DIR / 'encode/encode_metadata.2021-11-05.tsv.gz'
assert ENCODE_METADATA_TSV.is_file()

**Input**: blacklist file for hg38

In [4]:
BLACKLIST = RAW_DATA_DIR / 'blacklist/hg38-blacklist.v2.bed.gz'

Input: chromsizes file for `hg38`, only the canonical chromosomes

In [5]:
CHROMSIZES = RAW_DATA_DIR / 'genome/hg38.filtered.chrom.sizes'

Get a set of valid chromosomes from chromsizes file

In [6]:
VALID_CHROMOSOMES = set()

with open(CHROMSIZES, 'rt') as f:
    for line in f:
        VALID_CHROMOSOMES.add(line.partition('\t')[0])
VALID_CHROMOSOMES

{'chr1',
 'chr10',
 'chr11',
 'chr12',
 'chr13',
 'chr14',
 'chr15',
 'chr16',
 'chr17',
 'chr18',
 'chr19',
 'chr2',
 'chr20',
 'chr21',
 'chr22',
 'chr3',
 'chr4',
 'chr5',
 'chr6',
 'chr7',
 'chr8',
 'chr9',
 'chrX'}

**Config**: the configuration directives required for computations, as they're defined in the [main config file](../../config/config.yaml).

In [7]:
BIN_SIZE = 1000  
STATISTICS_PSEUDOCOUNT = 100
# Note that this has no effect if kendall correlation is used
MIN_PERIODS_FOR_CORRELATIONS = 1
CORRELATION_TYPE = 'kendall'

## Downloading of ENCODE datasets

While the pipeline analyses the ENCODE data exhaustively, in this script we will only process the data for two `BED` datasets that we call $X$ and $Y$. We define these datasets by their ENCODE identifiers below:

In [8]:
ENCODE_IDENTIFIER_BEDFILE_X = 'ENCFF981ISM' # PHF8
ENCODE_IDENTIFIER_BEDFILE_Y = 'ENCFF122CSI' # H3K4me3

We will write outputs to the directory specified by the output names:

In [9]:

NOTEBOOK_OUTPUT_DIR = pathlib.Path('outputs') / f'{ENCODE_IDENTIFIER_BEDFILE_X}-{ENCODE_IDENTIFIER_BEDFILE_Y}'
if not NOTEBOOK_OUTPUT_DIR.is_dir():
    NOTEBOOK_OUTPUT_DIR.mkdir(parents=True)

Just like the pipeline, we can extract the information about these identifiers from the encode metadata TSV:

In [10]:
encode_meta = pd.read_csv(ENCODE_METADATA_TSV, sep='\t', index_col=0)
encode_meta = encode_meta.loc[[ENCODE_IDENTIFIER_BEDFILE_X, ENCODE_IDENTIFIER_BEDFILE_Y]]
encode_meta

  encode_meta = pd.read_csv(ENCODE_METADATA_TSV, sep='\t', index_col=0)


Unnamed: 0_level_0,File format,File type,File format type,Output type,File assembly,Experiment accession,Assay,Donor(s),Biosample term id,Biosample term name,...,Genome annotation,Platform,Controlled by,File Status,s3_uri,File analysis title,File analysis status,Audit WARNING,Audit NOT_COMPLIANT,Audit ERROR
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ENCFF981ISM,bed narrowPeak,bed,narrowPeak,IDR thresholded peaks,GRCh38,ENCSR000AQH,TF ChIP-seq,/human-donors/ENCDO000AAD/,EFO:0002067,K562,...,,,,released,s3://encode-public/2021/01/07/4e27a806-b056-4a...,ENCODE4 v1.6.1 GRCh38,released,"borderline replicate concordance, low read len...",,
ENCFF122CSI,bed narrowPeak,bed,narrowPeak,pseudoreplicated peaks,GRCh38,ENCSR000EWA,Histone ChIP-seq,/human-donors/ENCDO000AAD/,EFO:0002067,K562,...,,,,released,s3://encode-public/2020/09/29/f70dc6ea-ac1b-41...,ENCODE4 v1.5.1 GRCh38,released,"low read length, moderate library complexity, ...",,


The files can be downloaded from the columns `File download URL` and `md5sum`

In [11]:
encode_meta[['File download URL', 'md5sum']]

Unnamed: 0_level_0,File download URL,md5sum
File accession,Unnamed: 1_level_1,Unnamed: 2_level_1
ENCFF981ISM,https://www.encodeproject.org/files/ENCFF981IS...,9c193f16fb52a60daa00555fc986a6a1
ENCFF122CSI,https://www.encodeproject.org/files/ENCFF122CS...,e5694050796ccb5eeace1e6a4efec34d


For simplicity we will keep using the file identifiers as filenames, we will be downloading these files to the output directory:

In [12]:
BED_FILENAMES = {
    id_: (NOTEBOOK_OUTPUT_DIR / f'{id_}.bed.gz') for id_ in encode_meta.index
}
BED_FILENAMES

{'ENCFF981ISM': PosixPath('outputs/ENCFF981ISM-ENCFF122CSI/ENCFF981ISM.bed.gz'),
 'ENCFF122CSI': PosixPath('outputs/ENCFF981ISM-ENCFF122CSI/ENCFF122CSI.bed.gz')}

In [13]:
import urllib.request
import hashlib

for id_, row in encode_meta[['File download URL', 'md5sum']].iterrows():
    url = row['File download URL']
    expected_checksum = row['md5sum']
    target_filename = BED_FILENAMES[id_]
    
    print(f"-> Downloading {url} to {target_filename}")
    urllib.request.urlretrieve(url, target_filename)
    
    print(f"-> Verifying checksum")
    
    # Computes MD5 checksum for file, see https://stackoverflow.com/a/59056796/171400
    with open(target_filename, 'rb') as f:
        checksum = hashlib.md5()
        while chunk := f.read(8192):
            checksum.update(chunk)
        
        checksum = checksum.hexdigest()
    print(f"-> Downloaded file checksum: {checksum}")
    assert checksum == expected_checksum, "Checksums do not match"

-> Downloading https://www.encodeproject.org/files/ENCFF981ISM/@@download/ENCFF981ISM.bed.gz to outputs/ENCFF981ISM-ENCFF122CSI/ENCFF981ISM.bed.gz
-> Verifying checksum
-> Downloaded file checksum: 9c193f16fb52a60daa00555fc986a6a1
-> Downloading https://www.encodeproject.org/files/ENCFF122CSI/@@download/ENCFF122CSI.bed.gz to outputs/ENCFF981ISM-ENCFF122CSI/ENCFF122CSI.bed.gz
-> Verifying checksum
-> Downloaded file checksum: e5694050796ccb5eeace1e6a4efec34d


## Processing of the `bed` files

### Non-overlaping genomic bins

We start the data processing by dividing the whole genome into a set of non-overlapping bins of the specified bin size (parameter `BIN_SIZE`). This is achieved with [`bedtools makewindows`](https://daler.github.io/pybedtools/autodocs/pybedtools.bedtool.BedTool.window_maker.html).

In [14]:
import pybedtools

In [15]:
genomic_windows = pybedtools.BedTool().window_maker(w=BIN_SIZE, g=CHROMSIZES, i='srcwinnum')
genomic_windows = genomic_windows.saveas(NOTEBOOK_OUTPUT_DIR / 'genomic_windows.bed.gz')

In [16]:
print("Made {:,} genomic windows of bin size {:,}bp".format(len(genomic_windows), BIN_SIZE))

Made 3,031,053 genomic windows of bin size 1,000bp


Now throw away the genomic windows that overlap blacklisted regions

In [17]:
blacklist = pybedtools.BedTool(BLACKLIST)

In [18]:
genomic_windows_no_blacklist = genomic_windows.intersect(blacklist, v=True)
genomic_windows_no_blacklist.saveas(NOTEBOOK_OUTPUT_DIR / 'genomic_windows_wo_blacklist.bed.gz')


<BedTool(outputs/ENCFF981ISM-ENCFF122CSI/genomic_windows_wo_blacklist.bed.gz)>

In [19]:
print("Was left with {:,} genomic windows of bin size {:,}bp".format(len(genomic_windows_no_blacklist), BIN_SIZE))

Was left with 2,835,235 genomic windows of bin size 1,000bp


### genomic-bins vs bedfile intersection matrix

Once we have the genomic windows matrix we can use it as a common ground on which to project the ChIP-seq datasets onto. 

For each genomic bin, we want to obtain the maximum [`signalValue` (column 7)](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) of the peaks from the `bed` files that overlap this bin. We can achieve this with [`bedtools map`](https://bedtools.readthedocs.io/en/latest/content/tools/map.html) operation.

In [20]:
matrix_to_bedtool_maps = {}

for id_, bed_filename in  BED_FILENAMES.items():
    print(f"-> Processing {bed_filename}")
    
    # Load bed filename
    bed = pybedtools.BedTool(bed_filename)
    
    # Drop chromosomes we are not interested in
    bed = bed.filter(lambda x: x.chrom in VALID_CHROMOSOMES)
    
    # Sort it
    bed = bed.sort(g=CHROMSIZES)
    
    # Map this file onto the genomic window
    mapped = genomic_windows_no_blacklist.map(
        bed,
        
        c=7, # signalValue
        o='max'
    )
    
    new_filename = (NOTEBOOK_OUTPUT_DIR / f'{id_}.mapped_to_genomic_windows.bed.gz')
    print(f'-> Done. Saving as {new_filename}')
    mapped.saveas(new_filename)
    
    matrix_to_bedtool_maps[id_] = mapped
    

-> Processing outputs/ENCFF981ISM-ENCFF122CSI/ENCFF981ISM.bed.gz
-> Done. Saving as outputs/ENCFF981ISM-ENCFF122CSI/ENCFF981ISM.mapped_to_genomic_windows.bed.gz
-> Processing outputs/ENCFF981ISM-ENCFF122CSI/ENCFF122CSI.bed.gz
-> Done. Saving as outputs/ENCFF981ISM-ENCFF122CSI/ENCFF122CSI.mapped_to_genomic_windows.bed.gz


What the code above does result in is the bed file whose regions are the same of the genomic windows, but the fifth column (score) now corresponds to the maximum signal value overlapping peaks, or a dot (`.`) if the overlap is missing:

In [21]:
matrix_to_bedtool_maps[ENCODE_IDENTIFIER_BEDFILE_X].head(20)

chr1	793000	794000	chr1_794	.
 chr1	794000	795000	chr1_795	.
 chr1	795000	796000	chr1_796	.
 chr1	796000	797000	chr1_797	.
 chr1	797000	798000	chr1_798	.
 chr1	798000	799000	chr1_799	.
 chr1	799000	800000	chr1_800	.
 chr1	800000	801000	chr1_801	.
 chr1	801000	802000	chr1_802	.
 chr1	802000	803000	chr1_803	.
 chr1	803000	804000	chr1_804	.
 chr1	804000	805000	chr1_805	.
 chr1	805000	806000	chr1_806	.
 chr1	806000	807000	chr1_807	.
 chr1	807000	808000	chr1_808	.
 chr1	808000	809000	chr1_809	.
 chr1	809000	810000	chr1_810	.
 chr1	810000	811000	chr1_811	.
 chr1	811000	812000	chr1_812	.
 chr1	812000	813000	chr1_813	.
 

At this point we can get out of the bedtools environment and work in pandas:

In [22]:
matrix_to_bedtool_maps_df = {
    id_: mapped_bed.to_dataframe()
    for id_, mapped_bed in matrix_to_bedtool_maps.items()
}

In [23]:
matrix_to_bedtool_maps_df[ENCODE_IDENTIFIER_BEDFILE_X]

Unnamed: 0,chrom,start,end,name,score
0,chr1,793000,794000,chr1_794,.
1,chr1,794000,795000,chr1_795,.
2,chr1,795000,796000,chr1_796,.
3,chr1,796000,797000,chr1_797,.
4,chr1,797000,798000,chr1_798,.
...,...,...,...,...,...
2835230,chrX,155978000,155979000,chrX_155979,.
2835231,chrX,155979000,155980000,chrX_155980,.
2835232,chrX,155980000,155981000,chrX_155981,.
2835233,chrX,155981000,155982000,chrX_155982,.


We're mostly interested in `score` column. 
We can create the data matrix, by:

1. Setting the index to the column `name`
2. Selecting only the `score` column
3. Replacing `.` scores with NaNs
4. Converting to `float`

In [24]:
bed_matrix = pd.DataFrame(
    {
        id_: df.set_index('name')['score'].replace('.', np.nan).astype(float)
        for id_, df in matrix_to_bedtool_maps_df.items()
    }
)

In [25]:
bed_matrix

Unnamed: 0_level_0,ENCFF981ISM,ENCFF122CSI
name,Unnamed: 1_level_1,Unnamed: 2_level_1
chr1_794,,
chr1_795,,
chr1_796,,
chr1_797,,
chr1_798,,
...,...,...
chrX_155979,,
chrX_155980,,
chrX_155981,,
chrX_155982,,


Note that not all columns are null, i.e. these are the total numbers of non-null columns (total number of bins that overlap peaks of each of the datasets)

In [26]:
(~bed_matrix.isnull()).sum()

ENCFF981ISM    30523
ENCFF122CSI    47840
dtype: int64

Some of such columns:

In [27]:
bed_matrix[(~bed_matrix.isnull()).any(axis=1)].head(20)

Unnamed: 0_level_0,ENCFF981ISM,ENCFF122CSI
name,Unnamed: 1_level_1,Unnamed: 2_level_1
chr1_827,,55.93549
chr1_828,109.37204,55.93549
chr1_904,,23.00623
chr1_905,86.66685,23.00623
chr1_906,82.95709,23.00623
chr1_924,16.61978,18.20666
chr1_925,16.61978,18.20666
chr1_926,59.18352,18.20666
chr1_941,52.96328,12.27774
chr1_942,22.70389,8.26568


### Correlations between bed files

At this point we can compute the correlations between the two peaksets directly.
To do this we will just use the [`pandas.DataFrame.corr`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html). Note that this function will only attempt to compute the correlations where both columns have value. Additionally, we set the `min_periods` parameter as described in the parameter set but it may not have any effect, depending on the correalation method selected (i.e. if it is set to Kendall).

In [28]:
print(f"Computing correlation {CORRELATION_TYPE=} using {MIN_PERIODS_FOR_CORRELATIONS=}")

Computing correlation CORRELATION_TYPE='kendall' using MIN_PERIODS_FOR_CORRELATIONS=1


In [29]:
correlation_answer = bed_matrix.corr(
    method=CORRELATION_TYPE, 
    min_periods=MIN_PERIODS_FOR_CORRELATIONS
)

In [30]:
correlation_answer

Unnamed: 0,ENCFF981ISM,ENCFF122CSI
ENCFF981ISM,1.0,0.359587
ENCFF122CSI,0.359587,1.0


This returns a matrix, but we only need one off-diagonal values:

In [31]:
correlation_answer = correlation_answer.loc[ENCODE_IDENTIFIER_BEDFILE_X, ENCODE_IDENTIFIER_BEDFILE_Y]
correlation_answer

0.35958678366773367

Note that this is equivalent to only computing the correlation on rows where both values are non-null only:

In [32]:
correlation_answer_alternative = bed_matrix[(~bed_matrix.isnull()).all(axis=1)].corr(
    method=CORRELATION_TYPE, 
    min_periods=MIN_PERIODS_FOR_CORRELATIONS
)
correlation_answer_alternative =  correlation_answer_alternative.loc[ENCODE_IDENTIFIER_BEDFILE_X, ENCODE_IDENTIFIER_BEDFILE_Y]

assert correlation_answer == correlation_answer_alternative
correlation_answer_alternative

0.35958678366773367

### Entropy based statistics

The entropy-based statistics, however, are the main workhorse of this pipeline.
The code below shows how they can be computed. First we binarise the `bed_matrix` into a set of `yes/no` entries:

In [33]:
bed_matrix_binary = ~bed_matrix.isnull()
bed_matrix_binary.head()

Unnamed: 0_level_0,ENCFF981ISM,ENCFF122CSI
name,Unnamed: 1_level_1,Unnamed: 2_level_1
chr1_794,False,False
chr1_795,False,False
chr1_796,False,False
chr1_797,False,False
chr1_798,False,False


Then we compute a 2x2 cooccurrence matrix for the two `bed` datasets:

In [34]:
import itertools

cooccurrence_matrix = []

# For all four combinations of [False, True] 
for x_value, y_value in itertools.product([False, True], repeat=2):
    
    # Count how many times they occurr
    mask = bed_matrix_binary[ENCODE_IDENTIFIER_BEDFILE_X] == x_value
    mask &= bed_matrix_binary[ENCODE_IDENTIFIER_BEDFILE_Y] == y_value
    count = mask.sum()
    
    # Store that
    cooccurrence_matrix.append([x_value, y_value, count])

# Make this into a dataframe
cooccurrence_matrix = pd.DataFrame(
    cooccurrence_matrix, 
    columns=[
         ENCODE_IDENTIFIER_BEDFILE_X,
         ENCODE_IDENTIFIER_BEDFILE_Y,
         'count',
     ]
)

# Reshape this into 2x2 matrix
cooccurrence_matrix = cooccurrence_matrix.set_index([
    ENCODE_IDENTIFIER_BEDFILE_X, 
    ENCODE_IDENTIFIER_BEDFILE_Y
])['count'].unstack(ENCODE_IDENTIFIER_BEDFILE_Y)

From this we get the number of times the BED files overlap (based on our genomic grid), the number of times they are both missing, and the number of times one occurs without the other (see below).

Note that in the code we use some optimisations and therefore the numbers are calculated in a slightly more sophisticated way, but the end result is the same.

In [35]:
cooccurrence_matrix

ENCFF122CSI,False,True
ENCFF981ISM,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2782330,22382
True,5065,25458


Note that the matrix columns sum to the universe length (i.e. the number of genomic windows):

In [36]:
universe_length = len(genomic_windows_no_blacklist)
_sum = np.sum(np.asarray(cooccurrence_matrix))
print("Universe length: {:,}, sum of co-occurrence matrix columns: {:,}".format(universe_length, _sum))
assert universe_length == _sum


Universe length: 2,835,235, sum of co-occurrence matrix columns: 2,835,235


Once we have the co-occurrence matrix, we can compute the joint probabilities.

To add some statistical smoothing we will use a pseudocount defined in `STATISTICS_PSEUDOCOUNT` above:

In [37]:
print(f'{STATISTICS_PSEUDOCOUNT=}')

STATISTICS_PSEUDOCOUNT=100


We will add this pseudocount to each and every cell of the co-occurrence matrix, and will then divide the cells by the total effective universe size which is now equal to the `universe_size + 4*pseudocount`.

In [38]:
joint_probabilities = (cooccurrence_matrix + STATISTICS_PSEUDOCOUNT) / (universe_length + 4*STATISTICS_PSEUDOCOUNT)

In [39]:
joint_probabilities

ENCFF122CSI,False,True
ENCFF981ISM,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.981237,0.007928
True,0.001821,0.009013


We can obtain marginal probabilities as well:

In [40]:
marginal_x = joint_probabilities.sum(axis=1)
assert marginal_x.index.name == ENCODE_IDENTIFIER_BEDFILE_X

marginal_y = joint_probabilities.sum(axis=0)
assert marginal_y.index.name == ENCODE_IDENTIFIER_BEDFILE_Y


In [41]:
marginal_x

ENCFF981ISM
False    0.989165
True     0.010835
dtype: float64

In [42]:
marginal_y

ENCFF122CSI
False    0.983058
True     0.016942
dtype: float64

If our data were independent, we would expect to observe the following joint_probability table (resulting of multiplying the marginal probabilities):

In [43]:
import itertools

joint_independent = []

# For all four combinations of [False, True] 
for x_value, y_value in itertools.product([False, True], repeat=2):
    
    # Compute the probability under independence
    p = marginal_x.loc[x_value] * marginal_y.loc[y_value]
    
    # Store that
    joint_independent.append([x_value, y_value, p])

# Make this into a dataframe
joint_independent = pd.DataFrame(
    joint_independent, 
    columns=[
         ENCODE_IDENTIFIER_BEDFILE_X,
         ENCODE_IDENTIFIER_BEDFILE_Y,
         'count',
     ]
)

# Reshape this into 2x2 matrix
joint_independent = joint_independent.set_index([
    ENCODE_IDENTIFIER_BEDFILE_X, 
    ENCODE_IDENTIFIER_BEDFILE_Y
])['count'].unstack(ENCODE_IDENTIFIER_BEDFILE_Y)

In [44]:
assert joint_independent.index.equals(joint_probabilities.index)
assert joint_independent.columns.equals(joint_probabilities.columns)


In [45]:
joint_independent

ENCFF122CSI,False,True
ENCFF981ISM,Unnamed: 1_level_1,Unnamed: 2_level_1
False,0.972407,0.016758
True,0.010651,0.000184


We now have everything we need to compute the marginal and relative entropies.
First, the marginal entropies:

In [46]:
import scipy.stats 

entropy_marginal_x = scipy.stats.entropy(marginal_x) 
print(f'-> Computed marginal entropy of {ENCODE_IDENTIFIER_BEDFILE_X} = {entropy_marginal_x}')

entropy_marginal_y = scipy.stats.entropy(marginal_y) 
print(f'-> Computed marginal entropy of {ENCODE_IDENTIFIER_BEDFILE_Y} = {entropy_marginal_y}')


-> Computed marginal entropy of ENCFF981ISM = 0.05980241823702084
-> Computed marginal entropy of ENCFF122CSI = 0.0858845552486885


Now we can compute the joint entropy of the two variables:

In [47]:
joint_entropy_xy = scipy.stats.entropy(np.asarray(joint_probabilities).ravel())
print(f'-> Computed joint entropy of {ENCODE_IDENTIFIER_BEDFILE_X},{ENCODE_IDENTIFIER_BEDFILE_Y} = {joint_entropy_xy}')


-> Computed joint entropy of ENCFF981ISM,ENCFF122CSI = 0.11087141848050827


We can also compute the mutual information, which is the K-L divergence from independence:

In [48]:
mi_xy_using_kl_formula = scipy.stats.entropy(
    np.asarray(joint_probabilities).ravel(),
    np.asarray(joint_independent).ravel(),
    
)
print(f'-> Computed MI of {ENCODE_IDENTIFIER_BEDFILE_X},{ENCODE_IDENTIFIER_BEDFILE_Y} = {mi_xy_using_kl_formula}')


-> Computed MI of ENCFF981ISM,ENCFF122CSI = 0.03481555500520116


This should be the same as the sum of marginal entropies minus joint entropy:

In [49]:
mi_xy_using_entropies = entropy_marginal_x + entropy_marginal_y - joint_entropy_xy
print(f'-> Recomputed MI of {ENCODE_IDENTIFIER_BEDFILE_X},{ENCODE_IDENTIFIER_BEDFILE_Y} = {mi_xy_using_entropies}')

from numpy.testing import assert_array_almost_equal
# I'm using the array function because the default parameters make sense there
assert_array_almost_equal([mi_xy_using_kl_formula], [mi_xy_using_entropies])

-> Recomputed MI of ENCFF981ISM,ENCFF122CSI = 0.03481555500520106


The resulting uncertainty coefficients are therefore computed as follows:

In [50]:
uncertainty_x_given_y = mi_xy_using_kl_formula / entropy_marginal_x
uncertainty_y_given_x = mi_xy_using_kl_formula / entropy_marginal_y

In [51]:
print(f"-> Uncertainty coefficient U(X|Y) <- MI(X,Y)/H(X) =  {uncertainty_x_given_y}")
print(f"-> Uncertainty coefficient U(Y|X) <- MI(X,Y)/H(Y) =  {uncertainty_y_given_x}")


-> Uncertainty coefficient U(X|Y) <- MI(X,Y)/H(X) =  0.5821763739923229
-> Uncertainty coefficient U(Y|X) <- MI(X,Y)/H(Y) =  0.4053762041893185


Note that in the pipeline output we will be using the convention that $X$ is in rows of the output, and $Y$ in the columns. And we will further assume that $X$ is a protein, e.g. PHF8 and Y is a histone. With such assumptions we will be using the `uncertainty_x_given_y` as our final output. The code below verifies this and shows how to extract these numbers from the pipeline

In [52]:
uncertainty_x_given_y

0.5821763739923229

In one exceptional case, when dealing with (network analysis) a harmonic mean of these parameters will be used, this can be computed using `scipy.stats.hmean` function:

In [53]:
from scipy.stats import hmean
hmean([uncertainty_x_given_y, uncertainty_y_given_x])

0.47795014437054345

Or, by a longer form expression which may appear in textbooks:

In [54]:
2 * mi_xy_using_kl_formula / (entropy_marginal_x + entropy_marginal_y)

0.47795014437054345

# Verification of pipeline results

As we used two encode datasets to comptue this example we can use this notebook to verify the pipeline results, provided that it has been run.

In [55]:
cell_line = encode_meta['Biosample term name'].unique()
assert len(cell_line) == 1
cell_line = cell_line[0]

In [56]:
MIN_PERIODS_FOR_CORRELATIONS

1

In [57]:
OUTPUT_CSV_GZ = pathlib.Path(
    f'../../output/final/analysis/params_{BIN_SIZE}bp_pc_{STATISTICS_PSEUDOCOUNT}_mp_{MIN_PERIODS_FOR_CORRELATIONS}/{cell_line}/consolidated_tables/bedstats_consolidated_{cell_line}_{BIN_SIZE}bp_params_pc_{STATISTICS_PSEUDOCOUNT}_mp_{MIN_PERIODS_FOR_CORRELATIONS}_from_bed.csv.gz'
)

print(f'-> Reading pipeline outputs from {OUTPUT_CSV_GZ}')
assert OUTPUT_CSV_GZ.is_file(), "Pipeline output not found - are you sure the pipeline was run to completion?"


pipeline_outputs = pd.read_csv(OUTPUT_CSV_GZ, index_col=0)

pipeline_outputs.columns = pd.MultiIndex.from_tuples([c.split('__') for c in pipeline_outputs.columns], names=['col_type', 'colname'])

print('-> column headers parsed:')
print(pipeline_outputs.columns.get_level_values('col_type').unique())


-> Reading pipeline outputs from ../../output/final/analysis/params_1000bp_pc_100_mp_1/K562/consolidated_tables/bedstats_consolidated_K562_1000bp_params_pc_100_mp_1_from_bed.csv.gz
-> column headers parsed:
Index(['metadata', 'marcs_feature_significant_category', 'normalised_mi',
       'entropy_by_row', 'kendall_correlation', 'mi', 'entropy_by_col',
       'counts_true_true', 'counts_false_true', 'counts_true_false',
       'counts_false_false'],
      dtype='object', name='col_type')


In [58]:
pipeline_outputs.head()

col_type,metadata,metadata,metadata,metadata,metadata,marcs_feature_significant_category,marcs_feature_significant_category,marcs_feature_significant_category,marcs_feature_significant_category,marcs_feature_significant_category,...,counts_false_false,counts_false_false,counts_false_false,counts_false_false,counts_false_false,counts_false_false,counts_false_false,counts_false_false,counts_false_false,counts_false_false
colname,encode_id,cell_line,factor,factor_type,marcs_gene_label,H2A.Z,H3K27ac,H3K27me2,H3K27me3,H3K4me1,...,state:15_Quies,state:1_TssA,state:2_TssAFlnk,state:3_TxFlnk,state:4_Tx,state:5_TxWk,state:6_EnhG,state:7_Enh,state:8_ZNF/Rpts,state:9_Het
Factor_Cell_Identifier,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
ADNP-K562-ENCFF739AJO,ENCFF739AJO,K562,ADNP,protein,ADNP,Neither,,,Neither,Neither,...,944095.0,2770050.0,2758028.0,2796910.0,2701476.0,2416319.0,2777986.0,2652595.0,2800345.0,2776332.0
AFF1-K562-ENCFF195YGC,ENCFF195YGC,K562,AFF1,protein,AFF1,,,,,,...,949267.0,2779125.0,2765182.0,2803088.0,2706542.0,2417014.0,2782919.0,2653527.0,2805917.0,2781939.0
AFF1-K562-ENCFF674XTY,ENCFF674XTY,K562,AFF1,protein,AFF1,,,,,,...,957511.0,2788113.0,2772685.0,2814105.0,2717880.0,2426518.0,2793931.0,2660970.0,2817366.0,2793315.0
ARID1B-K562-ENCFF879NTL,ENCFF879NTL,K562,ARID1B,protein,ARID1B,Neither,,,,,...,912195.0,2738438.0,2735645.0,2760289.0,2663106.0,2385776.0,2742981.0,2642590.0,2761588.0,2737625.0
ARID2-K562-ENCFF913WRW,ENCFF913WRW,K562,ARID2,protein,ARID2,Neither,,,Neither,Neither,...,958075.0,2784414.0,2769572.0,2812570.0,2717226.0,2427008.0,2792948.0,2660247.0,2816181.0,2792197.0


Pipeline outputs are indexed using the Factor-cell-identifier keys, create those here:

In [59]:
def factor_cell_identifier(encode_meta_row):
    assay = encode_meta_row['Assay']

    if assay in ['TF ChIP-seq', 'Histone ChIP-seq']:
        factor = encode_meta_row['Experiment target'].partition('-')[0]
    else:
        factor = assay
        
    cell_line = encode_meta_row['Biosample term name']
    identifier = encode_meta_row.name
    
    return f'{factor}-{cell_line}-{identifier}'

In [60]:
FACTOR_CELL_IDENTIFIER_X = factor_cell_identifier(encode_meta.loc[ENCODE_IDENTIFIER_BEDFILE_X])
print(f'-> {ENCODE_IDENTIFIER_BEDFILE_X} -> {FACTOR_CELL_IDENTIFIER_X}')
FACTOR_CELL_IDENTIFIER_Y = factor_cell_identifier(encode_meta.loc[ENCODE_IDENTIFIER_BEDFILE_Y])
print(f'-> {ENCODE_IDENTIFIER_BEDFILE_Y} -> {FACTOR_CELL_IDENTIFIER_Y}')


-> ENCFF981ISM -> PHF8-K562-ENCFF981ISM
-> ENCFF122CSI -> H3K4me3-K562-ENCFF122CSI


Note that the pipeline outputs are not symmetric, and the important statistics are only computed for (`[all assays]` vs `[non-protein assays]`). At this point onwards we will therefore assume that the assay Y is one of the non-protein assays and verify the data this way.

In [61]:
assert encode_meta.loc[ENCODE_IDENTIFIER_BEDFILE_Y, 'Assay'] in ['Histone ChIP-seq', 'ATAC-seq', 'DNAse-seq']


## Validation of correlation calculation

The correlation value is stored in the column called `{CORRELATION_TYPE}_correlation`, we can extract the computed correlation result for our two example bed files:

In [62]:
f'{CORRELATION_TYPE}_correlation'

'kendall_correlation'

In [63]:
correlation_answer_from_pipeline = pipeline_outputs.loc[
    FACTOR_CELL_IDENTIFIER_X, 
    (
        f'{CORRELATION_TYPE}_correlation', FACTOR_CELL_IDENTIFIER_Y
    )
]

In [64]:
print(f'-> Correlation computed by the pipeline: {correlation_answer_from_pipeline}, correlation computed by us: {correlation_answer}')

assert_array_almost_equal(
    [correlation_answer], 
    [correlation_answer_from_pipeline]
)

-> Correlation computed by the pipeline: 0.3595867836677336, correlation computed by us: 0.35958678366773367


### Validation of the co-occurrence counts

Co-occurrence counts are split between four columns in the pipeline output:

1. `counts_true_true`, representing X=True, Y=True (X in rows)
2. `counts_false_true`, representing X=False, Y=True (X in rows)
3. `counts_true_false`, representing X=True, Y=False (X in rows)
4. `counts_false_false` representing X=False, Y=False (X in rows)


In [65]:
FACTOR_CELL_IDENTIFIER_X

'PHF8-K562-ENCFF981ISM'

In [66]:
cooccurrence_matrix

ENCFF122CSI,False,True
ENCFF981ISM,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2782330,22382
True,5065,25458


We can use that to extract a co-occurrence matrix:

In [67]:
co_occurrence_matrix_from_pipeline = pd.DataFrame([
    [
     pipeline_outputs.loc[FACTOR_CELL_IDENTIFIER_X, ('counts_false_false', FACTOR_CELL_IDENTIFIER_Y)],
     pipeline_outputs.loc[FACTOR_CELL_IDENTIFIER_X, ('counts_false_true', FACTOR_CELL_IDENTIFIER_Y)],
    ],
    [
     pipeline_outputs.loc[FACTOR_CELL_IDENTIFIER_X, ('counts_true_false', FACTOR_CELL_IDENTIFIER_Y)],
     pipeline_outputs.loc[FACTOR_CELL_IDENTIFIER_X, ('counts_true_true', FACTOR_CELL_IDENTIFIER_Y)],
    ],
], index=[False, True], columns=[False, True], dtype=int)

co_occurrence_matrix_from_pipeline.index.name = ENCODE_IDENTIFIER_BEDFILE_X
co_occurrence_matrix_from_pipeline.columns.name = ENCODE_IDENTIFIER_BEDFILE_Y

co_occurrence_matrix_from_pipeline

ENCFF122CSI,False,True
ENCFF981ISM,Unnamed: 1_level_1,Unnamed: 2_level_1
False,2782330,22382
True,5065,25458


Make sure the results are identical:

In [68]:
from numpy.testing import assert_array_equal
assert_array_equal(cooccurrence_matrix, co_occurrence_matrix_from_pipeline)

### Validation of the entropies

We can now validate the marginal entropies.

For the $X$ variable, these are listed in the pipeline matrix only once in column `entropy_by_row`:

In [69]:
marginal_entropy_x_from_pipeline = pipeline_outputs.loc[
    FACTOR_CELL_IDENTIFIER_X, 
    ('entropy_by_row', 'entropy_by_row')
]

print(f'-> Marginal entropy for {ENCODE_IDENTIFIER_BEDFILE_X}: computed here: {entropy_marginal_x}, computed by pipeline: {marginal_entropy_x_from_pipeline}')
assert_array_almost_equal([entropy_marginal_x], [marginal_entropy_x_from_pipeline])

-> Marginal entropy for ENCFF981ISM: computed here: 0.05980241823702084, computed by pipeline: 0.0598024182370209


For variable Y the entropy is listed twice: once in column `entropy_by_row`, the other time as one of the columns in `entropy_by_col`. First, let's find the entropy by row:

In [70]:
marginal_entropy_y_from_pipeline_by_row = pipeline_outputs.loc[
    FACTOR_CELL_IDENTIFIER_Y, 
    ('entropy_by_row', 'entropy_by_row')
]

print(f'-> Marginal entropy for {ENCODE_IDENTIFIER_BEDFILE_Y}: computed here: {entropy_marginal_y}, computed by pipeline: {marginal_entropy_y_from_pipeline_by_row}')
assert_array_almost_equal([entropy_marginal_y], [marginal_entropy_y_from_pipeline_by_row])

-> Marginal entropy for ENCFF122CSI: computed here: 0.0858845552486885, computed by pipeline: 0.0858845552486885


And now by column (the whole `entropy_by_col` column will be the same value ...)

In [71]:
marginal_entropy_y_from_pipeline_by_col = pipeline_outputs[('entropy_by_col', FACTOR_CELL_IDENTIFIER_Y)].unique()

print(f'-> Marginal entropy for {ENCODE_IDENTIFIER_BEDFILE_Y}: computed here: {entropy_marginal_y}, computed by pipeline (column): {marginal_entropy_y_from_pipeline_by_col}')


assert_array_almost_equal([entropy_marginal_y], marginal_entropy_y_from_pipeline_by_col)



-> Marginal entropy for ENCFF122CSI: computed here: 0.0858845552486885, computed by pipeline (column): [0.08588456]


Similarly, we can verify Mutual Information calculations:

In [72]:
mi_from_pipeline = pipeline_outputs.loc[FACTOR_CELL_IDENTIFIER_X, ('mi', FACTOR_CELL_IDENTIFIER_Y)]

print(f'-> MI for {ENCODE_IDENTIFIER_BEDFILE_X}, {ENCODE_IDENTIFIER_BEDFILE_Y}: computed here: {mi_xy_using_kl_formula}, computed by pipeline: {mi_from_pipeline}')

assert_array_almost_equal([mi_xy_using_kl_formula], [mi_from_pipeline])

-> MI for ENCFF981ISM, ENCFF122CSI: computed here: 0.03481555500520116, computed by pipeline: 0.0348155550052011


And finally, and most importantly we need to verify that the normalised mutual information column in the pipeline output correspond to the `uncertainty_x_given_y`, and _not_ `uncertainty_y_given_x`

In [73]:
normed_mi_from_pipeline = pipeline_outputs.loc[FACTOR_CELL_IDENTIFIER_X, ('normalised_mi', FACTOR_CELL_IDENTIFIER_Y)]

print(f'-> Normed MI {ENCODE_IDENTIFIER_BEDFILE_X} "given" {ENCODE_IDENTIFIER_BEDFILE_Y}: computed here: {uncertainty_x_given_y}, computed by pipeline: {normed_mi_from_pipeline}')

assert_array_almost_equal([uncertainty_x_given_y], [normed_mi_from_pipeline])

print(f'-> Note this is not the same as normed MI {ENCODE_IDENTIFIER_BEDFILE_Y} "given" {ENCODE_IDENTIFIER_BEDFILE_X} (={uncertainty_y_given_x})')


-> Normed MI ENCFF981ISM "given" ENCFF122CSI: computed here: 0.5821763739923229, computed by pipeline: 0.5821763739923219
-> Note this is not the same as normed MI ENCFF122CSI "given" ENCFF981ISM (=0.4053762041893185)


That's it !