# List of contents
* [Creating interrelation profiles](#creating)
  * [Creating co-occurrence matrices](#creating_corms)
  * [Creating other relation matrices](#creating_other)
  * [Using the pre-computed interrelation matrices](#precomputed)
* [Evaluating interrelation profiles](#evaluation)
  * [Direct interpretation](#direct_interpretation)
  * [Matching feature vectors to interrelation profiles](#vector_match)
  * [Comparing interrelation profiles](#profile_match)

# Creating interrelation profiles <a class="anchor" id="creating"/>

In fip, interrelation profiles consist of matrices containing the raw counts of feature co-occurrences within the profiled set (CORM), probabilities of feature co-occurrence within the profiled set (COPRM), PMI values for each feature pair (PMIRM), etc.

## Creating feature co-occurrence relation matrices (CORMs) <a class="anchor" id="creating_corms"/>

There are several ways of creating CORMs:

In [1]:
# from an iterable containing binary feature vectors, such as fingerprints.
# The vectors can be either something that can be cas as numpy boolean array:
from fiprofiling.relationmatrices import CORM
feature_vectors = [[True, False, False, False], [True, True, False, False],
                   [False, True, True, False], [False, False, False, True]]
rm = CORM.from_fingerprints(feature_vectors)
rm

<fiprofiling.relationmatrices.CORM at 0x7f2bb0b63f98>

In [2]:
# CORM objects, as well as those of other interrelation matrices,
# store the interrelation values in a Pandas DataFrame that can be accessed under 'df' variable
rm.df

Unnamed: 0,0,1,2,3
0,2,1,0,0
1,1,2,1,0
2,0,1,1,0
3,0,0,0,1


In [3]:
# it is also possible to create a CORM from binary strings:
feature_vectors = ['1000', '1100', '0110', '0001']
rm = CORM.from_fingerprints(feature_vectors, fpformat='bintext')
rm

<fiprofiling.relationmatrices.CORM at 0x7f2bb0b32080>

In [4]:
rm.df

Unnamed: 0,0,1,2,3
0,2,1,0,0
1,1,2,1,0
2,0,1,1,0
3,0,0,0,1


In [5]:
# as well as hexadecimal strings:
feature_vectors = ['11', '1a', 'bb', '50']
rm = CORM.from_fingerprints(feature_vectors, fpformat='hextext')
rm

<fiprofiling.relationmatrices.CORM at 0x7f2bb0b327f0>

In [6]:
rm.df

Unnamed: 0,0,1,2,3,4,5,6,7
0,2,1,0,1,2,1,0,1
1,1,2,0,2,2,1,0,1
2,0,0,0,0,0,0,0,0
3,1,2,0,2,2,1,0,1
4,2,2,0,2,4,1,1,1
5,1,1,0,1,1,1,0,1
6,0,0,0,0,1,0,1,0
7,1,1,0,1,1,1,0,1


In [7]:
# Relational matrices may also be instantiated directly
# from a Pandas DataFrame containing the co-occurrence counts.
# However, this methods also needs the total number of feature vectors represented by the provided DataFrame,
# to supply co-occurrence probabilities if CORM -> COPRM conversion is desired.
rm2 = CORM.from_dataframe(rm.df, num_datapoints=len(feature_vectors))
rm2.df

Unnamed: 0,0,1,2,3,4,5,6,7
0,2,1,0,1,2,1,0,1
1,1,2,0,2,2,1,0,1
2,0,0,0,0,0,0,0,0
3,1,2,0,2,2,1,0,1
4,2,2,0,2,4,1,1,1
5,1,1,0,1,1,1,0,1
6,0,0,0,0,1,0,1,0
7,1,1,0,1,1,1,0,1


In [8]:
# feature vectors may also be added to an already instantiated CORM
rm2.add_fingerprint('11111111', fpformat='bintext')
rm2.df

Unnamed: 0,0,1,2,3,4,5,6,7
0,3,2,1,2,3,2,1,2
1,2,3,1,3,3,2,1,2
2,1,1,1,1,1,1,1,1
3,2,3,1,3,3,2,1,2
4,3,3,1,3,5,2,2,2
5,2,2,1,2,2,2,1,2
6,1,1,1,1,2,1,2,1
7,2,2,1,2,2,2,1,2


## Creating Co-occurrence probability and other derivative matrices <a class="anchor" id="creating_other"/>
Each CORM, along with the total count of feature vectors, can be transformed into a corresponding COPRM, which can then be transformed into other derivative matrices. The usual order is CORM -> COPRM -> PMIRM -> ZPMIRM, though it is not necessary to perform all these steps manually. If only matrices such as PMIRM and ZPMIRM are desired, they may be instantiated directly from any of the prerequisite matrices (CORM, COPRM, ...), and let the intermediate steps to be conducted in the background. 

In [9]:
from fiprofiling.relationmatrices import COPRM
example_coprm = COPRM.from_CORM(rm2)
example_coprm.df

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.6,0.4,0.2,0.4,0.6,0.4,0.2,0.4
1,0.4,0.6,0.2,0.6,0.6,0.4,0.2,0.4
2,0.2,0.2,0.2,0.2,0.2,0.2,0.2,0.2
3,0.4,0.6,0.2,0.6,0.6,0.4,0.2,0.4
4,0.6,0.6,0.2,0.6,1.0,0.4,0.4,0.4
5,0.4,0.4,0.2,0.4,0.4,0.4,0.2,0.4
6,0.2,0.2,0.2,0.2,0.4,0.2,0.4,0.2
7,0.4,0.4,0.2,0.4,0.4,0.4,0.2,0.4


In [10]:
from fiprofiling.relationmatrices import PMIRM
example_pmirm = PMIRM.from_COPRM(example_coprm)
example_pmirm.df

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.0,0.152003,0.736966,0.152003,0.0,0.736966,-0.263034,0.736966
1,0.152003,0.0,0.736966,0.736966,0.0,0.736966,-0.263034,0.736966
2,0.736966,0.736966,0.0,0.736966,0.0,1.321928,1.321928,1.321928
3,0.152003,0.736966,0.736966,0.0,0.0,0.736966,-0.263034,0.736966
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.736966,0.736966,1.321928,0.736966,0.0,0.0,0.321928,1.321928
6,-0.263034,-0.263034,1.321928,-0.263034,0.0,0.321928,0.0,0.321928
7,0.736966,0.736966,1.321928,0.736966,0.0,1.321928,0.321928,0.0


In [11]:
from fiprofiling.relationmatrices import ZPMIRM
example_zpmirm = ZPMIRM.from_PMIRM(example_pmirm)
example_zpmirm.df

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.0,-0.610733,0.557855,-0.610733,-0.914392,0.557855,-1.43986,0.557855
1,-0.610733,0.0,0.557855,0.557855,-0.914392,0.557855,-1.43986,0.557855
2,0.557855,0.557855,0.0,0.557855,-0.914392,1.726444,1.726444,1.726444
3,-0.610733,0.557855,0.557855,0.0,-0.914392,0.557855,-1.43986,0.557855
4,-0.914392,-0.914392,-0.914392,-0.914392,0.0,-0.914392,-0.914392,-0.914392
5,0.557855,0.557855,1.726444,0.557855,-0.914392,0.0,-0.271271,1.726444
6,-1.43986,-1.43986,1.726444,-1.43986,-0.914392,-0.271271,0.0,-0.271271
7,0.557855,0.557855,1.726444,0.557855,-0.914392,1.726444,-0.271271,0.0


In [12]:
# the latter matrices can also be computed directly from any of the former ones
example_zpmirm2 = ZPMIRM.from_CORM(rm2)
example_zpmirm2.df

Unnamed: 0,0,1,2,3,4,5,6,7
0,0.0,-0.610733,0.557855,-0.610733,-0.914392,0.557855,-1.43986,0.557855
1,-0.610733,0.0,0.557855,0.557855,-0.914392,0.557855,-1.43986,0.557855
2,0.557855,0.557855,0.0,0.557855,-0.914392,1.726444,1.726444,1.726444
3,-0.610733,0.557855,0.557855,0.0,-0.914392,0.557855,-1.43986,0.557855
4,-0.914392,-0.914392,-0.914392,-0.914392,0.0,-0.914392,-0.914392,-0.914392
5,0.557855,0.557855,1.726444,0.557855,-0.914392,0.0,-0.271271,1.726444
6,-1.43986,-1.43986,1.726444,-1.43986,-0.914392,-0.271271,0.0,-0.271271
7,0.557855,0.557855,1.726444,0.557855,-0.914392,1.726444,-0.271271,0.0


## Using pre-computed interrelation matrices <a class="anchor" id="precomputed"/>
Creating interrelation profiles can be computationally expensive, especially for large datasets. Some precomputed interrelation profiles are already available in fip, and can be directly imported:

In [13]:
from fiprofiling.data.precomputed_rm.pubchem.ecfp4_1024 import coprm
coprm.df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1014,1015,1016,1017,1018,1019,1020,1021,1022,1023
0,0.010251,0.003804,0.000451,0.000408,0.001935,0.000659,0.000069,0.000114,0.000376,0.000235,...,0.000214,0.000255,0.000087,0.000793,0.000526,0.004583,0.000152,0.000230,0.000280,0.000105
1,0.003804,0.321488,0.016214,0.011528,0.060131,0.008477,0.003607,0.005651,0.013449,0.008371,...,0.011354,0.007528,0.002097,0.017997,0.009484,0.105171,0.004947,0.005371,0.005978,0.003495
2,0.000451,0.016214,0.052611,0.002004,0.044092,0.001467,0.000390,0.000661,0.002053,0.001124,...,0.001241,0.001033,0.000431,0.002958,0.001338,0.031557,0.000980,0.001027,0.001187,0.000504
3,0.000408,0.011528,0.002004,0.048739,0.009139,0.001661,0.000255,0.001146,0.002302,0.001227,...,0.000880,0.000750,0.000322,0.002763,0.001572,0.033621,0.003386,0.001450,0.001033,0.000768
4,0.001935,0.060131,0.044092,0.009139,0.196233,0.005763,0.001608,0.002648,0.007821,0.004645,...,0.004923,0.003765,0.001359,0.009482,0.004508,0.127306,0.004253,0.004033,0.003759,0.002373
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1019,0.004583,0.105171,0.031557,0.033621,0.127306,0.018990,0.003235,0.006711,0.015020,0.017759,...,0.009400,0.007881,0.002750,0.023536,0.011212,0.367534,0.007321,0.007346,0.008536,0.007448
1020,0.000152,0.004947,0.000980,0.003386,0.004253,0.000793,0.000108,0.000239,0.001327,0.000496,...,0.000439,0.000271,0.000131,0.001383,0.000676,0.007321,0.017961,0.000562,0.000622,0.000267
1021,0.000230,0.005371,0.001027,0.001450,0.004033,0.000550,0.000107,0.000210,0.000802,0.000372,...,0.000351,0.000265,0.000086,0.000978,0.000786,0.007346,0.000562,0.017900,0.000286,0.000208
1022,0.000280,0.005978,0.001187,0.001033,0.003759,0.000566,0.000136,0.000236,0.001022,0.000443,...,0.000432,0.000324,0.000192,0.004318,0.000559,0.008536,0.000622,0.000286,0.019322,0.000232


As of Feb 2020, there are precomputed interrelation profiles for ZINC('zinc'), PubChem('pubchem'), DrugBank('drugbank'), ChEMBL('chembl') and all those combined('synthetizable'), using MACCS ('maccs') and PubChem ('pubchemkey') structural keys, and ECFP4 and ECFP6 circular fingerprints hashed to 1024 bits ('ecfp4_1024' and 'ecfp6_1024'). The profiles can be imported in various forms: 'corm', 'coprm', 'pmirm' and 'zpmirm'.

# Evaluating interrelation profiles <a class="anchor" id="evaluation"/>
## Direct interpretation <a class="anchor" id="direct_interpretation"/>
The feature co-occurrences within the characterized set, their co-occurrence probabilities, [pointwise mutual information values](https://en.wikipedia.org/wiki/Pointwise_mutual_information) and [Z-scored](https://en.wikipedia.org/wiki/Standard_score) PMI values are directly accessible within CORM, COPRM, PMIRM and ZPMRIM, respectively. The data is stored in the form of the contained Pandas DataFrames.
### Interrelation profile of completely independent features
To establish a baseline, let's look at interrelation profiles of completely unrelated features:

In [14]:
# let's create a CORM from random feature vectors consisting of completely unrelated features
import numpy as np
random_feature_vectors = [np.random.choice([0, 1], size=(10,)) for i in range(10000)]
random_corm = CORM.from_fingerprints(random_feature_vectors)
random_corm.df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,5076,2517,2584,2561,2497,2567,2444,2547,2525,2561
1,2517,4980,2556,2507,2497,2544,2431,2451,2479,2518
2,2584,2556,5087,2593,2537,2551,2521,2497,2519,2531
3,2561,2507,2593,5060,2561,2572,2504,2571,2531,2602
4,2497,2497,2537,2561,4979,2463,2447,2453,2441,2502
5,2567,2544,2551,2572,2463,5023,2463,2544,2500,2528
6,2444,2431,2521,2504,2447,2463,4929,2459,2441,2508
7,2547,2451,2497,2571,2453,2544,2459,4978,2482,2504
8,2525,2479,2519,2531,2441,2500,2441,2482,4986,2564
9,2561,2518,2531,2602,2502,2528,2508,2504,2564,5064


The above CORM contains raw co-occurrence counts. The counts are higher on the diagonal, since a feature co-occurring with itself is merely an occurrence of the feature. The values on the diagonal stay at around 5000, since there were 10k vectors and each bit had a flat 50% chance of being on. The off-diagonal values are around 2500, since the chance of two flat 50% features co-occurring is 25%. 

In [15]:
random_coprm = COPRM.from_CORM(random_corm)
random_coprm.df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.5076,0.2517,0.2584,0.2561,0.2497,0.2567,0.2444,0.2547,0.2525,0.2561
1,0.2517,0.498,0.2556,0.2507,0.2497,0.2544,0.2431,0.2451,0.2479,0.2518
2,0.2584,0.2556,0.5087,0.2593,0.2537,0.2551,0.2521,0.2497,0.2519,0.2531
3,0.2561,0.2507,0.2593,0.506,0.2561,0.2572,0.2504,0.2571,0.2531,0.2602
4,0.2497,0.2497,0.2537,0.2561,0.4979,0.2463,0.2447,0.2453,0.2441,0.2502
5,0.2567,0.2544,0.2551,0.2572,0.2463,0.5023,0.2463,0.2544,0.25,0.2528
6,0.2444,0.2431,0.2521,0.2504,0.2447,0.2463,0.4929,0.2459,0.2441,0.2508
7,0.2547,0.2451,0.2497,0.2571,0.2453,0.2544,0.2459,0.4978,0.2482,0.2504
8,0.2525,0.2479,0.2519,0.2531,0.2441,0.25,0.2441,0.2482,0.4986,0.2564
9,0.2561,0.2518,0.2531,0.2602,0.2502,0.2528,0.2508,0.2504,0.2564,0.5064


The probability values from the above COPRM based on the 10k entirely random vectors can be obtained directly from a corresponding CORM by dividing it with amount of vectors, 10k. As expected, in vectors from the random set, there is around 50% chance of each feature occurring on its own, and around 25% of two such features co-occurring.

In [16]:
random_pmirm = PMIRM.from_COPRM(random_coprm)
random_pmirm.df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,-0.006205,0.001027,-0.004194,-0.017424,0.00977,-0.033815,0.011469,-0.003363,-0.005334
1,-0.006205,0.0,0.012855,-0.007393,0.010122,0.024332,-0.013963,-0.016413,-0.002342,-0.002217
2,0.001027,0.012855,0.0,0.010598,0.002381,-0.002373,0.007814,-0.020257,-0.009919,-0.025457
3,-0.004194,-0.007393,0.010598,0.0,0.023642,0.017132,0.00573,0.029554,0.004615,0.022134
4,-0.017424,0.010122,0.002381,0.023642,0.0,-0.022061,-0.004209,-0.014947,-0.024339,-0.011124
5,0.00977,0.024332,-0.002373,0.017132,-0.022061,0.0,-0.0075,0.024911,-0.002576,-0.008902
6,-0.033815,-0.013963,0.007814,0.00573,-0.004209,-0.0075,0.0,0.003139,-0.009777,0.006893
7,0.011469,-0.016413,-0.020257,0.029554,-0.014947,0.024911,0.003139,0.0,-1.8e-05,-0.009681
8,-0.003363,-0.002342,-0.009919,0.004615,-0.024339,-0.002576,-0.009777,-1.8e-05,0.0,0.022164
9,-0.005334,-0.002217,-0.025457,0.022134,-0.011124,-0.008902,0.006893,-0.009681,0.022164,0.0


The above PMIRM shows the amount of mutual information the features provide about each other within the profiled set. Since all features appear together about as often as could be expected from the rate of their occurrence, the PMI values are very low. PMI values on the diagonal are always 0, since the individual occurrence of features is the exact same as their co-occurrence with themselves.

In [17]:
random_zpmirm = ZPMIRM.from_PMIRM(random_pmirm)
random_zpmirm.df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,-0.369064,0.123789,-0.232048,-1.133726,0.71966,-2.250798,0.835433,-0.175437,-0.309745
1,-0.369064,0.0,0.929917,-0.450067,0.743658,1.712093,-0.897816,-1.064837,-0.105838,-0.097285
2,0.123789,0.929917,0.0,0.776076,0.216036,-0.107959,0.586358,-1.326821,-0.622204,-1.681185
3,-0.232048,-0.450067,0.776076,0.0,1.665081,1.221409,0.444336,2.068015,0.368347,1.562327
4,-1.133726,0.743658,0.216036,1.665081,0.0,-1.449713,-0.23305,-0.964892,-1.604963,-0.704314
5,0.71966,1.712093,-0.107959,1.221409,-1.449713,0.0,-0.457325,1.751589,-0.121766,-0.552917
6,-2.250798,-0.897816,0.586358,0.444336,-0.23305,-0.457325,0.0,0.267704,-0.612575,0.523583
7,0.835433,-1.064837,-1.326821,2.068015,-0.964892,1.751589,0.267704,0.0,0.052575,-0.605999
8,-0.175437,-0.105838,-0.622204,0.368347,-1.604963,-0.121766,-0.612575,0.052575,0.0,1.564358
9,-0.309745,-0.097285,-1.681185,1.562327,-0.704314,-0.552917,0.523583,-0.605999,1.564358,0.0


Despite the very small PMI values in this random PMIRM, they may still be standardized by Z-score, rating each individual interrelation relative to all others in terms of standard deviation from the mean. In other words, ZPMI values indicate how high is the PMI of each co-occurrence relative to all other PMI values within the interrelation profile.

### Interrelation profile of a set with strong interdependencies
To contrast with the above interrelation profile consisting of independent features, let's make a dummy set of features with some strong interdependence between features. Let's modify the generated random vectors so that:
* bit 0 implies bit 1
* bit 2 and 3 always occur together (equivalence)
* bit 4 negatively implies bit 5 (when 4 is True, 5 is False)
* bit 6 is negation of bit 7

In [18]:
def modify_interrelations(feature_vector):
    if feature_vector[0]:
        feature_vector[1] = 1
    feature_vector[3] = feature_vector[2]
    if feature_vector[4]:
        feature_vector[5] = 0
    feature_vector[6] = 1 - feature_vector[7]
    return feature_vector
interrelated_corm = CORM.from_fingerprints((modify_interrelations(fv) for fv in random_feature_vectors))
interrelated_coprm = COPRM.from_CORM(interrelated_corm)
interrelated_coprm.df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.5076,0.5076,0.2584,0.2584,0.2497,0.1305,0.2529,0.2547,0.2525,0.2561
1,0.5076,0.7539,0.3844,0.3844,0.3741,0.1933,0.3779,0.376,0.377,0.3826
2,0.2584,0.3844,0.5087,0.5087,0.2537,0.1301,0.259,0.2497,0.2519,0.2531
3,0.2584,0.3844,0.5087,0.5087,0.2537,0.1301,0.259,0.2497,0.2519,0.2531
4,0.2497,0.3741,0.2537,0.2537,0.4979,0.0,0.2526,0.2453,0.2441,0.2502
5,0.1305,0.1933,0.1301,0.1301,0.0,0.256,0.1251,0.1309,0.1291,0.1313
6,0.2529,0.3779,0.259,0.259,0.2526,0.1251,0.5022,0.0,0.2504,0.256
7,0.2547,0.376,0.2497,0.2497,0.2453,0.1309,0.0,0.4978,0.2482,0.2504
8,0.2525,0.377,0.2519,0.2519,0.2441,0.1291,0.2504,0.2482,0.4986,0.2564
9,0.2561,0.3826,0.2531,0.2531,0.2502,0.1313,0.256,0.2504,0.2564,0.5064


As shown in the above COPRM, the effects of the introduced interdependencies can be directly observed on the co-occurrence probabilities.

In [19]:
interrelated_pmirm = PMIRM.from_COPRM(interrelated_coprm)
interrelated_pmirm.df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.407555,0.001027,0.001027,-0.017424,0.006142,-0.011459,0.011469,-0.003363,-0.005334
1,0.407555,0.0,0.003348,0.003348,-0.004877,0.002253,-0.002703,0.002721,0.004237,0.003114
2,0.001027,0.003348,0.0,0.975113,0.002381,-0.00141,0.019803,-0.020257,-0.009919,-0.025457
3,0.001027,0.003348,0.975113,0.0,0.002381,-0.00141,0.019803,-0.020257,-0.009919,-0.025457
4,-0.017424,-0.004877,0.002381,0.002381,0.0,,0.014665,-0.014947,-0.024339,-0.011124
5,0.006142,0.002253,-0.00141,-0.00141,,0.0,-0.039396,0.038683,0.01639,0.018374
6,-0.011459,-0.002703,0.019803,0.019803,0.014665,-0.039396,0.0,,1.8e-05,0.009532
7,0.011469,0.002721,-0.020257,-0.020257,-0.014947,0.038683,,0.0,-1.8e-05,-0.009681
8,-0.003363,0.004237,-0.009919,-0.009919,-0.024339,0.01639,1.8e-05,-1.8e-05,0.0,0.022164
9,-0.005334,0.003114,-0.025457,-0.025457,-0.011124,0.018374,0.009532,-0.009681,0.022164,0.0


In terms of pointwise mutual information, the implication between bit 1 and 2, as well as the equivalence between 2 and 3 resulted in high PMI values. Conversely, the negative implication between 4 and 5 and the negation between 6 and 7 resulted in "holes" within the PMI profile, as log2(p(6;7)/(p(6) * p(7))) = log2(0) = -inf. Other interrelations remain random, resulting in PMI values near 0.

## Matching feature vectors to interrelation profiles<a class="anchor" id="vector_match"/>

Direct profile interpretation aside, it is also possible to quantify how well does a given feature vector match an interrelation profile using a measurement such as Relative Feature Tightness (RFT):

In [20]:
# let's create a feature vector that conforms to the interrelations within a profile from the above section:
conforming_feature_vector = '1111000011'
# and quantify how well it matches the profile built from 10k vectors featuring such interrelations:
interrelated_pmirm.fp_tightness(conforming_feature_vector, fpformat='bintext')

0.07452690957641434

In [21]:
# compare with a feature vector that does not match the interrelations within the same reference profile
nonconforming_feature_vector = '0000111100'
interrelated_pmirm.fp_tightness(nonconforming_feature_vector, fpformat='bintext')

-0.00016583829298040761

In [22]:
# the fp_tightness method accepts feature vectors in the same formats as the CORM construction:
interrelated_pmirm.fp_tightness([False, False, False, False, True, True, True, True, False, False])
interrelated_pmirm.fp_tightness(np.array([0, 0, 0, 0, 1, 1, 1, 1, 0, 0]))
# etc.

-0.00016583829298040761

RFT can also be computed the same way against Z-scored PMI matrices, ZPMIRMs, to quantify how well do the feature combination in the evaluated feature vector match the standardized values in the reference ZPMRIM:

In [23]:
interrelated_zpmirm = ZPMIRM.from_PMIRM(interrelated_pmirm)
interrelated_zpmirm.fp_tightness(conforming_feature_vector, fpformat='bintext')

0.30725061184162317

In [24]:
interrelated_zpmirm.fp_tightness(nonconforming_feature_vector, fpformat='bintext')

-0.1305216251551404

Measuring RFT against an interrelation profile with PMI values standardized by Z-score no longer reflects the absolute PMI values, just their position relative to all others in terms of standard deviation from the mean. This makes the resulting measurements easier to compare with those obtained against other interrelation profiles. We call this measurement against a ZPMIRM as "ZRFT", to differenciate it from the raw RFT obtained against a PMIRM.

## Comparing interrelation profiles<a class="anchor" id="profile_match"/>

Much like with the aforedescribed individual feature vectors, it is also possible to use RFT and ZRFT for quantifying how well do feature co-occurrences within a given set of feature vectors match another interrelation profile. This can be done by comparing the feature co-occurrence probabilities observed within the measured set (in the form of a COPRM), to the feature interrelations within a reference interrelation profile (in the form of a PMIRM or a ZPMIRM):

In [25]:
# let's make a new set of random, completely independent feature vectors
# and compare it to the aforecreated profile with strong feature interrelations
random_feature_vectors2 = [np.random.choice([0, 1], size=(10,)) for i in range(10000)]
random_corm2 = CORM.from_fingerprints(random_feature_vectors2)
random_coprm2 = COPRM.from_CORM(random_corm2)
interrelated_pmirm.tightness(random_coprm2)

0.006865763175914659

In [26]:
# for contrast, let's make a new set of feature vectors that conform to those in the reference matrix
interrelated_corm2 = CORM.from_fingerprints((modify_interrelations(fv) for fv in random_feature_vectors2))
interrelated_coprm2 = COPRM.from_CORM(interrelated_corm2)
interrelated_pmirm.tightness(interrelated_coprm2)

0.013888857314527625

The latter interrelation profile has higher RFT to the profile of the original interdependent set, because it contains the same feature interdependencies. Same goes for ZRFT measurements:

In [27]:
interrelated_zpmirm.tightness(random_coprm2)

-0.000617220177749029

In [28]:
interrelated_zpmirm.tightness(interrelated_coprm2)

0.04227271662610315

For two interrelation profiles A and B, the RFT measurement of A relative to B might not be the same as B relative to A. Therefore, since A->B is not always equal to B->A, (Z)RFT can't be considered as a metric:

In [29]:
interrelated_zpmirm.tightness(interrelated_coprm2)

0.04227271662610315

In [30]:
interrelated_zpmirm2 = ZPMIRM.from_COPRM(interrelated_coprm2)
interrelated_zpmirm2.tightness(interrelated_coprm)

0.04501895145619563

It is also possible to measure the (Z)RFT of an interrelation profile against itself to obtain a baseline value to compare with other (Z)RFT measurements:

In [31]:
interrelated_pmirm.tightness(interrelated_coprm)

0.014301152216293129

In [32]:
interrelated_zpmirm.tightness(interrelated_coprm)

0.04411565665503906

In [33]:
interrelated_zpmirm2.tightness(interrelated_coprm2)

0.04342793435703071

In [34]:
random_zpmirm.tightness(random_coprm)

0.002730781528478593