## Comparison of CMAP/Chembl and sentence co-occurrence matrices

This notebook compares the sentence co-occurrence data (the "predictions") to the CMAP/Chembl data (the "truth). The matrices are filtered down so that they both have the same rows and columns, and then I compare the elements of each matrix.

The rows of both matrices represent genes while the columns represent diseases.

### Data format
- CMAP/Chembl: binary matrix with elements {ij} = 1 if there is a relationship between gene i and disease j, and 0 otherwise
- Sentence Co-Occurrence: matrix with element {ij} = (number of abstracts in which both gene and disease are mentioned)

### Comparison measures
I am only comparing whether there is a relationship or not, so I am not taking into account the number of abstracts where the gene and disease were mentioned.
1. Recall: Proportion of correctly predicted relationships/total number of relationships
2. Specificity: Proportion of correctly predicted non-relationships/total number of non-relationships

The numerators in both comparison measures are based on the original CMAP/Chembl matrix, and not the filtered one.

In [1]:
import pandas as pd

## Comparison measure functions for recall, specificity and precision

These measures are based on the CMAP/Chembl filtered matrices, i.e., these are the "truth"

In [8]:
def compare_recall(df_original, df, number_of_relationships):
    count = 0
    for col in df_original.columns:
        for row in df_original.index:
            if (df_original[col][row] >0) & (df[col][row] >0): # both dataframes show a relationship
                count = count + 1
    recall = count/number_of_relationships
    return print('The recall is: {}'.format(recall))



def compare_specificity(df_original,df, number_of_non_relationships):
    count = 0
    for col in df_original.columns:
        for row in df_original.index:
            if (df_original[col][row] == 0) & (df[col][row] == 0): # both dataframes show a non-relationship
                count = count + 1          
    spec = count/number_of_non_relationships            
    return print('The specificity is: {}'.format(spec))



def compare_precision(df_original, df, number_of_elements):
    count = 0
    for col in df_original.columns:
        for row in df_original.index:
            if df_original[col][row] == df[col][row]: # both dataframes show same value (relationship OR non-relationship)
                count = count + 1
    precision = count/number_of_elements        
    return print('The precision is: {}'.format(precision)) 

# CMAP comparison

In [3]:
# importing the filtered matrices and putting them in correct format for comparison
cmap_filtered = pd.read_csv('cmap_filtered.csv').set_index('target').fillna(0)
co_filtered = pd.read_csv('co_occurrence_filtered.csv').set_index('target')

# calculating total number of connections for comparison measures
number_of_elements = len(cmap_filtered.columns)*len(cmap_filtered.index)
number_of_relationships = cmap_filtered.sum().sum()
number_of_non_relationships = number_of_elements - number_of_relationships

In [4]:
print("The proportion of true relationships in the CMAP data is: {}".format(number_of_relationships/number_of_elements))

The proportion of true relationships in the CMAP data is: 0.011943834259378904


### Results

Recall - the "truth" is the filtered cmap data for now. The results are slightly lower if I use the original CMAP data

In [5]:
compare_recall(cmap_filtered,co_filtered, number_of_relationships)

The recall is: 0.561846375766318


In [6]:
compare_specificity(cmap_filtered,co_filtered, number_of_non_relationships)

The specificity is: 0.8235177443471362


In [9]:
compare_precision(cmap_filtered,co_filtered, number_of_elements)

The precision is: 0.8148490330361373


# Chembl comparison

In [15]:
# importing the dataframes and putting in the correct format
chembl_filtered = pd.read_csv('chembl_filtered.csv').set_index('target').fillna(0)
co_chembl_filtered = pd.read_csv('parquet_chembl_filtered.csv').set_index('target').fillna(0)

# calculating total number of connections for comparison measures
number_of_elements = len(chembl_filtered.columns)*len(chembl_filtered.index)
number_of_relationships = chembl_filtered.sum().sum()
number_of_non_relationships = number_of_elements - number_of_relationships

In [16]:
print("The proportion of true relationships in the Chembl data is: {}".format(number_of_relationships/number_of_elements))

The proportion of true relationships in the Chembl data is: 0.0328443672392717


### Results

In [10]:
compare_recall(chembl_filtered, co_chembl_filtered, number_of_relationships)

The recall is: 0.37221294002419775


In [11]:
compare_specificity(chembl_filtered, co_chembl_filtered, number_of_non_relationships)

The specificity is: 0.8692685456021757


In [12]:
compare_specificity(chembl_filtered, co_chembl_filtered, number_of_elements)

The specificity is: 0.8407179702608703
