## DRKG Jaccard score analysis
This notebook performs an similarity analysis of different link types in the DRKG based on the overlap of the nodes and edges among these edge type.  Speciffically, we report the Jaccard score for edges and nodes and also an edge overlap score among edge types. For definitions of the scores see this notebook and the paper. These scores helps us assess the quality of the constructed DRKG

In [1]:
import pandas as pd
import numpy as np
import dgl
import sys
sys.path.insert(1, '../utils')
from utils import download_and_extract
download_and_extract()
drkg_file = '../data/drkg/drkg.tsv'
df = pd.read_csv(drkg_file, sep="\t")
triplets = df.values.tolist()

Using backend: pytorch


Download finished. Unzipping the file...


Find unique entities

In [2]:
entity_dictionary={}
def insert_entry(entry,ent_type,dic):
    if ent_type not in dic:
        dic[ent_type]={}
    ent_n_id=len(dic[ent_type])
    if entry not in dic[ent_type]:
         dic[ent_type][entry]=ent_n_id
    return dic

for triple in triplets:
    src = triple[0]
    split_src=src.split('::')
    src_type=split_src[0]
    dest = triple[2]
    split_dest=dest.split('::')
    dest_type=split_dest[0]
    insert_entry(src,src_type,entity_dictionary)
    insert_entry(dest,dest_type,entity_dictionary)

Next we find for each edge_type the distinct nodes and edges.

In [3]:
edge_dictionary={}
node_dictionary={}
for triple in triplets:
    src = triple[0]
    split_src=src.split('::')
    src_type=split_src[0]
    dest = triple[2]
    split_dest=dest.split('::')
    dest_type=split_dest[0]
    
    src_int_id=entity_dictionary[src_type][src]
    dest_int_id=entity_dictionary[dest_type][dest]
    
    pair=[(src_int_id,dest_int_id)]
    etype=triple[1]
    if etype in edge_dictionary:
        edge_dictionary[etype]+=pair
    else:
        edge_dictionary[etype]=pair
    if etype in node_dictionary:
        node_dictionary[etype].add(src_int_id)
        node_dictionary[etype].add(dest_int_id)
    else:
        node_dictionary[etype]=set()   
        node_dictionary[etype].add(src_int_id)
        node_dictionary[etype].add(dest_int_id)

## Score Calculation

Next we calculate the Jaccard index for nodes and edges and each edge type https://en.wikipedia.org/wiki/Jaccard_index. We also calculate what the overlap coefficient https://en.wikipedia.org/wiki/Overlap_coefficient 

In [4]:
jacard_pair_info=['edge_type_1\tedge_type_2\tJancard-edge\tJancard-node\tPct of edges included in smaller set\n']
keys=list(edge_dictionary.keys())
for i in range(len(keys)):
    for k in range(i+1,len(keys)):
        e1=keys[i]
        e2=keys[k]
        e1_ed=set(edge_dictionary[e1])
        e2_ed=set(edge_dictionary[e2])
        common_edges=e1_ed.intersection(e2_ed)
        union_edges=e1_ed.union(e2_ed)
        jacard_edge=float(len(common_edges)/(len(union_edges)))


        n1_d=node_dictionary[e1]
        n2_d=node_dictionary[e2]
        common_nodes=n1_d.intersection(n2_d)
        union_nodes=n1_d.union(n2_d)           
        jacard_node=float(len(common_nodes)/(len(union_nodes)))

        if len(e1_ed)>len(e2_ed):
            max_ed=e1_ed
            min_ed=e2_ed
        else:
            max_ed=e2_ed
            min_ed=e1_ed
        edge_inclusion=float((len(min_ed)-len(min_ed.difference(max_ed)))/len(min_ed))

        jacard_pair_info.append("{}\t{}\t{}\t{}\t{}\n".format(e1, e2, jacard_edge,jacard_node,edge_inclusion))

In [5]:
jacard_triplets=[jacard_pair_in.split('\t') for jacard_pair_in in jacard_pair_info]
jacard_triplets=[jacard_triplet[:-1]+[jacard_triplet[-1].split('\n')[0]] for jacard_triplet in jacard_triplets]

In [6]:
jacard_triplets_sort=(sorted(jacard_triplets[1:],key=lambda x: float(x[4])))[::-1]
jacard_triplets_store=["{}\t{}\t{}\t{}\t{}\n".format(j[0], j[1], j[2],j[3],j[4]) for j in jacard_triplets_sort]
entity_file = "edge_pair_jaccard_scores_sorted_overlap.tsv"
with open(entity_file, 'w+') as f:
    f.writelines(jacard_triplets_store)

In [7]:
jacard_triplets_sort=(sorted(jacard_triplets[1:],key=lambda x: float(x[2])))[::-1]
jacard_triplets_store=["{}\t{}\t{}\t{}\t{}\n".format(j[0], j[1], j[2],j[3],j[4]) for j in jacard_triplets_sort]
entity_file = "edge_pair_jaccard_scores_sorted_jacard.tsv"
with open(entity_file, 'w+') as f:
    f.writelines(jacard_triplets_store)

In [8]:
print(jacard_triplets_store)

['STRING::REACTION::Gene:Gene\tSTRING::CATALYSIS::Gene:Gene\t0.60764928840311\t0.6564436183395291\t0.8185443610948584\n', 'STRING::REACTION::Gene:Gene\tSTRING::BINDING::Gene:Gene\t0.4130593886549027\t0.481036351098414\t0.6628761377127028\n', 'STRING::CATALYSIS::Gene:Gene\tSTRING::BINDING::Gene:Gene\t0.30688155761585206\t0.47683080197349326\t0.4902002374356945\n', 'bioarx::DrugHumGen:Compound:Gene\tHetionet::CbG::Compound:Gene\t0.2722463231404084\t0.3625282167042889\t0.6670987814363495\n', 'bioarx::DrugHumGen:Compound:Gene\tDRUGBANK::target::Compound:Gene\t0.2605456907752274\t0.5645015560686678\t0.47103037895396177\n', 'STRING::INHIBITION::Gene:Gene\tSTRING::PTMOD::Gene:Gene\t0.24041654939487755\t0.35875968992248064\t0.5652087606696222\n', 'GNBR::E::Compound:Gene\tGNBR::K::Compound:Gene\t0.21505839298207846\t0.4438074438074438\t0.6439448875997099\n', 'DRUGBANK::enzyme::Compound:Gene\tHetionet::CbG::Compound:Gene\t0.188328530259366\t0.3438459391571309\t0.5309770465163518\n', 'DRUGBANK::t