# DRKG Edge Score Analysis
This nodebook shows how to analyze the (h, r, t) edge score. We use $$\mbox{score} = \gamma - ||\mathbf{h}+\mathbf{r}-\mathbf{t}||_{2}$$ as the score function here, which is compatable with the training methodology (Please refer to Train_embeddings.ipynb for more details). Here $\gamma$ is a constant we used in training of the TransE model that is set to 12.0. 

## Preparing data for Edge Score Analysis

In order to avoid the possible bias of over-fitting thetriplets in the training set, we split the whole DRKG into 10 equal folds and train 10 differ-ent models by picking each fold as the test setand the rest other nine folds are the trainingset.

Please make sure you have already installed pytorch, dgl and dgl-ke packages.

First we load the drkg dataset

In [None]:
import pandas as pd
import numpy as np
import sys
sys.path.insert(1, '../utils')
from utils import download_and_extract
download_and_extract()
drkg_file = '../data/drkg/drkg.tsv'
df = pd.read_csv(drkg_file, sep="\t")
triples = df.values.tolist()

Create a directory to store the ten fold training data

In [None]:
!mkdir train/ten_fold

Split dataset into 10 equal parts.

In [None]:
num_triples = len(triples)
import numpy as np
seed = np.arange(num_triples)
np.random.shuffle(seed)

fold_size = int((num_triples + 10) * 0.1)
total = 0
for i in range(10):
    fold_edge_cnt = fold_size if total + fold_size < num_triples else num_triples - total
    fold_edges = seed[total:total+fold_edge_cnt]
    with open("train/ten_fold/part{}.tsv".format(i), 'w+') as f:
        for idx in fold_edges:
            f.writelines("{}\t{}\t{}\n".format(triples[idx][0], triples[idx][1], triples[idx][2]))
    total += fold_edge_cnt

Build ten training dataset.

In [None]:
import os

for i in range(10):
    os.mkdir(os.path.join("./train/ten_fold/", "part{}".format(i)))

In [None]:
from shutil import copyfile

for i in range(10):
    plit_triples = []
    for j in range(10):
        if i == j:
            continue
        with open("./train/ten_fold/part{}.tsv".format(j), 'r') as f:
            for line in f:
                plit_triples.append(line)
    
    with open(os.path.join(os.path.join("./train/ten_fold/", "part{}".format(i)), "skip_part{}.tsv".format(i)), 'w+') as f:
        f.writelines(plit_triples)
    copyfile(os.path.join("./train/ten_fold/", "part{}.tsv".format(i)),
             os.path.join(os.path.join("./train/ten_fold/", "part{}".format(i)), "part{}.tsv".format(i)))

Now we get ten directorys uner ./train/ten_fold/

In [None]:
!ls  train/ten_fold

Then we need to run ten training process

In [None]:
!DGLBACKEND=pytorch dglke_train --dataset DRKG --data_path ./train/ten_fold/part0/ --data_files skip_part0.tsv part0.tsv part0.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 2048 \
--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 4 5 6 7 --num_proc 8 --neg_sample_size_eval 10000 --async_update

In [None]:
!DGLBACKEND=pytorch dglke_train --dataset DRKG --data_path ./train/ten_fold/part1/ --data_files skip_part1.tsv part1.tsv part1.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 2048 \
--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 4 5 6 7 --num_proc 8 --neg_sample_size_eval 10000 --async_update

In [None]:
!DGLBACKEND=pytorch dglke_train --dataset DRKG --data_path ./train/ten_fold/part2/ --data_files skip_part2.tsv part2.tsv part2.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 2048 \
--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 4 5 6 7 --num_proc 8 --neg_sample_size_eval 10000 --async_update

In [None]:
!DGLBACKEND=pytorch dglke_train --dataset DRKG --data_path ./train/ten_fold/part3/ --data_files skip_part3.tsv part3.tsv part3.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 2048 \
--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 4 5 6 7 --num_proc 8 --neg_sample_size_eval 10000 --async_update

In [None]:
!DGLBACKEND=pytorch dglke_train --dataset DRKG --data_path ./train/ten_fold/part4/ --data_files skip_part4.tsv part4.tsv part4.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 2048 \
--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 4 5 6 7 --num_proc 8 --neg_sample_size_eval 10000 --async_update

In [None]:
!DGLBACKEND=pytorch dglke_train --dataset DRKG --data_path ./train/ten_fold/part5/ --data_files skip_part5.tsv part5.tsv part5.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 2048 \
--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 4 5 6 7 --num_proc 8 --neg_sample_size_eval 10000 --async_update

In [None]:
!DGLBACKEND=pytorch dglke_train --dataset DRKG --data_path ./train/ten_fold/part6/ --data_files skip_part6.tsv part6.tsv part6.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 2048 \
--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 4 5 6 7 --num_proc 8 --neg_sample_size_eval 10000 --async_update

In [None]:
!DGLBACKEND=pytorch dglke_train --dataset DRKG --data_path ./train/ten_fold/part7/ --data_files skip_part7.tsv part7.tsv part7.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 2048 \
--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 4 5 6 7 --num_proc 8 --neg_sample_size_eval 10000 --async_update

In [None]:
!DGLBACKEND=pytorch dglke_train --dataset DRKG --data_path ./train/ten_fold/part8/ --data_files skip_part8.tsv part8.tsv part8.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 2048 \
--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 4 5 6 7 --num_proc 8 --neg_sample_size_eval 10000 --async_update

In [None]:
!DGLBACKEND=pytorch dglke_train --dataset DRKG --data_path ./train/ten_fold/part9/ --data_files skip_part9.tsv part9.tsv part9.tsv --format 'raw_udd_hrt' --model_name TransE_l2 --batch_size 2048 \
--neg_sample_size 256 --hidden_dim 400 --gamma 12.0 --lr 0.1 --max_step 100000 --log_interval 1000 --batch_size_eval 16 -adv --regularization_coef 1.00E-07 --test --num_thread 1 --gpu 0 1 2 3 4 5 6 7 --num_proc 8 --neg_sample_size_eval 10000 --async_update

## Loading Entity ID Mapping

In [None]:
import pandas as pd
import numpy as np
import os
import csv
import torch as th

### Edge scores of part0

In [None]:
ids = []
entity2id = {}
with open("./train/ten_fold/part0/entities.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=[ 'entity','id'])
    for row_val in reader:
        id = row_val['id']

        entity2id[row_val['entity']] = int(id)

print(len(entity2id))

rel2id = {}
with open("./train/ten_fold/part0/relations.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['entity','id'])
    for row_val in reader:
        id = row_val['id']

        rel2id[row_val['entity']] = int(id)

print(len(rel2id))

node_emb = np.load('ckpts/TransE_l2_DRKG_7/DRKG_TransE_l2_entity.npy')
rel_emb = np.load('ckpts/TransE_l2_DRKG_7/DRKG_TransE_l2_relation.npy')

head_ids = []
rel_ids = []
tail_ids = []
p0_rows = []
with open("./train/ten_fold/part0/part0.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['head', 'rel', 'tail'])
    for row_val in reader:
        head = row_val['head']
        rel = row_val['rel']
        tail = row_val['tail']

        head_id = entity2id[head]
        rel_id = rel2id[rel]
        tail_id = entity2id[tail]
        
        head_ids.append(head_id)
        rel_ids.append(rel_id)
        tail_ids.append(tail_id)
        p0_rows.append((head, rel, tail))
        
head_ids = np.array(head_ids)
rel_ids = np.array(rel_ids)
tail_ids = np.array(tail_ids)
triple_ids = np.arange(head_ids.shape[0])

gamma=12.0
def transE_l2(head, rel, tail):
    score = head + rel - tail
    #return th.norm(score, p=2, dim=-1)
    return gamma - th.norm(score, p=2, dim=-1)

with th.no_grad():
    node_emb = th.tensor(node_emb)
    rel_emb = th.tensor(rel_emb)
    head_ids = th.tensor(head_ids)
    rel_ids = th.tensor(rel_ids)
    tail_ids = th.tensor(tail_ids)

    head_embedding = node_emb[head_ids]
    rel_embedding = rel_emb[rel_ids]
    tail_embedding = node_emb[tail_ids]


p0_l2_score = transE_l2(head_embedding, rel_embedding, tail_embedding)
print(th.max(p0_l2_score))
print(th.min(p0_l2_score))

print(p0_l2_score.shape[0])
p0_rel_score = {}
for i in range(p0_l2_score.shape[0]):
    rel = p0_rows[i][1]
    if p0_rel_score.get(rel, None) is None:
        p0_rel_score[rel] = []
    p0_rel_score[rel].append(p0_l2_score[i])

### Edge scores of part1

In [None]:
ids = []
entity2id = {}
with open("./train/ten_fold/part1/entities.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=[ 'entity','id'])
    for row_val in reader:
        id = row_val['id']

        entity2id[row_val['entity']] = int(id)

print(len(entity2id))

rel2id = {}
with open("./train/ten_fold/part1/relations.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['entity','id'])
    for row_val in reader:
        id = row_val['id']

        rel2id[row_val['entity']] = int(id)

print(len(rel2id))

node_emb = np.load('ckpts/TransE_l2_DRKG_8/DRKG_TransE_l2_entity.npy')
rel_emb = np.load('ckpts/TransE_l2_DRKG_8/DRKG_TransE_l2_relation.npy')

head_ids = []
rel_ids = []
tail_ids = []
p1_rows = []
with open("./train/ten_fold/part1/part1.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['head', 'rel', 'tail'])
    for row_val in reader:
        head = row_val['head']
        rel = row_val['rel']
        tail = row_val['tail']

        head_id = entity2id[head]
        rel_id = rel2id[rel]
        tail_id = entity2id[tail]
        
        head_ids.append(head_id)
        rel_ids.append(rel_id)
        tail_ids.append(tail_id)
        p1_rows.append((head, rel, tail))
        
head_ids = np.array(head_ids)
rel_ids = np.array(rel_ids)
tail_ids = np.array(tail_ids)
triple_ids = np.arange(head_ids.shape[0])

with th.no_grad():
    node_emb = th.tensor(node_emb)
    rel_emb = th.tensor(rel_emb)
    head_ids = th.tensor(head_ids)
    rel_ids = th.tensor(rel_ids)
    tail_ids = th.tensor(tail_ids)

    head_embedding = node_emb[head_ids]
    rel_embedding = rel_emb[rel_ids]
    tail_embedding = node_emb[tail_ids]


p1_l2_score = transE_l2(head_embedding, rel_embedding, tail_embedding)
print(th.max(p1_l2_score))
print(th.min(p1_l2_score))
print(p1_l2_score.shape[0])
p1_rel_score = {}
for i in range(p1_l2_score.shape[0]):
    rel = p1_rows[i][1]
    if p1_rel_score.get(rel, None) is None:
        p1_rel_score[rel] = []
    p1_rel_score[rel].append(p1_l2_score[i])
    

### Edge scores of part2

In [None]:
ids = []
entity2id = {}
with open("./train/ten_fold/part2/entities.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=[ 'entity','id'])
    for row_val in reader:
        id = row_val['id']

        entity2id[row_val['entity']] = int(id)

print(len(entity2id))

rel2id = {}
with open("./train/ten_fold/part2/relations.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['entity','id'])
    for row_val in reader:
        id = row_val['id']

        rel2id[row_val['entity']] = int(id)

print(len(rel2id))

node_emb = np.load('ckpts/TransE_l2_DRKG_9/DRKG_TransE_l2_entity.npy')
rel_emb = np.load('ckpts/TransE_l2_DRKG_9/DRKG_TransE_l2_relation.npy')

head_ids = []
rel_ids = []
tail_ids = []
p2_rows = []
with open("./train/ten_fold/part2/part2.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['head', 'rel', 'tail'])
    for row_val in reader:
        head = row_val['head']
        rel = row_val['rel']
        tail = row_val['tail']

        head_id = entity2id[head]
        rel_id = rel2id[rel]
        tail_id = entity2id[tail]
        
        head_ids.append(head_id)
        rel_ids.append(rel_id)
        tail_ids.append(tail_id)
        p2_rows.append((head, rel, tail))
        
head_ids = np.array(head_ids)
rel_ids = np.array(rel_ids)
tail_ids = np.array(tail_ids)
triple_ids = np.arange(head_ids.shape[0])

with th.no_grad():
    node_emb = th.tensor(node_emb)
    rel_emb = th.tensor(rel_emb)
    head_ids = th.tensor(head_ids)
    rel_ids = th.tensor(rel_ids)
    tail_ids = th.tensor(tail_ids)

    head_embedding = node_emb[head_ids]
    rel_embedding = rel_emb[rel_ids]
    tail_embedding = node_emb[tail_ids]


p2_l2_score = transE_l2(head_embedding, rel_embedding, tail_embedding)
print(th.max(p2_l2_score))
print(th.min(p2_l2_score))

p2_rel_score = {}
for i in range(p2_l2_score.shape[0]):
    rel = p2_rows[i][1]
    if p2_rel_score.get(rel, None) is None:
        p2_rel_score[rel] = []
    p2_rel_score[rel].append(p2_l2_score[i])
    
#for key, rel_score in p2_rel_score.items():
#    print("{}:{}".format(key, len(rel_score)))
#    plt.hist(rel_score)
#    plt.show()

### Edge scores of part3

In [None]:
ids = []
entity2id = {}
with open("./train/ten_fold/part3/entities.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=[ 'entity','id'])
    for row_val in reader:
        id = row_val['id']

        entity2id[row_val['entity']] = int(id)

print(len(entity2id))

rel2id = {}
with open("./train/ten_fold/part3/relations.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['entity','id'])
    for row_val in reader:
        id = row_val['id']

        rel2id[row_val['entity']] = int(id)

print(len(rel2id))


node_emb = np.load('ckpts/TransE_l2_DRKG_10/DRKG_TransE_l2_entity.npy')
rel_emb = np.load('ckpts/TransE_l2_DRKG_10/DRKG_TransE_l2_relation.npy')

head_ids = []
rel_ids = []
tail_ids = []
p3_rows = []
with open("./train/ten_fold/part3/part3.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['head', 'rel', 'tail'])
    for row_val in reader:
        head = row_val['head']
        rel = row_val['rel']
        tail = row_val['tail']

        head_id = entity2id[head]
        rel_id = rel2id[rel]
        tail_id = entity2id[tail]
        
        head_ids.append(head_id)
        rel_ids.append(rel_id)
        tail_ids.append(tail_id)
        p3_rows.append((head, rel, tail))
        
head_ids = np.array(head_ids)
rel_ids = np.array(rel_ids)
tail_ids = np.array(tail_ids)
triple_ids = np.arange(head_ids.shape[0])

with th.no_grad():
    node_emb = th.tensor(node_emb)
    rel_emb = th.tensor(rel_emb)
    head_ids = th.tensor(head_ids)
    rel_ids = th.tensor(rel_ids)
    tail_ids = th.tensor(tail_ids)

    head_embedding = node_emb[head_ids]
    rel_embedding = rel_emb[rel_ids]
    tail_embedding = node_emb[tail_ids]


p3_l2_score = transE_l2(head_embedding, rel_embedding, tail_embedding)
print(th.max(p3_l2_score))
print(th.min(p3_l2_score))

p3_rel_score = {}
for i in range(p3_l2_score.shape[0]):
    rel = p3_rows[i][1]
    if p3_rel_score.get(rel, None) is None:
        p3_rel_score[rel] = []
    p3_rel_score[rel].append(p3_l2_score[i])
    

### Edge scores of part4

In [None]:
ids = []
entity2id = {}
with open("./train/ten_fold/part4/entities.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=[ 'entity','id'])
    for row_val in reader:
        id = row_val['id']

        entity2id[row_val['entity']] = int(id)

print(len(entity2id))

rel2id = {}
with open("./train/ten_fold/part4/relations.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['entity','id'])
    for row_val in reader:
        id = row_val['id']

        rel2id[row_val['entity']] = int(id)

print(len(rel2id))

node_emb = np.load('ckpts/TransE_l2_DRKG_11/DRKG_TransE_l2_entity.npy')
rel_emb = np.load('ckpts/TransE_l2_DRKG_11/DRKG_TransE_l2_relation.npy')

head_ids = []
rel_ids = []
tail_ids = []
p4_rows = []
with open("./train/ten_fold/part4/part4.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['head', 'rel', 'tail'])
    for row_val in reader:
        head = row_val['head']
        rel = row_val['rel']
        tail = row_val['tail']

        head_id = entity2id[head]
        rel_id = rel2id[rel]
        tail_id = entity2id[tail]
        
        head_ids.append(head_id)
        rel_ids.append(rel_id)
        tail_ids.append(tail_id)
        p4_rows.append((head, rel, tail))
        
head_ids = np.array(head_ids)
rel_ids = np.array(rel_ids)
tail_ids = np.array(tail_ids)
triple_ids = np.arange(head_ids.shape[0])

with th.no_grad():
    node_emb = th.tensor(node_emb)
    rel_emb = th.tensor(rel_emb)
    head_ids = th.tensor(head_ids)
    rel_ids = th.tensor(rel_ids)
    tail_ids = th.tensor(tail_ids)

    head_embedding = node_emb[head_ids]
    rel_embedding = rel_emb[rel_ids]
    tail_embedding = node_emb[tail_ids]


p4_l2_score = transE_l2(head_embedding, rel_embedding, tail_embedding)
print(th.max(p4_l2_score))
print(th.min(p4_l2_score))

p4_rel_score = {}
for i in range(p4_l2_score.shape[0]):
    rel = p4_rows[i][1]
    if p4_rel_score.get(rel, None) is None:
        p4_rel_score[rel] = []
    p4_rel_score[rel].append(p4_l2_score[i])


### Edge scores of part5

In [None]:
ids = []
entity2id = {}
with open("./train/ten_fold/part5/entities.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=[ 'entity','id'])
    for row_val in reader:
        id = row_val['id']

        entity2id[row_val['entity']] = int(id)

print(len(entity2id))

rel2id = {}
with open("./train/ten_fold/part5/relations.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['entity','id'])
    for row_val in reader:
        id = row_val['id']

        rel2id[row_val['entity']] = int(id)

print(len(rel2id))

node_emb = np.load('ckpts/TransE_l2_DRKG_12/DRKG_TransE_l2_entity.npy')
rel_emb = np.load('ckpts/TransE_l2_DRKG_12/DRKG_TransE_l2_relation.npy')

head_ids = []
rel_ids = []
tail_ids = []
p5_rows = []
with open("./train/ten_fold/part5/part5.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['head', 'rel', 'tail'])
    for row_val in reader:
        head = row_val['head']
        rel = row_val['rel']
        tail = row_val['tail']

        head_id = entity2id[head]
        rel_id = rel2id[rel]
        tail_id = entity2id[tail]
        
        head_ids.append(head_id)
        rel_ids.append(rel_id)
        tail_ids.append(tail_id)
        p5_rows.append((head, rel, tail))
        
head_ids = np.array(head_ids)
rel_ids = np.array(rel_ids)
tail_ids = np.array(tail_ids)
triple_ids = np.arange(head_ids.shape[0])

with th.no_grad():
    node_emb = th.tensor(node_emb)
    rel_emb = th.tensor(rel_emb)
    head_ids = th.tensor(head_ids)
    rel_ids = th.tensor(rel_ids)
    tail_ids = th.tensor(tail_ids)

    head_embedding = node_emb[head_ids]
    rel_embedding = rel_emb[rel_ids]
    tail_embedding = node_emb[tail_ids]


p5_l2_score = transE_l2(head_embedding, rel_embedding, tail_embedding)
print(th.max(p5_l2_score))
print(th.min(p5_l2_score))

p5_rel_score = {}
for i in range(p5_l2_score.shape[0]):
    rel = p5_rows[i][1]
    if p5_rel_score.get(rel, None) is None:
        p5_rel_score[rel] = []
    p5_rel_score[rel].append(p5_l2_score[i])

### Edge scores of part6

In [None]:
ids = []
entity2id = {}
with open("./train/ten_fold/part6/entities.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=[ 'entity','id'])
    for row_val in reader:
        id = row_val['id']

        entity2id[row_val['entity']] = int(id)

print(len(entity2id))

rel2id = {}
with open("./train/ten_fold/part6/relations.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['entity','id'])
    for row_val in reader:
        id = row_val['id']

        rel2id[row_val['entity']] = int(id)

print(len(rel2id))

node_emb = np.load('ckpts/TransE_l2_DRKG_13/DRKG_TransE_l2_entity.npy')
rel_emb = np.load('ckpts/TransE_l2_DRKG_13/DRKG_TransE_l2_relation.npy')

head_ids = []
rel_ids = []
tail_ids = []
p6_rows = []
with open("./train/ten_fold/part6/part6.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['head', 'rel', 'tail'])
    for row_val in reader:
        head = row_val['head']
        rel = row_val['rel']
        tail = row_val['tail']

        head_id = entity2id[head]
        rel_id = rel2id[rel]
        tail_id = entity2id[tail]
        
        head_ids.append(head_id)
        rel_ids.append(rel_id)
        tail_ids.append(tail_id)
        p6_rows.append((head, rel, tail))
        
head_ids = np.array(head_ids)
rel_ids = np.array(rel_ids)
tail_ids = np.array(tail_ids)
triple_ids = np.arange(head_ids.shape[0])

with th.no_grad():
    node_emb = th.tensor(node_emb)
    rel_emb = th.tensor(rel_emb)
    head_ids = th.tensor(head_ids)
    rel_ids = th.tensor(rel_ids)
    tail_ids = th.tensor(tail_ids)

    head_embedding = node_emb[head_ids]
    rel_embedding = rel_emb[rel_ids]
    tail_embedding = node_emb[tail_ids]


p6_l2_score = transE_l2(head_embedding, rel_embedding, tail_embedding)
print(th.max(p6_l2_score))
print(th.min(p6_l2_score))

p6_rel_score = {}
for i in range(p6_l2_score.shape[0]):
    rel = p6_rows[i][1]
    if p6_rel_score.get(rel, None) is None:
        p6_rel_score[rel] = []
    p6_rel_score[rel].append(p6_l2_score[i])


### Edge scores of part7

In [None]:
ids = []
entity2id = {}
with open("./train/ten_fold/part7/entities.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=[ 'entity','id'])
    for row_val in reader:
        id = row_val['id']

        entity2id[row_val['entity']] = int(id)

print(len(entity2id))

rel2id = {}
with open("./train/ten_fold/part7/relations.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['entity','id'])
    for row_val in reader:
        id = row_val['id']

        rel2id[row_val['entity']] = int(id)

print(len(rel2id))

node_emb = np.load('ckpts/TransE_l2_DRKG_14/DRKG_TransE_l2_entity.npy')
rel_emb = np.load('ckpts/TransE_l2_DRKG_14/DRKG_TransE_l2_relation.npy')

head_ids = []
rel_ids = []
tail_ids = []
p7_rows = []
with open("./train/ten_fold/part7/part7.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['head', 'rel', 'tail'])
    for row_val in reader:
        head = row_val['head']
        rel = row_val['rel']
        tail = row_val['tail']

        head_id = entity2id[head]
        rel_id = rel2id[rel]
        tail_id = entity2id[tail]
        
        head_ids.append(head_id)
        rel_ids.append(rel_id)
        tail_ids.append(tail_id)
        p7_rows.append((head, rel, tail))
        
head_ids = np.array(head_ids)
rel_ids = np.array(rel_ids)
tail_ids = np.array(tail_ids)
triple_ids = np.arange(head_ids.shape[0])

with th.no_grad():
    node_emb = th.tensor(node_emb)
    rel_emb = th.tensor(rel_emb)
    head_ids = th.tensor(head_ids)
    rel_ids = th.tensor(rel_ids)
    tail_ids = th.tensor(tail_ids)

    head_embedding = node_emb[head_ids]
    rel_embedding = rel_emb[rel_ids]
    tail_embedding = node_emb[tail_ids]


p7_l2_score = transE_l2(head_embedding, rel_embedding, tail_embedding)
print(th.max(p7_l2_score))
print(th.min(p7_l2_score))

p7_rel_score = {}
for i in range(p7_l2_score.shape[0]):
    rel = p7_rows[i][1]
    if p7_rel_score.get(rel, None) is None:
        p7_rel_score[rel] = []
    p7_rel_score[rel].append(p7_l2_score[i])


### Edge scores of part8

In [None]:
ids = []
entity2id = {}
with open("./train/ten_fold/part8/entities.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=[ 'entity','id'])
    for row_val in reader:
        id = row_val['id']

        entity2id[row_val['entity']] = int(id)

print(len(entity2id))

rel2id = {}
with open("./train/ten_fold/part8/relations.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['entity','id'])
    for row_val in reader:
        id = row_val['id']

        rel2id[row_val['entity']] = int(id)

print(len(rel2id))

node_emb = np.load('ckpts/TransE_l2_DRKG_15/DRKG_TransE_l2_entity.npy')
rel_emb = np.load('ckpts/TransE_l2_DRKG_15/DRKG_TransE_l2_relation.npy')

head_ids = []
rel_ids = []
tail_ids = []
p8_rows = []
with open("./train/ten_fold/part8/part8.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['head', 'rel', 'tail'])
    for row_val in reader:
        head = row_val['head']
        rel = row_val['rel']
        tail = row_val['tail']

        head_id = entity2id[head]
        rel_id = rel2id[rel]
        tail_id = entity2id[tail]
        
        head_ids.append(head_id)
        rel_ids.append(rel_id)
        tail_ids.append(tail_id)
        p8_rows.append((head, rel, tail))
        
head_ids = np.array(head_ids)
rel_ids = np.array(rel_ids)
tail_ids = np.array(tail_ids)
triple_ids = np.arange(head_ids.shape[0])

with th.no_grad():
    node_emb = th.tensor(node_emb)
    rel_emb = th.tensor(rel_emb)
    head_ids = th.tensor(head_ids)
    rel_ids = th.tensor(rel_ids)
    tail_ids = th.tensor(tail_ids)

    head_embedding = node_emb[head_ids]
    rel_embedding = rel_emb[rel_ids]
    tail_embedding = node_emb[tail_ids]


p8_l2_score = transE_l2(head_embedding, rel_embedding, tail_embedding)
print(th.max(p8_l2_score))
print(th.min(p8_l2_score))

p8_rel_score = {}
for i in range(p8_l2_score.shape[0]):
    rel = p8_rows[i][1]
    if p8_rel_score.get(rel, None) is None:
        p8_rel_score[rel] = []
    p8_rel_score[rel].append(p8_l2_score[i])


### Edge scores of part9

In [None]:
ids = []
entity2id = {}
with open("./train/ten_fold/part9/entities.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=[ 'entity','id'])
    for row_val in reader:
        id = row_val['id']

        entity2id[row_val['entity']] = int(id)

print(len(entity2id))

rel2id = {}
with open("./train/ten_fold/part9/relations.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['entity','id'])
    for row_val in reader:
        id = row_val['id']

        rel2id[row_val['entity']] = int(id)

print(len(rel2id))

node_emb = np.load('ckpts/TransE_l2_DRKG_16/DRKG_TransE_l2_entity.npy')
rel_emb = np.load('ckpts/TransE_l2_DRKG_16/DRKG_TransE_l2_relation.npy')

head_ids = []
rel_ids = []
tail_ids = []
p9_rows = []
with open("./train/ten_fold/part9/part9.tsv", newline='', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile, delimiter='\t', fieldnames=['head', 'rel', 'tail'])
    for row_val in reader:
        head = row_val['head']
        rel = row_val['rel']
        tail = row_val['tail']

        head_id = entity2id[head]
        rel_id = rel2id[rel]
        tail_id = entity2id[tail]
        
        head_ids.append(head_id)
        rel_ids.append(rel_id)
        tail_ids.append(tail_id)
        p9_rows.append((head, rel, tail))
        
head_ids = np.array(head_ids)
rel_ids = np.array(rel_ids)
tail_ids = np.array(tail_ids)
triple_ids = np.arange(head_ids.shape[0])

with th.no_grad():
    node_emb = th.tensor(node_emb)
    rel_emb = th.tensor(rel_emb)
    head_ids = th.tensor(head_ids)
    rel_ids = th.tensor(rel_ids)
    tail_ids = th.tensor(tail_ids)

    head_embedding = node_emb[head_ids]
    rel_embedding = rel_emb[rel_ids]
    tail_embedding = node_emb[tail_ids]


p9_l2_score = transE_l2(head_embedding, rel_embedding, tail_embedding)
print(th.max(p9_l2_score))
print(th.min(p9_l2_score))

p9_rel_score = {}
for i in range(p9_l2_score.shape[0]):
    rel = p9_rows[i][1]
    if p9_rel_score.get(rel, None) is None:
        p9_rel_score[rel] = []
    p9_rel_score[rel].append(p9_l2_score[i])


### Aggregate all score data together

In [None]:
rel_score = {}
cnt = 0
for key, val in p0_rel_score.items():
    rel_score[key] = val
    cnt += len(val)
print(cnt)
for key, val in p1_rel_score.items():
    rel_score[key] += val
for key, val in p2_rel_score.items():
    rel_score[key] += val
for key, val in p3_rel_score.items():
    rel_score[key] += val
for key, val in p4_rel_score.items():
    rel_score[key] += val
for key, val in p5_rel_score.items():
    rel_score[key] += val
for key, val in p6_rel_score.items():
    rel_score[key] += val
for key, val in p7_rel_score.items():
    rel_score[key] += val
for key, val in p8_rel_score.items():
    rel_score[key] += val
for key, val in p9_rel_score.items():
    rel_score[key] += val

### Edge Score Analysis
We first collect all the scores together

In [None]:
total = 0
total_score = []
for key, score in rel_score.items():
    total += len(score)
    score = th.stack(score).numpy()
    total_score.append(score)
total_score = np.concatenate(total_score)
print(total)

Then we draw the histogram

In [None]:
import matplotlib.pyplot as plt
plt.hist(total_score)
plt.xlabel('Edge Scores')
plt.ylabel('Number of Edges')
plt.show()