# Select the final label set that can be used to generate profiles for entities of a desired type
In this notebook, we will utilize the embeddings we created for the entities to further narrow down the labels that we previously created and filtered. This will give us a final set of labels that can be used to generate "profiles" for entities of a desired type by selecting labels from the final label set that a given entity matches. In this notebook, we will now ask for a desired type to profile on so that we don't create label sets for all types in the knowledge graph.

### Pre-requisite steps to run this notebook
1. Run the candidate_label_creation, candidate_filter, and HAS_entity_embeddings notebooks first (these have their own pre-reqs). We will use files that were created by those notebooks in this notebook.

In [280]:
import os
import pandas as pd
import numpy as np
import itertools
import seaborn as sns
import json
import random
import matplotlib.pyplot as plt
from utility import rename_cols_and_overwrite_id
from gensim.models import KeyedVectors
from tqdm.notebook import tqdm
import time

### Parameters
**REQUIRED**  
**type_to_profile**: The type that we will create a label set for, and therefore be able to create entity profiles for. This should be a string denoting a Q-node. e.g. if we want to create profiles for beers, we would set type_to_profile="Q44" (Q44=beer).  
**work_dir**: path to work dir that was specified in the candidate_label_creation, candidate_filter, and HAS_embeddings notebooks. We will utilize files that were saved by those previous notebooks, and also save files created by this notebook here.  
**item_file**: file path for the file that contains entity to entity relationships (e.g. wikibase-item)  
**time_file**: file path for the file that contains entity to time-type values  
**quantity_file**: file path for the file that contains entity to quantity-type values (remember to specify your trimmed quantity file if you ran the optional pre-processing trim_quantity_file notebook)  
**label_file**: file path for the file that contains wikidata labels  
**store_dir**: path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

**OPTIONAL**  
**In-progress... this is currently not used.**  
*string_file*: file path for the file that contains entity to string-type values  

In [281]:
# type_to_profile = "Q30612" 
work_dir = "./output/wikidata-20210215-dwd"
store_dir = "{}/temp_final_label_sets".format(work_dir)
embedding_file= "./output/wikidata-20210215-dwd/HAS_embeddings/HAS_embeddings_10x10s,200dim,min_count=0.kv"

### Process parameters and set up variables / file names

In [282]:
# Ensure paths are absolute
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)
    
# Create directories
output_dir = "{}/final_label_sets".format(work_dir)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
type_dir = "{}/{}".format(output_dir, type_to_profile)
if not os.path.exists(type_dir):
    os.makedirs(type_dir)
if not os.path.exists(store_dir):
    os.makedirs(store_dir)
    
# adding some environment variables we'll be using frequently
os.environ['ITEM_FILE'] = item_file
os.environ['TIME_FILE'] = time_file
os.environ['QUANTITY_FILE'] = quantity_file
os.environ['LABEL_FILE'] = label_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['FILTERED_LABELS'] = "{}/candidate_filter".format(work_dir)
os.environ['LABEL_CREATION_DIR'] = "{}/label_creation".format(work_dir)
os.environ['TYPE'] = type_to_profile
os.environ['WORK'] = label_set_work_dir
os.environ['OUT'] = output_dir
os.environ['TYPE_DIR'] = type_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

## 1. Get labels that abstract entities of the type we want to profile and find the entities that match each of those labels

**this could possibly be a separate notebook or just code moved out of notebook.**

**Actually we could do this in earlier steps along the way - like in label creation... For scalability, if we did this we might need to move the type_to_profile decision to there so we don't enumerate all labels and their entities (this would blow up when we get to RALs).** 

**since we didn't keep quantities and times separate for ravls and rails, the final label set won't be able to disambiguate whether the value for these is a time or quantity - this actually isn't much of a problem since the property will be able to disambiguate time versus quantity, however it means that we'll have to look in a concatenated quantities/times file when creating a profile for an entity.**

We will later choose from these labels to form the final label set, and we will do this based off of several formulas that take into account the entities that match each label.

Also, we need each label to have a unique identifier (again this is something that we could have done along the way in earlier notebooks - would simplify some queries). We'll add a column for this here as well.

In [263]:
!echo "{type_to_profile}"

Q30612


In [264]:
%%time
!kgtk query --debug -i $FILTERED_LABELS/all_entity_labels_filtered.tsv \
-i $FILTERED_LABELS/all_candidate_labels_filtered_in.tsv --graph-cache $STORE \
-o $OUT/entity_labels_for_cur_type.tsv \
--match 'entity_labels: (ent)-[l {type:t, type_label:t_lab, prop:p, property_label:p_lab, value:val, si_units:si, wd_units:wd, lower_bound:lb, upper_bound:ub, type2:t2, type2_label:t2_lab, prop2:p2, property2_label:p2_lab}]->(lab_id), candidate_labels: (type:`'"$TYPE"'`)-[lab_id]->()' \
--return 'ent as node1, "_" as label, lab_id as node2, t as type, t_lab as type_label, p as prop, p_lab as property_label, t2 as type2, t2_lab as type2_label, p2 as prop2, p2_lab as property2_label, val as value, lb as lower_bound, ub as upper_bound, si as si_units, wd as wd_units, l as id'

[2021-06-18 16:29:55 query]: SQL Translation:
---------------------------------------------
  SELECT graph_1_c1."node1" "_aLias.node1", ? "_aLias.label", graph_2_c2."id" "_aLias.node2", graph_1_c1."type" "_aLias.type", graph_1_c1."type_label" "_aLias.type_label", graph_1_c1."prop" "_aLias.prop", graph_1_c1."property_label" "_aLias.property_label", graph_1_c1."type2" "_aLias.type2", graph_1_c1."type2_label" "_aLias.type2_label", graph_1_c1."prop2" "_aLias.prop2", graph_1_c1."property2_label" "_aLias.property2_label", graph_1_c1."value" "_aLias.value", graph_1_c1."lower_bound" "_aLias.lower_bound", graph_1_c1."upper_bound" "_aLias.upper_bound", graph_1_c1."si_units" "_aLias.si_units", graph_1_c1."wd_units" "_aLias.wd_units", graph_1_c1."id" "_aLias.id"
     FROM graph_1 AS graph_1_c1, graph_2 AS graph_2_c2
     WHERE graph_1_c1."lower_bound"=graph_1_c1."lower_bound"
     AND graph_1_c1."prop"=graph_1_c1."prop"
     AND graph_1_c1."prop2"=graph_1_c1."prop2"
     AND graph_1_c1."property2_

## 2. Get all entities of the type we are profiling

In [265]:
%%time
!kgtk query -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $OUT/entities_in_type.tsv --graph-cache $STORE \
--match 'type: (entity)-[]->(type:`'"$TYPE"'`)' \
--return 'distinct entity as entity' \
--order-by 'entity'

CPU times: user 63.2 ms, sys: 1.42 s, total: 1.49 s
Wall time: 6.65 s


In [266]:
%%time
entities_in_type_df = pd.read_csv("{}/entities_in_type.tsv".format(output_dir), delimiter = '\t')
print("number of entities: {}".format(entities_in_type_df.shape[0]))
display(entities_in_type_df.head())

number of entities: 351014


Unnamed: 0,entity
0,Q100157869
1,Q100157924
2,Q100157935
3,Q100158017
4,Q100158036


CPU times: user 258 ms, sys: 8.65 ms, total: 266 ms
Wall time: 262 ms


## 3. Load information needed into Python variables.
1. Load dictionary of all labels found in step 1 above along with their corresponding positive entities (note, performance of this step might be improved if we do it along the way in step 1)
2. Load set of all entities of the type we are profiling (compiled in step 2)
3. Load the keyed vector of entity embeddings (created in HAS_entity_embeddings notebook)

In [267]:
%%time
labels_df = pd.read_csv("{}/entity_labels_for_cur_type.tsv".format(output_dir), delimiter = '\t').fillna("")
label_to_entities = labels_df.groupby("node2")["node1"].apply(set).to_dict()
print("total number of candidate labels for profiling entities of type {}: {}".format(type_to_profile, len(label_to_entities)))


  return caller(func, *(extras + args), **kw)


total number of candidate labels for profiling entities of type Q30612: 645
CPU times: user 1min 11s, sys: 10.1 s, total: 1min 21s
Wall time: 1min 21s


In [268]:
%%time
# look up number of entities of the type we are profiling
all_ents_in_type = set(pd.read_csv("{}/entities_in_type.tsv".format(output_dir), delimiter = '\t').entity)
print("Loaded set of all entities of the type we are profiling ({} total)".format(len(all_ents_in_type)))

Loaded set of all entities of the type we are profiling (351014 total)
CPU times: user 221 ms, sys: 0 ns, total: 221 ms
Wall time: 216 ms


In [237]:
%%time
# load embeddings created in the embeddings notebook
entity_embeddings = KeyedVectors.load(embedding_file)
print("Loaded entity embeddings")

Loaded entity embeddings
CPU times: user 2min 20s, sys: 26.5 s, total: 2min 46s
Wall time: 2min 55s


In [142]:
%%time
# load labels from step 1
labels_df = pd.read_csv("{}/labels_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t').fillna("")

# create dictionary of label_id --> set of matching entities
# using groupby is not the most efficient, but more concise
label_to_entities = labels_df.groupby("node1")["node2"].apply(set).to_dict()
print("total number of candidate labels for profiling entities of type {}: {}".format(type_to_profile, len(label_to_entities)))

# look up number of entities of the type we are profiling
all_ents_in_type = set(pd.read_csv("{}/entities_in_type.tsv".format(output_dir), delimiter = '\t').entity)
print("Loaded set of all entities of the type we are profiling ({} total)".format(len(all_ents_in_type)))

# load embeddings created in the embeddings notebook
entity_embeddings = KeyedVectors.load("{}/HAS_embeddings/HAS_embeddings.kv".format(work_dir))
print("Loaded entity embeddings")

  return caller(func, *(extras + args), **kw)


total number of candidate labels for profiling entities of type Q5: 587
Loaded set of all entities of the type we are profiling (7958973 total)
CPU times: user 6min 22s, sys: 2min 28s, total: 8min 51s
Wall time: 8min 54s


In [143]:
len(labels_df)

83356633

In [269]:
%%time
from functools import lru_cache
@lru_cache(maxsize = 10000000)
def get_entity_embedding(entity):
    return entity_embeddings[entity]
# the below is faster but only works if all ents are guaranteed to be in the emebeddings
# all_ents_in_type_ordered = list(all_ents_in_type)
# all_ents_in_type_ordered_embeddings = entity_embeddings[all_ents_in_type_ordered]
# entity_embeddings_dict = {ent : embed for ent, embed in zip(all_ents_in_type_ordered, all_ents_in_type_ordered_embeddings)}

entity_embeddings_dict = {ent : entity_embeddings[ent] for ent in all_ents_in_type if ent in entity_embeddings}

CPU times: user 1.01 s, sys: 483 ms, total: 1.49 s
Wall time: 1.49 s


In [259]:
len(entity_embeddings_dict)

28022

In [7]:
num_ents_in_type = len(all_ents_in_type)

In [24]:
label_supports={label : len(entities) / num_ents_in_type for label, entities in label_to_entities.items()}

In [26]:
label_supports

{'AIL-quantity-E2358': 0.0010924776349913487,
 'AIL-quantity-E2632': 0.0012697618147467016,
 'AIL-quantity-E3533': 0.002303689182008784,
 'AIL-quantity-E3664': 0.0026186795708441277,
 'AIL-quantity-E3666': 0.002513389604412529,
 'AIL-quantity-E3756': 0.0023044430481168865,
 'AIL-quantity-E3806': 0.002510876717385522,
 'AIL-quantity-E3942': 0.003411746716567577,
 'AIL-quantity-E4451': 0.003750860820862189,
 'AIL-quantity-E4515': 0.004111208820535011,
 'AIL-quantity-E4606': 0.0033980514822703886,
 'AIL-quantity-E4814': 0.0039662152390767,
 'AIL-time.year-E10055': 0.0019391949187414004,
 'AIL-time.year-E10270': 0.0019251227513901604,
 'AIL-time.year-E10353': 0.0020055351362543885,
 'AIL-time.year-E11148': 0.0023006737175763755,
 'AIL-time.year-E11348': 0.002389252985278377,
 'AIL-time.year-E11438': 0.002458231734169723,
 'AIL-time.year-E12161': 0.002375809039683889,
 'AIL-time.year-E12206': 0.0028239824409506103,
 'AIL-time.year-E12362': 0.0026919302276813854,
 'AIL-time.year-E12967': 0.0

In [31]:
np.log(1/(np.array(list(label_supports.values()))))

array([6.8193071 , 6.66892594, 6.07324345, 5.94508507, 5.986123  ,
       6.07291626, 5.9871233 , 5.68053089, 5.58576991, 5.49403818,
       5.68455311, 5.52994298, 6.24548238, 6.25276555, 6.21184435,
       6.07455328, 6.03677452, 6.00831299, 6.04241725, 5.86960718,
       5.91749679, 5.84625673, 5.72526726, 5.68248462, 5.69059833,
       5.64021871, 5.57014875, 5.57789559, 5.50689515, 5.49513899,
       5.52101787, 5.36276706, 5.25841212, 5.24827413, 5.18208175,
       5.04921053, 5.00820364, 4.9218364 , 4.92698874, 4.89715515,
       4.8171901 , 4.78482323, 4.72875727, 4.69557632, 4.72455779,
       4.55481627, 4.52230398, 4.5148606 , 4.41364241, 4.31164054,
       4.26476246, 4.23517719, 4.12385655, 4.07693604, 4.05881686,
       4.03494921, 3.99676512, 3.92708414, 3.8664055 , 3.80643673,
       3.7099484 , 3.6216508 , 3.60805544, 3.5949352 , 3.37174864,
       3.39899137, 3.3433092 , 3.1189981 , 3.04749709, 3.0650395 ,
       2.99326177, 2.94332074, 6.76502805, 6.74967805, 6.76677

In [27]:
with open("{}/label_supports_in_type_(recompute_these_the_right_way).json".format(label_set_final_dir), 'w') as f:
    json.dump(label_supports, f)

## 4. Compute distinctiveness for each label
distinctiveness of a label = average similarity amongst its positive entities - average similarity between its positive and negative entities

In [270]:
%%time
# for human, about 3 mins per label, could try more sampling... i.e. we are doing all ents X sample of other ents to compare. we could sample both sides.
# Sampling params:
internal_sim_window_size = 100 # Internal is faster, so can set this higher
external_sim_sample_size = 50

print("Calculating distinctiveness for {} labels".format(len(label_to_entities)))
count = 0

label_distinctiveness_orig = {}
# todo - use comprehension instead of loop
for label, pos_ents in tqdm(label_to_entities.items()):
    if count > 10:
        break
    count += 1
    
    # average internal similarity
    if len(pos_ents) < 2:
        avg_internal_similarity = 1 # only relevant if we have labels that apply to only a single entity
    else:
        window_size = min(internal_sim_window_size, len(pos_ents) -1)
        pos_ent_embeds = np.array([entity_embeddings_dict[entity] for entity in pos_ents])
        avg_internal_similarity = np.sum([np.sum(entity_embeddings.cosine_similarities(pos_ent_embeds[i], pos_ent_embeds[i+1:i+1+window_size])) for i in range(len(pos_ents))]) / ((len(pos_ents)*window_size) - ((window_size+1)*window_size/2))

    # average external similarity
    neg_ents = all_ents_in_type - pos_ents
    
    num_samples = min(external_sim_sample_size, len(neg_ents))
    neg_ent_embeds = [entity_embeddings_dict[entity] for entity in neg_ents]
    avg_external_similarity = np.sum([np.sum(entity_embeddings.cosine_similarities(pos_ent_embeds[i], random.sample(neg_ent_embeds,num_samples))) for i in range(len(pos_ents))]) / (len(pos_ents)*num_samples)

    label_distinctiveness_orig[label] = avg_internal_similarity - avg_external_similarity
    
# with open("{}/label_distinctiveness_dict.json".format(label_set_work_dir), "w") as f:
#     json.dump(label_distinctiveness, f)

Calculating distinctiveness for 645 labels


  0%|          | 0/645 [00:00<?, ?it/s]

KeyError: 'Q64666439'

In [123]:
label_to_entities = labels_df.groupby("node1")["node2"].apply(set).to_dict()

In [122]:
len(test['RAIL-E4983269'])

823744

In [241]:
len(pos_ent_embeds)

0

In [271]:
%%time
# Sampling params:
max_sample_size = 10000

print("Calculating distinctiveness for {} labels".format(len(label_to_entities)))

label_distinctiveness = {}
count=0
for label, pos_ents in tqdm(label_to_entities.items()):
#     count += 1
#     if count < 121 or count > 131:
#         continue
#     count += 1
#     print("starting iteration")
    pos_ents_sample = random.sample(pos_ents, min(max_sample_size, len(pos_ents)))
    pos_ent_embeds = [entity_embeddings_dict[entity] for entity in pos_ents_sample if entity in entity_embeddings_dict]
#     print("done looking up pos embeddings")

    # May happen for types with fewer entities if a label has few entities and they are all tail entities
    # and are therefore not in the embeddings
    if len(pos_ent_embeds) == 0:
        label_distinctiveness[label]=0
        continue
    
    pos_centroid = np.mean(pos_ent_embeds, axis=0)
    
    # average internal similarity
    avg_internal_similarity = np.mean(entity_embeddings.cosine_similarities(pos_centroid, pos_ent_embeds))
    
#     print("starting set subtraction op")
    # average external similarity
    neg_ents = all_ents_in_type - pos_ents
#     print("done w set subtraction op")
    neg_ents_sample = random.sample(neg_ents, min(max_sample_size, len(neg_ents)))
#     print("done w sampling")
    neg_ent_embeds = [entity_embeddings_dict[entity] for entity in neg_ents_sample if entity in entity_embeddings_dict]
#     print("done looking up neg embeddings")
    
    avg_external_similarity = np.mean(entity_embeddings.cosine_similarities(pos_centroid, neg_ent_embeds))
    
    label_distinctiveness[label] = avg_internal_similarity - avg_external_similarity
    
# with open("{}/label_distinctiveness_dict.json".format(type_dir), "w") as f:
#     json.dump(label_distinctiveness, f)

Calculating distinctiveness for 645 labels


  0%|          | 0/645 [00:00<?, ?it/s]

CPU times: user 15min 36s, sys: 1h 7min 32s, total: 1h 23min 8s
Wall time: 1min 18s


In [128]:
count = 0
for label, pos_ents in label_to_entities.items():
    print(f"{count}: {len(pos_ents)}")
    count +=1

0: 8695
1: 10106
2: 18335
3: 20842
4: 20004
5: 18341
6: 19984
7: 27154
8: 29853
9: 32721
10: 27045
11: 31567
12: 15434
13: 15322
14: 15962
15: 18311
16: 19016
17: 19565
18: 18909
19: 22476
20: 21425
21: 23007
22: 25966
23: 27101
24: 26882
25: 28271
26: 30323
27: 30089
28: 32303
29: 32685
30: 31850
31: 37311
32: 41415
33: 41837
34: 44700
35: 51052
36: 53189
37: 57987
38: 57689
39: 59436
40: 64384
41: 66502
42: 70337
43: 72710
44: 70633
45: 83700
46: 86466
47: 87112
48: 96391
49: 106742
50: 111865
51: 115224
52: 128792
53: 134979
54: 137447
55: 140767
56: 146246
57: 156800
58: 166609
59: 176906
60: 194826
61: 212811
62: 215724
63: 218573
64: 273228
65: 265885
66: 281110
67: 351798
68: 377873
69: 371302
70: 398933
71: 419362
72: 9180
73: 9322
74: 9164
75: 9417
76: 11056
77: 9608
78: 9910
79: 10352
80: 9927
81: 10335
82: 10400
83: 10627
84: 10493
85: 10747
86: 10902
87: 10926
88: 11228
89: 11128
90: 11450
91: 11485
92: 13039
93: 12305
94: 11790
95: 12670
96: 13733
97: 13870
98: 14334
99: 8

testing on 11 labels that have >2M matching entities (these take the longest because we have many positives and many negatives).  
testing sampling both positives and negatives:  
no sampling takes 7m 51s (\~43s/label)  
sampling 100,000 takes 37.9s (\~3.4s/label)  
sampling 10,000 takes 21.5s (\~2s/label)  
sampling 1,000 takes 19.4s (\~1.76s/label)  

In [140]:
def get_aae(gold,sample):
    return np.mean(np.abs(np.array(list(gold.values())) - np.array(list(sample.values()))))

from tabulate import tabulate
headers = ["label", "no sampling", "100,000 samples", "10,000 samples", "1,000 samples"]
rows = []
for k in label_distinctiveness:
    row=[k, label_distinctiveness[k], label_distinctiveness100000[k], label_distinctiveness10000[k], label_distinctiveness1000[k]]
    rows.append(row)
row=["avg abs err",0,get_aae(label_distinctiveness,label_distinctiveness100000),
     get_aae(label_distinctiveness,label_distinctiveness10000),
     get_aae(label_distinctiveness,label_distinctiveness1000)]
rows.append(row)

print(tabulate(rows,headers=headers))

label             no sampling    100,000 samples    10,000 samples    1,000 samples
--------------  -------------  -----------------  ----------------  ---------------
RAIL-E5804528       0.0847355        0.0851485         0.0841672         0.0907977
RAIL-E5805618       0.0856908        0.0861703         0.0852886         0.0888516
RAIL-E5962378       0.0771669        0.0775624         0.0782194         0.0831838
RAIL-E5962924       0.0807492        0.0804133         0.0810758         0.0786395
RAIL-E5964663       0.0774279        0.0773608         0.0769587         0.0767247
RAIL-E5965672       0.0819716        0.0822412         0.081434          0.0817775
RAIL-E5987042       0.0734252        0.0742237         0.0736925         0.0748734
RAIL-E5989264       0.0744493        0.0742138         0.0759095         0.0773825
RAVL-E10029434      0.0836047        0.083967          0.0846588         0.0931227
RAVL-E10033019      0.0845044        0.0841198         0.0835327         0.0881564
RA

testing only sampling negatives on the more common easier cases of having much fewer positives.    
sampling with 100 - 10,000 seems to take about same time - 1.6s per label
no sampling takes more like 15s per label

orig method with sample settings internal=100, external=50 takes ~9s per label

In [64]:
from tabulate import tabulate
headers = ["label", "no sampling", "10000 samples", "1000 samples", "100 samples", "orig method"]
rows = []
for k in label_distinctiveness:
    row=[k, label_distinctiveness[k], label_distinctiveness10000[k], label_distinctiveness1000[k], label_distinctiveness100[k], label_distinctiveness_orig[k]]
    rows.append(row)
print(tabulate(rows,headers=headers))

label                 no sampling    10000 samples    1000 samples    100 samples    orig method
------------------  -------------  ---------------  --------------  -------------  -------------
AIL-quantity-E2358       0.43223          0.432845        0.430157       0.436105      0.295243
AIL-quantity-E2632       0.170978         0.172305        0.175248       0.152981      0.0881574
AIL-quantity-E3533       0.368022         0.366781        0.366724       0.370261      0.226696
AIL-quantity-E3664       0.38903          0.389251        0.38807        0.393451      0.244964
AIL-quantity-E3666       0.359961         0.359174        0.359747       0.358948      0.2202
AIL-quantity-E3756       0.372736         0.372702        0.374167       0.381768      0.227895
AIL-quantity-E3806       0.37342          0.373965        0.367016       0.368123      0.230005
AIL-quantity-E3942       0.342124         0.3432          0.341748       0.33816       0.205633
AIL-quantity-E4451       0.364386      

Let's take a look at the most distinctive labels

In [257]:
print("There are {} labels, and {} distinct values of distinctiveness.".format(len(label_distinctiveness),len(set(label_distinctiveness.values()))))
print("Most distinctive labels:")
most_distinct = sorted(label_distinctiveness.items(), key=lambda item: item[1], reverse=True)[:5]
display(most_distinct)
print("Looking at the top 5:")
cols_to_display = ["type", "type_label", "prop", "prop_label", "value", "value_label", "value_lb", "value_ub", "prop2", "prop2_label", "value2", "value2_lb", "value2_ub", "si_units", "wd_units"]
for i in range(len(most_distinct)):
    label_id = most_distinct[i][0]
    display(labels_df.loc[labels_df["node1"]==label_id, cols_to_display].fillna("").iloc[[0]])
    print("Number of positive entities for this label: {} (out of {} entities of this type)".format(len(label_to_entities[label_id]), len(all_ents_in_type)))
#print("Positive entities for this label: {}".format(", ".join(label_to_entities[label_id])))
# negatives = all_ents_in_type - label_to_entities[label_id]
# print("Negative entities for this label: {}".format(", ".join(negatives)))

There are 273 labels, and 273 distinct values of distinctiveness.
Most distinctive labels:


[('RAIL-E976418', 0.31647640386999143),
 ('RAIL-E976411', 0.3164195152423728),
 ('RAIL-E976422', 0.31641950157321325),
 ('RAIL-E976421', 0.31638429720047456),
 ('RAIL-E976420', 0.31637531900466437)]

Looking at the top 5:


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2721045,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P4010,'GDP (PPP)'@en,,5652059000000.0,,,Q550207


Number of positive entities for this label: 23390 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2404370,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P1198,'unemployment rate'@en,,,5.0,,Q11229


Number of positive entities for this label: 23390 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2427760,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P2046,'area'@en,,1156670.5,5473562.0,,Q712226


Number of positive entities for this label: 23391 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2814859,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P8477,'BTI Status Index'@en,,,,,


Number of positive entities for this label: 23390 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2791469,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P8476,'BTI Governance Index'@en,,,,,


Number of positive entities for this label: 23390 (out of 176405 entities of this type)


## 5. Iteratively choose labels to add to label set using formula
Choose label $l_i = argmax_{l_i \in L_t^c}{[d(l_i) + \delta * reward(l_i,L_t) - (1- \delta) * penalty(l_i,L_t)]}$

Where...
- $l_i$ is a label
- $L_t^c$ is the set of candidate labels relevant to the type $t$ that we are profiling
- $L_t$ is the set of labels relevant to the type $t$ that we have so far chosen to be in the final label set
- $d(l_i)$ is the distinctiveness of label $l_i$
- $reward(l_i,L_t)$ is a function that captures the potential increase in the total coverage of entities of type $t$ in the KG by the labels in $L_t$ if $l_i$ were to be added to it.
- $penalty(l_i,L_t)$ is a function that captures the potential increase in redundancy of labels in $L_t$ if $l_i$ were to be added to it.
- $\delta$ is a hyperparameter that adjusts if we care more about increasing total coverage versus minimizing redundancy

Functions for computing reward and penalty...

$reward(l_i,L_t) = \cfrac{|\bigcup_{l_j \in (L_t \cup \{l_i\})}{\varepsilon_t^{l_j}}|}{|\varepsilon_t|}$

Where...
- $\varepsilon_t^{l_j}$ is the set of entities of type $t$ that match label $l_j$
- $\varepsilon_t$ is the set of entities of type $t$

In [100]:
# note params are slightly different to avoid redundant computation
def reward(candidate_label, entities_covered_already):
#     return len(entities_covered_already | label_to_entities[candidate_label]) / len(all_ents_in_type)
    return (len(entities_covered_already) + len(label_to_entities[candidate_label]) - len(entities_covered_already.intersection(label_to_entities[candidate_label]))) / len(all_ents_in_type)
#     return (len(entities_covered_already) + len(label_to_entities[candidate_label] - entities_covered_already)) / len(all_ents_in_type)

#     num_covered = len(entities_covered_already)
#     num_covered += sum([e not in label_to_entities[candidate_label] for e in entities_covered_already])
#     for e in label_to_entities[candidate_label]:
#         if e not in entities_covered_already:
#             num_covered += 1
#     return num_covered / len(all_ents_in_type)

In [102]:
# note params are slightly different to avoid redundant computation
def reward(un_covered_ents_for_cand, entities_covered_already):
    return (len(un_covered_ents_for_cand) + len(entities_covered_already)) / len(all_ents_in_type)

From the paper, "reward is the potential contribution of \[the label\] to the increase of the total
coverage of positive entities in the KG". The formula above doesn't exactly do that... yes, if the label increases the coverage of positive entities, then it will be higher, but if we already have good coverage and this label doesn't add anything new, reward will still be high. If we want a function that does what they explained, we could use something like $\cfrac{|\varepsilon_t^{l_i}| - |\varepsilon_t^{L_t} \cap \varepsilon_t^{l_i}|}{|\varepsilon_t|}$

$penalty(l_i,L_t) = \cfrac{\sum_{l_j \in L_t}{|\varepsilon_t^{l_i} \cap \varepsilon_t^{l_j}|}}{|L_t| * |\varepsilon_t|}$

In [66]:
# note params are slightly different to avoid redundant computation
def penalty(numerator, label_set):
    # when label set is empty, avoid divide by zero
    if len(label_set) == 0:
        return 0 
    return numerator / (len(label_set) * len(all_ents_in_type))

Iteratively choose labels

In [67]:
with open("{}/label_distinctiveness_dict.json".format(label_set_work_dir), "r") as f:
    label_distinctiveness = json.load(f)

In [73]:
len(covered_ents)

1643232

In [75]:
len(label_to_entities[candidate_labels[0]])

8695

In [101]:
%%time
for i in range(10):
    _=reward(l, covered_ents)

CPU times: user 3.26 s, sys: 75.3 ms, total: 3.34 s
Wall time: 3.32 s


In [272]:
%%time
# for 54 human labels, looks like 30s/label
d = .5 # can make this a parameter
min_cutoff = 0 # can make this a parameter (though it isn't mentioned in the paper..)
covered_ents = set() # set of all entities covered by label set
label_set = []
candidate_labels = list(label_to_entities.keys())
penalty_numerator_for_label = {l : 0 for l in candidate_labels}
un_covered_ents_for_label = {l : label_to_entities[l] for l in candidate_labels}
print("We have {} candidate labels to choose from".format(len(candidate_labels)))

label_reward={}
label_penalty={}
label_score={}

count=0
start=time.perf_counter()
for i in tqdm(range(len(candidate_labels))):
    # Finding the best label
#     print("computing vals (mostly calling reward())")
    vals = [label_distinctiveness[l] + d*reward(un_covered_ents_for_label[l], covered_ents) - (1-d)*penalty(penalty_numerator_for_label[l], label_set) for l in candidate_labels]
#     print("done computing vals (mostly calling reward())")
    max_val = np.max(vals)
    if max_val <= min_cutoff:
        break
    max_ix = np.random.choice(np.flatnonzero(vals == max_val))
    max_label = candidate_labels[max_ix]
    max_label_ents = label_to_entities[max_label]
    
    label_reward[max_label] = reward(max_label, covered_ents)
    label_penalty[max_label] = penalty(penalty_numerator_for_label[max_label], label_set)
    label_score[max_label] = max_val
    
    # Adding the label to final label set
    label_set.append(max_label)
#     print("updating covered ents")
    covered_ents = covered_ents | max_label_ents
#     print("done updating covered ents")

    # Remove the label from candidate labels
    candidate_labels.pop(max_ix)
    
    # Update penalty numerator for labels that are still in the candidate list
#     print("updating penalty numerators")
    for l in candidate_labels:
        penalty_numerator_for_label[l] += len((label_to_entities[l] & max_label_ents))
#     print("done updating penalty numerators")
    
    # Update un-covered entitites for labels that are still in the candidate list
#     print("updating un-covered entities")
    for l in candidate_labels:
        un_covered_ents_for_label[l] -= max_label_ents
#     print("done updating un-covered entities")
    
print("Final label set has {} labels, covering {} / {} entities of type {} in the dataset".format(len(label_set), len(covered_ents), len(all_ents_in_type), type_to_profile))
# with open("{}/ordered_final_label_set.json".format(label_set_work_dir), 'w') as f:
#     json.dump(label_set, f)
# with open("{}/label_reward_dict.json".format(label_set_work_dir), 'w') as f:
#     json.dump(label_reward, f)
# with open("{}/label_penalty_dict.json".format(label_set_work_dir), 'w') as f:
#     json.dump(label_penalty, f)
# with open("{}/label_score_dict.json".format(label_set_work_dir), 'w') as f:
#     json.dump(label_score, f)

We have 645 candidate labels to choose from


  0%|          | 0/645 [00:00<?, ?it/s]

Final label set has 645 labels, covering 350898 / 351014 entities of type Q30612 in the dataset
CPU times: user 44 s, sys: 1.31 s, total: 45.3 s
Wall time: 45 s


# TODO - compare above results with original. should be same since we loaded previously computed distinctiveness. Also time full run of new distinctiveness calculation

### Format final profiles
We'll create 2 ways to consume these:  
1. a dictionary of {entity : [ordered list of label-ids]} plus and a table of labels (ids plus their contents)
2. a single table of entities and their corresponding labels and label content.

First we'll add the computed distinctiveness, reward, and penalty info to the labels' content

In [273]:
%%time
labels_df["distinctiveness"] = labels_df["node2"].map(label_distinctiveness)
labels_df["reward"] = labels_df["node2"].map(label_reward)
labels_df["penalty"] = labels_df["node2"].map(label_penalty)
labels_df["re-ranking score"] = labels_df["node2"].map(label_score)

CPU times: user 6.28 s, sys: 269 ms, total: 6.55 s
Wall time: 6.53 s


In [199]:
labels_df.columns

Index(['node1', 'label', 'node2', 'type', 'type_label', 'prop',
       'property_label', 'type2', 'type2_label', 'prop2', 'property2_label',
       'value', 'lower_bound', 'upper_bound', 'si_units', 'wd_units', 'id',
       'distinctiveness', 'reward', 'penalty', 're-ranking score'],
      dtype='object')

In [274]:
%%time
# with open("{}/ordered_final_label_set.json".format(label_set_work_dir), 'r') as f:
#     label_set = json.load(f)

profiles = {}
for label in tqdm(label_set):
    for ent in label_to_entities[label]:
        if ent not in profiles:
            profiles[ent] = []
        profiles[ent].append(label)
for ent in (all_ents_in_type - covered_ents):
    profiles[ent] = []
    
with open("{}/profiles_dict.json".format(type_dir), 'w') as f:
    json.dump(profiles, f)

info_cols = ["node2", "type", "type_label", "prop", "property_label", "type2", "type2_label", "prop2", "property2_label", "value", "lower_bound", "upper_bound", "si_units", "wd_units", "distinctiveness", "reward", "penalty", "re-ranking score"]
labels_info = labels_df.loc[:,info_cols].rename(columns={"node2":"label_id"}).groupby("label_id").first().reset_index()
labels_info.to_csv(path_or_buf = "{}/all_labels_info.csv".format(type_dir), index = False)

label_set_labels_info = labels_info.loc[labels_info.loc[:,"label_id"].isin(label_set), :]
label_set_labels_info.to_csv(path_or_buf = "{}/label_set_labels_info.csv".format(type_dir), index = False)


  0%|          | 0/645 [00:00<?, ?it/s]

CPU times: user 25.7 s, sys: 16.4 s, total: 42.1 s
Wall time: 41.9 s


In [275]:
%%time
# profiles_info_cols = ["node1","label", "node2", "node2;label", "type", "type_label", "prop", "prop_label", "value", "value_label", "value_lb", "value_ub", "prop2", "prop2_label", "value2", "value2_lb", "value2_ub", "si_units", "wd_units", "distinctiveness", "reward", "penalty", "re-ranking score", "id"]
profiles_df = labels_df.loc[labels_df.loc[:,"node2"].isin(label_set),:].sort_values(by=['node1'])
profiles_df.to_csv(path_or_buf = "{}/entities_and_their_profiles.tsv".format(type_dir), index = False, sep='\t')

CPU times: user 4min 51s, sys: 18.9 s, total: 5min 10s
Wall time: 5min 10s


In [69]:
display(profiles_df)

Unnamed: 0,node1,label,node2,node2;label,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units,id
16606934,RAIL-E4330978,pos_entity,Q10000001,'Tatyana Kolotilshchikova'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,,,P2046,'area'@en,,2.8902e+06,,,Q712226,E16606935
24828611,RAIL-E5027414,pos_entity,Q10000001,'Tatyana Kolotilshchikova'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,,,P2299,'PPP GDP per capita'@en,,,50391.5,,Q550207,E24828612
63480438,RAIL-E4995496,pos_entity,Q10000001,'Tatyana Kolotilshchikova'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q6256,'country'@en,,,P2855,'VAT-rate'@en,,8.35,26,,Q11229,E63480439
40115327,RAIL-E5479287,pos_entity,Q10000001,'Tatyana Kolotilshchikova'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,,,P571,'inception'@en,,1657.5,,,,E40115328
8449486,RAIL-E5274231,pos_entity,Q10000001,'Tatyana Kolotilshchikova'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,,,P1081,'Human Development Index'@en,,0.626,,,,E8449487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14519208,RAIL-E5259315,pos_entity,Q999999,'Seraina Boner'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,,,P1198,'unemployment rate'@en,,,21,,Q11229,E14519209
17467236,RAIL-E4330978,pos_entity,Q9999999,'Olympiy Rudakov'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,,,P2046,'area'@en,,2.8902e+06,,,Q712226,E17467237
8449485,RAIL-E4301046,pos_entity,Q9999999,'Olympiy Rudakov'@en,Q5,'human'@en,P19,'place of birth'@en,Q1549591,'big city'@en,,,P2044,'elevation above sea level'@en,,-14,730.5,,Q11573,E8449486
42860548,RAIL-E5479287,pos_entity,Q9999999,'Olympiy Rudakov'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,,,P571,'inception'@en,,1657.5,,,,E42860549


### Now we can look at profiles for the entities. We'll do this in the next notebook - view_profiles