# Select the final label set that can be used to generate profiles for entities of a desired type
In this notebook, we will utilize the embeddings we created for the entities to further narrow down the labels that we previously created and filtered. This will give us a final set of labels that can be used to generate "profiles" for entities of a desired type by selecting labels from the final label set that a given entity matches. In this notebook, we will now ask for a desired type to profile on so that we don't create label sets for all types in the knowledge graph.

### Pre-requisite steps to run this notebook
1. Run the candidate_label_creation, candidate_filter, and HAS_entity_embeddings notebooks first (these have their own pre-reqs). We will use files that were created by those notebooks in this notebook.

In [1]:
import os
import pandas as pd
import numpy as np
import itertools
import seaborn as sns
import json
import random
import matplotlib.pyplot as plt
from utility import rename_cols_and_overwrite_id
from gensim.models import KeyedVectors
from tqdm.notebook import tqdm
import time

### Parameters
**REQUIRED**  
**type_to_profile**: The type that we will create a label set for, and therefore be able to create entity profiles for. This should be a string denoting a Q-node. e.g. if we want to create profiles for beers, we would set type_to_profile="Q44" (Q44=beer).  
**work_dir**: path to work dir that was specified in the candidate_label_creation, candidate_filter, and HAS_embeddings notebooks. We will utilize files that were saved by those previous notebooks, and also save files created by this notebook here.  
**item_file**: file path for the file that contains entity to entity relationships (e.g. wikibase-item)  
**time_file**: file path for the file that contains entity to time-type values  
**quantity_file**: file path for the file that contains entity to quantity-type values (remember to specify your trimmed quantity file if you ran the optional pre-processing trim_quantity_file notebook)  
**label_file**: file path for the file that contains wikidata labels  
**store_dir**: path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

**OPTIONAL**  
**In-progress... this is currently not used.**  
*string_file*: file path for the file that contains entity to string-type values  

In [2]:
data_dir = "./data/wikidata_humans" # my data files are all in the same directory, so I'll reuse this path prefix

# **REQUIRED**
type_to_profile = "Q5" 
work_dir = "./output/wikidata_humans_v3"
item_file = "{}/claims.wikibase-item.tsv.gz".format(data_dir)
time_file = "{}/claims.time.tsv.gz".format(data_dir)
quantity_file = "{}/claims.quantity_trimmed.tsv.gz".format(data_dir)
label_file = "{}/labels.en.tsv.gz".format(data_dir)
store_dir = "{}/temp".format(work_dir)

# **optional**
string_file = None #"{}/claims.string.tsv.gz".format(data_dir)

### Process parameters and set up variables / file names

In [3]:
# Ensure paths are absolute
item_file = os.path.abspath(item_file)
time_file = os.path.abspath(time_file)
quantity_file = os.path.abspath(quantity_file)
label_file = os.path.abspath(label_file)
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)
if string_file:
    string_file = os.path.abspath(string_file)
    
# Create directories
output_dir = "{}/final_label_sets/{}".format(work_dir, type_to_profile)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
label_set_work_dir = "{}/work".format(output_dir)
if not os.path.exists(label_set_work_dir):
    os.makedirs(label_set_work_dir)
label_set_final_dir = "{}/final".format(output_dir)
if not os.path.exists(label_set_final_dir):
    os.makedirs(label_set_final_dir)
if not os.path.exists(store_dir):
    os.makedirs(store_dir)
    
# adding some environment variables we'll be using frequently
os.environ['ITEM_FILE'] = item_file
os.environ['TIME_FILE'] = time_file
os.environ['QUANTITY_FILE'] = quantity_file
os.environ['LABEL_FILE'] = label_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['FILTERED_LABELS'] = "{}/candidate_filter".format(work_dir)
os.environ['LABEL_CREATION_DIR'] = "{}/label_creation".format(work_dir)
os.environ['TYPE'] = type_to_profile
os.environ['WORK'] = label_set_work_dir
os.environ['OUT'] = label_set_final_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

## 1. Get labels that abstract entities of the type we want to profile and find the entities that match each of those labels

**this could possibly be a separate notebook or just code moved out of notebook.**

**Actually we could do this in earlier steps along the way - like in label creation... For scalability, if we did this we might need to move the type_to_profile decision to there so we don't enumerate all labels and their entities (this would blow up when we get to RALs).** 

**since we didn't keep quantities and times separate for ravls and rails, the final label set won't be able to disambiguate whether the value for these is a time or quantity - this actually isn't much of a problem since the property will be able to disambiguate time versus quantity, however it means that we'll have to look in a concatenated quantities/times file when creating a profile for an entity.**

We will later choose from these labels to form the final label set, and we will do this based off of several formulas that take into account the entities that match each label.

Also, we need each label to have a unique identifier (again this is something that we could have done along the way in earlier notebooks - would simplify some queries). We'll add a column for this here as well.

### 1.1 RELs

In [31]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_rel_item_filtered.tsv \
-i $ITEM_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $WORK/RELs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id}]->(val), `'"$ITEM_FILE"'`: (e)-[l2 {label:p}]->(val), type: (e)-[]->(type), `'"$LABEL_FILE"'`: (type)-[:label]->(type_lab), `'"$LABEL_FILE"'`: (val)-[:label]->(val_lab), `'"$LABEL_FILE"'`: (e)-[:label]->(ent_lab)' \
--return 'distinct printf("REL-%s", label_id) as node1, "pos_entity" as label, e as node2, ent_lab as `node2;label`, type as type, type_lab as type_label, p as prop, p_lab as prop_label, val as value, val_lab as value_label, "_" as id' \
--where 'type_lab.kgtk_lqstring_lang_suffix = "en" AND val_lab.kgtk_lqstring_lang_suffix = "en" AND ent_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'type, p, value, e'

In [32]:
!kgtk add-id -i $WORK/RELs_and_their_entities.tsv -o $WORK/RELs_and_their_entities_temp.tsv --overwrite-id \
&& mv $WORK/RELs_and_their_entities_temp.tsv $WORK/RELs_and_their_entities.tsv


In [33]:
labels_df = pd.read_csv("{}/RELs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))

number of distinct labels: 3
number of rows (labels can have multiple matching entities): 7649907


Unnamed: 0,node1,label,node2,node2;label,type,type_label,prop,prop_label,value,value_label,id
0,REL-E7388537,pos_entity,Q100017731,'Ziyoung Park'@en,Q5,'human'@en,P106,'occupation'@en,Q1650915,'researcher'@en,E1
1,REL-E7388537,pos_entity,Q100019468,'Marcin Trzmielewski'@en,Q5,'human'@en,P106,'occupation'@en,Q1650915,'researcher'@en,E2
2,REL-E7388537,pos_entity,Q100023549,'Uttam Khanal'@en,Q5,'human'@en,P106,'occupation'@en,Q1650915,'researcher'@en,E3
3,REL-E7388537,pos_entity,Q100038944,'Liv Vadskjær Hjordt'@en,Q5,'human'@en,P106,'occupation'@en,Q1650915,'researcher'@en,E4
4,REL-E7388537,pos_entity,Q100049183,'Christopher A. Armatas'@en,Q5,'human'@en,P106,'occupation'@en,Q1650915,'researcher'@en,E5
5,REL-E7388537,pos_entity,Q100057080,'Jarle Vaage'@en,Q5,'human'@en,P106,'occupation'@en,Q1650915,'researcher'@en,E6
6,REL-E7388537,pos_entity,Q100104271,'Pedro Szekely'@en,Q5,'human'@en,P106,'occupation'@en,Q1650915,'researcher'@en,E7
7,REL-E7388537,pos_entity,Q100107129,'Filip Ilievski'@en,Q5,'human'@en,P106,'occupation'@en,Q1650915,'researcher'@en,E8
8,REL-E7388537,pos_entity,Q100138094,'Alain Harf'@en,Q5,'human'@en,P106,'occupation'@en,Q1650915,'researcher'@en,E9
9,REL-E7388537,pos_entity,Q100142069,'Frida Eggens'@en,Q5,'human'@en,P106,'occupation'@en,Q1650915,'researcher'@en,E10


### 1.2 AVLs - quantities

In [34]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_avl_quantity_filtered.tsv \
-i $QUANTITY_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $WORK/quantity_AVLs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id, si_units:label_si, wd_units:label_wd}]->(label_quantity_num), `'"$QUANTITY_FILE"'`: (e)-[l2 {label:p}]->(val), type: (e)-[]->(type), `'"$LABEL_FILE"'`: (type)-[:label]->(type_lab), `'"$LABEL_FILE"'`: (e)-[:label]->(ent_lab)' \
--return 'distinct printf("AVL-quantity-%s", label_id) as node1, "pos_entity" as label, e as node2, ent_lab as `node2;label`, type as type, type_lab as type_label, p as prop, p_lab as prop_label, label_quantity_num as value, label_si as si_units, label_wd as wd_units, "_" as id' \
--where 'ent_lab.kgtk_lqstring_lang_suffix = "en" AND type_lab.kgtk_lqstring_lang_suffix = "en" AND kgtk_quantity_number(val)=label_quantity_num AND ( (kgtk_quantity_si_units(val) is null AND label_si="") OR (kgtk_quantity_si_units(val)=label_si) ) AND ( (kgtk_quantity_wd_units(val) is null AND label_wd="") OR (kgtk_quantity_wd_units(val)=label_wd) )' \
--order-by 'type, prop, value, si_units, wd_units, e'


In [35]:
!kgtk add-id -i $WORK/quantity_AVLs_and_their_entities.tsv -o $WORK/quantity_AVLs_and_their_entities.tsv --overwrite-id \
&& mv $WORK/quantity_AVLs_and_their_entities_temp.tsv $WORK/quantity_AVLs_and_their_entities.tsv

In [36]:
labels_df = pd.read_csv("{}/quantity_AVLs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))


number of distinct labels: 0
number of rows (labels can have multiple matching entities): 0


### 1.3 AVLs - times

In [37]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_avl_time.year_filtered.tsv \
-i $TIME_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE\
-o $WORK/time.year_AVLs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id}]->(label_year), `'"$TIME_FILE"'`: (e)-[l2 {label:p}]->(val), type: (e)-[]->(type), `'"$LABEL_FILE"'`: (type)-[:label]->(type_lab), `'"$LABEL_FILE"'`: (e)-[:label]->(ent_lab)' \
--return 'distinct printf("AVL-time.year-%s", label_id) as node1, "pos_entity" as label, e as node2, ent_lab as `node2;label`, type as type, type_lab as type_label, p as prop, p_lab as prop_label, label_year as value, "_" as id' \
--where 'ent_lab.kgtk_lqstring_lang_suffix = "en" AND type_lab.kgtk_lqstring_lang_suffix = "en" AND kgtk_date_year(val)=label_year' \
--order-by 'type, prop, value, e'

In [38]:
!kgtk add-id -i $WORK/time.year_AVLs_and_their_entities.tsv -o $WORK/time.year_AVLs_and_their_entities_temp.tsv --overwrite-id \
&& mv $WORK/time.year_AVLs_and_their_entities_temp.tsv $WORK/time.year_AVLs_and_their_entities.tsv


In [39]:
labels_df = pd.read_csv("{}/time.year_AVLs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))

number of distinct labels: 0
number of rows (labels can have multiple matching entities): 0


### 1.4 AILs - quantities

In [23]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_ail_quantity_filtered.tsv \
-i $QUANTITY_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $WORK/quantity_AILs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id, upper_bound:ub, si_units:label_si, wd_units:label_wd}]->(lb), `'"$QUANTITY_FILE"'`: (e)-[l2 {label:p}]->(val), type: (e)-[]->(type), `'"$LABEL_FILE"'`: (type)-[:label]->(type_lab), `'"$LABEL_FILE"'`: (e)-[:label]->(ent_lab)' \
--return 'distinct printf("AIL-quantity-%s", label_id) as node1, "pos_entity" as label, e as node2, ent_lab as `node2;label`, type as type, type_lab as type_label, p as prop, p_lab as prop_label, lb as value_lb, ub as value_ub, label_si as si_units, label_wd as wd_units, "_" as id' \
--where 'ent_lab.kgtk_lqstring_lang_suffix = "en" AND type_lab.kgtk_lqstring_lang_suffix = "en" AND (not lb OR kgtk_quantity_number(val) >= kgtk_quantity_number(lb)) AND (not ub OR kgtk_quantity_number(ub) >= kgtk_quantity_number(val)) AND ( (kgtk_quantity_si_units(val) is null AND label_si="") OR (kgtk_quantity_si_units(val)=label_si) ) AND ( (kgtk_quantity_wd_units(val) is null AND label_wd="") OR (kgtk_quantity_wd_units(val)=label_wd) )' \
--order-by 'type, prop, value_lb, si_units, wd_units, e'


In [24]:
!kgtk add-id -i $WORK/quantity_AILs_and_their_entities.tsv -o $WORK/quantity_AILs_and_their_entities_temp.tsv --overwrite-id \
&& mv $WORK/quantity_AILs_and_their_entities_temp.tsv $WORK/quantity_AILs_and_their_entities.tsv

In [25]:
labels_df = pd.read_csv("{}/quantity_AILs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))

number of distinct labels: 12
number of rows (labels can have multiple matching entities): 264647


Unnamed: 0,node1,label,node2,node2;label,type,type_label,prop,prop_label,value_lb,value_ub,si_units,wd_units,id
0,AIL-quantity-E2632,pos_entity,Q1001168,'Mihály Ficsor'@en,Q5,'human'@en,P1971,'number of children'@en,0.0,2.0,,,E1
1,AIL-quantity-E2632,pos_entity,Q100152426,'Flip Kobler'@en,Q5,'human'@en,P1971,'number of children'@en,0.0,2.0,,,E2
2,AIL-quantity-E2632,pos_entity,Q100152438,'Cindy Marcus'@en,Q5,'human'@en,P1971,'number of children'@en,0.0,2.0,,,E3
3,AIL-quantity-E2632,pos_entity,Q1001652,'Mátyás Firtl'@en,Q5,'human'@en,P1971,'number of children'@en,0.0,2.0,,,E4
4,AIL-quantity-E2632,pos_entity,Q100220,'Susan Cummings'@en,Q5,'human'@en,P1971,'number of children'@en,0.0,2.0,,,E5
5,AIL-quantity-E2632,pos_entity,Q1002295,'Gábor Fodor'@en,Q5,'human'@en,P1971,'number of children'@en,0.0,2.0,,,E6
6,AIL-quantity-E2632,pos_entity,Q100230925,'Vít Vomáčka'@en,Q5,'human'@en,P1971,'number of children'@en,0.0,2.0,,,E7
7,AIL-quantity-E2632,pos_entity,Q100233515,'Precious Chong'@en,Q5,'human'@en,P1971,'number of children'@en,0.0,2.0,,,E8
8,AIL-quantity-E2632,pos_entity,Q100233531,'Wesley Cardinal'@en,Q5,'human'@en,P1971,'number of children'@en,0.0,2.0,,,E9
9,AIL-quantity-E2632,pos_entity,Q100234694,'Franck Brinsolaro'@en,Q5,'human'@en,P1971,'number of children'@en,0.0,2.0,,,E10


### 1.5 AILs - times

In [10]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_ail_time.year_filtered.tsv \
-i $TIME_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $WORK/time.year_AILs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id, upper_bound:ub}]->(lb), `'"$TIME_FILE"'`: (e)-[l2 {label:p}]->(val), type: (e)-[]->(type), `'"$LABEL_FILE"'`: (type)-[:label]->(type_lab), `'"$LABEL_FILE"'`: (e)-[:label]->(ent_lab)' \
--return 'distinct printf("AIL-time.year-%s", label_id) as node1, "pos_entity" as label, e as node2, ent_lab as `node2;label`, type as type, type_lab as type_label, p as prop, p_lab as prop_label, lb as value_lb, ub as value_ub, "_" as id' \
--where 'ent_lab.kgtk_lqstring_lang_suffix = "en" AND type_lab.kgtk_lqstring_lang_suffix = "en" AND (not lb OR kgtk_date_year(val) >= kgtk_quantity_number(lb)) AND (not ub OR kgtk_quantity_number(ub) >= kgtk_date_year(val))' \
--order-by 'type, prop, value_lb, e'


In [20]:
!kgtk add-id -i $WORK/time.year_AILs_and_their_entities.tsv -o $WORK/time.year_AILs_and_their_entities_temp.tsv --overwrite-id \
&& mv $WORK/time.year_AILs_and_their_entities_temp.tsv $WORK/time.year_AILs_and_their_entities.tsv

In [22]:
labels_df = pd.read_csv("{}/time.year_AILs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))

number of distinct labels: 87
number of rows (labels can have multiple matching entities): 7000948


Unnamed: 0,node1,label,node2,node2;label,type,type_label,prop,prop_label,value_lb,value_ub,id
0,AIL-time.year-E7872,pos_entity,Q1000366,'Bucky Pizzarelli'@en,Q5,'human'@en,P2031,'work period (start)'@en,1940,1950,E1
1,AIL-time.year-E7872,pos_entity,Q1001117,'Buddy Banks'@en,Q5,'human'@en,P2031,'work period (start)'@en,1940,1950,E2
2,AIL-time.year-E7872,pos_entity,Q1001175,'Buddy Elias'@en,Q5,'human'@en,P2031,'work period (start)'@en,1940,1950,E3
3,AIL-time.year-E7872,pos_entity,Q1001203,'Buddy Greco'@en,Q5,'human'@en,P2031,'work period (start)'@en,1940,1950,E4
4,AIL-time.year-E7872,pos_entity,Q100122,'Kurt Kreuger'@en,Q5,'human'@en,P2031,'work period (start)'@en,1940,1950,E5
5,AIL-time.year-E7872,pos_entity,Q1001311,'Buddy Starcher'@en,Q5,'human'@en,P2031,'work period (start)'@en,1940,1950,E6
6,AIL-time.year-E7872,pos_entity,Q100424665,'Hilde von Baravalle'@en,Q5,'human'@en,P2031,'work period (start)'@en,1940,1950,E7
7,AIL-time.year-E7872,pos_entity,Q100440,'Larry Hagman'@en,Q5,'human'@en,P2031,'work period (start)'@en,1940,1950,E8
8,AIL-time.year-E7872,pos_entity,Q1005544,'Vittorio Cottafavi'@en,Q5,'human'@en,P2031,'work period (start)'@en,1940,1950,E9
9,AIL-time.year-E7872,pos_entity,Q100705041,'Friederike Tschiedel'@en,Q5,'human'@en,P2031,'work period (start)'@en,1940,1950,E10


### 1.6 RAVLs - quantities and times
both quantities and times here because we mixed them together in the label creation step...

In [53]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_ravl_filtered.tsv \
-i $LABEL_CREATION_DIR/entity_AVLs_all.tsv -i $ITEM_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $WORK/RAVLs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type1:`'"$TYPE"'`)-[l1 {label:p1, id:label_id, prop2:p2, prop2_label:p2_lab, val:val, si_units:si, wd_units:wd}]->(type2), `'"$ITEM_FILE"'`: (e1)-[l2 {label:p1}]->(e2), AVLs_all: (type2)-[l3 {label:p2, entity:e2, si_units:si, wd_units:wd}]->(val2), type: (e1)-[]->(type1), `'"$LABEL_FILE"'`: (type1)-[:label]->(type1_lab), `'"$LABEL_FILE"'`: (p1)-[:label]->(p1_lab), `'"$LABEL_FILE"'`: (type2)-[:label]->(type2_lab), `'"$LABEL_FILE"'`: (e1)-[:label]->(ent1_lab)' \
--return 'distinct printf("RAVL-%s", label_id) as node1, "pos_entity" as label, e1 as node2, ent1_lab as `node2;label`, type1 as type, type1_lab as type_label, p1 as prop, p1_lab as prop_label, type2 as value, type2_lab as value_label, p2 as prop2, p2_lab as prop2_label, val as value2, si as si_units, wd as wd_units, "_" as id' \
--where 'ent1_lab.kgtk_lqstring_lang_suffix = "en" AND type1_lab.kgtk_lqstring_lang_suffix = "en" AND p1_lab.kgtk_lqstring_lang_suffix = "en" AND type2_lab.kgtk_lqstring_lang_suffix = "en" AND kgtk_quantity_number(val)=kgtk_quantity_number(val2)' \
--order-by 'type1, p1, type2, p2, value2, si, wd, e1'

In [54]:
labels_df = pd.read_csv("{}/RAVLs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


number of distinct labels: 10
number of rows (labels can have multiple matching entities): 10743308


Unnamed: 0,node1,label,node2,node2;label,type,type_label,prop,prop_label,value,value_label,prop2,prop2_label,value2,si_units,wd_units,id
0,RAVL-E9755347,pos_entity,Q1000002,'Claus Hammel'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,P2884,'mains voltage'@en,230.0,,Q25250,_
1,RAVL-E9755347,pos_entity,Q1000006,'Florian Eichinger'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,P2884,'mains voltage'@en,230.0,,Q25250,_
2,RAVL-E9755347,pos_entity,Q1000015,'Florian Jahr'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,P2884,'mains voltage'@en,230.0,,Q25250,_
3,RAVL-E9755347,pos_entity,Q1000023,'Wiltraut Rupp-von Brünneck'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,P2884,'mains voltage'@en,230.0,,Q25250,_
4,RAVL-E9755347,pos_entity,Q1000044,'Sylvia Bayr-Klimpfinger'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,P2884,'mains voltage'@en,230.0,,Q25250,_
5,RAVL-E9755347,pos_entity,Q1000045,'Eugen Karl Brammertz'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,P2884,'mains voltage'@en,230.0,,Q25250,_
6,RAVL-E9755347,pos_entity,Q1000048,'Franz Zimmermann'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,P2884,'mains voltage'@en,230.0,,Q25250,_
7,RAVL-E9755347,pos_entity,Q100005,'Tadeusz Borowski'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,P2884,'mains voltage'@en,230.0,,Q25250,_
8,RAVL-E9755347,pos_entity,Q1000050,'Mayk Bullerjahn'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,P2884,'mains voltage'@en,230.0,,Q25250,_
9,RAVL-E9755347,pos_entity,Q1000061,'Valentyn Symonenko'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,P2884,'mains voltage'@en,230.0,,Q25250,_


Note - below we need to create a temporary file to add ids to avoid what seems like a bug... wrote a github issue for this.

In [55]:
!kgtk add-id -i $WORK/RAVLs_and_their_entities.tsv -o $WORK/RAVLs_and_their_entities_temp.tsv --overwrite-id \
&& mv $WORK/RAVLs_and_their_entities_temp.tsv $WORK/RAVLs_and_their_entities.tsv

In [56]:
labels_df = pd.read_csv("{}/RAVLs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))

number of distinct labels: 0
number of rows (labels can have multiple matching entities): 0


### 1.7 RAILs - quantities and times
both quantities and times here because we mixed them together in the label creation step...

In [57]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_rail_filtered.tsv \
-i $LABEL_CREATION_DIR/entity_AILs_all.tsv -i $ITEM_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $WORK/RAILs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type1:`'"$TYPE"'`)-[l1 {label:p1, id:label_id, prop2:p2, prop2_label:p2_lab, lower_bound:lb, upper_bound:ub, si_units:si, wd_units:wd}]->(type2), `'"$ITEM_FILE"'`: (e1)-[l2 {label:p1}]->(e2), AILs_all: (type2)-[l3 {label:p2, entity:e2, lower_bound:lb, upper_bound:ub, si_units:si, wd_units:wd}]->(), type: (e1)-[]->(type1), `'"$LABEL_FILE"'`: (type1)-[:label]->(type1_lab), `'"$LABEL_FILE"'`: (p1)-[:label]->(p1_lab), `'"$LABEL_FILE"'`: (type2)-[:label]->(type2_lab), `'"$LABEL_FILE"'`: (e1)-[:label]->(ent1_lab)' \
--return 'distinct printf("RAIL-%s", label_id) as node1, "pos_entity" as label, e1 as node2, ent1_lab as `node2;label`, type1 as type, type1_lab as type_label, p1 as prop, p1_lab as prop_label, type2 as value, type2_lab as value_label, p2 as prop2, p2_lab as prop2_label, lb as value2_lb, ub as value2_ub, si as si_units, wd as wd_units, "_" as id' \
--where 'ent1_lab.kgtk_lqstring_lang_suffix = "en" AND type1_lab.kgtk_lqstring_lang_suffix = "en" AND p1_lab.kgtk_lqstring_lang_suffix = "en" AND type2_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'type1, p1, type2, p2, lb, ub, si, wd, e1'

In [58]:
!kgtk add-id -i $WORK/RAILs_and_their_entities.tsv -o $WORK/RAILs_and_their_entities_temp.tsv --overwrite-id \
&& mv $WORK/RAILs_and_their_entities_temp.tsv $WORK/RAILs_and_their_entities.tsv

In [59]:
labels_df = pd.read_csv("{}/RAILs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


number of distinct labels: 51
number of rows (labels can have multiple matching entities): 73660531


Unnamed: 0,node1,label,node2,node2;label,type,type_label,prop,prop_label,value,value_label,prop2,prop2_label,value2_lb,value2_ub,si_units,wd_units,id
0,RAIL-E4301046,pos_entity,Q10000001,'Tatyana Kolotilshchikova'@en,Q5,'human'@en,P19,'place of birth'@en,Q1549591,'big city'@en,P2044,'elevation above sea level'@en,-14.0,730.5,,Q11573,E1
1,RAIL-E4301046,pos_entity,Q1000053,'Vasily Nebenzya'@en,Q5,'human'@en,P19,'place of birth'@en,Q1549591,'big city'@en,P2044,'elevation above sea level'@en,-14.0,730.5,,Q11573,E2
2,RAIL-E4301046,pos_entity,Q1000061,'Valentyn Symonenko'@en,Q5,'human'@en,P19,'place of birth'@en,Q1549591,'big city'@en,P2044,'elevation above sea level'@en,-14.0,730.5,,Q11573,E3
3,RAIL-E4301046,pos_entity,Q1000074,'Richard Ellerkmann'@en,Q5,'human'@en,P19,'place of birth'@en,Q1549591,'big city'@en,P2044,'elevation above sea level'@en,-14.0,730.5,,Q11573,E4
4,RAIL-E4301046,pos_entity,Q1000079,'Otto Arndt Liebisch'@en,Q5,'human'@en,P19,'place of birth'@en,Q1549591,'big city'@en,P2044,'elevation above sea level'@en,-14.0,730.5,,Q11573,E5
5,RAIL-E4301046,pos_entity,Q1000087,'Horst Baier'@en,Q5,'human'@en,P19,'place of birth'@en,Q1549591,'big city'@en,P2044,'elevation above sea level'@en,-14.0,730.5,,Q11573,E6
6,RAIL-E4301046,pos_entity,Q1000089,'Ulrike Scheel'@en,Q5,'human'@en,P19,'place of birth'@en,Q1549591,'big city'@en,P2044,'elevation above sea level'@en,-14.0,730.5,,Q11573,E7
7,RAIL-E4301046,pos_entity,Q100013138,'James Gordon Kelly'@en,Q5,'human'@en,P19,'place of birth'@en,Q1549591,'big city'@en,P2044,'elevation above sea level'@en,-14.0,730.5,,Q11573,E8
8,RAIL-E4301046,pos_entity,Q1000180,'Thilo Koch'@en,Q5,'human'@en,P19,'place of birth'@en,Q1549591,'big city'@en,P2044,'elevation above sea level'@en,-14.0,730.5,,Q11573,E9
9,RAIL-E4301046,pos_entity,Q1000183,'Martin Bärenz'@en,Q5,'human'@en,P19,'place of birth'@en,Q1549591,'big city'@en,P2044,'elevation above sea level'@en,-14.0,730.5,,Q11573,E10


### 1.8 Combine into a single labels file

In [60]:
!kgtk cat -i $WORK/RELs_and_their_entities.tsv \
-i $WORK/quantity_AVLs_and_their_entities.tsv \
-i $WORK/time.year_AVLs_and_their_entities.tsv \
-i $WORK/quantity_AILs_and_their_entities.tsv \
-i $WORK/time.year_AILs_and_their_entities.tsv \
-i $WORK/RAVLs_and_their_entities.tsv \
-i $WORK/RAILs_and_their_entities.tsv \
-o $WORK/labels_and_their_entities.tsv \
&& kgtk add-id -i $WORK/labels_and_their_entities.tsv --overwrite-id -o $WORK/labels_and_their_entities_temp.tsv \
&& mv $WORK/labels_and_their_entities_temp.tsv $WORK/labels_and_their_entities.tsv

## 2. Get all entities of the type we are profiling

In [34]:
!kgtk query -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $WORK/entities_in_type.tsv --graph-cache $STORE \
--match 'type: (entity)-[]->(type:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (entity)-[:label]->(lab)' \
--return 'distinct entity as entity, lab as label' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'entity'

In [35]:
entities_in_type_df = pd.read_csv("{}/entities_in_type.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of entities: {}".format(entities_in_type_df.shape[0]))
display(entities_in_type_df.loc[:10].fillna(""))

number of entities: 176405


Unnamed: 0,entity,label
0,Q1000094,'You\\\\'re Dead'@en
1,Q1000174,'Tinko'@en
2,Q1000394,'This Modern Age'@en
3,Q10005695,'Happy Journey'@en
4,Q100057542,'The Two Sights'@en
5,Q10007277,"'Pacho, hybský zbojník'@en"
6,Q1000825,'Jan Dara'@en
7,Q1000826,'Guns of the Magnificent Seven'@en
8,Q100097551,'American Murder: The Family Next Door'@en
9,Q100104941,'Die Filmstadt Hollywood'@en


## 3. Load information needed into Python variables.
1. Load dictionary of all labels found in step 1 above along with their corresponding positive entities (note, performance of this step might be improved if we do it along the way in step 1)
2. Load set of all entities of the type we are profiling (compiled in step 2)
3. Load the keyed vector of entity embeddings (created in HAS_entity_embeddings notebook)

In [68]:
%%time
# load labels from step 1
labels_df = pd.read_csv("{}/labels_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t').fillna("")

# create dictionary of label_id --> set of matching entities
# using groupby is not the most efficient, but more concise
label_to_entities = labels_df.groupby("node1")["node2"].apply(set).to_dict()
print("total number of candidate labels for profiling entities of type {}: {}".format(type_to_profile, len(label_to_entities)))

# look up number of entities of the type we are profiling
all_ents_in_type = set(pd.read_csv("{}/entities_in_type.tsv".format(os.environ['WORK']), delimiter = '\t').entity)
print("Loaded set of all entities of the type we are profiling ({} total)".format(len(all_ents_in_type)))

# load embeddings created in the embeddings notebook
entity_embeddings = KeyedVectors.load("{}/HAS_embeddings/entity_embeddings.kv".format(work_dir))
print("Loaded entity embeddings")



total number of candidate labels for profiling entities of type Q5: 54
Loaded set of all entities of the type we are profiling (7958973 total)
Loaded entity embeddings
CPU times: user 10min 17s, sys: 2min 2s, total: 12min 19s
Wall time: 13min 21s


## 4. Compute distinctiveness for each label
distinctiveness of a label = average similarity amongst its positive entities - average similarity between its positive and negative entities

In [70]:
%%time
# for human, about 3 mins per label, could try more sampling... i.e. we are doing all ents X sample of other ents to compare. we could sample both sides.
# Sampling params:
internal_sim_window_size = 100 # Internal is faster, so can set this higher
external_sim_sample_size = 50

print("Calculating distinctiveness for {} labels".format(len(label_to_entities)))
count = 0

label_distinctiveness = {}
# todo - use comprehension instead of loop
for label, pos_ents in tqdm(label_to_entities.items()):
    # average internal similarity
    if len(pos_ents) < 2:
        avg_internal_similarity = 1 # only relevant if we have labels that apply to only a single entity
    else:
        window_size = min(internal_sim_window_size, len(pos_ents) -1)
        pos_ent_embeds = entity_embeddings[pos_ents]
        avg_internal_similarity = np.sum([np.sum(entity_embeddings.cosine_similarities(pos_ent_embeds[i], pos_ent_embeds[i+1:i+1+window_size])) for i in range(len(pos_ents))]) / ((len(pos_ents)*window_size) - ((window_size+1)*window_size/2))

    # average external similarity
    neg_ents = all_ents_in_type - pos_ents
    
    num_samples = min(external_sim_sample_size, len(neg_ents))
    neg_ent_embeds = list(entity_embeddings[neg_ents])
    avg_external_similarity = np.sum([np.sum(entity_embeddings.cosine_similarities(pos_ent_embeds[i], random.sample(neg_ent_embeds,num_samples))) for i in range(len(pos_ents))]) / (len(pos_ents)*num_samples)

    label_distinctiveness[label] = avg_internal_similarity - avg_external_similarity
    
    count += 1
    print("Done with {} labels{}".format(count, "             "), end="\r")
    
with open("{}/label_distinctiveness_dict.json".format(label_set_work_dir), "w") as f:
    json.dump(label_distinctiveness, f)

Calculating distinctiveness for 54 labels


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=54.0), HTML(value='')))

Done with 1 labels             


KeyboardInterrupt: 

Let's take a look at the most distinctive labels

In [257]:
print("There are {} labels, and {} distinct values of distinctiveness.".format(len(label_distinctiveness),len(set(label_distinctiveness.values()))))
print("Most distinctive labels:")
most_distinct = sorted(label_distinctiveness.items(), key=lambda item: item[1], reverse=True)[:5]
display(most_distinct)
print("Looking at the top 5:")
cols_to_display = ["type", "type_label", "prop", "prop_label", "value", "value_label", "value_lb", "value_ub", "prop2", "prop2_label", "value2", "value2_lb", "value2_ub", "si_units", "wd_units"]
for i in range(len(most_distinct)):
    label_id = most_distinct[i][0]
    display(labels_df.loc[labels_df["node1"]==label_id, cols_to_display].fillna("").iloc[[0]])
    print("Number of positive entities for this label: {} (out of {} entities of this type)".format(len(label_to_entities[label_id]), len(all_ents_in_type)))
#print("Positive entities for this label: {}".format(", ".join(label_to_entities[label_id])))
# negatives = all_ents_in_type - label_to_entities[label_id]
# print("Negative entities for this label: {}".format(", ".join(negatives)))

There are 273 labels, and 273 distinct values of distinctiveness.
Most distinctive labels:


[('RAIL-E976418', 0.31647640386999143),
 ('RAIL-E976411', 0.3164195152423728),
 ('RAIL-E976422', 0.31641950157321325),
 ('RAIL-E976421', 0.31638429720047456),
 ('RAIL-E976420', 0.31637531900466437)]

Looking at the top 5:


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2721045,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P4010,'GDP (PPP)'@en,,5652059000000.0,,,Q550207


Number of positive entities for this label: 23390 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2404370,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P1198,'unemployment rate'@en,,,5.0,,Q11229


Number of positive entities for this label: 23390 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2427760,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P2046,'area'@en,,1156670.5,5473562.0,,Q712226


Number of positive entities for this label: 23391 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2814859,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P8477,'BTI Status Index'@en,,,,,


Number of positive entities for this label: 23390 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2791469,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P8476,'BTI Governance Index'@en,,,,,


Number of positive entities for this label: 23390 (out of 176405 entities of this type)


## 5. Iteratively choose labels to add to label set using formula
Choose label $l_i = argmax_{l_i \in L_t^c}{[d(l_i) + \delta * reward(l_i,L_t) - (1- \delta) * penalty(l_i,L_t)]}$

Where...
- $l_i$ is a label
- $L_t^c$ is the set of candidate labels relevant to the type $t$ that we are profiling
- $L_t$ is the set of labels relevant to the type $t$ that we have so far chosen to be in the final label set
- $d(l_i)$ is the distinctiveness of label $l_i$
- $reward(l_i,L_t)$ is a function that captures the potential increase in the total coverage of entities of type $t$ in the KG by the labels in $L_t$ if $l_i$ were to be added to it.
- $penalty(l_i,L_t)$ is a function that captures the potential increase in redundancy of labels in $L_t$ if $l_i$ were to be added to it.
- $\delta$ is a hyperparameter that adjusts if we care more about increasing total coverage versus minimizing redundancy

Functions for computing reward and penalty...

$reward(l_i,L_t) = \cfrac{|\bigcup_{l_j \in (L_t \cup \{l_i\})}{\varepsilon_t^{l_j}}|}{|\varepsilon_t|}$

Where...
- $\varepsilon_t^{l_j}$ is the set of entities of type $t$ that match label $l_j$
- $\varepsilon_t$ is the set of entities of type $t$

In [72]:
# note params are slightly different to avoid redundant computation
def reward(candidate_label, entities_covered_already):
    return len(entities_covered_already | label_to_entities[candidate_label]) / len(all_ents_in_type)

From the paper, "reward is the potential contribution of \[the label\] to the increase of the total
coverage of positive entities in the KG". The formula above doesn't exactly do that... yes, if the label increases the coverage of positive entities, then it will be higher, but if we already have good coverage and this label doesn't add anything new, reward will still be high. If we want a function that does what they explained, we could use something like $\cfrac{|\varepsilon_t^{l_i}| - |\varepsilon_t^{L_t} \cap \varepsilon_t^{l_i}|}{|\varepsilon_t|}$

$penalty(l_i,L_t) = \cfrac{\sum_{l_j \in L_t}{|\varepsilon_t^{l_i} \cap \varepsilon_t^{l_j}|}}{|L_t| * |\varepsilon_t|}$

In [73]:
# note params are slightly different to avoid redundant computation
def penalty(numerator, label_set):
    # when label set is empty, avoid divide by zero
    if len(label_set) == 0:
        return 0 
    return numerator / (len(label_set) * len(all_ents_in_type))

Iteratively choose labels

In [69]:
with open("{}/label_distinctiveness_dict.json".format(label_set_work_dir), "r") as f:
    label_distinctiveness = json.load(f)

In [77]:
%%time
# for 54 human labels, looks like 30s/label
d = .5 # can make this a parameter
min_cutoff = 0 # can make this a parameter (though it isn't mentioned in the paper..)
covered_ents = set() # set of all entities covered by label set
label_set = []
candidate_labels = list(label_to_entities.keys())
penalty_numerator_for_label = {l : 0 for l in candidate_labels}
print("We have {} candidate labels to choose from".format(len(candidate_labels)))

label_reward={}
label_penalty={}
label_score={}

count=0
start=time.perf_counter()
for i in tqdm(range(len(candidate_labels))):
    # Finding the best label
    vals = [label_distinctiveness[l] + d*reward(l, covered_ents) - (1-d)*penalty(penalty_numerator_for_label[l], label_set) for l in candidate_labels]
    max_val = np.max(vals)
    if max_val <= min_cutoff:
        break
    max_ix = np.random.choice(np.flatnonzero(vals == max_val))
    max_label = candidate_labels[max_ix]
    max_label_ents = label_to_entities[max_label]
    
    label_reward[max_label] = reward(max_label, covered_ents)
    label_penalty[max_label] = penalty(penalty_numerator_for_label[max_label], label_set)
    label_score[max_label] = max_val
    
    # Adding the label to final label set
    label_set.append(max_label)
    covered_ents = covered_ents | max_label_ents

    # Remove the label from candidate labels
    candidate_labels.pop(max_ix)
    
    # Update penalty numerator for labels that are still in the candidate list
    for l in candidate_labels:
        penalty_numerator_for_label[l] += len((label_to_entities[l] & max_label_ents))

    count += 1
    mins_elapsed = (time.perf_counter() - start) / 60
    print("Done with {} labels. Time elapsed: {:.2f} mins{}".format(count, mins_elapsed, "             "), end="\r")
    
print("Final label set has {} labels, covering {} / {} entities of type {} in the dataset".format(len(label_set), len(covered_ents), len(all_ents_in_type), type_to_profile))
with open("{}/ordered_final_label_set.json".format(label_set_work_dir), 'w') as f:
    json.dump(label_set, f)
with open("{}/label_reward_dict.json".format(label_set_work_dir), 'w') as f:
    json.dump(label_reward, f)
with open("{}/label_penalty_dict.json".format(label_set_work_dir), 'w') as f:
    json.dump(label_penalty, f)
with open("{}/label_score_dict.json".format(label_set_work_dir), 'w') as f:
    json.dump(label_score, f)

We have 54 candidate labels to choose from


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=54.0), HTML(value='')))

Done with 3 labels. Time elapsed: 1.25 mins             


KeyboardInterrupt: 

### Format final profiles
We'll create 2 ways to consume these:  
1. a dictionary of {entity : [ordered list of label-ids]} plus and a table of labels (ids plus their contents)
2. a single table of entities and their corresponding labels and label content.

First we'll add the computed distinctiveness, reward, and penalty info to the labels' content

In [78]:
labels_df["distinctiveness"] = labels_df["node1"].map(label_distinctiveness)
labels_df["reward"] = labels_df["node1"].map(label_reward)
labels_df["penalty"] = labels_df["node1"].map(label_penalty)
labels_df["re-ranking score"] = labels_df["node1"].map(label_score)

In [21]:
%%time
with open("{}/ordered_final_label_set.json".format(label_set_work_dir), 'r') as f:
    label_set = json.load(f)
profiles = {}
covered_ents = set()
for label in label_set:
    for ent in label_to_entities[label]:
        covered_ents.add(ent)
        if ent not in profiles:
            profiles[ent] = []
        profiles[ent].append(label)
for ent in (all_ents_in_type - covered_ents):
    profiles[ent] = []
    
with open("{}/profiles_dict.json".format(label_set_final_dir), 'w') as f:
    json.dump(profiles, f)

info_cols = ["node1", "type", "type_label", "prop", "prop_label", "value", "value_label", "value_lb", "value_ub", "prop2", "prop2_label", "value2", "value2_lb", "value2_ub", "si_units", "wd_units", "distinctiveness", "reward", "penalty", "re-ranking score"]
labels_info = labels_df.loc[:,info_cols].rename(columns={"node1":"label_id"}).groupby("label_id").first().reset_index()
labels_info.to_csv(path_or_buf = "{}/all_labels_info.csv".format(label_set_work_dir), index = False)

label_set_labels_info = labels_info.loc[labels_info.loc[:,"label_id"].isin(label_set), :]
label_set_labels_info.to_csv(path_or_buf = "{}/label_set_labels_info.csv".format(label_set_final_dir), index = False)

CPU times: user 10.1 ms, sys: 2.43 ms, total: 12.5 ms
Wall time: 2.13 s


In [63]:
profiles_info_cols = ["node1","label", "node2", "node2;label", "type", "type_label", "prop", "prop_label", "value", "value_label", "value_lb", "value_ub", "prop2", "prop2_label", "value2", "value2_lb", "value2_ub", "si_units", "wd_units", "distinctiveness", "reward", "penalty", "re-ranking score", "id"]
profiles_df = labels_df.loc[labels_df.loc[:,"node1"].isin(label_set),profiles_info_cols].sort_values(by=['node2'])
profiles_df.to_csv(path_or_buf = "{}/entities_and_their_profiles.tsv".format(label_set_final_dir), index = False, sep='\t')

In [69]:
display(profiles_df)

Unnamed: 0,node1,label,node2,node2;label,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units,id
16606934,RAIL-E4330978,pos_entity,Q10000001,'Tatyana Kolotilshchikova'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,,,P2046,'area'@en,,2.8902e+06,,,Q712226,E16606935
24828611,RAIL-E5027414,pos_entity,Q10000001,'Tatyana Kolotilshchikova'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,,,P2299,'PPP GDP per capita'@en,,,50391.5,,Q550207,E24828612
63480438,RAIL-E4995496,pos_entity,Q10000001,'Tatyana Kolotilshchikova'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q6256,'country'@en,,,P2855,'VAT-rate'@en,,8.35,26,,Q11229,E63480439
40115327,RAIL-E5479287,pos_entity,Q10000001,'Tatyana Kolotilshchikova'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,,,P571,'inception'@en,,1657.5,,,,E40115328
8449486,RAIL-E5274231,pos_entity,Q10000001,'Tatyana Kolotilshchikova'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,,,P1081,'Human Development Index'@en,,0.626,,,,E8449487
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14519208,RAIL-E5259315,pos_entity,Q999999,'Seraina Boner'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,,,P1198,'unemployment rate'@en,,,21,,Q11229,E14519209
17467236,RAIL-E4330978,pos_entity,Q9999999,'Olympiy Rudakov'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,,,P2046,'area'@en,,2.8902e+06,,,Q712226,E17467237
8449485,RAIL-E4301046,pos_entity,Q9999999,'Olympiy Rudakov'@en,Q5,'human'@en,P19,'place of birth'@en,Q1549591,'big city'@en,,,P2044,'elevation above sea level'@en,,-14,730.5,,Q11573,E8449486
42860548,RAIL-E5479287,pos_entity,Q9999999,'Olympiy Rudakov'@en,Q5,'human'@en,P27,'country of citizenship'@en,Q3624078,'sovereign state'@en,,,P571,'inception'@en,,1657.5,,,,E42860549


### Now we can look at profiles for the entities. We'll do this in the next notebook - view_profiles