# Select the final label set that can be used to generate profiles for entities of a desired type
In this notebook, we will utilize the embeddings we created for the entities to further narrow down the labels that we previously created and filtered. This will give us a final set of labels that can be used to generate "profiles" for entities of a desired type by selecting labels from the final label set that a given entity matches. In this notebook, we will now ask for a desired type to profile on so that we don't create label sets for all types in the knowledge graph.

### Pre-requisite steps to run this notebook
1. Run the candidate_label_creation, candidate_filter, and HAS_entity_embeddings notebooks first (these have their own pre-reqs). We will use files that were created by those notebooks in this notebook.

In [1]:
import os
import pandas as pd
import numpy as np
import itertools
import seaborn as sns
import json
import random
import matplotlib.pyplot as plt
from utility import run_command
from utility import rename_cols_and_overwrite_id
from label_discretization import discretize_labels
from gensim.models import KeyedVectors

### Parameters
**REQUIRED**  
**type_to_profile**: The type that we will create a label set for, and therefore be able to create entity profiles for. This should be a string denoting a Q-node. e.g. if we want to create profiles for beers, we would set type_to_profile="Q44" (Q44=beer).  
**work_dir**: path to work dir that was specified in the candidate_label_creation, candidate_filter, and HAS_embeddings notebooks. We will utilize files that were saved by those previous notebooks, and also save files created by this notebook here.  
**item_file**: file path for the file that contains entity to entity relationships (e.g. wikibase-item)  
**time_file**: file path for the file that contains entity to time-type values  
**quantity_file**: file path for the file that contains entity to quantity-type values (remember to specify your trimmed quantity file if you ran the optional pre-processing trim_quantity_file notebook)  
**label_file**: file path for the file that contains wikidata labels  
**store_dir**: path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

**OPTIONAL**  
**In-progress... this is currently not used.**  
*string_file*: file path for the file that contains entity to string-type values  

In [6]:
# data_dir = "../../Q44/data" # my data files are all in the same directory, so I'll reuse this path prefix

# # **REQUIRED**
# type_to_profile = "Q44"
# work_dir = "../../Q44/profiler_work"
# item_file = "{}/Q44.part.wikibase-item.tsv".format(data_dir)
# time_file = "{}/Q44.part.time.tsv".format(data_dir)
# quantity_file = "{}/Q44.part.quantity.tsv".format(data_dir)
# label_file = "{}/Q44.label.en.tsv".format(data_dir)
# store_dir = "../../Q44"

# # **optional**
# string_file = None #"{}/Q44.part.string.tsv".format(data_dir)

data_dir = "../../wikidata_films/data" # my data files are all in the same directory, so I'll reuse this path prefix

# **REQUIRED**
type_to_profile = "Q11424" # Film
work_dir = "../../wikidata_films/profiler_work"
item_file = "{}/claims.wikibase-item.tsv.gz".format(data_dir)
time_file = "{}/claims.time.tsv.gz".format(data_dir)
quantity_file = "{}/claims.quantity_trimmed.tsv.gz".format(data_dir)
label_file = "{}/labels.en.tsv.gz".format(data_dir)
store_dir = "../../wikidata_films"

# **optional**
string_file = None #"{}/claims.string.tsv.gz".format(data_dir)

### Process parameters and set up variables / file names

In [7]:
# Ensure paths are absolute
item_file = os.path.abspath(item_file)
time_file = os.path.abspath(time_file)
quantity_file = os.path.abspath(quantity_file)
label_file = os.path.abspath(label_file)
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)
if string_file:
    string_file = os.path.abspath(string_file)
    
# Create directories
output_dir = "{}/final_label_sets/{}".format(work_dir, type_to_profile)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
label_set_work_dir = "{}/work".format(output_dir)
if not os.path.exists(label_set_work_dir):
    os.makedirs(label_set_work_dir)
label_set_final_dir = "{}/final".format(output_dir)
if not os.path.exists(label_set_final_dir):
    os.makedirs(label_set_final_dir)
    
# adding some environment variables we'll be using frequently
os.environ['ITEM_FILE'] = item_file
os.environ['TIME_FILE'] = time_file
os.environ['QUANTITY_FILE'] = quantity_file
os.environ['LABEL_FILE'] = label_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['FILTERED_LABELS'] = "{}/candidate_filter".format(work_dir)
os.environ['LABEL_CREATION_DIR'] = "{}/label_creation".format(work_dir)
os.environ['TYPE'] = type_to_profile
os.environ['WORK'] = label_set_work_dir
os.environ['OUT'] = label_set_final_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

## 1. Get labels that abstract entities of the type we want to profile and find the entities that match each of those labels

**this could possibly be a separate notebook or just code moved out of notebook.**

**Actually we could do this in earlier steps along the way - like in label creation... For scalability, if we did this we might need to move the type_to_profile decision to there so we don't enumerate all labels and their entities (this would blow up when we get to RALs).** 

**since we didn't keep quantities and times separate for ravls and rails, the final label set won't be able to disambiguate whether the value for these is a time or quantity - this actually isn't much of a problem since the property will be able to disambiguate time versus quantity, however it means that we'll have to look in a concatenated quantities/times file when creating a profile for an entity.**

We will later choose from these labels to form the final label set, and we will do this based off of several formulas that take into account the entities that match each label.

Also, we need each label to have a unique identifier (again this is something that we could have done along the way in earlier notebooks - would simplify some queries). We'll add a column for this here as well.

### 1.1 RELs

In [13]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_rel_item_filtered.tsv \
-i $ITEM_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $WORK/RELs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id}]->(val), `'"$ITEM_FILE"'`: (e)-[l2 {label:p}]->(val), type: (e)-[]->(type), `'"$LABEL_FILE"'`: (type)-[:label]->(type_lab), `'"$LABEL_FILE"'`: (val)-[:label]->(val_lab)' \
--return 'distinct printf("REL-%s", label_id) as node1, "pos_entity" as label, e as node2, type as type, type_lab as type_label, p as prop, p_lab as prop_label, val as value, val_lab as value_label, "_" as id' \
--where 'type_lab.kgtk_lqstring_lang_suffix = "en" AND val_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'type, p, value, e'

In [14]:
!kgtk add-id -i $WORK/RELs_and_their_entities.tsv -o $WORK/RELs_and_their_entities_temp.tsv --overwrite-id \
&& mv $WORK/RELs_and_their_entities_temp.tsv $WORK/RELs_and_their_entities.tsv


In [15]:
labels_df = pd.read_csv("{}/RELs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))

number of distinct labels: 7
number of rows (labels can have multiple matching entities): 316974


Unnamed: 0,node1,label,node2,type,type_label,prop,prop_label,value,value_label,id
0,REL-E1297987,pos_entity,Q1000394,Q11424,'film'@en,P136,'genre'@en,Q130232,'drama'@en,E1
1,REL-E1297987,pos_entity,Q10005695,Q11424,'film'@en,P136,'genre'@en,Q130232,'drama'@en,E2
2,REL-E1297987,pos_entity,Q100136948,Q11424,'film'@en,P136,'genre'@en,Q130232,'drama'@en,E3
3,REL-E1297987,pos_entity,Q100136995,Q11424,'film'@en,P136,'genre'@en,Q130232,'drama'@en,E4
4,REL-E1297987,pos_entity,Q100137285,Q11424,'film'@en,P136,'genre'@en,Q130232,'drama'@en,E5
5,REL-E1297987,pos_entity,Q100139868,Q11424,'film'@en,P136,'genre'@en,Q130232,'drama'@en,E6
6,REL-E1297987,pos_entity,Q100152089,Q11424,'film'@en,P136,'genre'@en,Q130232,'drama'@en,E7
7,REL-E1297987,pos_entity,Q100152173,Q11424,'film'@en,P136,'genre'@en,Q130232,'drama'@en,E8
8,REL-E1297987,pos_entity,Q100152200,Q11424,'film'@en,P136,'genre'@en,Q130232,'drama'@en,E9
9,REL-E1297987,pos_entity,Q100152226,Q11424,'film'@en,P136,'genre'@en,Q130232,'drama'@en,E10


### 1.2 AVLs - quantities

In [16]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_avl_quantity_filtered.tsv \
-i $QUANTITY_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $WORK/quantity_AVLs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id, si_units:label_si, wd_units:label_wd}]->(label_quantity_num), `'"$QUANTITY_FILE"'`: (e)-[l2 {label:p}]->(val), type: (e)-[]->(type), `'"$LABEL_FILE"'`: (type)-[:label]->(type_lab)' \
--return 'distinct printf("AVL-quantity-%s", label_id) as node1, "pos_entity" as label, e as node2, type as type, type_lab as type_label, p as prop, p_lab as prop_label, label_quantity_num as value, label_si as si_units, label_wd as wd_units, "_" as id' \
--where 'type_lab.kgtk_lqstring_lang_suffix = "en" AND kgtk_quantity_number(val)=label_quantity_num AND ( (kgtk_quantity_si_units(val) is null AND label_si="") OR (kgtk_quantity_si_units(val)=label_si) ) AND ( (kgtk_quantity_wd_units(val) is null AND label_wd="") OR (kgtk_quantity_wd_units(val)=label_wd) )' \
--order-by 'type, prop, value, si_units, wd_units, e'


In [17]:
!kgtk add-id -i $WORK/quantity_AVLs_and_their_entities.tsv -o $WORK/quantity_AVLs_and_their_entities.tsv --overwrite-id

In [18]:
labels_df = pd.read_csv("{}/quantity_AVLs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))


number of distinct labels: 0
number of rows (labels can have multiple matching entities): 0


### 1.3 AVLs - times

In [19]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_avl_time.year_filtered.tsv \
-i $TIME_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE\
-o $WORK/time.year_AVLs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id}]->(label_year), `'"$TIME_FILE"'`: (e)-[l2 {label:p}]->(val), type: (e)-[]->(type), `'"$LABEL_FILE"'`: (type)-[:label]->(type_lab)' \
--return 'distinct printf("AVL-time.year-%s", label_id) as node1, "pos_entity" as label, e as node2, type as type, type_lab as type_label, p as prop, p_lab as prop_label, label_year as value, "_" as id' \
--where 'type_lab.kgtk_lqstring_lang_suffix = "en" AND kgtk_date_year(val)=label_year' \
--order-by 'type, prop, value, e'

In [20]:
!kgtk add-id -i $WORK/time.year_AVLs_and_their_entities.tsv -o $WORK/time.year_AVLs_and_their_entities.tsv --overwrite-id

In [21]:
labels_df = pd.read_csv("{}/time.year_AVLs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))

number of distinct labels: 0
number of rows (labels can have multiple matching entities): 0


### 1.4 AILs - quantities

In [22]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_ail_quantity_filtered.tsv \
-i $QUANTITY_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $WORK/quantity_AILs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id, upper_bound:ub, si_units:label_si, wd_units:label_wd}]->(lb), `'"$QUANTITY_FILE"'`: (e)-[l2 {label:p}]->(val), type: (e)-[]->(type), `'"$LABEL_FILE"'`: (type)-[:label]->(type_lab)' \
--return 'distinct printf("AIL-quantity-%s", label_id) as node1, "pos_entity" as label, e as node2, type as type, type_lab as type_label, p as prop, p_lab as prop_label, lb as value_lb, ub as value_ub, label_si as si_units, label_wd as wd_units, "_" as id' \
--where 'type_lab.kgtk_lqstring_lang_suffix = "en" AND (not lb OR kgtk_quantity_number(val) >= kgtk_quantity_number(lb)) AND (not ub OR kgtk_quantity_number(ub) >= kgtk_quantity_number(val)) AND ( (kgtk_quantity_si_units(val) is null AND label_si="") OR (kgtk_quantity_si_units(val)=label_si) ) AND ( (kgtk_quantity_wd_units(val) is null AND label_wd="") OR (kgtk_quantity_wd_units(val)=label_wd) )' \
--order-by 'type, prop, value_lb, si_units, wd_units, e'


In [23]:
!kgtk add-id -i $WORK/quantity_AILs_and_their_entities.tsv -o $WORK/quantity_AILs_and_their_entities.tsv --overwrite-id

In [24]:
labels_df = pd.read_csv("{}/quantity_AILs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))

number of distinct labels: 0
number of rows (labels can have multiple matching entities): 0


### 1.5 AILs - times

In [25]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_ail_time.year_filtered.tsv \
-i $TIME_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $WORK/time.year_AILs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id, upper_bound:ub}]->(lb), `'"$TIME_FILE"'`: (e)-[l2 {label:p}]->(val), type: (e)-[]->(type), `'"$LABEL_FILE"'`: (type)-[:label]->(type_lab)' \
--return 'distinct printf("AIL-time.year-%s", label_id) as node1, "pos_entity" as label, e as node2, type as type, type_lab as type_label, p as prop, p_lab as prop_label, lb as value_lb, ub as value_ub, "_" as id' \
--where 'type_lab.kgtk_lqstring_lang_suffix = "en" AND (not lb OR kgtk_date_year(val) >= kgtk_quantity_number(lb)) AND (not ub OR kgtk_quantity_number(ub) >= kgtk_date_year(val))' \
--order-by 'type, prop, value_lb, e'


In [26]:
!kgtk add-id -i $WORK/time.year_AILs_and_their_entities.tsv -o $WORK/time.year_AILs_and_their_entities.tsv --overwrite-id

In [27]:
labels_df = pd.read_csv("{}/time.year_AILs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))

number of distinct labels: 0
number of rows (labels can have multiple matching entities): 0


### 1.6 RAVLs - quantities and times
both quantities and times here because we mixed them together in the label creation step...

In [28]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_ravl_filtered.tsv \
-i $LABEL_CREATION_DIR/entity_AVLs_all.tsv -i $ITEM_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $WORK/RAVLs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type1:`'"$TYPE"'`)-[l1 {label:p1, id:label_id, prop2:p2, prop2_label:p2_lab, val:val, si_units:si, wd_units:wd}]->(type2), `'"$ITEM_FILE"'`: (e1)-[l2 {label:p1}]->(e2), AVLs_all: (type2)-[l3 {label:p2, entity:e2, si_units:si, wd_units:wd}]->(val), type: (e1)-[]->(type1), `'"$LABEL_FILE"'`: (type1)-[:label]->(type1_lab), `'"$LABEL_FILE"'`: (p1)-[:label]->(p1_lab), `'"$LABEL_FILE"'`: (type2)-[:label]->(type2_lab)' \
--return 'distinct printf("RAVL-%s", label_id) as node1, "pos_entity" as label, e1 as node2, type1 as type, type1_lab as type_label, p1 as prop, p1_lab as prop_label, type2 as value, type2_lab as value_label, p2 as prop2, p2_lab as prop2_label, val as value2, si as si_units, wd as wd_units, "_" as id' \
--where 'type1_lab.kgtk_lqstring_lang_suffix = "en" AND p1_lab.kgtk_lqstring_lang_suffix = "en" AND type2_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'type1, p1, type2, p2, value2, si, wd, e1'

Note - below we need to create a temporary file to add ids to avoid what seems like a bug... wrote a github issue for this.

In [29]:
!kgtk add-id -i $WORK/RAVLs_and_their_entities.tsv -o $WORK/RAVLs_and_their_entities_temp.tsv --overwrite-id \
&& mv $WORK/RAVLs_and_their_entities_temp.tsv $WORK/RAVLs_and_their_entities.tsv

In [30]:
labels_df = pd.read_csv("{}/RAVLs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))

number of distinct labels: 0
number of rows (labels can have multiple matching entities): 0


### 1.7 RAILs - quantities and times
both quantities and times here because we mixed them together in the label creation step...

In [31]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_rail_filtered.tsv \
-i $LABEL_CREATION_DIR/entity_AILs_all.tsv -i $ITEM_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $WORK/RAILs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type1:`'"$TYPE"'`)-[l1 {label:p1, id:label_id, prop2:p2, prop2_label:p2_lab, lower_bound:lb, upper_bound:ub, si_units:si, wd_units:wd}]->(type2), `'"$ITEM_FILE"'`: (e1)-[l2 {label:p1}]->(e2), AILs_all: (type2)-[l3 {label:p2, entity:e2, lower_bound:lb, upper_bound:ub, si_units:si, wd_units:wd}]->(), type: (e1)-[]->(type1), `'"$LABEL_FILE"'`: (type1)-[:label]->(type1_lab), `'"$LABEL_FILE"'`: (p1)-[:label]->(p1_lab), `'"$LABEL_FILE"'`: (type2)-[:label]->(type2_lab)' \
--return 'distinct printf("RAIL-%s", label_id) as node1, "pos_entity" as label, e1 as node2, type1 as type, type1_lab as type_label, p1 as prop, p1_lab as prop_label, type2 as value, type2_lab as value_label, p2 as prop2, p2_lab as prop2_label, lb as value2_lb, ub as value2_ub, si as si_units, wd as wd_units, "_" as id' \
--where 'type1_lab.kgtk_lqstring_lang_suffix = "en" AND p1_lab.kgtk_lqstring_lang_suffix = "en" AND type2_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'type1, p1, type2, p2, lb, ub, si, wd, e1'

In [32]:
!kgtk add-id -i $WORK/RAILs_and_their_entities.tsv -o $WORK/RAILs_and_their_entities_temp.tsv --overwrite-id \
&& mv $WORK/RAILs_and_their_entities_temp.tsv $WORK/RAILs_and_their_entities.tsv

In [33]:
labels_df = pd.read_csv("{}/RAILs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of distinct labels: {}".format(len(labels_df.node1.unique())))
print("number of rows (labels can have multiple matching entities): {}".format(labels_df.shape[0]))
if not labels_df.empty:
    display(labels_df.loc[:10].fillna(""))

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


number of distinct labels: 266
number of rows (labels can have multiple matching entities): 12557347


Unnamed: 0,node1,label,node2,type,type_label,prop,prop_label,value,value_label,prop2,prop2_label,value2_lb,value2_ub,si_units,wd_units,id
0,RAIL-E893235,pos_entity,Q1000094,Q11424,'film'@en,P495,'country of origin'@en,Q112099,'island nation'@en,P1081,'Human Development Index'@en,,,,,E1
1,RAIL-E893235,pos_entity,Q100156151,Q11424,'film'@en,P495,'country of origin'@en,Q112099,'island nation'@en,P1081,'Human Development Index'@en,,,,,E2
2,RAIL-E893235,pos_entity,Q100249658,Q11424,'film'@en,P495,'country of origin'@en,Q112099,'island nation'@en,P1081,'Human Development Index'@en,,,,,E3
3,RAIL-E893235,pos_entity,Q100255572,Q11424,'film'@en,P495,'country of origin'@en,Q112099,'island nation'@en,P1081,'Human Development Index'@en,,,,,E4
4,RAIL-E893235,pos_entity,Q100292318,Q11424,'film'@en,P495,'country of origin'@en,Q112099,'island nation'@en,P1081,'Human Development Index'@en,,,,,E5
5,RAIL-E893235,pos_entity,Q100320850,Q11424,'film'@en,P495,'country of origin'@en,Q112099,'island nation'@en,P1081,'Human Development Index'@en,,,,,E6
6,RAIL-E893235,pos_entity,Q1004318,Q11424,'film'@en,P495,'country of origin'@en,Q112099,'island nation'@en,P1081,'Human Development Index'@en,,,,,E7
7,RAIL-E893235,pos_entity,Q1004567,Q11424,'film'@en,P495,'country of origin'@en,Q112099,'island nation'@en,P1081,'Human Development Index'@en,,,,,E8
8,RAIL-E893235,pos_entity,Q1004710,Q11424,'film'@en,P495,'country of origin'@en,Q112099,'island nation'@en,P1081,'Human Development Index'@en,,,,,E9
9,RAIL-E893235,pos_entity,Q100490548,Q11424,'film'@en,P495,'country of origin'@en,Q112099,'island nation'@en,P1081,'Human Development Index'@en,,,,,E10


### 1.8 Combine into a single labels file

In [38]:
!kgtk cat -i $WORK/RELs_and_their_entities.tsv \
-i $WORK/quantity_AVLs_and_their_entities.tsv \
-i $WORK/time.year_AVLs_and_their_entities.tsv \
-i $WORK/quantity_AILs_and_their_entities.tsv \
-i $WORK/time.year_AILs_and_their_entities.tsv \
-i $WORK/RAVLs_and_their_entities.tsv \
-i $WORK/RAILs_and_their_entities.tsv \
-o $OUT/labels_and_their_entities.tsv \
&& kgtk add-id -i $OUT/labels_and_their_entities.tsv --overwrite-id -o $OUT/labels_and_their_entities_temp.tsv \
&& mv $OUT/labels_and_their_entities_temp.tsv $OUT/labels_and_their_entities.tsv

## 2. Get all entities of the type we are profiling

In [34]:
!kgtk query -i $LABEL_CREATION_DIR/type_mapping.tsv -i $LABEL_FILE \
-o $WORK/entities_in_type.tsv --graph-cache $STORE \
--match 'type: (entity)-[]->(type:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (entity)-[:label]->(lab)' \
--return 'distinct entity as entity, lab as label' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'entity'

In [35]:
entities_in_type_df = pd.read_csv("{}/entities_in_type.tsv".format(os.environ["WORK"]), delimiter = '\t')
print("number of entities: {}".format(entities_in_type_df.shape[0]))
display(entities_in_type_df.loc[:10].fillna(""))

number of entities: 176405


Unnamed: 0,entity,label
0,Q1000094,'You\\\\'re Dead'@en
1,Q1000174,'Tinko'@en
2,Q1000394,'This Modern Age'@en
3,Q10005695,'Happy Journey'@en
4,Q100057542,'The Two Sights'@en
5,Q10007277,"'Pacho, hybský zbojník'@en"
6,Q1000825,'Jan Dara'@en
7,Q1000826,'Guns of the Magnificent Seven'@en
8,Q100097551,'American Murder: The Family Next Door'@en
9,Q100104941,'Die Filmstadt Hollywood'@en


## 3. Load information needed into Python variables.
1. Load dictionary of all labels found in step 1 above along with their corresponding positive entities (note, performance of this step might be improved if we do it along the way in step 1)
2. Load set of all entities of the type we are profiling (compiled in step 2)
3. Load the keyed vector of entity embeddings (created in HAS_entity_embeddings notebook)

In [42]:
# load labels from step 1
labels_df = pd.read_csv("{}/labels_and_their_entities.tsv".format(os.environ["OUT"]), delimiter = '\t').fillna("")

# create dictionary of label_id --> set of matching entities
# using groupby is not the most efficient, but more concise
label_to_entities = labels_df.groupby("node1")["node2"].apply(set).to_dict()
print("total number of candidate labels for profiling entities of type {}: {}".format(type_to_profile, len(label_to_entities)))

# look up number of entities of the type we are profiling
all_ents_in_type = set(pd.read_csv("{}/entities_in_type.tsv".format(os.environ['WORK']), delimiter = '\t').entity)
print("Loaded set of all entities of the type we are profiling ({} total)".format(len(all_ents_in_type)))

# load embeddings created in the embeddings notebook
entity_embeddings = KeyedVectors.load("{}/HAS_embeddings/entity_embeddings.kv".format(work_dir))
print("Loaded entity embeddings")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


total number of candidate labels for profiling entities of type Q11424: 273
Loaded set of all entities of the type we are profiling (176405 total)
Loaded entity embeddings


## 4. Compute distinctiveness for each label
distinctiveness of a label = average similarity amongst its positive entities - average similarity between its positive and negative entities

In [83]:
import random

Trying out ways to make this code more scalable...  
Here is a single label and its positive entities:

In [236]:
label, pos_ents = list(label_to_entities.items())[1]
neg_ents = all_ents_in_type - pos_ents

Computing average internal similarity. Code here is relatively efficient, we get automatic parallelization in the list comprehension.

In [162]:
%%time
pos_ent_embeds = entity_embeddings[pos_ents]
avg_internal_similarity = np.sum([np.sum(entity_embeddings.cosine_similarities(pos_ent_embeds[i], pos_ent_embeds[i+1:])) for i in range(len(pos_ents))]) / len(pos_ents)**2
avg_internal_similarity

CPU times: user 4min 55s, sys: 18min 19s, total: 23min 14s
Wall time: 29.4 s


0.2640026161164113

Note that in the above calculation, the denominator is pos_ents^2. This is about twice the number it should be to get an accurate average (should be (pos_ents * pos_ents + 1)/2). This doesn't really matter as long as we are consistent in calculating this internal similarity.

Now we will compute the average internal similarity again, using almost the same method, except this time we'll use a sliding window to limit the computation. Note that the resulting avg internal similarity is about twice what we computed above due to the change in denominator.

In [239]:
%%time
window_size = 100
min(len(pos_ents)-1,window_size)
pos_ent_embeds = entity_embeddings[pos_ents]
avg_internal_similarity = np.sum([np.sum(entity_embeddings.cosine_similarities(pos_ent_embeds[i], pos_ent_embeds[i+1:i+1+window_size])) for i in range(len(pos_ents))]) / ((len(pos_ents)*window_size) - ((window_size+1)*window_size/2))
avg_internal_similarity

CPU times: user 1.15 s, sys: 0 ns, total: 1.15 s
Wall time: 1.15 s


0.3068606054337333

Average external similarity, efficient code, but no sampling: (Not running this, takes too long)

In [None]:
%%time
neg_ent_embeds = entity_embeddings[neg_ents]
avg_external_similarity = np.sum([np.sum(entity_embeddings.cosine_similarities(pos_ent_embeds[i], neg_ent_embeds)) for i in range(len(pos_ents))]) / (len(pos_ents)*len(neg_ents))
avg_external_similarity

And now same code as above except with sampling. Note that there is no discrepancy between these numbers as there was for internal similarity because the denominator in the above calculation was already correct for accurately finding the average.

In [250]:
%%time
num_samples = 50
neg_ent_embeds = list(entity_embeddings[neg_ents])
avg_external_similarity = np.sum([np.sum(entity_embeddings.cosine_similarities(pos_ent_embeds[i], random.sample(neg_ent_embeds,min(num_samples,len(neg_ents))))) for i in range(len(pos_ents))]) / (len(pos_ents)*min(num_samples,len(neg_ents)))
avg_external_similarity


CPU times: user 4.6 s, sys: 96.8 ms, total: 4.7 s
Wall time: 4.7 s


0.2300795204726573

In [251]:
%%time
# Sampling params:
internal_sim_window_size = 100 # Internal is faster, so can set this higher
external_sim_sample_size = 50

print("Calculating distinctiveness for {} labels".format(len(label_to_entities)))
count = 0

label_distinctiveness = {}
# todo - use comprehension instead of loop
for label, pos_ents in label_to_entities.items():
    # average internal similarity
    if len(pos_ents) < 2:
        avg_internal_similarity = 1 # only relevant if we have labels that apply to only a single entity
    else:
        window_size = min(internal_sim_window_size, len(pos_ents) -1)
        pos_ent_embeds = entity_embeddings[pos_ents]
        avg_internal_similarity = np.sum([np.sum(entity_embeddings.cosine_similarities(pos_ent_embeds[i], pos_ent_embeds[i+1:i+1+window_size])) for i in range(len(pos_ents))]) / ((len(pos_ents)*window_size) - ((window_size+1)*window_size/2))

    # average external similarity
    neg_ents = all_ents_in_type - pos_ents
    
    num_samples = min(external_sim_sample_size, len(neg_ents))
    neg_ent_embeds = list(entity_embeddings[neg_ents])
    avg_external_similarity = np.sum([np.sum(entity_embeddings.cosine_similarities(pos_ent_embeds[i], random.sample(neg_ent_embeds,num_samples))) for i in range(len(pos_ents))]) / (len(pos_ents)*num_samples)

    label_distinctiveness[label] = avg_internal_similarity - avg_external_similarity
    
    count += 1
    if count % 10 == 0:
        print("Done with {} labels".format(count))

Calculating distinctiveness for 273 labels
Done with 10 labels
Done with 20 labels
Done with 30 labels
Done with 40 labels
Done with 50 labels
Done with 60 labels
Done with 70 labels
Done with 80 labels
Done with 90 labels
Done with 100 labels
Done with 110 labels
Done with 120 labels
Done with 130 labels
Done with 140 labels
Done with 150 labels
Done with 160 labels
Done with 170 labels
Done with 180 labels
Done with 190 labels
Done with 200 labels
Done with 210 labels
Done with 220 labels
Done with 230 labels
Done with 240 labels
Done with 250 labels
Done with 260 labels
Done with 270 labels
CPU times: user 39min 33s, sys: 44.8 s, total: 40min 17s
Wall time: 40min 17s


In [269]:
with open("{}/label_distinctiveness_dict.json".format(label_set_final_dir), "w") as f:
    json.dump(label_distinctiveness, f)

Let's take a look at the most distinctive labels

In [257]:
print("There are {} labels, and {} distinct values of distinctiveness.".format(len(label_distinctiveness),len(set(label_distinctiveness.values()))))
print("Most distinctive labels:")
most_distinct = sorted(label_distinctiveness.items(), key=lambda item: item[1], reverse=True)[:5]
display(most_distinct)
print("Looking at the top 5:")
cols_to_display = ["type", "type_label", "prop", "prop_label", "value", "value_label", "value_lb", "value_ub", "prop2", "prop2_label", "value2", "value2_lb", "value2_ub", "si_units", "wd_units"]
for i in range(len(most_distinct)):
    label_id = most_distinct[i][0]
    display(labels_df.loc[labels_df["node1"]==label_id, cols_to_display].fillna("").iloc[[0]])
    print("Number of positive entities for this label: {} (out of {} entities of this type)".format(len(label_to_entities[label_id]), len(all_ents_in_type)))
#print("Positive entities for this label: {}".format(", ".join(label_to_entities[label_id])))
# negatives = all_ents_in_type - label_to_entities[label_id]
# print("Negative entities for this label: {}".format(", ".join(negatives)))

There are 273 labels, and 273 distinct values of distinctiveness.
Most distinctive labels:


[('RAIL-E976418', 0.31647640386999143),
 ('RAIL-E976411', 0.3164195152423728),
 ('RAIL-E976422', 0.31641950157321325),
 ('RAIL-E976421', 0.31638429720047456),
 ('RAIL-E976420', 0.31637531900466437)]

Looking at the top 5:


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2721045,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P4010,'GDP (PPP)'@en,,5652059000000.0,,,Q550207


Number of positive entities for this label: 23390 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2404370,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P1198,'unemployment rate'@en,,,5.0,,Q11229


Number of positive entities for this label: 23390 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2427760,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P2046,'area'@en,,1156670.5,5473562.0,,Q712226


Number of positive entities for this label: 23391 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2814859,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P8477,'BTI Status Index'@en,,,,,


Number of positive entities for this label: 23390 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
2791469,Q11424,'film'@en,P495,'country of origin'@en,Q223832,'dominion of the British Empire'@en,,,P8476,'BTI Governance Index'@en,,,,,


Number of positive entities for this label: 23390 (out of 176405 entities of this type)


## 5. Iteratively choose labels to add to label set using formula
Choose label $l_i = argmax_{l_i \in L_t^c}{[d(l_i) + \delta * reward(l_i,L_t) - (1- \delta) * penalty(l_i,L_t)]}$

Where...
- $l_i$ is a label
- $L_t^c$ is the set of candidate labels relevant to the type $t$ that we are profiling
- $L_t$ is the set of labels relevant to the type $t$ that we have so far chosen to be in the final label set
- $d(l_i)$ is the distinctiveness of label $l_i$
- $reward(l_i,L_t)$ is a function that captures the potential increase in the total coverage of entities of type $t$ in the KG by the labels in $L_t$ if $l_i$ were to be added to it.
- $penalty(l_i,L_t)$ is a function that captures the potential increase in redundancy of labels in $L_t$ if $l_i$ were to be added to it.
- $\delta$ is a hyperparameter that adjusts if we care more about increasing total coverage versus minimizing redundancy

Functions for computing reward and penalty...

$reward(l_i,L_t) = \cfrac{|\bigcup_{l_j \in (L_t \cup \{l_i\})}{\varepsilon_t^{l_j}}|}{|\varepsilon_t|}$

Where...
- $\varepsilon_t^{l_j}$ is the set of entities of type $t$ that match label $l_j$
- $\varepsilon_t$ is the set of entities of type $t$

In [258]:
# note params are slightly different to avoid redundant computation
def reward(candidate_label, entities_covered_already):
    return len(entities_covered_already | label_to_entities[candidate_label]) / len(all_ents_in_type)

From the paper, "reward is the potential contribution of \[the label\] to the increase of the total
coverage of positive entities in the KG". The formula above doesn't exactly do that... yes, if the label increases the coverage of positive entities, then it will be higher, but if we already have good coverage and this label doesn't add anything new, reward will still be high. If we want a function that does what they explained, we could use something like $\cfrac{|\varepsilon_t^{l_i}| - |\varepsilon_t^{L_t} \cap \varepsilon_t^{l_i}|}{|\varepsilon_t|}$

$penalty(l_i,L_t) = \cfrac{\sum_{l_j \in L_t}{|\varepsilon_t^{l_i} \cap \varepsilon_t^{l_j}|}}{|L_t| * |\varepsilon_t|}$

In [259]:
# note params are slightly different to avoid redundant computation
def penalty(numerator, label_set):
    # when label set is empty, avoid divide by zero
    if len(label_set) == 0:
        return 0 
    return numerator / (len(label_set) * len(all_ents_in_type))

Iteratively choose labels

In [260]:
%%time
d = .5 # can make this a parameter
min_cutoff = 0 # can make this a parameter (though it isn't mentioned in the paper..)
covered_ents = set() # set of all entities covered by label set
label_set = []
candidate_labels = list(label_to_entities.keys())
penalty_numerator_for_label = {l : 0 for l in candidate_labels}
print("We have {} candidate labels to choose from".format(len(candidate_labels)))

for i in range(len(candidate_labels)):
    # Finding the best label
    vals = [label_distinctiveness[l] + d*reward(l, covered_ents) - (1-d)*penalty(penalty_numerator_for_label[l], label_set) for l in candidate_labels]
    max_val = np.max(vals)
    if max_val <= min_cutoff:
        break
    max_ix = np.random.choice(np.flatnonzero(vals == max_val))
    max_label = candidate_labels[max_ix]
    max_label_ents = label_to_entities[max_label]
    
    # Adding the label to final label set
    label_set.append(max_label)
    covered_ents = covered_ents | max_label_ents

    # Remove the label from candidate labels
    candidate_labels.pop(max_ix)
    
    # Update penalty numerator for labels that are still in the candidate list
    for l in candidate_labels:
        penalty_numerator_for_label[l] += len((label_to_entities[l] & max_label_ents))
        
#     print("Iteration {}:".format(i+1))
#     print("\tmax label: {}".format(max_label))
#     print("\t\tvalue: {}".format(max_val))
#     print("\t\tcovered entities: {}".format(", ".join(np.sort(list(max_label_ents)))))
#     print("\t\tdistinctiveness: {}".format(label_distinctiveness[max_label]))
#     print("\t\treward: {}".format(reward(max_label, covered_ents)))
#     print("\t\tpenalty: {}".format(penalty(penalty_numerator_for_label[max_label], label_set[:-1])))
    
print("Final label set has {} labels, covering {} / {} entities of type {} in the dataset".format(len(label_set), len(covered_ents), len(all_ents_in_type), type_to_profile))

We have 273 candidate labels to choose from
Final label set has 273 labels, covering 167207 / 176405 entities of type Q11424 in the dataset
CPU times: user 8min 10s, sys: 10.1 s, total: 8min 20s
Wall time: 8min 20s


In [268]:
with open("{}/ordered_final_label_set.json".format(label_set_final_dir), 'w') as f:
    json.dump(label_set, f)

## 6. Look at what profiles are created
**may make sense to eventually move to separate notebook.**


In [261]:
# NOTE - we can make max_labels_in_profile a notebook param
def print_profile_for_entity(entity, max_labels_in_profile = 5):
    print("Generating profile for entity: {}...".format(entity))

    # We can improve performance using a loop and stopping when we find however many labels we desire.
    # doing it this way here so we can see how many labels total are applicable to the entity
    matching_label_ids_ordered = [l for l in label_set if entity in label_to_entities[l]]

    print("There are {} labels in the final label set that match this entity.".format(len(matching_label_ids_ordered)))
    print("Choosing the top {}:".format(max_labels_in_profile))

    cols_to_display = ["type", "type_label", "prop", "prop_label", "value", "value_label", "value_lb", "value_ub", "prop2", "prop2_label", "value2", "value2_lb", "value2_ub", "si_units", "wd_units"]
    for label_id in matching_label_ids_ordered[:max_labels_in_profile]:
#         print("\nLabel ID: {}".format(label_id))
        display(labels_df.loc[labels_df["node1"]==label_id,cols_to_display].fillna("").iloc[[0]])
#         print("positives for this label: {}".format(", ".join(label_to_entities[label_id])))
        print("Number of positive entities for this label: {} (out of {} entities of this type)".format(len(label_to_entities[label_id]), len(all_ents_in_type)))

Let's look at the generated profile for one of the beers: Q93558270 - Vergina Porfyra

In [262]:
print_profile_for_entity("Q104123")

Generating profile for entity: Q104123...
There are 129 labels in the final label set that match this entity.
Choosing the top 5:


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
5114405,Q11424,'film'@en,P495,'country of origin'@en,Q3624078,'sovereign state'@en,,,P571,'inception'@en,,1675.5,,,


Number of positive entities for this label: 156578 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
12009857,Q11424,'film'@en,P495,'country of origin'@en,Q99541706,'historical unrecognized state'@en,,,P2884,'mains voltage'@en,,,,,Q25250


Number of positive entities for this label: 54749 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
182721,Q11424,'film'@en,P495,'country of origin'@en,Q1489259,'superpower'@en,,,P1198,'unemployment rate'@en,,,,,Q11229


Number of positive entities for this label: 54749 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
7732022,Q11424,'film'@en,P495,'country of origin'@en,Q5255892,'democratic republic'@en,,,P7295,'Gregorian calendar start date'@en,,1667.0,,,


Number of positive entities for this label: 54749 (out of 176405 entities of this type)


Unnamed: 0,type,type_label,prop,prop_label,value,value_label,value_lb,value_ub,prop2,prop2_label,value2,value2_lb,value2_ub,si_units,wd_units
1732323,Q11424,'film'@en,P495,'country of origin'@en,Q1520223,'constitutional republic'@en,,,P4010,'GDP (PPP)'@en,,11367510000000.0,21417810000000.0,,Q550207


Number of positive entities for this label: 54749 (out of 176405 entities of this type)
