### Pre-requisite steps to run this notebook
1. If you do not have kgtk installed, or do not have the kgtk query command, first install this with `pip install -e <path to local kgtk repo>`
2. kneed (https://pypi.org/project/kneed/) is a dependency. Install this either through Anaconda with `conda install -c conda-forge kneed`, or through pip with `pip install kneed`
3. You'll need to have a subset of wikidata partitioned into different files on your machine. You need to create this yourself, or if you have access to the Table_Linker google drive then you can download the Q44 example data here: https://drive.google.com/drive/folders/1U3Tc25rRwu6xy74mPDOG5LIjhUXpbD9A?usp=sharing

In [3]:
import os
import pandas as pd
from utility import run_command
from utility import rename_cols_and_overwrite_id
from label_discretization import discretize_labels

### Parameters
**REQUIRED**  
**type_to_profile**: The type that we will create a label set for, and therefore be able to create entity profiles for. This should be a string denoting a Q-node. E.G. if we want to create profiles for beers, we would set type_to_profile="Q44" (Q44=beer).  
**work_dir**: path to work dir that was specified in the candidate_label_creation, candidate_filter, and HAS_embeddings notebooks. We will utilize files that were saved by those previous notebooks, and also save files created by this notebook here.  
**item_file**: file path for the file that contains entity to entity relationships (e.g. wikibase-item)  
**time_file**: file path for the file that contains entity to time-type values  
**quantity_file**: file path for the file that contains entity to quantity-type values  
**label_file**: file path for the file that contains wikidata labels  
**store_dir**: path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

**OPTIONAL**  
**In-progress... this is currently not used.**  
*string_file*: file path for the file that contains entity to string-type values  

In [4]:
data_dir = "../../Q44/data" # my data files are all in the same directory, so I'll reuse this path prefix

# **REQUIRED**
type_to_profile = "Q44"
work_dir = "../../Q44/profiler_work"
item_file = "{}/Q44.part.wikibase-item.tsv".format(data_dir)
time_file = "{}/Q44.part.time.tsv".format(data_dir)
quantity_file = "{}/Q44.part.quantity.tsv".format(data_dir)
label_file = "{}/Q44.label.en.tsv".format(data_dir)
store_dir = "../../Q44"

# **optional**
string_file = None #"{}/Q44.part.string.tsv".format(data_dir)

### Process parameters and set up variables / file names

In [176]:
# Ensure paths are absolute
item_file = os.path.abspath(item_file)
time_file = os.path.abspath(time_file)
quantity_file = os.path.abspath(quantity_file)
label_file = os.path.abspath(label_file)
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)
if string_file:
    string_file = os.path.abspath(string_file)
    
# Create directories
output_dir = "{}/final_label_sets/{}".format(work_dir, type_to_profile)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
label_set_work_dir = "{}/work".format(output_dir)
if not os.path.exists(label_set_work_dir):
    os.makedirs(label_set_work_dir)
label_set_final_dir = "{}/final".format(output_dir)
if not os.path.exists(label_set_final_dir):
    os.makedirs(label_set_final_dir)
    
# adding some environment variables we'll be using frequently
os.environ['ITEM_FILE'] = item_file
os.environ['TIME_FILE'] = time_file
os.environ['QUANTITY_FILE'] = quantity_file
os.environ['LABEL_FILE'] = label_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['FILTERED_LABELS'] = "{}/candidate_filter".format(work_dir)
os.environ['LABEL_CREATION_DIR'] = "{}/label_creation".format(work_dir)
os.environ['TYPE'] = type_to_profile
os.environ['WORK'] = label_set_work_dir
os.environ['OUT'] = label_set_final_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

## 1. Get labels that abstract entities of the type we want to profile and find the entities that match each of those labels

## this could possibly be a separate notebook or just code moved out of notebook.
## **Actually we could do this in earlier steps along the way - like in label creation... For scalability, if we did this we might need to move the type_to_profile decision to there so we don't enumerate all labels and their entities (this would blow up when we get to RALs). 

## ran into problem comparing units - can't figure out how to check if kgtk_quantity_si_units() returns null. Ignoring unit comparison below... if we do the above, then this won't be a problem. However, for final entity profiling, we will need to solve this problem..

## another problem - since we didn't keep quantities and times separate for ravls and rails, the final label set won't be able to disambiguate whether the value for these is a time or quantity - this actually isn't much of a problem since the property will be able to disambiguate time versus quantity, however it means that we'll have to look in a concatenated quantities/times file when creating a profile for an entity.
We will later choose from these labels to form the final label set, and we will do this based off of several formulas that take into account the entities that match each label.

Also, we need each label to have a unique identifier. We'll add a column for this here as well.

### 1.1 RELs

In [177]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_rel_item_filtered.tsv \
-i $ITEM_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv \
-o $WORK/candidate_RELs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id}]->(val), `'"$ITEM_FILE"'`: (e1)-[l2 {label:p}]->(val), type: (e1)-[]->(type)' \
--return 'distinct type as node1, p as label, p_lab as prop_label, val as node2, printf("REL-%s", label_id) as label_id, e1 as entity, "_" as id' \
--order-by 'type, label, val, e1'

In [178]:
!kgtk add-id -i $WORK/candidate_RELs_and_their_entities.tsv -o $WORK/candidate_RELs_and_their_entities.tsv --overwrite-id

In [179]:
display(pd.read_csv("{}/candidate_RELs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,prop_label,node2,label_id,entity,id
0,Q44,P1056,'product or material produced'@en,Q93552342,REL-E22327,Q12877510,E1
1,Q44,P1056,'product or material produced'@en,Q93557205,REL-E22328,Q12877510,E2
2,Q44,P1056,'product or material produced'@en,Q93558270,REL-E22329,Q12877510,E3
3,Q44,P1056,'product or material produced'@en,Q93559285,REL-E22330,Q12877510,E4
4,Q44,P1056,'product or material produced'@en,Q93560567,REL-E22331,Q12877510,E5
5,Q44,P1056,'product or material produced'@en,Q93560976,REL-E22332,Q12877510,E6
6,Q44,P1056,'product or material produced'@en,Q97412285,REL-E22333,Q12877510,E7
7,Q44,P112,'founded by'@en,Q90449270,REL-E22334,Q12877510,E8
8,Q44,P127,'owned by'@en,Q12877510,REL-E38780,Q93552342,E9
9,Q44,P127,'owned by'@en,Q12877510,REL-E38780,Q93557205,E10


### 1.2 AVLs - quantities

In [180]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_avl_quantity_filtered.tsv \
-i $QUANTITY_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv \
-o $WORK/candidate_quantity_AVLs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id, si_units:label_si, wd_units:label_wd}]->(label_quantity_num), `'"$QUANTITY_FILE"'`: (e)-[l2 {label:p}]->(val), type: (e)-[]->(type)' \
--return 'distinct type as node1, p as label, p_lab as prop_label, label_quantity_num as node2, label_si as si_units, label_wd as wd_units, printf("AVL-quantity-%s", label_id) as label_id, e as entity, "_" as id' \
--where 'kgtk_quantity_number(val)=label_quantity_num AND ( (not kgtk_stringify(kgtk_quantity_si_units(val)) AND not kgtk_stringify(label_si)) OR (kgtk_quantity_si_units(val)=label_si) ) AND ( (not kgtk_stringify(kgtk_quantity_wd_units(val)) AND not kgtk_stringify(label_wd)) OR (kgtk_quantity_wd_units(val)=label_wd) )' \
--order-by 'type, label, node2, si_units, wd_units, e'

In [181]:
!kgtk add-id -i $WORK/candidate_quantity_AVLs_and_their_entities.tsv -o $WORK/candidate_quantity_AVLs_and_their_entities.tsv --overwrite-id

In [182]:
display(pd.read_csv("{}/candidate_quantity_AVLs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,prop_label,node2,si_units,wd_units,label_id,entity,id
0,Q44,P2226,'market capitalization'@en,3896025.7,,,AVL-quantity-E51037,Q12877510,E1
1,Q44,P2665,'alcohol by volume'@en,0.4,,Q2080811,AVL-quantity-E51038,Q97412285,E2
2,Q44,P2665,'alcohol by volume'@en,5.4,,Q2080811,AVL-quantity-E51040,Q93559285,E3
3,Q44,P2665,'alcohol by volume'@en,5.5,,Q2080811,AVL-quantity-E51041,Q93560567,E4
4,Q44,P2665,'alcohol by volume'@en,5.8,,Q2080811,AVL-quantity-E51042,Q93557205,E5
5,Q44,P2665,'alcohol by volume'@en,6.3,,Q2080811,AVL-quantity-E51043,Q93558270,E6
6,Q44,P6088,'beer bitterness'@en,29.5,,,AVL-quantity-E51046,Q93557205,E7


### 1.3 AVLs - times

In [75]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_avl_time.year_filtered.tsv \
-i $TIME_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv \
-o $WORK/candidate_time.year_AVLs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id}]->(label_year), `'"$TIME_FILE"'`: (e)-[l2 {label:p}]->(val), type: (e)-[]->(type)' \
--return 'distinct type as node1, p as label, p_lab as prop_label, label_year as node2, printf("AVL-time.year-%s", label_id) as label_id, e as entity, "_" as id' \
--where 'kgtk_date_year(val)=label_year' \
--order-by 'type, label, node2, e'

In [76]:
!kgtk add-id -i $WORK/candidate_time.year_AVLs_and_their_entities.tsv -o $WORK/candidate_time.year_AVLs_and_their_entities.tsv --overwrite-id

In [77]:
display(pd.read_csv("{}/candidate_time.year_AVLs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,prop_label,node2,label_id,entity,id
0,Q44,P571,'inception'@en,1996,AVL-time.year-E321,Q12877510,E1
1,Q44,P571,'inception'@en,1998,AVL-time.year-E322,Q93552342,E2
2,Q44,P571,'inception'@en,2001,AVL-time.year-E323,Q93557205,E3
3,Q44,P571,'inception'@en,2003,AVL-time.year-E324,Q93559285,E4
4,Q44,P571,'inception'@en,2013,AVL-time.year-E325,Q93558270,E5
5,Q44,P571,'inception'@en,2017,AVL-time.year-E326,Q93560567,E6
6,Q44,P571,'inception'@en,2020,AVL-time.year-E327,Q97412285,E7


### 1.4 AILs - quantities

In [164]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_ail_quantity_filtered.tsv \
-i $QUANTITY_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv \
-o $WORK/candidate_quantity_AILs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id, upper_bound:ub, si_units:label_si, wd_units:label_wd}]->(lb), `'"$QUANTITY_FILE"'`: (e)-[l2 {label:p}]->(val), type: (e)-[]->(type)' \
--return 'distinct type as node1, p as label, p_lab as prop_label, lb as node2, ub as upper_bound, label_si as si_units, label_wd as wd_units, printf("AIL-quantity-%s", label_id) as label_id, e as entity, "_" as id' \
--where '(not lb OR kgtk_quantity_number(val) >= kgtk_quantity_number(lb)) AND (not ub OR kgtk_quantity_number(ub) >= kgtk_quantity_number(val))' \
--order-by 'type, label, node2, si_units, wd_units, e'


In [167]:
!kgtk add-id -i $WORK/candidate_quantity_AILs_and_their_entities.tsv -o $WORK/candidate_quantity_AILs_and_their_entities.tsv --overwrite-id

In [168]:
display(pd.read_csv("{}/candidate_quantity_AILs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,prop_label,node2,upper_bound,si_units,wd_units,label_id,entity,id
0,Q44,P2226,'market capitalization'@en,,,,,AIL-quantity-E401,Q12877510,E1
1,Q44,P2665,'alcohol by volume'@en,,2.7,,Q2080811,AIL-quantity-E402,Q97412285,E2
2,Q44,P2665,'alcohol by volume'@en,2.7,,,Q2080811,AIL-quantity-E1254,Q93552342,E3
3,Q44,P2665,'alcohol by volume'@en,2.7,,,Q2080811,AIL-quantity-E1254,Q93557205,E4
4,Q44,P2665,'alcohol by volume'@en,2.7,,,Q2080811,AIL-quantity-E1254,Q93558270,E5
5,Q44,P2665,'alcohol by volume'@en,2.7,,,Q2080811,AIL-quantity-E1254,Q93559285,E6
6,Q44,P2665,'alcohol by volume'@en,2.7,,,Q2080811,AIL-quantity-E1254,Q93560567,E7
7,Q44,P6088,'beer bitterness'@en,,17.5,,,AIL-quantity-E403,Q93559285,E8
8,Q44,P6088,'beer bitterness'@en,17.5,23.0,,,AIL-quantity-E722,Q93552342,E9
9,Q44,P6088,'beer bitterness'@en,17.5,23.0,,,AIL-quantity-E722,Q93558270,E10


### 1.5 AILs - times

In [171]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_ail_time.year_filtered.tsv \
-i $TIME_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv \
-o $WORK/candidate_time.year_AILs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type:`'"$TYPE"'`)-[l1 {label:p, prop_label:p_lab, id:label_id, upper_bound:ub}]->(lb), `'"$TIME_FILE"'`: (e)-[l2 {label:p}]->(val), type: (e)-[]->(type)' \
--return 'distinct type as node1, p as label, p_lab as prop_label, lb as node2, ub as upper_bound, printf("AIL-time.year-%s", label_id) as label_id, e as entity, "_" as id' \
--where '(not lb OR kgtk_date_year(val) >= kgtk_quantity_number(lb)) AND (not ub OR kgtk_quantity_number(ub) >= kgtk_date_year(val))' \
--order-by 'type, label, node2, e'


In [172]:
!kgtk add-id -i $WORK/candidate_time.year_AILs_and_their_entities.tsv -o $WORK/candidate_time.year_AILs_and_their_entities.tsv --overwrite-id

In [173]:
display(pd.read_csv("{}/candidate_time.year_AILs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,prop_label,node2,upper_bound,label_id,entity,id
0,Q44,P571,'inception'@en,,2008.0,AIL-time.year-E75,Q12877510,E1
1,Q44,P571,'inception'@en,,2008.0,AIL-time.year-E75,Q93552342,E2
2,Q44,P571,'inception'@en,,2008.0,AIL-time.year-E75,Q93557205,E3
3,Q44,P571,'inception'@en,,2008.0,AIL-time.year-E75,Q93559285,E4
4,Q44,P571,'inception'@en,2008.0,2015.0,AIL-time.year-E25,Q93558270,E5
5,Q44,P571,'inception'@en,2015.0,,AIL-time.year-E47,Q93560567,E6
6,Q44,P571,'inception'@en,2015.0,,AIL-time.year-E47,Q97412285,E7


### 1.6 RAVLs - quantities and times
both quantities and times here because we mixed them together in the label creation step...

In [191]:
!kgtk query -i $FILTERED_LABELS/candidate_labels_ravl_filtered.tsv \
-i $LABEL_CREATION_DIR/entity_AVLs_all.tsv -i $ITEM_FILE -i $LABEL_CREATION_DIR/type_mapping.tsv \
-o $WORK/candidate_RAVLs_and_their_entities.tsv --graph-cache $STORE \
--match 'filtered: (type1:`'"$TYPE"'`)-[l1 {label:p1, id:label_id, prop2:p2, prop2_label:p2_lab, val:val, si_units:label_si, wd_units:label_wd}]->(type2), `'"$ITEM_FILE"'`: (e1)-[l2 {label:p1}]->(e2), AVLs_all: (e2)-[l3 {label:p2}]->(val), type: (e1)-[]->(type1), type: (e2)-[]->(type2)' \
--return 'distinct type1 as node1, p1 as label, type2 as node2, p2 as prop2, p2_lab as prop2_label, val as val, label_si as si_units, label_wd as wd_units, printf("RAVL-%s", label_id) as label_id, e1 as entity, "_" as id' \
--order-by 'type1, p1, type2, p2, val, si_units, wd_units, e1'

In [192]:
!kgtk add-id -i $WORK/candidate_RAVLs_and_their_entities.tsv -o $WORK/candidate_RAVLs_and_their_entities.tsv --overwrite-id

In [193]:
display(pd.read_csv("{}/candidate_RAVLs_and_their_entities.tsv".format(os.environ["WORK"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,node2,prop2,prop2_label,val,si_units,wd_units,label_id,entity,id


### 1.7 RAVLs - times

### 1.8 RAILs - quantities

### 1.9 RAILs - times

## 2. Load information needed into Python variables.
1. Load dictionary of all labels found in step 1 above along with their corresponding positive entities
2. Look up the number of entities of the type we are profiling.

## 3. Compute distinctiveness for each label

## 4. Iteratively choose labels to add to label set using formula
Fill in formula here

Functions for computing reward and penalty

Iteratively choose labels