# Re-implementing procedure outlined in "Entity Profiling in Knowledge Graphs" (Zhang Et al.)
# This notebook will implement the candidate label creation step

### Pre-requisite steps to run this notebook
1. If you do not have kgtk installed, or do not have the kgtk query command, first install this with `pip install -e <path to local kgtk repo>`
2. kneed (https://pypi.org/project/kneed/) is a dependency. Install this either through Anaconda with `conda install -c conda-forge kneed`, or through pip with `pip install kneed`
3. You'll need to have a subset of wikidata partitioned into different files on your machine. You can create this yourself by following the steps in the KGTK/Turotial noteboks, or if you have access to the Table_Linker google drive then you can download the Q44 example data here: https://drive.google.com/drive/folders/1U3Tc25rRwu6xy74mPDOG5LIjhUXpbD9A?usp=sharing
4. (Optional) Consider running the trim_quantity_file notebook as a pre-processing step (see notebook for details).

In [2]:
import os
import pandas as pd
from utility import run_command
from utility import rename_cols_and_overwrite_id
from label_discretization import discretize_labels

### Parameters
**Required**  
*item_file*: file path for the file that contains entity to entity relationships (e.g. wikibase-item)  
*time_file*: file path for the file that contains entity to time-type values  
*quantity_file*: file path for the file that contains entity to quantity-type values (remember to specify the trimmed file if you did the quantity file trimming pre-processing step).  
*label_file*: file path for the file that contains wikidata labels  
*work_dir*: path to folder where files created by this notebook should be stored  
*store_dir*: path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

**Optional**    
*string_file*: file path for the file that contains entity to string-type values  

In [3]:
data_dir = "../../wikidata_politicians/data" # my data files are all in the same directory, so I'll reuse this path prefix

# **REQUIRED**
item_file = "{}/claims.wikibase-item.tsv.gz".format(data_dir)
time_file = "{}/claims.time.tsv.gz".format(data_dir)
quantity_file = "{}/claims.quantity_trimmed.tsv.gz".format(data_dir)
label_file = "{}/labels.en.tsv.gz".format(data_dir)
work_dir = "../../wikidata_politicians/profiler_work"
store_dir = "../../wikidata_politicians"

# **optional**
string_file = None #"{}/claims.string.tsv.gz".format(data_dir)

### Process parameters and set up variables / file names

In [4]:
# Ensure paths are absolute
item_file = os.path.abspath(item_file)
time_file = os.path.abspath(time_file)
quantity_file = os.path.abspath(quantity_file)
label_file = os.path.abspath(label_file)
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)
if string_file:
    string_file = os.path.abspath(string_file)
    
# Create directories
if not os.path.exists(work_dir):
    os.makedirs(work_dir)
output_dir = "{}/label_creation".format(work_dir)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# adding some environment variables we'll be using frequently
os.environ['ITEM_FILE'] = item_file
os.environ['TIME_FILE'] = time_file
os.environ['QUANTITY_FILE'] = quantity_file
os.environ['LABEL_FILE'] = label_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['OUT'] = output_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

# Outline of procedure:
**Goal**:<br>
We want to create candidate label sets including
- Attribute value labels (type, property, *attribute*)
- Realtional entity labels (type, property, *entity*)
- Attribute interval labels (type, property, *range of attribute values*)
- Relational attribute labels (type, property, *attribute or attribute range of another entity*)

To enable subsequent filtering of these labels, we also want to count:
- The number of entities of each type
- The number of entities that match each label (call these "positives")
    
**Steps**:

0. Create type-mapping
1. Count the number of entities of each type
    - *optional future step*: define type with P279 transitive closure in addition to P31. 
2. Create AVLs trivially from attribute files along with counts of the positive entities for each label
    - At this step, we should also contribute to a mapping of entities --> matching attribute labels to facilitate creating RALs in a later step  
3. Create RELs trivially from entity relation files along with counts of positive entities for each label
4. Create AILs by discretizing the AVLs we found, along with counts of positive entities for each label
    - See label_discretization notebook for some exploration of discretization approach that led to the method that is implemented in this notebook
    - At this step, we should also contribute to a mapping of entities --> matching attribute labels to facilitate creating RALs in a later step
5. Create RALs by using the entities --> attribute labels table that we built in steps 2 and 4. Also keep track of counts of positive entities for each label
    
*Misc issues encountered*
- kgtk rename-columns doesn't always work when input file == output file. Getting around this right now by creating temp files... 

## 0. Create type mapping
Mapping is from entity (Q node) to the entity's type (another Q node). Using P31 only for now, but can add P279* as well later  

In [5]:
!kgtk query -i $ITEM_FILE -i $LABEL_FILE \
-o $OUT/type_mapping.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (e)-[l {label:"P31"}]->(type), `'"$LABEL_FILE"'`: (e)-[:label]->(e_lab), `'"$LABEL_FILE"'`: (type)-[:label]->(type_lab)' \
--return 'distinct l as id, e as node1, l.label as label, type as node2, e_lab as entity_label, type_lab as type_label' \
--where 'e_lab.kgtk_lqstring_lang_suffix = "en" AND type_lab.kgtk_lqstring_lang_suffix = "en"'

In [6]:
display(pd.read_csv("{}/type_mapping.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=5).fillna(""))

Unnamed: 0,id,node1,label,node2,entity_label,type_label
0,P10-P31-Q18610173-85ef4d24-0,P10,P31,Q18610173,'video'@en,'Wikidata property to link to Commons'@en
1,P1000-P31-Q18608871-093affb5-0,P1000,P31,Q18608871,'record held'@en,'Wikidata property for items about people'@en
2,P1001-P31-Q15720608-deeedec9-0,P1001,P31,Q15720608,'applies to jurisdiction'@en,'Wikidata qualifier'@en
3,P1001-P31-Q22984026-8beb0cfe-0,P1001,P31,Q22984026,'applies to jurisdiction'@en,'Wikidata property related to law and justice'@en
4,P1001-P31-Q22997934-1e5b1a96-0,P1001,P31,Q22997934,'applies to jurisdiction'@en,'Wikidata property related to government and s...


## 1. Count number of entities of each type:
Use the entity --> type mapping we created in step 0 to do this

In [7]:
!kgtk query -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-o $OUT/entity_counts_per_type.tsv --graph-cache $STORE \
--match 'type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (type)-[:label]->(lab)' \
--return 'distinct type as type, lab as type_label, count(distinct n1) as count, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc'

In [8]:
rename_cols_and_overwrite_id("$OUT/entity_counts_per_type", ".tsv", "type type_label count", "node1 label node2")

In [9]:
!head -10 $OUT/entity_counts_per_type.tsv | column -t -s $'\t'

node1      label                        node2  id
Q4164871   'position'@en                58903  E1
Q484170    'commune of France'@en       39128  E2
Q2074737   'municipality of Spain'@en   6826   E3
Q5         'human'@en                   3627   E4
Q4167836   'Wikimedia category'@en      3068   E5
Q659103    'commune of Romania'@en      2862   E6
Q294414    'public office'@en           2717   E7
Q21869758  'delegated commune'@en       2321   E8
Q13406463  'Wikimedia list article'@en  1097   E9


## 2. Create AVLs with counts of positive entities
At this step we also want to keep track of entities --> matching attribute labels for future use. This will help when we are creating RALs (step 5)

To accomplish these goals, we will do the following:

For each attribute type (string, time, quantity), we will 1. use the entity --> type mapping along with the attribute data file to create an entity_attribute_labels file that has a mapping of entity --> labels applicable to the entity, and 2. use the entity_attribute_labels file to aggregate labels with counts of matching entities which we will save in a candidate_labels file

### 2.1 strings
Creating mapping of entity --> string attribute labels

In [10]:
if not string_file:
    print("No string attribute file was provided in the parameters section, skipping this step.")
else:
    # perform query
    command = "$kgtk query -i $OUT/type_mapping.tsv -i STRING_FILE -i LABEL_FILE \
               -o $OUT/entity_attribute_labels_string.tsv --graph-cache $STORE \
               --match '`STRING_FILE`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `LABEL_FILE`: (p)-[:label]->(lab)' \
               --return 'distinct n1 as entity, type as type, p as prop, n2 as value, lab as property_label, \"_\" as id' \
               --where 'lab.kgtk_lqstring_lang_suffix = \"en\"' \
               --order-by 'n1'"
    run_command(command, {"STRING_FILE" : string_file, "LABEL_FILE" : label_file})
    # reformat columns to be in KGTK format
    rename_cols_and_overwrite_id("$OUT/entity_attribute_labels_string", ".tsv", "type prop value", "node1 label node2")
    # view header of result
    run_command("head -5 $OUT/entity_attribute_labels_string.tsv | column -t -s $'\t'")

No string attribute file was provided in the parameters section, skipping this step.


Aggregating distinct labels w/ positive entity counts

In [11]:
if not string_file:
    print("No string attribute file was provided in the parameters section, skipping this step.")
else:
    # perform query
    command = "$kgtk query -i $OUT/entity_attribute_labels_string.tsv \
               -o $OUT/candidate_labels_avl_string.tsv --graph-cache $STORE \
               --match 'labels: (type)-[l {label:prop, property_label:lab, entity:e}]->(val)' \
               --return 'distinct type as type, prop as prop, val as value, count(distinct e) as positives, lab as property_label, \"_\" as id' \
               --order-by 'count(distinct e) desc'"
    run_command(command)
    # reformat columns to be in KGTK format
    rename_cols_and_overwrite_id("$OUT/candidate_labels_avl_string", ".tsv", "type prop value", "node1 label node2")
    # view header of result
    run_command("head -5 $OUT/candidate_labels_avl_string.tsv | column -t -s $'\t'")

No string attribute file was provided in the parameters section, skipping this step.


### 2.2 Times

Looking at what precisions we need to deal with...

In [12]:
!kgtk query -i $TIME_FILE $LABEL_FILE\
--graph-cache $STORE \
--match '`'"$TIME_FILE"'`: (n1)-[l {label:p}]->(n2), `'"$LABEL_FILE"'`: (p)-[:label]->(lab)' \
--return 'distinct kgtk_date_precision(n2) as precisions, count(n1) as count' \
--limit 10 \
| column -t -s $'\t'

precisions  count
6           14
7           218
8           45
9           6077
10          212
11          21214


From the above, we have several precisions below precision of year=9. We don't have kgtk type interpretation functions for these granularities, so for now we'll interpret them all as years. Furthermore, we will interpret all times at the year granularity for now.

Additional work can be done later to create labels with finer time granularity if desired.

Creating mapping of entity --> year attribute labels

In [13]:
!kgtk query -i $OUT/type_mapping.tsv -i $TIME_FILE -i $LABEL_FILE \
-o $OUT/entity_attribute_labels_time.year.tsv --graph-cache $STORE \
--match '`'"$TIME_FILE"'`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (p)-[:label]->(p_lab), `'"$LABEL_FILE"'`: (type)-[:label]->(t_lab)' \
--return 'distinct n1 as entity, type as type, p as prop, kgtk_date_year(n2) as value, t_lab as type_label, p_lab as property_label, "_" as id' \
--where 't_lab.kgtk_lqstring_lang_suffix = "en" AND p_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'n1'

In [14]:
rename_cols_and_overwrite_id("$OUT/entity_attribute_labels_time.year", ".tsv", "type prop value", "node1 label node2")

In [15]:
!head -5 $OUT/entity_attribute_labels_time.year.tsv | column -t -s $'\t'

entity  node1      label  node2  type_label                                               property_label   id
P6107   Q18616576  P580   2006   'Wikidata property'@en                                   'start time'@en  E1
Q100    Q1093829   P571   1630   'city of the United States'@en                           'inception'@en   E2
Q100    Q1549591   P571   1630   'big city'@en                                            'inception'@en   E3
Q100    Q21518270  P571   1630   'state or insular area capital in the United States'@en  'inception'@en   E4


Aggregating distinct labels w/ positive entity counts

In [16]:
!kgtk query -i $OUT/entity_attribute_labels_time.year.tsv \
-o $OUT/candidate_labels_avl_time.year.tsv --graph-cache $STORE \
--match 'labels: (n1)-[l {label:p, property_label:lab, entity:e}]->(val)' \
--return 'distinct n1 as type, p as prop, val as value, count(distinct e) as positives, lab as property_label, "_" as id' \
--order-by 'count(distinct e) desc'

In [17]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_avl_time.year", ".tsv", "type prop value", "node1 label node2")

In [18]:
!head -5 $OUT/candidate_labels_avl_time.year.tsv | column -t -s $'\t'

node1      label  node2  positives  property_label                                id
Q4164871   P576   2015   1106       'dissolved, abolished or demolished date'@en  E1
Q484170    P576   2015   1057       'dissolved, abolished or demolished date'@en  E2
Q21869758  P576   2015   995        'dissolved, abolished or demolished date'@en  E3
Q4164871   P576   2016   674        'dissolved, abolished or demolished date'@en  E4


## 2.3 Quantities
Creating mapping of entity --> quantity attribute labels

Note, quantities may have units. We will separate out the quantity value and units into separate columns

In [19]:
test = !kgtk query -i $OUT/type_mapping.tsv -i $QUANTITY_FILE -i $LABEL_FILE \
--graph-cache $STORE \
--match '`'"$QUANTITY_FILE"'`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (p)-[:label]->(p_lab), `'"$LABEL_FILE"'`: (type)-[:label]->(t_lab)' \
--return 'distinct n1 as entity, type as type, p as prop, kgtk_quantity_number(n2) as value, kgtk_quantity_si_units(n2) as si_units, kgtk_quantity_wd_units(n2) as wd_units, t_lab as type_label, p_lab as property_label, "_" as id' \
--where 't_lab.kgtk_lqstring_lang_suffix = "en" AND p_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'n1'


In [20]:
test

['entity\ttype\tprop\tvalue\tsi_units\twd_units\ttype_label\tproperty_label\tid',
 "P1004\tQ19829908\tP4876\t37427\t\t\t'Wikidata property for authority control for places'@en\t'number of records'@en\t_",
 "P1004\tQ24075706\tP4876\t37427\t\t\t'Wikidata property for authority control, with reciprocal use of Wikidata'@en\t'number of records'@en\t_",
 "P1004\tQ27525351\tP4876\t37427\t\t\t'Wikidata property related to music'@en\t'number of records'@en\t_",
 "P1014\tQ27918607\tP4876\t53249\t\t\t'Wikidata property related to art'@en\t'number of records'@en\t_",
 "P1014\tQ43831109\tP4876\t53249\t\t\t'Wikidata property related to architecture'@en\t'number of records'@en\t_",
 "P1014\tQ89560413\tP4876\t53249\t\t\t'Wikidata property related to a thesaurus'@en\t'number of records'@en\t_",
 "P1022\tQ19847637\tP4876\t541\t\t\t'Wikidata property for an identifier'@en\t'number of records'@en\t_",
 "P1022\tQ24043375\tP4876\t541\t\t\t'Wikidata property for occupations'@en\t'number of records'@en\t_",
 

In [21]:
!kgtk query -i $OUT/type_mapping.tsv -i $QUANTITY_FILE -i $LABEL_FILE \
-o $OUT/entity_attribute_labels_quantity.tsv --graph-cache $STORE \
--match '`'"$QUANTITY_FILE"'`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (p)-[:label]->(p_lab), `'"$LABEL_FILE"'`: (type)-[:label]->(t_lab)' \
--return 'distinct n1 as entity, type as type, p as prop, kgtk_quantity_numeral(n2) as value, kgtk_quantity_si_units(n2) as si_units, kgtk_quantity_wd_units(n2) as wd_units, t_lab as type_label, p_lab as property_label, "_" as id' \
--where 't_lab.kgtk_lqstring_lang_suffix = "en" AND p_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'n1'

In [22]:
rename_cols_and_overwrite_id("$OUT/entity_attribute_labels_quantity", ".tsv", "type prop value", "node1 label node2")

In [23]:
display(pd.read_csv("{}/entity_attribute_labels_quantity.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,entity,node1,label,node2,si_units,wd_units,type_label,property_label,id
0,P1004,Q19829908,P4876,37427,,,'Wikidata property for authority control for p...,'number of records'@en,E1
1,P1004,Q24075706,P4876,37427,,,"'Wikidata property for authority control, with...",'number of records'@en,E2
2,P1004,Q27525351,P4876,37427,,,'Wikidata property related to music'@en,'number of records'@en,E3
3,P1014,Q27918607,P4876,53249,,,'Wikidata property related to art'@en,'number of records'@en,E4
4,P1014,Q43831109,P4876,53249,,,'Wikidata property related to architecture'@en,'number of records'@en,E5


Aggregating distinct labels w/ positive entity counts

In [24]:
!kgtk query -i $OUT/entity_attribute_labels_quantity.tsv \
-o $OUT/candidate_labels_avl_quantity.tsv --graph-cache $STORE \
--match 'labels: (n1)-[l {label:p, property_label:lab, entity:e, si_units:si, wd_units:wd}]->(val)' \
--return 'distinct n1 as type, p as prop, val as value, count(distinct e) as positives, lab as property_label, "_" as id, si as si_units, wd as wd_units' \
--order-by 'count(distinct e) desc'

In [25]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_avl_quantity", ".tsv", "type prop value", "node1 label node2")

In [26]:
display(pd.read_csv("{}/candidate_labels_avl_quantity.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,label,node2,positives,property_label,id,si_units,wd_units
0,Q37002670,P4253,0,192,'number of constituencies'@en,E1,,
1,Q4164871,P2097,4,163,'term length of office'@en,E2,,Q577
2,Q484170,P2044,200,136,'elevation above sea level'@en,E3,,Q11573
3,Q484170,P2044,150,131,'elevation above sea level'@en,E4,,Q11573
4,Q6256,P3000,18,114,'marriageable age'@en,E5,,Q24564698


### 2.4 Combining entity --> attribute label mappings to single table

In [27]:
command = "$kgtk cat \
           -i $OUT/entity_attribute_labels_time.year.tsv \
           -i $OUT/entity_attribute_labels_quantity.tsv \
           -o $OUT/entity_AVLs_all.tsv"
if string_file:
    command += " -i $OUT/entity_attribute_labels_string.tsv"

run_command(command)

## 3. Create RELs with counts of positive entities
We do this the same way we created AVLs, except we use the entity to entity relation data file, and we don't need to save the intermediate entity --> labels file since these labels won't contribute to RALs later

In [28]:
!kgtk query -i $ITEM_FILE -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-o $OUT/candidate_labels_rel_item.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (p)-[:label]->(lab)' \
--return 'distinct type as type, p as prop, n2 as value, count(distinct n1) as positives, lab as property_label, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc'

In [29]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_rel_item", ".tsv", "type prop value", "node1 label node2")

In [30]:
!head -10 $OUT/candidate_labels_rel_item.tsv | column -t -s $'\t'

node1     label  node2     positives  property_label             id
Q4164871  P31    Q4164871  58903      'instance of'@en           E1
Q4164871  P17    Q142      39854      'country'@en               E2
Q4164871  P279   Q382617   39179      'subclass of'@en           E3
Q484170   P17    Q142      39128      'country'@en               E4
Q484170   P31    Q484170   39128      'instance of'@en           E5
Q484170   P421   Q6655     31524      'located in time zone'@en  E6
Q484170   P421   Q6723     31519      'located in time zone'@en  E7
Q4164871  P17    Q29       8035       'country'@en               E8
Q4164871  P279   Q5663900  7904       'subclass of'@en           E9


## 4. Create AILs with counts of positive entities
Similar to what we did for AVLs, we also want to keep track of entities --> matching attribute labels for future use in RAL creation (step 5)

We will create attribute *interval* labels from our attribute *value* labels that we previously created. The code that does this is explored in the explore_label_discretization notebook, and implemented in label_discretization.py.

For each entity --> labels file that has a numeric value type (year or quantity) we will:
1. Create a corresponding entity --> bucketed labels file. For example, a label in the input that looks like <country, population, 1,000,000> might get summarized (bucketed) in the output to look like <country, population, (500,000, 2,000,000)>.
2. Use the resulting bucketed entity_attribute_labels file to once again aggregate labels with counts of matching entities. This will give us a candidate_labels_ail file.

*Note on syntax we are using for ranges*: we will define ranges with lower and upper bounds. Ranges may have blank values for the lower and/or upper bounds. A range that only has an upper bound means the bin includes all values <= the upper bound. A range that has no lower or upper bound denotes a single bin that includes all values. Such ranges may be created for labels of a <type, property> have very few datapoints.

Note, the code will create some output about labels that it may not be creating good buckets for. We'll silence this output so it doesn't take up too much space when viewing on github. If you would like to unsilence this output, comment out the `%%capture` lines

### 4.1 Years
Create entity --> bucketed labels file

In [31]:
%%capture
avl_file_in = "{}/entity_attribute_labels_time.year.tsv".format(os.environ["OUT"])
ail_file_out = "{}/entity_attribute_labels_time.year_bucketed.tsv".format(os.environ["OUT"])
discretize_labels(avl_file_in, ail_file_out)

In [32]:
display(pd.read_csv("{}/entity_attribute_labels_time.year_bucketed.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=11).fillna(""))


Unnamed: 0,entity,node1,label,node2,type_label,property_label,id,lower_bound,upper_bound
0,P6107,Q18616576,P580,2006,'Wikidata property'@en,'start time'@en,E1,,
1,Q100,Q1093829,P571,1630,'city of the United States'@en,'inception'@en,E2,,1659.5
2,Q100,Q1549591,P571,1630,'big city'@en,'inception'@en,E3,1486.0,
3,Q100,Q21518270,P571,1630,'state or insular area capital in the United S...,'inception'@en,E4,1564.0,1667.5
4,Q1000,Q11042,P571,1960,'culture'@en,'inception'@en,E5,993.0,
5,Q1000,Q1292119,P571,1960,'style'@en,'inception'@en,E6,993.0,
6,Q1000,Q179023,P571,1960,'French colonial empire'@en,'inception'@en,E7,,
7,Q1000,Q3624078,P571,1960,'sovereign state'@en,'inception'@en,E8,1620.5,
8,Q1000,Q6256,P571,1960,'country'@en,'inception'@en,E9,1637.0,
9,Q1000134,Q21869758,P576,2018,'delegated commune'@en,"'dissolved, abolished or demolished date'@en",E10,2017.5,2018.5


Aggregating distinct interval labels with positive entity counts

**NOTE: Below, the lower_bound column is renamed to be node2**.

In [33]:
!kgtk query -i $OUT/entity_attribute_labels_time.year_bucketed.tsv \
-o $OUT/candidate_labels_ail_time.year.tsv \
--graph-cache $STORE \
--match 'labels: (type)-[l {label:prop, property_label:lab, entity:e, lower_bound:lb, upper_bound:ub}]->(val)' \
--return 'type as type, prop as prop, lb as lower_bound, ub as upper_bound, count(e) as positives, lab as property_label, "_" as id' \
--order-by 'count(e) desc'

In [34]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_ail_time.year", ".tsv", "type prop lower_bound", "node1 label node2")

In [35]:
display(pd.read_csv("{}/candidate_labels_ail_time.year.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=15).fillna(""))

Unnamed: 0,node1,label,node2,upper_bound,positives,property_label,id
0,Q4164871,P576,2009.0,,2622,"'dissolved, abolished or demolished date'@en",E1
1,Q5,P569,1850.5,1987.0,2467,'date of birth'@en,E2
2,Q4164871,P571,1784.5,,2346,'inception'@en,E3
3,Q5,P570,1830.5,,1157,'date of death'@en,E4
4,Q484170,P576,2014.5,2015.5,1057,"'dissolved, abolished or demolished date'@en",E5
5,Q484170,P571,1955.5,,999,'inception'@en,E6
6,Q21869758,P576,2014.5,2015.5,995,"'dissolved, abolished or demolished date'@en",E7
7,Q4164871,P576,1958.0,1998.0,849,"'dissolved, abolished or demolished date'@en",E8
8,Q4164871,P576,1802.5,1850.5,700,"'dissolved, abolished or demolished date'@en",E9
9,Q484170,P576,2015.5,2016.5,674,"'dissolved, abolished or demolished date'@en",E10


### 4.2 Quantities
Create entity --> bucketed labels file

In [36]:
%%capture
avl_file_in = "{}/entity_attribute_labels_quantity.tsv".format(os.environ["OUT"])
ail_file_out = "{}/entity_attribute_labels_quantity_bucketed.tsv".format(os.environ["OUT"])
discretize_labels(avl_file_in, ail_file_out)

In [37]:
display(pd.read_csv("{}/entity_attribute_labels_quantity_bucketed.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,entity,node1,label,node2,si_units,wd_units,type_label,property_label,id,lower_bound,upper_bound
0,P1004,Q19829908,P4876,37427.0,,,'Wikidata property for authority control for p...,'number of records'@en,E1,,5377793.0
1,P1004,Q24075706,P4876,37427.0,,,"'Wikidata property for authority control, with...",'number of records'@en,E2,,696863.5
2,P1004,Q27525351,P4876,37427.0,,,'Wikidata property related to music'@en,'number of records'@en,E3,,178376.0
3,P1014,Q27918607,P4876,53249.0,,,'Wikidata property related to art'@en,'number of records'@en,E4,,106726.0
4,P1014,Q43831109,P4876,53249.0,,,'Wikidata property related to architecture'@en,'number of records'@en,E5,22846.0,
5,P1014,Q89560413,P4876,53249.0,,,'Wikidata property related to a thesaurus'@en,'number of records'@en,E6,45624.5,
6,P1022,Q19847637,P4876,541.0,,,'Wikidata property for an identifier'@en,'number of records'@en,E7,,1177725.5
7,P1022,Q24043375,P4876,541.0,,,'Wikidata property for occupations'@en,'number of records'@en,E8,,
8,P1042,Q55999460,P1114,1952404.0,,,'Wikidata property for authority control for t...,'quantity'@en,E9,,
9,P1042,Q57589544,P1114,1952404.0,,,'Wikidata property for authority control for a...,'quantity'@en,E10,,


Aggregating distinct interval labels with positive entity counts

**NOTE: Below, the lower_bound column is renamed to be node2**.

In [38]:
!kgtk query -i $OUT/entity_attribute_labels_quantity_bucketed.tsv \
-o $OUT/candidate_labels_ail_quantity.tsv \
--graph-cache $STORE \
--match 'labels: (type)-[l {label:prop, property_label:lab, si_units:si, wd_units:wd, entity:e, lower_bound:lb, upper_bound:ub}]->(val)' \
--return 'type as type, prop as prop, si as si_units, wd as wd_units, lb as lower_bound, ub as upper_bound, count(e) as positives, lab as property_label, "_" as id' \
--order-by 'count(e) desc'

In [39]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_ail_quantity", ".tsv", "type prop lower_bound", "node1 label node2")

In [40]:
display(pd.read_csv("{}/candidate_labels_ail_quantity.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,si_units,wd_units,node2,upper_bound,positives,property_label,id
0,Q484170,P2046,,Q712226,1.28,27.97,32562,'area'@en,E1
1,Q484170,P2044,,Q11573,,441.0,8732,'elevation above sea level'@en,E2
2,Q2074737,P2044,,Q11573,,1172.0,6424,'elevation above sea level'@en,E3
3,Q2074737,P2046,,Q712226,,131.08813,6022,'area'@en,E4
4,Q659103,P2046,,Q712226,11.59,138.015,2452,'area'@en,E5
5,Q21869758,P2046,,Q712226,1.445,28.32,2045,'area'@en,E6
6,Q484170,P2046,,Q712226,46.975,,1436,'area'@en,E7
7,Q659103,P2044,,Q11573,,448.5,989,'elevation above sea level'@en,E8
8,Q33146843,P2044,,Q11573,,762.5,825,'elevation above sea level'@en,E9
9,Q33146843,P2046,,Q712226,,59.7,814,'area'@en,E10


### 4.3 Combining entity --> attribute interval label mappings to single table

In [41]:
!kgtk cat \
-i $OUT/entity_attribute_labels_quantity_bucketed.tsv \
-i $OUT/entity_attribute_labels_time.year_bucketed.tsv \
-o $OUT/entity_AILs_all.tsv

## 5. Create RALs with counts of positive entities

### 5.1 RALs created from attribute *value* labels:

In [42]:
%%time
!kgtk query -i $ITEM_FILE -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-i $OUT/entity_AVLs_all.tsv -o $OUT/candidate_labels_ravl.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (n1)-[l1 {label:p1}]->(n2), type: (n1)-[]->(t1), entity_AVLs: (t2)-[l2 {label:p2, entity:n2, si_units:si, wd_units:wd}]->(val), `'"$LABEL_FILE"'`: (p2)-[:label]->(lab2)' \
--return 't1 as type1, p1 as prop1, t2 as type2, p2 as prop2, lab2 as prop2_label, val as value, count(distinct n1) as positives, si as si_units, wd as wd_units, "_" as id' \
--order-by "count(distinct n1) desc" \
--where 'lab2.kgtk_lqstring_lang_suffix = "en"'

CPU times: user 1.52 s, sys: 261 ms, total: 1.78 s
Wall time: 1min 37s


In [43]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_ravl", ".tsv", "type1 prop1 type2", "node1 label node2")

In [44]:
display(pd.read_csv("{}/candidate_labels_ravl.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,node2,prop2,prop2_label,value,positives,si_units,wd_units,id
0,Q4164871,P17,Q3624078,P2997,'age of majority'@en,18,54112,,Q24564698,E1
1,Q4164871,P17,Q6256,P2997,'age of majority'@en,18,53782,,Q24564698,E2
2,Q4164871,P17,Q6256,P3270,'compulsory education (minimum age)'@en,6,51005,,Q577,E3
3,Q4164871,P17,Q3624078,P3270,'compulsory education (minimum age)'@en,6,51003,,Q577,E4
4,Q4164871,P17,Q6256,P7295,'Gregorian calendar start date'@en,1582,48427,,,E5
5,Q4164871,P17,Q3624078,P7295,'Gregorian calendar start date'@en,1582,48359,,,E6
6,Q4164871,P17,Q51576574,P2997,'age of majority'@en,18,48155,,Q24564698,E7
7,Q4164871,P17,Q3624078,P3271,'compulsory education (maximum age)'@en,16,48056,,Q24564698,E8
8,Q4164871,P17,Q6256,P3271,'compulsory education (maximum age)'@en,16,48056,,Q24564698,E9
9,Q4164871,P17,Q51576574,P3270,'compulsory education (minimum age)'@en,6,48048,,Q577,E10


### 5.2 RALs created from attribute *interval* labels:

In [45]:
%%time
!kgtk query -i $ITEM_FILE -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-i $OUT/entity_AILs_all.tsv -o $OUT/candidate_labels_rail.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (n1)-[l1 {label:p1}]->(n2), type: (n1)-[]->(t1), entity_AILs: (t2)-[l2 {label:p2, entity:n2, lower_bound:lb, upper_bound:ub, wd_units:wd, si_units:si}]->(val), `'"$LABEL_FILE"'`: (p2)-[:label]->(lab2)' \
--return 't1 as type1, p1 as prop1, t2 as type2, p2 as prop2, lab2 as prop2_label, si as si_units, wd as wd_units, lb as lower_bound, ub as upper_bound, count(distinct n1) as positives, "_" as id' \
--order-by "count(distinct n1) desc" \
--where 'lab2.kgtk_lqstring_lang_suffix = "en"'

CPU times: user 1.64 s, sys: 287 ms, total: 1.93 s
Wall time: 1min 44s


In [46]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_rail", ".tsv", "type1 prop1 type2", "node1 label node2")

In [47]:
display(pd.read_csv("{}/candidate_labels_rail.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,node2,prop2,prop2_label,si_units,wd_units,lower_bound,upper_bound,positives,id
0,Q4164871,P17,Q3624078,P571,'inception'@en,,,1620.5,,56851,E1
1,Q4164871,P17,Q3624078,P2219,'real gross domestic product growth rate'@en,,Q11229,-4.75,9.05,56804,E2
2,Q4164871,P17,Q6256,P2219,'real gross domestic product growth rate'@en,,Q11229,-4.75,9.05,56508,E3
3,Q4164871,P17,Q3624078,P1081,'Human Development Index'@en,,,0.626,,56477,E4
4,Q4164871,P17,Q6256,P571,'inception'@en,,,1637.0,,56473,E5
5,Q4164871,P17,Q3624078,P2299,'PPP GDP per capita'@en,,Q550207,,50391.5325,55697,E6
6,Q4164871,P17,Q6256,P1081,'Human Development Index'@en,,,0.6695,0.946,55491,E7
7,Q4164871,P17,Q3624078,P2046,'area'@en,,Q712226,,1181392.5,55005,E8
8,Q4164871,P17,Q6256,P2046,'area'@en,,Q712226,,1425608.0,54716,E9
9,Q4164871,P17,Q6256,P2855,'VAT-rate'@en,,Q11229,8.35,26.0,54672,E10
