# Re-implementing procedure outlined in "Entity Profiling in Knowledge Graphs" (Zhang Et al.)
# This notebook will implement the candidate label creation step

### Pre-requisite steps to run this notebook
1. If you do not have kgtk installed, or do not have the kgtk query command, first install this with `pip install -e <path to local kgtk repo>`
2. kneed (https://pypi.org/project/kneed/) is a dependency. Install this either through Anaconda with `conda install -c conda-forge kneed`, or through pip with `pip install kneed`
3. You'll need to have a subset of wikidata partitioned into different files on your machine. You need to create this yourself, or if you have access to the Table_Linker google drive then you can download the Q44 example data here: https://drive.google.com/drive/folders/1U3Tc25rRwu6xy74mPDOG5LIjhUXpbD9A?usp=sharing

In [2]:
import os
import pandas as pd
from utility import run_command
from utility import rename_cols_and_overwrite_id
from label_discretization import discretize_labels

### Parameters
**Required**  
*item_file*: file path for the file that contains entity to entity relationships (e.g. wikibase-item)  
*time_file*: file path for the file that contains entity to time-type values  
*quantity_file*: file path for the file that contains entity to quantity-type values  
*label_file*: file path for the file that contains wikidata labels  
*work_dir*: path to folder where files created by this notebook should be stored  
*store_dir*: path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

**Optional**    
*string_file*: file path for the file that contains entity to string-type values  

In [3]:
data_dir = "../../Q44/data" # my data files are all in the same directory, so I'll reuse this path prefix

# **REQUIRED**
item_file = "{}/Q44.part.wikibase-item.tsv".format(data_dir)
time_file = "{}/Q44.part.time.tsv".format(data_dir)
quantity_file = "{}/Q44.part.quantity.tsv".format(data_dir)
label_file = "{}/Q44.label.en.tsv".format(data_dir)
work_dir = "../../Q44/profiler_work"
store_dir = "../../Q44"

# **optional**
string_file = None #"{}/Q44.part.string.tsv".format(data_dir)

### Process parameters and set up variables / file names

In [4]:
# Ensure paths are absolute
item_file = os.path.abspath(item_file)
time_file = os.path.abspath(time_file)
quantity_file = os.path.abspath(quantity_file)
label_file = os.path.abspath(label_file)
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)
if string_file:
    string_file = os.path.abspath(string_file)
    
# Create directories
if not os.path.exists(work_dir):
    os.makedirs(work_dir)
output_dir = "{}/label_creation".format(work_dir)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# adding some environment variables we'll be using frequently
os.environ['ITEM_FILE'] = item_file
os.environ['TIME_FILE'] = time_file
os.environ['QUANTITY_FILE'] = quantity_file
os.environ['LABEL_FILE'] = label_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['OUT'] = output_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

# Outline of procedure:
**Goal**:<br>
We want to create candidate label sets including
- Attribute value labels (type, property, *attribute*)
- Realtional entity labels (type, property, *entity*)
- Attribute interval labels (type, property, *range of attribute values*)
- Relational attribute labels (type, property, *attribute or attribute range of another entity*)

To enable subsequent filtering of these labels, we also want to count:
- The number of entities of each type
- The number of entities that match each label (call these "positives")
    
**Steps**:

0. Create type-mapping
1. Count the number of entities of each type
    - *optional future step*: define type with P279 transitive closure in addition to P31. 
2. Create AVLs trivially from attribute files along with counts of the positive entities for each label
    - At this step, we should also contribute to a mapping of entities --> matching attribute labels to facilitate creating RALs in a later step  
3. Create RELs trivially from entity relation files along with counts of positive entities for each label
4. Create AILs by discretizing the AVLs we found, along with counts of positive entities for each label
    - See label_discretization notebook for some exploration of discretization approach that led to the method that is implemented in this notebook
    - At this step, we should also contribute to a mapping of entities --> matching attribute labels to facilitate creating RALs in a later step
5. Create RALs by using the entities --> attribute labels table that we built in steps 2 and 4. Also keep track of counts of positive entities for each label
    
*Misc issues encountered*
- kgtk rename-columns doesn't always work when input file == output file. Getting around this right now by creating temp files... 

## 0. Create type mapping
Mapping is from entity (Q node) to the entity's type (another Q node). Using P31 only for now, but can add P279* as well later

In [8]:
!kgtk filter -p ' ; P31 ; ' -i $ITEM_FILE -o $OUT/type_mapping.tsv

In [9]:
!head -5 $OUT/type_mapping.tsv | column -t -s $'\t'

id              node1     label  node2
Q1000597-P31-1  Q1000597  P31    Q3957
Q1011-P31-2     Q1011     P31    Q112099
Q1011-P31-1     Q1011     P31    Q3624078
Q1019-P31-2     Q1019     P31    Q112099


## 1. Count number of entities of each type:
Use the entity --> type mapping we created in step 0 to do this

In [9]:
!kgtk query -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-o $OUT/entity_counts_per_type.tsv --graph-cache $STORE \
--match 'type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (type)-[:label]->(lab)' \
--return 'distinct type as type, lab as type_label, count(distinct n1) as count, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc'

In [10]:
rename_cols_and_overwrite_id("$OUT/entity_counts_per_type", ".tsv", "type type_label count", "node1 label node2")

In [11]:
!head -10 $OUT/entity_counts_per_type.tsv | column -t -s $'\t'

node1     label                    node2  id
Q131734   'brewery'@en             87     E1
Q3624078  'sovereign state'@en     69     E2
Q4830453  'business'@en            50     E3
Q6256     'country'@en             26     E4
Q6881511  'enterprise'@en          23     E5
Q7270     'republic'@en            18     E6
Q179164   'unitary state'@en       16     E7
Q1998962  'beer style'@en          16     E8
Q123480   'landlocked country'@en  15     E9


## 2. Create AVLs with counts of positive entities
At this step we also want to keep track of entities --> matching attribute labels for future use. This will help when we are creating RALs (step 5)

To accomplish these goals, we will do the following:

For each attribute type (string, time, quantity), we will 1. use the entity --> type mapping along with the attribute data file to create an entity_attribute_labels file that has a mapping of entity --> labels applicable to the entity, and 2. use the entity_attribute_labels file to aggregate labels with counts of matching entities which we will save in a candidate_labels file

### 2.1 strings
Creating mapping of entity --> string attribute labels

In [12]:
if not string_file:
    print("No string attribute file was provided in the parameters section, skipping this step.")
else:
    # perform query
    command = "$kgtk query -i $OUT/type_mapping.tsv -i STRING_FILE -i LABEL_FILE \
               -o $OUT/entity_attribute_labels_string.tsv --graph-cache $STORE \
               --match '`STRING_FILE`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `LABEL_FILE`: (p)-[:label]->(lab)' \
               --return 'distinct n1 as entity, type as type, p as prop, n2 as value, lab as property_label, \"_\" as id' \
               --where 'lab.kgtk_lqstring_lang_suffix = \"en\"' \
               --order-by 'n1'"
    run_command(command, {"STRING_FILE" : string_file, "LABEL_FILE" : label_file})
    # reformat columns to be in KGTK format
    rename_cols_and_overwrite_id("$OUT/entity_attribute_labels_string", ".tsv", "type prop value", "node1 label node2")
    # view header of result
    run_command("head -5 $OUT/entity_attribute_labels_string.tsv | column -t -s $'\t'")

No string attribute file was provided in the parameters section, skipping this step.


Aggregating distinct labels w/ positive entity counts

In [13]:
if not string_file:
    print("No string attribute file was provided in the parameters section, skipping this step.")
else:
    # perform query
    command = "$kgtk query -i $OUT/entity_attribute_labels_string.tsv \
               -o $OUT/candidate_labels_avl_string.tsv --graph-cache $STORE \
               --match 'labels: (type)-[l {label:prop, property_label:lab, entity:e}]->(val)' \
               --return 'distinct type as type, prop as prop, val as value, count(distinct e) as positives, lab as property_label, \"_\" as id' \
               --order-by 'count(distinct e) desc'"
    run_command(command)
    # reformat columns to be in KGTK format
    rename_cols_and_overwrite_id("$OUT/candidate_labels_avl_string", ".tsv", "type prop value", "node1 label node2")
    # view header of result
    run_command("head -5 $OUT/candidate_labels_avl_string.tsv | column -t -s $'\t'")

No string attribute file was provided in the parameters section, skipping this step.


### 2.2 Times

Looking at what precisions we need to deal with...

In [14]:
!kgtk query -i $TIME_FILE $LABEL_FILE\
--graph-cache $STORE \
--match '`'"$TIME_FILE"'`: (n1)-[l {label:p}]->(n2), `'"$LABEL_FILE"'`: (p)-[:label]->(lab)' \
--return 'distinct kgtk_date_precision(n2) as precisions, count(n1) as count' \
--limit 10 \
| column -t -s $'\t'

precisions  count
6           12
7           40
8           12
9           697
10          48
11          469


From the above, we have several precisions below precision of year=9. We don't have kgtk type interpretation functions for these granularities, so for now we'll interpret them all as years. Furthermore, we will interpret all times at the year granularity for now.

Additional work can be done later to create labels with finer time granularity if desired.

Creating mapping of entity --> year attribute labels

In [15]:
!kgtk query -i $OUT/type_mapping.tsv -i $TIME_FILE -i $LABEL_FILE \
-o $OUT/entity_attribute_labels_time.year.tsv --graph-cache $STORE \
--match '`'"$TIME_FILE"'`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (p)-[:label]->(p_lab), `'"$LABEL_FILE"'`: (type)-[:label]->(t_lab)' \
--return 'distinct n1 as entity, type as type, p as prop, kgtk_date_year(n2) as value, t_lab as type_label, p_lab as property_label, "_" as id' \
--where 't_lab.kgtk_lqstring_lang_suffix = "en" AND p_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'n1'

In [16]:
rename_cols_and_overwrite_id("$OUT/entity_attribute_labels_time.year", ".tsv", "type prop value", "node1 label node2")

In [17]:
!head -5 $OUT/entity_attribute_labels_time.year.tsv | column -t -s $'\t'

entity  node1     label  node2  type_label            property_label  id
Q1011   Q112099   P571   1975   'island nation'@en    'inception'@en  E1
Q1011   Q3624078  P571   1975   'sovereign state'@en  'inception'@en  E2
Q1019   Q112099   P571   1960   'island nation'@en    'inception'@en  E3
Q1019   Q3624078  P571   1960   'sovereign state'@en  'inception'@en  E4


Aggregating distinct labels w/ positive entity counts

In [18]:
!kgtk query -i $OUT/entity_attribute_labels_time.year.tsv \
-o $OUT/candidate_labels_avl_time.year.tsv --graph-cache $STORE \
--match 'labels: (n1)-[l {label:p, property_label:lab, entity:e}]->(val)' \
--return 'distinct n1 as type, p as prop, val as value, count(distinct e) as positives, lab as property_label, "_" as id' \
--order-by 'count(distinct e) desc'

In [19]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_avl_time.year", ".tsv", "type prop value", "node1 label node2")

In [20]:
!head -5 $OUT/candidate_labels_avl_time.year.tsv | column -t -s $'\t'

node1     label  node2  positives  property_label  id
Q3624078  P571   1991   8          'inception'@en  E1
Q3624078  P571   1918   7          'inception'@en  E2
Q6256     P571   1918   5          'inception'@en  E3
Q6256     P571   1991   5          'inception'@en  E4


## 2.3 Quantities
Creating mapping of entity --> quantity attribute labels

Note, quantities may have units. We will separate out the quantity value and units into separate columns

In [21]:
!kgtk query -i $OUT/type_mapping.tsv -i $QUANTITY_FILE -i $LABEL_FILE \
-o $OUT/entity_attribute_labels_quantity.tsv --graph-cache $STORE \
--match '`'"$QUANTITY_FILE"'`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (p)-[:label]->(p_lab), `'"$LABEL_FILE"'`: (type)-[:label]->(t_lab)' \
--return 'distinct n1 as entity, type as type, p as prop, kgtk_quantity_number(n2) as value, kgtk_quantity_si_units(n2) as si_units, kgtk_quantity_wd_units(n2) as wd_units, t_lab as type_label, p_lab as property_label, "_" as id' \
--where 't_lab.kgtk_lqstring_lang_suffix = "en" AND p_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'n1'

In [22]:
rename_cols_and_overwrite_id("$OUT/entity_attribute_labels_quantity", ".tsv", "type prop value", "node1 label node2")

In [23]:
display(pd.read_csv("{}/entity_attribute_labels_quantity.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,entity,node1,label,node2,si_units,wd_units,type_label,property_label,id
0,Q1000597,Q3957,P1082,75074.0,,,'town'@en,'population'@en,E1
1,Q1011,Q112099,P1081,0.57,,,'island nation'@en,'Human Development Index'@en,E2
2,Q1011,Q3624078,P1081,0.57,,,'sovereign state'@en,'Human Development Index'@en,E3
3,Q1011,Q112099,P1081,0.572,,,'island nation'@en,'Human Development Index'@en,E4
4,Q1011,Q3624078,P1081,0.572,,,'sovereign state'@en,'Human Development Index'@en,E5


Aggregating distinct labels w/ positive entity counts

In [24]:
!kgtk query -i $OUT/entity_attribute_labels_quantity.tsv \
-o $OUT/candidate_labels_avl_quantity.tsv --graph-cache $STORE \
--match 'labels: (n1)-[l {label:p, property_label:lab, entity:e, si_units:si, wd_units:wd}]->(val)' \
--return 'distinct n1 as type, p as prop, val as value, count(distinct e) as positives, lab as property_label, "_" as id, si as si_units, wd as wd_units' \
--order-by 'count(distinct e) desc'

In [25]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_avl_quantity", ".tsv", "type prop value", "node1 label node2")

In [26]:
display(pd.read_csv("{}/candidate_labels_avl_quantity.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,label,node2,positives,property_label,id,si_units,wd_units
0,Q3624078,P3000,18.0,54,'marriageable age'@en,E1,,Q24564698
1,Q3624078,P2997,18.0,52,'age of majority'@en,E2,,Q24564698
2,Q3624078,P2884,230.0,40,'mains voltage'@en,E3,,Q25250
3,Q3624078,P1279,1.7,25,'inflation rate'@en,E4,,Q11229
4,Q3624078,P1279,2.8,25,'inflation rate'@en,E5,,Q11229


### 2.4 Combining entity --> attribute label mappings to single table

In [27]:
command = "$kgtk cat \
           -i $OUT/entity_attribute_labels_time.year.tsv \
           -i $OUT/entity_attribute_labels_quantity.tsv \
           -o $OUT/entity_AVLs_all.tsv"
if string_file:
    command += " -i $OUT/entity_attribute_labels_string.tsv"

run_command(command)

## 3. Create RELs with counts of positive entities
We do this the same way we created AVLs, except we use the entity to entity relation data file, and we don't need to save the intermediate entity --> labels file since these labels won't contribute to RALs later

In [28]:
!kgtk query -i $ITEM_FILE -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-o $OUT/candidate_labels_rel_item.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (p)-[:label]->(lab)' \
--return 'distinct type as type, p as prop, n2 as value, count(distinct n1) as positives, lab as property_label, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc'

In [29]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_rel_item", ".tsv", "type prop value", "node1 label node2")

In [30]:
!head -10 $OUT/candidate_labels_rel_item.tsv | column -t -s $'\t'

node1     label  node2    positives  property_label            id
Q131734   P452   Q869095  77         'industry'@en             E1
Q3624078  P463   Q1065    68         'member of'@en            E2
Q3624078  P463   Q7817    67         'member of'@en            E3
Q3624078  P530   Q183     67         'diplomatic relation'@en  E4
Q3624078  P463   Q376150  66         'member of'@en            E5
Q3624078  P530   Q865     66         'diplomatic relation'@en  E6
Q3624078  P463   Q17495   65         'member of'@en            E7
Q3624078  P463   Q191384  65         'member of'@en            E8
Q3624078  P463   Q656801  65         'member of'@en            E9


## 4. Create AILs with counts of positive entities
Similar to what we did for AVLs, we also want to keep track of entities --> matching attribute labels for future use in RAL creation (step 5)

We will create attribute *interval* labels from our attribute *value* labels that we previously created. The code that does this is explored in the explore_label_discretization notebook, and implemented in label_discretization.py.

For each entity --> labels file that has a numeric value type (year or quantity) we will:
1. Create a corresponding entity --> bucketed labels file. For example, a label in the input that looks like <country, population, 1,000,000> might get summarized (bucketed) in the output to look like <country, population, (500,000, 2,000,000)>.
2. Use the resulting bucketed entity_attribute_labels file to once again aggregate labels with counts of matching entities. This will give us a candidate_labels_ail file.

*Note on syntax we are using for ranges*: we will define ranges with lower and upper bounds. Ranges may have blank values for the lower and/or upper bounds. A range that only has an upper bound means the bin includes all values <= the upper bound. A range that has no lower or upper bound denotes a single bin that includes all values. Such ranges may be created for labels of a <type, property> have very few datapoints.

Note, the code will create some output about labels that it may not be creating good buckets for. We'll silence this output so it doesn't take up too much space when viewing on github. If you would like to unsilence this output, comment out the `%%capture` lines

### 4.1 Years
Create entity --> bucketed labels file

In [31]:
%%capture
avl_file_in = "{}/entity_attribute_labels_time.year.tsv".format(os.environ["OUT"])
ail_file_out = "{}/entity_attribute_labels_time.year_bucketed.tsv".format(os.environ["OUT"])
discretize_labels(avl_file_in, ail_file_out)

In [32]:
display(pd.read_csv("{}/entity_attribute_labels_time.year_bucketed.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=11).fillna(""))


Unnamed: 0,entity,node1,label,node2,type_label,property_label,id,lower_bound,upper_bound
0,Q1011,Q112099,P571,1975,'island nation'@en,'inception'@en,E1,619.0,
1,Q1011,Q3624078,P571,1975,'sovereign state'@en,'inception'@en,E2,1615.5,
2,Q1019,Q112099,P571,1960,'island nation'@en,'inception'@en,E3,619.0,
3,Q1019,Q3624078,P571,1960,'sovereign state'@en,'inception'@en,E4,1615.5,
4,Q1020773,Q3957,P571,1892,'town'@en,'inception'@en,E5,,
5,Q1020773,Q902814,P571,1892,'border town'@en,'inception'@en,E6,,
6,Q1027,Q112099,P571,1968,'island nation'@en,'inception'@en,E7,619.0,
7,Q1027,Q2221906,P571,1968,'geographic location'@en,'inception'@en,E8,,
8,Q1027,Q3624078,P571,1968,'sovereign state'@en,'inception'@en,E9,1615.5,
9,Q1027,Q4198907,P571,1968,'parliamentary republic'@en,'inception'@en,E10,,


Aggregating distinct interval labels with positive entity counts

**NOTE: Below, the lower_bound column is renamed to be node2**.

In [33]:
!kgtk query -i $OUT/entity_attribute_labels_time.year_bucketed.tsv \
-o $OUT/candidate_labels_ail_time.year.tsv \
--graph-cache $STORE \
--match 'labels: (type)-[l {label:prop, property_label:lab, entity:e, lower_bound:lb, upper_bound:ub}]->(val)' \
--return 'type as type, prop as prop, lb as lower_bound, ub as upper_bound, count(e) as positives, lab as property_label, "_" as id' \
--order-by 'count(e) desc'

In [34]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_ail_time.year", ".tsv", "type prop lower_bound", "node1 label node2")

In [35]:
display(pd.read_csv("{}/candidate_labels_ail_time.year.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=15).fillna(""))

Unnamed: 0,node1,label,node2,upper_bound,positives,property_label,id
0,Q3624078,P571,1615.5,,71,'inception'@en,E1
1,Q4830453,P571,1733.5,,46,'inception'@en,E2
2,Q131734,P571,1922.5,,20,'inception'@en,E3
3,Q51576574,P571,956.5,,20,'inception'@en,E4
4,Q7270,P571,1635.5,,19,'inception'@en,E5
5,Q131734,P571,1797.0,1922.5,16,'inception'@en,E6
6,Q179164,P571,1859.5,,14,'inception'@en,E7
7,Q123480,P571,1529.5,,13,'inception'@en,E8
8,Q3624078,P571,581.5,1370.5,12,'inception'@en,E9
9,Q112099,P571,619.0,,11,'inception'@en,E10


### 4.2 Quantities
Create entity --> bucketed labels file

In [36]:
%%capture
avl_file_in = "{}/entity_attribute_labels_quantity.tsv".format(os.environ["OUT"])
ail_file_out = "{}/entity_attribute_labels_quantity_bucketed.tsv".format(os.environ["OUT"])
discretize_labels(avl_file_in, ail_file_out)

In [37]:
display(pd.read_csv("{}/entity_attribute_labels_quantity_bucketed.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,entity,node1,label,node2,si_units,wd_units,type_label,property_label,id,lower_bound,upper_bound
0,Q1000597,Q3957,P1082,75074.0,,,'town'@en,'population'@en,E1,,
1,Q1011,Q112099,P1081,0.57,,,'island nation'@en,'Human Development Index'@en,E2,0.534,0.616
2,Q1011,Q3624078,P1081,0.57,,,'sovereign state'@en,'Human Development Index'@en,E3,0.565,0.5865
3,Q1011,Q112099,P1081,0.572,,,'island nation'@en,'Human Development Index'@en,E4,0.534,0.616
4,Q1011,Q3624078,P1081,0.572,,,'sovereign state'@en,'Human Development Index'@en,E5,0.565,0.5865
5,Q1011,Q112099,P1081,0.575,,,'island nation'@en,'Human Development Index'@en,E6,0.534,0.616
6,Q1011,Q3624078,P1081,0.575,,,'sovereign state'@en,'Human Development Index'@en,E7,0.565,0.5865
7,Q1011,Q112099,P1081,0.585,,,'island nation'@en,'Human Development Index'@en,E8,0.534,0.616
8,Q1011,Q3624078,P1081,0.585,,,'sovereign state'@en,'Human Development Index'@en,E9,0.565,0.5865
9,Q1011,Q112099,P1081,0.589,,,'island nation'@en,'Human Development Index'@en,E10,0.534,0.616


Aggregating distinct interval labels with positive entity counts

**NOTE: Below, the lower_bound column is renamed to be node2**.

In [38]:
!kgtk query -i $OUT/entity_attribute_labels_quantity_bucketed.tsv \
-o $OUT/candidate_labels_ail_quantity.tsv \
--graph-cache $STORE \
--match 'labels: (type)-[l {label:prop, property_label:lab, si_units:si, wd_units:wd, entity:e, lower_bound:lb, upper_bound:ub}]->(val)' \
--return 'type as type, prop as prop, si as si_units, wd as wd_units, lb as lower_bound, ub as upper_bound, count(e) as positives, lab as property_label, "_" as id' \
--order-by 'count(e) desc'

In [39]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_ail_quantity", ".tsv", "type prop lower_bound", "node1 label node2")

In [40]:
display(pd.read_csv("{}/candidate_labels_ail_quantity.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,si_units,wd_units,node2,upper_bound,positives,property_label,id
0,Q3624078,P2131,,Q4917,,800249400000.0,2752,'nominal GDP'@en,E1
1,Q3624078,P1082,,,,41699680.0,2689,'population'@en,E2
2,Q3624078,P2134,,Q4917,,67399250000.0,2529,'total reserves'@en,E3
3,Q3624078,P2132,,Q4917,,21120.5,2424,'nominal GDP per capita'@en,E4
4,Q3624078,P1081,,,0.595,0.9405,1893,'Human Development Index'@en,E5
5,Q3624078,P1279,,Q11229,-5.6,42.7,1516,'inflation rate'@en,E6
6,Q3624078,P4010,,Q550207,,1305344000000.0,1514,'GDP (PPP)'@en,E7
7,Q3624078,P2299,,Q550207,,31357.38,1463,'PPP GDP per capita'@en,E8
8,Q6256,P2131,,Q4917,,333496000000.0,951,'nominal GDP'@en,E9
9,Q6256,P2132,,Q4917,,29427.0,939,'nominal GDP per capita'@en,E10


### 4.3 Combining entity --> attribute interval label mappings to single table

In [41]:
!kgtk cat \
-i $OUT/entity_attribute_labels_quantity_bucketed.tsv \
-i $OUT/entity_attribute_labels_time.year_bucketed.tsv \
-o $OUT/entity_AILs_all.tsv

## 5. Create RALs with counts of positive entities

### 5.0 Trying out different ways to create these

Here is an idea of what kind of values we should find (creating RALs from scratch for quantity attributes only)

In [42]:
!kgtk query -i $ITEM_FILE -i $QUANTITY_FILE \
-i $OUT/type_mapping.tsv -i $LABEL_FILE --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (n1)-[l1 {label:p1}]->(n2), type: (n1)-[]->(type1), `'"$LABEL_FILE"'`: (p1)-[:label]->(lab1), `'"$QUANTITY_FILE"'`: (n2)-[l2 {label:p2}]->(n3), type: (n2)-[]->(type2), `'"$LABEL_FILE"'`: (p2)-[:label]->(lab2)' \
--return 'distinct type1 as type1, p1 as prop1, type2 as type2, p2 as prop2, n3 as value, count(distinct n1) as positives, lab1 as prop1_label, lab2 as prop2_label' \
--where 'lab1.kgtk_lqstring_lang_suffix = "en"' \
--where 'lab2.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc' \
--limit 5 \
| column -t -s $'\t'

type1     prop1  type2     prop2  value   positives  prop1_label               prop2_label
Q3624078  P530   Q3624078  P1081  +0.801  68         'diplomatic relation'@en  'Human Development Index'@en
Q3624078  P530   Q3624078  P1081  +0.809  68         'diplomatic relation'@en  'Human Development Index'@en
Q3624078  P530   Q3624078  P1081  +0.814  68         'diplomatic relation'@en  'Human Development Index'@en
Q3624078  P530   Q3624078  P1081  +0.824  68         'diplomatic relation'@en  'Human Development Index'@en
Q3624078  P530   Q3624078  P1081  +0.829  68         'diplomatic relation'@en  'Human Development Index'@en


Now using the REL table and the entities --> attribute labels table that we built in steps 2 and 4. Also keep track of counts of positive entities for each label

Trying to reuse RELs, however, when counting positives, we would need to sum each num_pos that matches the line - not sure how to do this, so the below won't capture when type1 --> x1 and type1 --> x2 resolve to be the same label: i.e. type1 --> typex --> val. See that the results of this method differ from the above method which we are confident in. See further down for alternate solution.

In [43]:
!kgtk query -i $OUT/candidate_labels_rel_item.tsv -i $OUT/entity_attribute_labels_quantity.tsv \
--graph-cache $STORE \
--match 'rel: (t1)-[l1 {label:p1, positives:num_pos}]->(v1), entity_attribute: (t2)-[l {entity:v1, label:p2}]->(v2)' \
--return 't1 as type, p1 as prop, t2 as value_type, p2 as value_prop, v2 as value_val, num_pos as positives, "_" as id' \
--order-by "kgtk_quantity_number_int(num_pos) desc" \
--limit 5 \
| column -t -s $'\t'

type      prop  value_type  value_prop  value_val  positives  id
Q3624078  P530  Q3624078    P1081       0.801      67         _
Q3624078  P530  Q4209223    P1081       0.801      67         _
Q3624078  P530  Q43702      P1081       0.801      67         _
Q3624078  P530  Q619610     P1081       0.801      67         _
Q3624078  P530  Q63791824   P1081       0.801      67         _


This way should work. Note that these results match the first method that we are confident in.

In [44]:
!kgtk query -i $ITEM_FILE -i $OUT/type_mapping.tsv \
-i $OUT/entity_attribute_labels_quantity.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (n1)-[l1 {label:p1}]->(n2), type: (n1)-[]->(t1), entity_attribute: (t2)-[l2 {label:p2, entity:n2}]->(val)' \
--return 't1 as type1, p1 as prop1, t2 as type2, p2 as prop2, val as value, count(distinct n1) as positives, "_" as id' \
--order-by "count(distinct n1) desc" \
--limit 5 \
| column -t -s $'\t'

type1     prop1  type2     prop2  value  positives  id
Q3624078  P530   Q3624078  P1081  0.801  68         _
Q3624078  P530   Q3624078  P1081  0.809  68         _
Q3624078  P530   Q3624078  P1081  0.814  68         _
Q3624078  P530   Q3624078  P1081  0.824  68         _
Q3624078  P530   Q3624078  P1081  0.829  68         _


And now doing this for all attribute types:

### 5.1 RALs created from attribute *value* labels:

In [45]:
!kgtk query -i $ITEM_FILE -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-i $OUT/entity_AVLs_all.tsv -o $OUT/candidate_labels_ravl.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (n1)-[l1 {label:p1}]->(n2), type: (n1)-[]->(t1), entity_AVLs: (t2)-[l2 {label:p2, entity:n2, si_units:si, wd_units:wd}]->(val), `'"$LABEL_FILE"'`: (p2)-[:label]->(lab2)' \
--return 't1 as type1, p1 as prop1, t2 as type2, p2 as prop2, lab2 as prop2_label, val as value, count(distinct n1) as positives, si as si_units, wd as wd_units, "_" as id' \
--order-by "count(distinct n1) desc" \
--where 'lab2.kgtk_lqstring_lang_suffix = "en"'

In [46]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_ravl", ".tsv", "type1 prop1 type2", "node1 label node2")

In [47]:
display(pd.read_csv("{}/candidate_labels_ravl.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,node2,prop2,prop2_label,value,positives,si_units,wd_units,id
0,Q131734,P452,Q8148,P580,'start time'@en,-3500.0,77,,,E1
1,Q3624078,P530,Q3624078,P1081,'Human Development Index'@en,0.801,68,,,E2
2,Q3624078,P530,Q3624078,P1081,'Human Development Index'@en,0.809,68,,,E3
3,Q3624078,P530,Q3624078,P1081,'Human Development Index'@en,0.814,68,,,E4
4,Q3624078,P530,Q3624078,P1081,'Human Development Index'@en,0.824,68,,,E5
5,Q3624078,P530,Q3624078,P1081,'Human Development Index'@en,0.829,68,,,E6
6,Q3624078,P530,Q3624078,P1081,'Human Development Index'@en,0.83,68,,,E7
7,Q3624078,P530,Q3624078,P1081,'Human Development Index'@en,0.834,68,,,E8
8,Q3624078,P530,Q3624078,P1081,'Human Development Index'@en,0.839,68,,,E9
9,Q3624078,P530,Q3624078,P1081,'Human Development Index'@en,0.844,68,,,E10


### 5.2 RALs created from attribute *interval* labels:

In [48]:
!kgtk query -i $ITEM_FILE -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-i $OUT/entity_AILs_all.tsv -o $OUT/candidate_labels_rail.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (n1)-[l1 {label:p1}]->(n2), type: (n1)-[]->(t1), entity_AILs: (t2)-[l2 {label:p2, entity:n2, lower_bound:lb, upper_bound:ub, wd_units:wd, si_units:si}]->(val), `'"$LABEL_FILE"'`: (p2)-[:label]->(lab2)' \
--return 't1 as type1, p1 as prop1, t2 as type2, p2 as prop2, lab2 as prop2_label, si as si_units, wd as wd_units, lb as lower_bound, ub as upper_bound, count(distinct n1) as positives, "_" as id' \
--order-by "count(distinct n1) desc" \
--where 'lab2.kgtk_lqstring_lang_suffix = "en"'

In [49]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_rail", ".tsv", "type1 prop1 type2", "node1 label node2")

In [50]:
display(pd.read_csv("{}/candidate_labels_rail.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,node2,prop2,prop2_label,si_units,wd_units,lower_bound,upper_bound,positives,id
0,Q131734,P452,Q8148,P580,'start time'@en,,,,,77,E1
1,Q131734,P17,Q3624078,P1081,'Human Development Index'@en,,,0.595,0.9405,70,E2
2,Q131734,P17,Q3624078,P1279,'inflation rate'@en,,Q11229,-5.6,42.7,70,E3
3,Q131734,P17,Q3624078,P2131,'nominal GDP'@en,,Q4917,,800249000000.0,70,E4
4,Q131734,P17,Q3624078,P2219,'real gross domestic product growth rate'@en,,Q11229,-4.61,,70,E5
5,Q131734,P17,Q3624078,P2250,'life expectancy'@en,,Q577,70.0592,83.0841,70,E6
6,Q131734,P17,Q3624078,P2299,'PPP GDP per capita'@en,,Q550207,,31357.4,70,E7
7,Q131734,P17,Q3624078,P2132,'nominal GDP per capita'@en,,Q4917,,21120.5,69,E8
8,Q131734,P17,Q3624078,P2134,'total reserves'@en,,Q4917,,67399200000.0,69,E9
9,Q131734,P17,Q3624078,P4841,'total fertility rate'@en,,,1.1235,2.1815,69,E10
