# Re-implementing procedure outlined in "Entity Profiling in Knowledge Graphs" (Zhang Et al.)
# This notebook will implement the candidate label creation step

### Pre-requisite steps to run this notebook
1. If you do not have kgtk installed, or do not have the kgtk query command, first install this with `pip install -e <path to local kgtk repo>`
2. kneed (https://pypi.org/project/kneed/) is a dependency. Install this either through Anaconda with `conda install -c conda-forge kneed`, or through pip with `pip install kneed`
3. You'll need to have a subset of wikidata partitioned into different files on your machine. You can create this yourself by following the steps in the KGTK/Turotial noteboks, or if you have access to the Table_Linker google drive then you can download the Q44 example data here: https://drive.google.com/drive/folders/1U3Tc25rRwu6xy74mPDOG5LIjhUXpbD9A?usp=sharing
4. (Optional) Consider running the trim_quantity_file notebook as a pre-processing step (see notebook for details).

In [2]:
import os
import pandas as pd
from utility import run_command
from utility import rename_cols_and_overwrite_id
from label_discretization import discretize_labels

### Parameters
**Required**  
*item_file*: file path for the file that contains entity to entity relationships (e.g. wikibase-item)  
*time_file*: file path for the file that contains entity to time-type values  
*quantity_file*: file path for the file that contains entity to quantity-type values (remember to specify the trimmed file if you did the quantity file trimming pre-processing step).  
*label_file*: file path for the file that contains wikidata labels  
*work_dir*: path to folder where files created by this notebook should be stored  
*store_dir*: path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

**Optional**    
*string_file*: file path for the file that contains entity to string-type values  

In [3]:
data_dir = "../../Q154/data" # my data files are all in the same directory, so I'll reuse this path prefix

# **REQUIRED**
item_file = "{}/claims.wikibase-item.tsv.gz".format(data_dir)
time_file = "{}/claims.time.tsv.gz".format(data_dir)
quantity_file = "{}/claims.quantity_trimmed.tsv.gz".format(data_dir)
label_file = "{}/labels.en.tsv.gz".format(data_dir)
work_dir = "../../Q154/profiler_work"
store_dir = "../../Q154"

# **optional**
string_file = "{}/claims.string.tsv.gz".format(data_dir)

### Process parameters and set up variables / file names

In [4]:
# Ensure paths are absolute
item_file = os.path.abspath(item_file)
time_file = os.path.abspath(time_file)
quantity_file = os.path.abspath(quantity_file)
label_file = os.path.abspath(label_file)
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)
if string_file:
    string_file = os.path.abspath(string_file)
    
# Create directories
if not os.path.exists(work_dir):
    os.makedirs(work_dir)
output_dir = "{}/label_creation".format(work_dir)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# adding some environment variables we'll be using frequently
os.environ['ITEM_FILE'] = item_file
os.environ['TIME_FILE'] = time_file
os.environ['QUANTITY_FILE'] = quantity_file
os.environ['LABEL_FILE'] = label_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['OUT'] = output_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

# Outline of procedure:
**Goal**:<br>
We want to create candidate label sets including
- Attribute value labels (type, property, *attribute*)
- Realtional entity labels (type, property, *entity*)
- Attribute interval labels (type, property, *range of attribute values*)
- Relational attribute labels (type, property, *attribute or attribute range of another entity*)

To enable subsequent filtering of these labels, we also want to count:
- The number of entities of each type
- The number of entities that match each label (call these "positives")
    
**Steps**:

0. Create type-mapping
1. Count the number of entities of each type
    - *optional future step*: define type with P279 transitive closure in addition to P31. 
2. Create AVLs trivially from attribute files along with counts of the positive entities for each label
    - At this step, we should also contribute to a mapping of entities --> matching attribute labels to facilitate creating RALs in a later step  
3. Create RELs trivially from entity relation files along with counts of positive entities for each label
4. Create AILs by discretizing the AVLs we found, along with counts of positive entities for each label
    - See label_discretization notebook for some exploration of discretization approach that led to the method that is implemented in this notebook
    - At this step, we should also contribute to a mapping of entities --> matching attribute labels to facilitate creating RALs in a later step
5. Create RALs by using the entities --> attribute labels table that we built in steps 2 and 4. Also keep track of counts of positive entities for each label
    
*Misc issues encountered*
- kgtk rename-columns doesn't always work when input file == output file. Getting around this right now by creating temp files... 

## 0. Create type mapping
Mapping is from entity (Q node) to the entity's type (another Q node). Using P31 only for now, but can add P279* as well later

In [5]:
!kgtk filter -p ' ; P31 ; ' -i $ITEM_FILE -o $OUT/type_mapping.tsv

In [6]:
!head -5 $OUT/type_mapping.tsv | column -t -s $'\t'

id                              node1  label  node2      node2;wikidatatype
P10-P31-Q18610173-85ef4d24-0    P10    P31    Q18610173  wikibase-item
P1001-P31-Q15720608-deeedec9-0  P1001  P31    Q15720608  wikibase-item
P1001-P31-Q22984026-8beb0cfe-0  P1001  P31    Q22984026  wikibase-item
P1001-P31-Q22997934-1e5b1a96-0  P1001  P31    Q22997934  wikibase-item


## 1. Count number of entities of each type:
Use the entity --> type mapping we created in step 0 to do this

In [7]:
!kgtk query -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-o $OUT/entity_counts_per_type.tsv --graph-cache $STORE \
--match 'type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (type)-[:label]->(lab)' \
--return 'distinct type as type, lab as type_label, count(distinct n1) as count, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc'

In [8]:
rename_cols_and_overwrite_id("$OUT/entity_counts_per_type", ".tsv", "type type_label count", "node1 label node2")

In [9]:
!head -10 $OUT/entity_counts_per_type.tsv | column -t -s $'\t'

node1      label                    node2  id
Q282       'wine'@en                2276   E1
Q10750129  'First Growth'@en        1078   E2
Q10210     'white wine'@en          734    E3
Q80114014  'Alsace wine'@en         722    E4
Q15075508  'beer brand'@en          683    E5
Q1827      'red wine'@en            361    E6
Q4167836   'Wikimedia category'@en  265    E7
Q12979     'rosé'@en                244    E8
Q44        'beer'@en                242    E9


## 2. Create AVLs with counts of positive entities
At this step we also want to keep track of entities --> matching attribute labels for future use. This will help when we are creating RALs (step 5)

To accomplish these goals, we will do the following:

For each attribute type (string, time, quantity), we will 1. use the entity --> type mapping along with the attribute data file to create an entity_attribute_labels file that has a mapping of entity --> labels applicable to the entity, and 2. use the entity_attribute_labels file to aggregate labels with counts of matching entities which we will save in a candidate_labels file

### 2.1 strings
Creating mapping of entity --> string attribute labels

In [10]:
if not string_file:
    print("No string attribute file was provided in the parameters section, skipping this step.")
else:
    # perform query
    command = "$kgtk query -i $OUT/type_mapping.tsv -i STRING_FILE -i LABEL_FILE \
               -o $OUT/entity_attribute_labels_string.tsv --graph-cache $STORE \
               --match '`STRING_FILE`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `LABEL_FILE`: (p)-[:label]->(lab)' \
               --return 'distinct n1 as entity, type as type, p as prop, n2 as value, lab as property_label, \"_\" as id' \
               --where 'lab.kgtk_lqstring_lang_suffix = \"en\"' \
               --order-by 'n1'"
    run_command(command, {"STRING_FILE" : string_file, "LABEL_FILE" : label_file})
    # reformat columns to be in KGTK format
    rename_cols_and_overwrite_id("$OUT/entity_attribute_labels_string", ".tsv", "type prop value", "node1 label node2")
    # view header of result
    run_command("head -5 $OUT/entity_attribute_labels_string.tsv | column -t -s $'\t'")

entity  node1      label  node2             property_label                 id
P101    Q56457408  P1282  "Key:craft"       'OpenStreetMap tag or key'@en  E1
P101    Q57955292  P1282  "Key:craft"       'OpenStreetMap tag or key'@en  E2
P1082   Q22984494  P1282  "Key:population"  'OpenStreetMap tag or key'@en  E3
P1216   Q18618628  P1282  "Key:HE_ref"      'OpenStreetMap tag or key'@en  E4



Aggregating distinct labels w/ positive entity counts

In [11]:
if not string_file:
    print("No string attribute file was provided in the parameters section, skipping this step.")
else:
    # perform query
    command = "$kgtk query -i $OUT/entity_attribute_labels_string.tsv \
               -o $OUT/candidate_labels_avl_string.tsv --graph-cache $STORE \
               --match 'labels: (type)-[l {label:prop, property_label:lab, entity:e}]->(val)' \
               --return 'distinct type as type, prop as prop, val as value, count(distinct e) as positives, lab as property_label, \"_\" as id' \
               --order-by 'count(distinct e) desc'"
    run_command(command)
    # reformat columns to be in KGTK format
    rename_cols_and_overwrite_id("$OUT/candidate_labels_avl_string", ".tsv", "type prop value", "node1 label node2")
    # view header of result
    run_command("head -5 $OUT/candidate_labels_avl_string.tsv | column -t -s $'\t'")

node1      label  node2  positives  property_label     id
Q6256      P3238  "0"    55         'trunk prefix'@en  E1
Q3624078   P3238  "0"    54         'trunk prefix'@en  E2
Q51576574  P3238  "0"    11         'trunk prefix'@en  E3
Q7270      P3238  "0"    11         'trunk prefix'@en  E4



### 2.2 Times

Looking at what precisions we need to deal with...

In [12]:
!kgtk query -i $TIME_FILE $LABEL_FILE\
--graph-cache $STORE \
--match '`'"$TIME_FILE"'`: (n1)-[l {label:p}]->(n2), `'"$LABEL_FILE"'`: (p)-[:label]->(lab)' \
--return 'distinct kgtk_date_precision(n2) as precisions, count(n1) as count' \
--limit 10 \
| column -t -s $'\t'

precisions  count
6           2
7           31
8           19
9           1104
10          21
11          618


From the above, we have several precisions below precision of year=9. We don't have kgtk type interpretation functions for these granularities, so for now we'll interpret them all as years. Furthermore, we will interpret all times at the year granularity for now.

Additional work can be done later to create labels with finer time granularity if desired.

Creating mapping of entity --> year attribute labels

In [13]:
!kgtk query -i $OUT/type_mapping.tsv -i $TIME_FILE -i $LABEL_FILE \
-o $OUT/entity_attribute_labels_time.year.tsv --graph-cache $STORE \
--match '`'"$TIME_FILE"'`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (p)-[:label]->(p_lab), `'"$LABEL_FILE"'`: (type)-[:label]->(t_lab)' \
--return 'distinct n1 as entity, type as type, p as prop, kgtk_date_year(n2) as value, t_lab as type_label, p_lab as property_label, "_" as id' \
--where 't_lab.kgtk_lqstring_lang_suffix = "en" AND p_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'n1'

In [14]:
rename_cols_and_overwrite_id("$OUT/entity_attribute_labels_time.year", ".tsv", "type prop value", "node1 label node2")

In [15]:
!head -5 $OUT/entity_attribute_labels_time.year.tsv | column -t -s $'\t'

entity  node1      label  node2  type_label                                            property_label          id
P2847   Q18608871  P2669  2019   'Wikidata property for items about people'@en         'discontinued date'@en  E1
P2847   Q24239898  P2669  2019   'Wikidata property for Wikivoyage listings'@en        'discontinued date'@en  E2
P2847   Q30041186  P2669  2019   'Wikidata property related to online communities'@en  'discontinued date'@en  E3
P2847   Q60457486  P2669  2019   'Wikidata property for a discontinued website'@en     'discontinued date'@en  E4


Aggregating distinct labels w/ positive entity counts

In [16]:
!kgtk query -i $OUT/entity_attribute_labels_time.year.tsv \
-o $OUT/candidate_labels_avl_time.year.tsv --graph-cache $STORE \
--match 'labels: (n1)-[l {label:p, property_label:lab, entity:e}]->(val)' \
--return 'distinct n1 as type, p as prop, val as value, count(distinct e) as positives, lab as property_label, "_" as id' \
--order-by 'count(distinct e) desc'

In [17]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_avl_time.year", ".tsv", "type prop value", "node1 label node2")

In [18]:
!head -5 $OUT/candidate_labels_avl_time.year.tsv | column -t -s $'\t'

node1      label  node2  positives  property_label                                id
Q184188    P576   2015   64         'dissolved, abolished or demolished date'@en  E1
Q6465      P571   1790   43         'inception'@en                                E2
Q18524218  P571   2015   30         'inception'@en                                E3
Q1565828   P571   1936   22         'inception'@en                                E4


## 2.3 Quantities
Creating mapping of entity --> quantity attribute labels

Note, quantities may have units. We will separate out the quantity value and units into separate columns

In [25]:
!kgtk query -i $OUT/type_mapping.tsv -i $QUANTITY_FILE -i $LABEL_FILE \
-o $OUT/entity_attribute_labels_quantity.tsv --graph-cache $STORE \
--match '`'"$QUANTITY_FILE"'`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (p)-[:label]->(p_lab), `'"$LABEL_FILE"'`: (type)-[:label]->(t_lab)' \
--return 'distinct n1 as entity, type as type, p as prop, kgtk_quantity_number(n2) as value, kgtk_quantity_si_units(n2) as si_units, kgtk_quantity_wd_units(n2) as wd_units, t_lab as type_label, p_lab as property_label, "_" as id' \
--where 't_lab.kgtk_lqstring_lang_suffix = "en" AND p_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'n1'

In [26]:
rename_cols_and_overwrite_id("$OUT/entity_attribute_labels_quantity", ".tsv", "type prop value", "node1 label node2")

In [28]:
display(pd.read_csv("{}/entity_attribute_labels_quantity.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,entity,node1,label,node2,si_units,wd_units,type_label,property_label,id
0,Q1000,Q11042,P1081,0.702,,,'culture'@en,'Human Development Index'@en,E1
1,Q1000,Q1292119,P1081,0.702,,,'style'@en,'Human Development Index'@en,E2
2,Q1000,Q179023,P1081,0.702,,,'French colonial empire'@en,'Human Development Index'@en,E3
3,Q1000,Q3624078,P1081,0.702,,,'sovereign state'@en,'Human Development Index'@en,E4
4,Q1000,Q6256,P1081,0.702,,,'country'@en,'Human Development Index'@en,E5


Aggregating distinct labels w/ positive entity counts

In [21]:
!kgtk query -i $OUT/entity_attribute_labels_quantity.tsv \
-o $OUT/candidate_labels_avl_quantity.tsv --graph-cache $STORE \
--match 'labels: (n1)-[l {label:p, property_label:lab, entity:e, si_units:si, wd_units:wd}]->(val)' \
--return 'distinct n1 as type, p as prop, val as value, count(distinct e) as positives, lab as property_label, "_" as id, si as si_units, wd as wd_units' \
--order-by 'count(distinct e) desc'

In [22]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_avl_quantity", ".tsv", "type prop value", "node1 label node2")

In [23]:
display(pd.read_csv("{}/candidate_labels_avl_quantity.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,label,node2,positives,property_label,id,si_units,wd_units
0,Q3624078,P3000,18,64,'marriageable age'@en,E1,,Q24564698
1,Q6256,P2997,18,64,'age of majority'@en,E2,,Q24564698
2,Q6256,P3000,18,64,'marriageable age'@en,E3,,Q24564698
3,Q3624078,P2997,18,63,'age of majority'@en,E4,,Q24564698
4,Q6256,P2884,230,50,'mains voltage'@en,E5,,Q25250


### 2.4 Combining entity --> attribute label mappings to single table

In [24]:
command = "$kgtk cat \
           -i $OUT/entity_attribute_labels_time.year.tsv \
           -i $OUT/entity_attribute_labels_quantity.tsv \
           -o $OUT/entity_AVLs_all.tsv"
if string_file:
    command += " -i $OUT/entity_attribute_labels_string.tsv"

run_command(command)

## 3. Create RELs with counts of positive entities
We do this the same way we created AVLs, except we use the entity to entity relation data file, and we don't need to save the intermediate entity --> labels file since these labels won't contribute to RALs later

In [25]:
!kgtk query -i $ITEM_FILE -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-o $OUT/candidate_labels_rel_item.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (p)-[:label]->(lab)' \
--return 'distinct type as type, p as prop, n2 as value, count(distinct n1) as positives, lab as property_label, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc'

In [26]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_rel_item", ".tsv", "type prop value", "node1 label node2")

In [27]:
!head -10 $OUT/candidate_labels_rel_item.tsv | column -t -s $'\t'

node1      label  node2      positives  property_label    id
Q282       P31    Q282       2276       'instance of'@en  E1
Q282       P279   Q1125341   2123       'subclass of'@en  E2
Q10750129  P31    Q10750129  1078       'instance of'@en  E3
Q10210     P31    Q10210     734        'instance of'@en  E4
Q80114014  P31    Q80114014  722        'instance of'@en  E5
Q15075508  P31    Q15075508  683        'instance of'@en  E6
Q10210     P31    Q10750129  420        'instance of'@en  E7
Q10750129  P31    Q10210     420        'instance of'@en  E8
Q1827      P31    Q1827      361        'instance of'@en  E9


## 4. Create AILs with counts of positive entities
Similar to what we did for AVLs, we also want to keep track of entities --> matching attribute labels for future use in RAL creation (step 5)

We will create attribute *interval* labels from our attribute *value* labels that we previously created. The code that does this is explored in the explore_label_discretization notebook, and implemented in label_discretization.py.

For each entity --> labels file that has a numeric value type (year or quantity) we will:
1. Create a corresponding entity --> bucketed labels file. For example, a label in the input that looks like <country, population, 1,000,000> might get summarized (bucketed) in the output to look like <country, population, (500,000, 2,000,000)>.
2. Use the resulting bucketed entity_attribute_labels file to once again aggregate labels with counts of matching entities. This will give us a candidate_labels_ail file.

*Note on syntax we are using for ranges*: we will define ranges with lower and upper bounds. Ranges may have blank values for the lower and/or upper bounds. A range that only has an upper bound means the bin includes all values <= the upper bound. A range that has no lower or upper bound denotes a single bin that includes all values. Such ranges may be created for labels of a <type, property> have very few datapoints.

Note, the code will create some output about labels that it may not be creating good buckets for. We'll silence this output so it doesn't take up too much space when viewing on github. If you would like to unsilence this output, comment out the `%%capture` lines

### 4.1 Years
Create entity --> bucketed labels file

In [28]:
%%capture
avl_file_in = "{}/entity_attribute_labels_time.year.tsv".format(os.environ["OUT"])
ail_file_out = "{}/entity_attribute_labels_time.year_bucketed.tsv".format(os.environ["OUT"])
discretize_labels(avl_file_in, ail_file_out)

In [29]:
display(pd.read_csv("{}/entity_attribute_labels_time.year_bucketed.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=11).fillna(""))


Unnamed: 0,entity,node1,label,node2,type_label,property_label,id,lower_bound,upper_bound
0,P2847,Q18608871,P2669,2019,'Wikidata property for items about people'@en,'discontinued date'@en,E1,,
1,P2847,Q24239898,P2669,2019,'Wikidata property for Wikivoyage listings'@en,'discontinued date'@en,E2,,
2,P2847,Q30041186,P2669,2019,'Wikidata property related to online communiti...,'discontinued date'@en,E3,,
3,P2847,Q60457486,P2669,2019,'Wikidata property for a discontinued website'@en,'discontinued date'@en,E4,,
4,Q1000,Q11042,P571,1960,'culture'@en,'inception'@en,E5,,
5,Q1000,Q1292119,P571,1960,'style'@en,'inception'@en,E6,,
6,Q1000,Q179023,P571,1960,'French colonial empire'@en,'inception'@en,E7,,
7,Q1000,Q3624078,P571,1960,'sovereign state'@en,'inception'@en,E8,1642.0,
8,Q1000,Q6256,P571,1960,'country'@en,'inception'@en,E9,1640.0,
9,Q1000138,Q21869758,P576,2016,'delegated commune'@en,"'dissolved, abolished or demolished date'@en",E10,,


Aggregating distinct interval labels with positive entity counts

**NOTE: Below, the lower_bound column is renamed to be node2**.

In [30]:
!kgtk query -i $OUT/entity_attribute_labels_time.year_bucketed.tsv \
-o $OUT/candidate_labels_ail_time.year.tsv \
--graph-cache $STORE \
--match 'labels: (type)-[l {label:prop, property_label:lab, entity:e, lower_bound:lb, upper_bound:ub}]->(val)' \
--return 'type as type, prop as prop, lb as lower_bound, ub as upper_bound, count(e) as positives, lab as property_label, "_" as id' \
--order-by 'count(e) desc'

In [31]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_ail_time.year", ".tsv", "type prop lower_bound", "node1 label node2")

In [32]:
display(pd.read_csv("{}/candidate_labels_ail_time.year.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=15).fillna(""))

Unnamed: 0,node1,label,node2,upper_bound,positives,property_label,id
0,Q131734,P571,1809.5,,145,'inception'@en,E1
1,Q6256,P571,1640.0,,114,'inception'@en,E2
2,Q3624078,P571,1642.0,,111,'inception'@en,E3
3,Q4830453,P571,1911.0,,72,'inception'@en,E4
4,Q15075508,P571,1849.0,,71,'inception'@en,E5
5,Q184188,P576,,,64,"'dissolved, abolished or demolished date'@en",E6
6,Q4830453,P571,1803.0,1911.0,63,'inception'@en,E7
7,Q282,P571,1957.5,2000.5,62,'inception'@en,E8
8,Q1565828,P571,,1963.0,61,'inception'@en,E9
9,Q44,P571,1815.0,,56,'inception'@en,E10


### 4.2 Quantities
Create entity --> bucketed labels file

In [33]:
%%capture
avl_file_in = "{}/entity_attribute_labels_quantity.tsv".format(os.environ["OUT"])
ail_file_out = "{}/entity_attribute_labels_quantity_bucketed.tsv".format(os.environ["OUT"])
discretize_labels(avl_file_in, ail_file_out)

In [34]:
display(pd.read_csv("{}/entity_attribute_labels_quantity_bucketed.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,entity,node1,label,node2,si_units,wd_units,type_label,property_label,id,lower_bound,upper_bound
0,Q1000,Q11042,P1081,0.702,,,'culture'@en,'Human Development Index'@en,E1,,
1,Q1000,Q1292119,P1081,0.702,,,'style'@en,'Human Development Index'@en,E2,,
2,Q1000,Q179023,P1081,0.702,,,'French colonial empire'@en,'Human Development Index'@en,E3,,
3,Q1000,Q3624078,P1081,0.702,,,'sovereign state'@en,'Human Development Index'@en,E4,0.676,0.7275
4,Q1000,Q6256,P1081,0.702,,,'country'@en,'Human Development Index'@en,E5,0.676,0.7275
5,Q1000,Q11042,P1082,2025137.0,,,'culture'@en,'population'@en,E6,,
6,Q1000,Q1292119,P1082,2025137.0,,,'style'@en,'population'@en,E7,,
7,Q1000,Q179023,P1082,2025137.0,,,'French colonial empire'@en,'population'@en,E8,,
8,Q1000,Q3624078,P1082,2025137.0,,,'sovereign state'@en,'population'@en,E9,,117742000.0
9,Q1000,Q6256,P1082,2025137.0,,,'country'@en,'population'@en,E10,,70141600.0


Aggregating distinct interval labels with positive entity counts

**NOTE: Below, the lower_bound column is renamed to be node2**.

In [35]:
!kgtk query -i $OUT/entity_attribute_labels_quantity_bucketed.tsv \
-o $OUT/candidate_labels_ail_quantity.tsv \
--graph-cache $STORE \
--match 'labels: (type)-[l {label:prop, property_label:lab, si_units:si, wd_units:wd, entity:e, lower_bound:lb, upper_bound:ub}]->(val)' \
--return 'type as type, prop as prop, si as si_units, wd as wd_units, lb as lower_bound, ub as upper_bound, count(e) as positives, lab as property_label, "_" as id' \
--order-by 'count(e) desc'

In [36]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_ail_quantity", ".tsv", "type prop lower_bound", "node1 label node2")

In [37]:
display(pd.read_csv("{}/candidate_labels_ail_quantity.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,si_units,wd_units,node2,upper_bound,positives,property_label,id
0,Q6256,P2046,,Q712226,,2581070.0,98,'area'@en,E1
1,Q3624078,P2046,,Q712226,,3033832.0,95,'area'@en,E2
2,Q6256,P2219,,Q11229,-7.8,9.05,94,'real gross domestic product growth rate'@en,E3
3,Q3624078,P2219,,Q11229,-7.8,9.05,93,'real gross domestic product growth rate'@en,E4
4,Q6256,P2131,,Q4917,,1995152000000.0,92,'nominal GDP'@en,E5
5,Q6256,P4010,,Q550207,,3519142000000.0,91,'GDP (PPP)'@en,E6
6,Q3624078,P4010,,Q550207,,3519142000000.0,90,'GDP (PPP)'@en,E7
7,Q3624078,P2131,,Q4917,,1995152000000.0,89,'nominal GDP'@en,E8
8,Q3624078,P2299,,Q550207,,57625.01,89,'PPP GDP per capita'@en,E9
9,Q6256,P2299,,Q550207,,47938.2,83,'PPP GDP per capita'@en,E10


### 4.3 Combining entity --> attribute interval label mappings to single table

In [38]:
!kgtk cat \
-i $OUT/entity_attribute_labels_quantity_bucketed.tsv \
-i $OUT/entity_attribute_labels_time.year_bucketed.tsv \
-o $OUT/entity_AILs_all.tsv

## 5. Create RALs with counts of positive entities

### 5.1 RALs created from attribute *value* labels:

In [39]:
%%time
!kgtk query -i $ITEM_FILE -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-i $OUT/entity_AVLs_all.tsv -o $OUT/candidate_labels_ravl.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (n1)-[l1 {label:p1}]->(n2), type: (n1)-[]->(t1), entity_AVLs: (t2)-[l2 {label:p2, entity:n2, si_units:si, wd_units:wd}]->(val), `'"$LABEL_FILE"'`: (p2)-[:label]->(lab2)' \
--return 't1 as type1, p1 as prop1, t2 as type2, p2 as prop2, lab2 as prop2_label, val as value, count(distinct n1) as positives, si as si_units, wd as wd_units, "_" as id' \
--order-by "count(distinct n1) desc" \
--where 'lab2.kgtk_lqstring_lang_suffix = "en"'

CPU times: user 237 ms, sys: 74.3 ms, total: 311 ms
Wall time: 19.2 s


In [40]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_ravl", ".tsv", "type1 prop1 type2", "node1 label node2")

In [41]:
display(pd.read_csv("{}/candidate_labels_ravl.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,node2,prop2,prop2_label,value,positives,si_units,wd_units,id
0,Q10210,P31,Q8024923,P373,'Commons category'@en,White wine,734,,,E1
1,Q10750129,P31,Q8024923,P373,'Commons category'@en,White wine,420,,,E2
2,Q1827,P31,Q8024923,P373,'Commons category'@en,Red wine,361,,,E3
3,Q12979,P31,Q8024923,P373,'Commons category'@en,Rosé wine,244,,,E4
4,Q131734,P452,Q8148,P373,'Commons category'@en,Beer brewing,194,,,E5
5,Q131734,P452,Q8148,P580,'start time'@en,-3500,194,,,E6
6,Q131734,P17,Q6256,P3238,'trunk prefix'@en,0,184,,,E7
7,Q282,P17,Q3624078,P2997,'age of majority'@en,18,182,,Q24564698,E8
8,Q282,P17,Q6256,P2997,'age of majority'@en,18,181,,Q24564698,E9
9,Q282,P17,Q3624078,P3270,'compulsory education (minimum age)'@en,6,180,,Q24564698,E10


### 5.2 RALs created from attribute *interval* labels:

In [42]:
%%time
!kgtk query -i $ITEM_FILE -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-i $OUT/entity_AILs_all.tsv -o $OUT/candidate_labels_rail.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (n1)-[l1 {label:p1}]->(n2), type: (n1)-[]->(t1), entity_AILs: (t2)-[l2 {label:p2, entity:n2, lower_bound:lb, upper_bound:ub, wd_units:wd, si_units:si}]->(val), `'"$LABEL_FILE"'`: (p2)-[:label]->(lab2)' \
--return 't1 as type1, p1 as prop1, t2 as type2, p2 as prop2, lab2 as prop2_label, si as si_units, wd as wd_units, lb as lower_bound, ub as upper_bound, count(distinct n1) as positives, "_" as id' \
--order-by "count(distinct n1) desc" \
--where 'lab2.kgtk_lqstring_lang_suffix = "en"'

CPU times: user 132 ms, sys: 45.5 ms, total: 177 ms
Wall time: 10.4 s


In [43]:
rename_cols_and_overwrite_id("$OUT/candidate_labels_rail", ".tsv", "type1 prop1 type2", "node1 label node2")

In [44]:
display(pd.read_csv("{}/candidate_labels_rail.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=10).fillna(""))

Unnamed: 0,node1,label,node2,prop2,prop2_label,si_units,wd_units,lower_bound,upper_bound,positives,id
0,Q131734,P17,Q6256,P2219,'real gross domestic product growth rate'@en,,Q11229,-7.8,9.05,217,E1
1,Q131734,P17,Q6256,P1081,'Human Development Index'@en,,,0.7275,,215,E2
2,Q131734,P17,Q6256,P571,'inception'@en,,,1640.0,,212,E3
3,Q131734,P17,Q6256,P1198,'unemployment rate'@en,,Q11229,1.45,15.0,207,E4
4,Q131734,P17,Q3624078,P2219,'real gross domestic product growth rate'@en,,Q11229,-7.8,9.05,202,E5
5,Q131734,P17,Q6256,P2046,'area'@en,,Q712226,,2581070.0,202,E6
6,Q131734,P17,Q3624078,P1081,'Human Development Index'@en,,,0.7275,,200,E7
7,Q131734,P17,Q3624078,P571,'inception'@en,,,1642.0,,197,E8
8,Q131734,P17,Q6256,P2855,'VAT-rate'@en,,Q11229,8.25,26.0,196,E9
9,Q131734,P17,Q6256,P3529,'median income'@en,,Q4917,,,196,E10
