# Re-implementing procedure outlined in "Entity Profiling in Knowledge Graphs" (Zhang Et al.)
# This notebook will implement the candidate label creation step

### Pre-requisite steps to run this notebook
1. If you do not have kgtk installed, or do not have the kgtk query command, first install this with `pip install -e <path to local kgtk repo>`
2. kneed (https://pypi.org/project/kneed/) is a dependency. Install this either through Anaconda with `conda install -c conda-forge kneed`, or through pip with `pip install kneed`
3. You'll need to have a subset of wikidata partitioned into different files on your machine. You can create this yourself by following the steps in the KGTK/Turotial noteboks, or if you have access to the Table_Linker google drive then you can download the Q44 example data here: https://drive.google.com/drive/folders/1U3Tc25rRwu6xy74mPDOG5LIjhUXpbD9A?usp=sharing
4. (Optional) Consider running the trim_quantity_file notebook as a pre-processing step (see notebook for details).

In [1]:
%load_ext autoreload
%autoreload 2

In [4]:
import os
import pandas as pd
from utility import run_command
from utility import rename_cols_and_overwrite_id
from label_discretization import discretize_labels_by_percentile, discretize_labels_fixed_width

### Parameters
**Required**  
*item_file*: file path for the file that contains entity to entity relationships (e.g. wikibase-item)  
*time_file*: file path for the file that contains entity to time-type values  
*quantity_file*: file path for the file that contains entity to quantity-type values (remember to specify the trimmed file if you did the quantity file trimming pre-processing step).  
*label_file*: file path for the file that contains wikidata labels  
*work_dir*: path to folder where files created by this notebook should be stored  
*store_dir*: path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

**Optional**    
*string_file*: file path for the file that contains entity to string-type values  

In [5]:
data_dir = "./data/wikidata-20210215-dwd" # my data files are all in the same directory, so I'll reuse this path prefix

# **REQUIRED**
item_file = "{}/claims.wikibase-item.tsv.gz".format(data_dir)
time_file = "{}/claims.time.tsv.gz".format(data_dir)
quantity_file = "{}/claims.quantity_trimmed.tsv.gz".format(data_dir)
label_file = "{}/labels.en.tsv.gz".format(data_dir)
work_dir = "./output/wikidata-20210215-dwd/"
store_dir = "{}/temp".format(work_dir)

# **optional**
string_file = None #"{}/claims.string.tsv.gz".format(data_dir)

### Process parameters and set up variables / file names

In [6]:
# Ensure paths are absolute
item_file = os.path.abspath(item_file)
time_file = os.path.abspath(time_file)
quantity_file = os.path.abspath(quantity_file)
label_file = os.path.abspath(label_file)
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)
if string_file:
    string_file = os.path.abspath(string_file)
    
# Create directories
if not os.path.exists(work_dir):
    os.makedirs(work_dir)
output_dir = "{}/label_creation".format(work_dir)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)
if not os.path.exists(store_dir):
    os.makedirs(store_dir)

# adding some environment variables we'll be using frequently
os.environ['ITEM_FILE'] = item_file
os.environ['TIME_FILE'] = time_file
os.environ['QUANTITY_FILE'] = quantity_file
os.environ['LABEL_FILE'] = label_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['OUT'] = output_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

# Outline of procedure:
**Goal**:<br>
We want to create candidate label sets including
- Attribute value labels (type, property, *attribute*)
- Realtional entity labels (type, property, *entity*)
- Attribute interval labels (type, property, *range of attribute values*)
- Relational attribute labels (type, property, *attribute or attribute range of another entity*)

To enable subsequent filtering of these labels, we also want to count:
- The number of entities of each type
- The number of entities that match each label (call these "positives")
    
**Steps**:

0. Create type-mapping
1. Count the number of entities of each type
    - *optional future step*: define type with P279 transitive closure in addition to P31. 
2. Create AVLs trivially from attribute files along with counts of the positive entities for each label
    - At this step, we should also contribute to a mapping of entities --> matching attribute labels to facilitate creating RALs in a later step  
3. Create RELs trivially from entity relation files along with counts of positive entities for each label
4. Create AILs by discretizing the AVLs we found, along with counts of positive entities for each label
    - See label_discretization notebook for some exploration of discretization approach that led to the method that is implemented in this notebook
    - At this step, we should also contribute to a mapping of entities --> matching attribute labels to facilitate creating RALs in a later step
5. Create RALs by using the entities --> attribute labels table that we built in steps 2 and 4. Also keep track of counts of positive entities for each label
    
*Misc issues encountered*
- kgtk rename-columns doesn't always work when input file == output file. Getting around this right now by creating temp files... 

## 0. Create type mapping
Mapping is from entity (Q node) to the entity's type (another Q node). Using P31 only for now, but can add P279* as well later  

In [10]:
!kgtk query -i $ITEM_FILE -i $LABEL_FILE \
-o $OUT/type_mapping.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (e)-[l {label:"P31"}]->(type), `'"$LABEL_FILE"'`: (e)-[:label]->(e_lab), `'"$LABEL_FILE"'`: (type)-[:label]->(type_lab)' \
--return 'distinct l as id, e as node1, l.label as label, type as node2, e_lab as entity_label, type_lab as type_label' \
--where 'e_lab.kgtk_lqstring_lang_suffix = "en" AND type_lab.kgtk_lqstring_lang_suffix = "en"'

^C

Keyboard interrupt in query -i /data/profiling/kgtk/entity_profiling/data/wikidata_humans/claims.wikibase-item.tsv.gz -i /data/profiling/kgtk/entity_profiling/data/wikidata_humans/labels.en.tsv.gz -o /data/profiling/kgtk/entity_profiling/output/wikidata_humans_v2/label_creation/type_mapping.tsv --graph-cache /data/profiling/kgtk/entity_profiling/output/wikidata_humans_v2/temp/wikidata.sqlite3.db --match `/data/profiling/kgtk/entity_profiling/data/wikidata_humans/claims.wikibase-item.tsv.gz`: (e)-[l {label:"P31"}]->(type), `/data/profiling/kgtk/entity_profiling/data/wikidata_humans/labels.en.tsv.gz`: (e)-[:label]->(e_lab), `/data/profiling/kgtk/entity_profiling/data/wikidata_humans/labels.en.tsv.gz`: (type)-[:label]->(type_lab) --return distinct l as id, e as node1, l.label as label, type as node2, e_lab as entity_label, type_lab as type_label --where e_lab.kgtk_lqstring_lang_suffix = "en" AND type_lab.kgtk_lqstring_lang_suffix = "en".


In [7]:
display(pd.read_csv("{}/type_mapping.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=5).fillna(""))

Unnamed: 0,id,node1,label,node2,entity_label,type_label
0,P10-P31-Q18610173-85ef4d24-0,P10,P31,Q18610173,'video'@en,'Wikidata property to link to Commons'@en
1,P1000-P31-Q18608871-093affb5-0,P1000,P31,Q18608871,'record held'@en,'Wikidata property for items about people'@en
2,P1001-P31-Q15720608-deeedec9-0,P1001,P31,Q15720608,'applies to jurisdiction'@en,'Wikidata qualifier'@en
3,P1001-P31-Q22984026-8beb0cfe-0,P1001,P31,Q22984026,'applies to jurisdiction'@en,'Wikidata property related to law and justice'@en
4,P1001-P31-Q22997934-1e5b1a96-0,P1001,P31,Q22997934,'applies to jurisdiction'@en,'Wikidata property related to government and s...


## 1. Count number of entities of each type:
Use the entity --> type mapping we created in step 0 to do this

In [7]:
!kgtk query -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-o $OUT/entity_counts_per_type.tsv --graph-cache $STORE \
--match 'type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (type)-[:label]->(lab)' \
--return 'distinct type as type, lab as type_label, count(distinct n1) as count, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc'

In [8]:
rename_cols_and_overwrite_id("$OUT/entity_counts_per_type", ".tsv", "type type_label count", "node1 label node2")

In [8]:
!head -10 $OUT/entity_counts_per_type.tsv | column -t -s $'\t'

node1     label                               node2    id
Q5        'human'@en                          8187532  E1
Q16521    'taxon'@en                          2809682  E2
Q4167836  'Wikimedia category'@en             2482326  E3
Q4167410  'Wikimedia disambiguation page'@en  705261   E4
Q79007    'street'@en                         522060   E5
Q8502     'mountain'@en                       443306   E6
Q101352   'family name'@en                    376122   E7
Q1931185  'astronomical radio source'@en      353206   E8
Q30612    'clinical trial'@en                 351014   E9


In [204]:
!wc -l $OUT/entity_counts_per_type.tsv

63160 /data02/profiling/kgtk/entity_profiling/output/wikidata-20210215-dwd/label_creation/entity_counts_per_type.tsv


## 2. Create AVLs with counts of positive entities
At this step we also want to keep track of entities --> matching attribute labels for future use. This will help when we are creating RALs (step 5)

To accomplish these goals, we will do the following:

For each attribute type (string, time, quantity), we will 1. use the entity --> type mapping along with the attribute data file to create an entity_attribute_labels file that has a mapping of entity --> labels applicable to the entity, and 2. use the entity_attribute_labels file to aggregate labels with counts of matching entities which we will save in a candidate_labels file

### 2.1 strings
Creating mapping of entity --> string attribute labels

In [10]:
if not string_file:
    print("No string attribute file was provided in the parameters section, skipping this step.")
else:
    # perform query
    command = "$kgtk query -i $OUT/type_mapping.tsv -i STRING_FILE -i LABEL_FILE \
               -o $OUT/entity_attribute_labels_string.tsv --graph-cache $STORE \
               --match '`STRING_FILE`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `LABEL_FILE`: (p)-[:label]->(lab)' \
               --return 'distinct n1 as entity, type as type, p as prop, n2 as value, lab as property_label, \"_\" as id' \
               --where 'lab.kgtk_lqstring_lang_suffix = \"en\"' \
               --order-by 'n1'"
    run_command(command, {"STRING_FILE" : string_file, "LABEL_FILE" : label_file})
    # reformat columns to be in KGTK format
    rename_cols_and_overwrite_id("$OUT/entity_attribute_labels_string", ".tsv", "type prop value", "node1 label node2")
    # view header of result
    run_command("head -5 $OUT/entity_attribute_labels_string.tsv | column -t -s $'\t'")

No string attribute file was provided in the parameters section, skipping this step.


Aggregating distinct labels w/ positive entity counts

In [11]:
if not string_file:
    print("No string attribute file was provided in the parameters section, skipping this step.")
else:
    # perform query
    command = "$kgtk query -i $OUT/entity_attribute_labels_string.tsv \
               -o $OUT/candidate_labels_avl_string.tsv --graph-cache $STORE \
               --match 'labels: (type)-[l {label:prop, property_label:lab, entity:e}]->(val)' \
               --return 'distinct type as type, prop as prop, val as value, count(distinct e) as positives, lab as property_label, \"_\" as id' \
               --order-by 'count(distinct e) desc'"
    run_command(command)
    # reformat columns to be in KGTK format
    rename_cols_and_overwrite_id("$OUT/candidate_labels_avl_string", ".tsv", "type prop value", "node1 label node2")
    # view header of result
    run_command("head -5 $OUT/candidate_labels_avl_string.tsv | column -t -s $'\t'")

No string attribute file was provided in the parameters section, skipping this step.


### 2.2 Times

Looking at what precisions we need to deal with...

In [12]:
!kgtk query -i $TIME_FILE $LABEL_FILE\
--graph-cache $STORE \
--match '`'"$TIME_FILE"'`: (n1)-[l {label:p}]->(n2), `'"$LABEL_FILE"'`: (p)-[:label]->(lab)' \
--return 'distinct kgtk_date_precision(n2) as precisions, count(n1) as count' \
--limit 10 \
| column -t -s $'\t'

precisions  count
6           14
7           218
8           45
9           6077
10          212
11          21214


From the above, we have several precisions below precision of year=9. We don't have kgtk type interpretation functions for these granularities, so for now we'll interpret them all as years. Furthermore, we will interpret all times at the year granularity for now.

Additional work can be done later to create labels with finer time granularity if desired.

Creating mapping of entity --> year attribute labels

In [91]:
%%time
!kgtk query -i $OUT/type_mapping.tsv -i $TIME_FILE -i $LABEL_FILE \
-o $OUT/entity_AVLs_time.year.tsv --graph-cache $STORE \
--match '`'"$TIME_FILE"'`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (p)-[:label]->(p_lab), `'"$LABEL_FILE"'`: (type)-[:label]->(t_lab)' \
--return 'distinct n1 as node1, "_" as label, printf("%s_%s_%s",type,p,kgtk_date_year(n2)) as node2, type as type, t_lab as type_label, p as prop, p_lab as property_label, kgtk_date_year(n2) as value, "_" as id' \
--where 't_lab.kgtk_lqstring_lang_suffix = "en" AND p_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'n1'

CPU times: user 4.27 s, sys: 977 ms, total: 5.25 s
Wall time: 4min 47s


In [92]:
!kgtk add-id -i $OUT/entity_AVLs_time.year.tsv \
-o $OUT/entity_AVLs_time.year.temp.tsv --overwrite-id \
&& mv $OUT/entity_AVLs_time.year.temp.tsv $OUT/entity_AVLs_time.year.tsv

In [93]:
display(pd.read_csv("{}/entity_AVLs_time.year.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))

Unnamed: 0,node1,label,node2,type,type_label,prop,property_label,value,id
0,P1841,_,Q55977691_P580_2016,Q55977691,'Wikidata property for authority control for a...,P580,'start time'@en,2016,E1
1,P2847,_,Q18608871_P2669_2019,Q18608871,'Wikidata property for items about people'@en,P2669,'discontinued date'@en,2019,E2
2,P2847,_,Q24239898_P2669_2019,Q24239898,'Wikidata property for Wikivoyage listings'@en,P2669,'discontinued date'@en,2019,E3
3,P2847,_,Q30041186_P2669_2019,Q30041186,'Wikidata property related to online communiti...,P2669,'discontinued date'@en,2019,E4
4,P2847,_,Q60457486_P2669_2019,Q60457486,'Wikidata property for a discontinued website'@en,P2669,'discontinued date'@en,2019,E5


Aggregating distinct labels w/ positive entity counts

In [94]:
%%time
!kgtk query -i $OUT/entity_AVLs_time.year.tsv \
-o $OUT/candidate_AVLs_time.year.tsv --graph-cache $STORE \
--match 'entity_AVLs: (e)-[l {type:t, type_label:t_lab, prop:p, property_label:p_lab,  value:val}]->(label_id)' \
--return 'distinct t as node1, t_lab as type_label, p as label, p_lab as property_label, val as node2, count(distinct e) as positives, label_id as id' \
--order-by 'count(distinct e) desc'

CPU times: user 1.81 s, sys: 392 ms, total: 2.2 s
Wall time: 1min 46s


In [96]:
display(pd.read_csv("{}/candidate_AVLs_time.year.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))

Unnamed: 0,node1,type_label,label,property_label,node2,positives,id
0,Q13433827,'encyclopedic article'@en,P577,'publication date'@en,2017,73623,Q13433827_P577_2017
1,Q5,'human'@en,P569,'date of birth'@en,1950,49166,Q5_P569_1950
2,Q5,'human'@en,P569,'date of birth'@en,2000,48204,Q5_P569_2000
3,Q5,'human'@en,P569,'date of birth'@en,1953,44922,Q5_P569_1953
4,Q5,'human'@en,P569,'date of birth'@en,1960,38937,Q5_P569_1960


## 2.3 Quantities
Creating mapping of entity --> quantity attribute labels

Note, quantities may have units. We will separate out the quantity value and units into separate columns

In [97]:
!kgtk query -i $OUT/type_mapping.tsv -i $QUANTITY_FILE -i $LABEL_FILE \
-o $OUT/entity_AVLs_quantity.tsv --graph-cache $STORE \
--match '`'"$QUANTITY_FILE"'`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (p)-[:label]->(p_lab), `'"$LABEL_FILE"'`: (type)-[:label]->(t_lab)' \
--return 'distinct n1 as node1, "_" as label, printf("%s_%s_%s_%s_%s",type,p,kgtk_quantity_numeral(n2),kgtk_quantity_si_units(n2),kgtk_quantity_wd_units(n2)) as node2, type as type, t_lab as type_label, p as prop, p_lab as property_label, kgtk_quantity_numeral(n2) as value, kgtk_quantity_si_units(n2) as si_units, kgtk_quantity_wd_units(n2) as wd_units, "_" as id' \
--where 't_lab.kgtk_lqstring_lang_suffix = "en" AND p_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'n1'

In [98]:
!kgtk add-id -i $OUT/entity_AVLs_quantity.tsv \
-o $OUT/entity_AVLs_quantity.temp.tsv --overwrite-id \
&& mv $OUT/entity_AVLs_quantity.temp.tsv $OUT/entity_AVLs_quantity.tsv

In [99]:
display(pd.read_csv("{}/entity_AVLs_quantity.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,label,node2,type,type_label,prop,property_label,value,si_units,wd_units,id
0,P1004,_,Q19829908_P4876_+37427__,Q19829908,'Wikidata property for authority control for p...,P4876,'number of records'@en,37427,,,E1
1,P1004,_,Q24075706_P4876_+37427__,Q24075706,"'Wikidata property for authority control, with...",P4876,'number of records'@en,37427,,,E2
2,P1004,_,Q27525351_P4876_+37427__,Q27525351,'Wikidata property related to music'@en,P4876,'number of records'@en,37427,,,E3
3,P1014,_,Q27918607_P4876_+53249__,Q27918607,'Wikidata property related to art'@en,P4876,'number of records'@en,53249,,,E4
4,P1014,_,Q43831109_P4876_+53249__,Q43831109,'Wikidata property related to architecture'@en,P4876,'number of records'@en,53249,,,E5


Aggregating distinct labels w/ positive entity counts

In [107]:
!kgtk query -i $OUT/entity_AVLs_quantity.tsv \
-o $OUT/candidate_AVLs_quantity.tsv --graph-cache $STORE \
--match 'entity_AVLs: (e)-[l {prop:p, property_label:p_lab, type:t, type_label:t_lab, value:val, si_units:si, wd_units:wd}]->(label_id)' \
--return 'distinct t as node1, t_lab as type_label, p as label, p_lab as property_label, val as node2, si as si_units, wd as wd_units, count(distinct e) as positives, label_id as id' \
--order-by 'count(distinct e) desc'


In [108]:
display(pd.read_csv("{}/candidate_AVLs_quantity.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=5).fillna(""))


Unnamed: 0,node1,type_label,label,property_label,node2,si_units,wd_units,positives,id
0,Q2668072,'collection'@en,P1436,'collection or exhibition size'@en,0,,Q11723795,229933,Q2668072_P1436_+0__Q11723795
1,Q30612,'clinical trial'@en,P2899,'minimum age'@en,18,,Q577,213715,Q30612_P2899_+18__Q577
2,Q2668072,'collection'@en,P1436,'collection or exhibition size'@en,0,,Q59221146,212568,Q2668072_P1436_+0__Q59221146
3,Q2668072,'collection'@en,P7328,'amount cataloged'@en,0,,Q11723795,44981,Q2668072_P7328_+0__Q11723795
4,Q2668072,'collection'@en,P7328,'amount cataloged'@en,0,,Q59221146,30580,Q2668072_P7328_+0__Q59221146


### 2.4 Combining entity --> attribute label mappings to single table (for use in creating RALs)

In [102]:
!kgtk cat \
-i $OUT/entity_AVLs_time.year.tsv \
-i $OUT/entity_AVLs_quantity.tsv \
-o $OUT/entity_AVLs_all.tsv

## 3. Create RELs with counts of positive entities

creating joined file of entity and labels

In [263]:
%%time
!kgtk query -i $ITEM_FILE -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-o $OUT/entity_RELs.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), `'"$LABEL_FILE"'`: (p)-[:label]->(p_lab), `'"$LABEL_FILE"'`: (type)-[:label]->(t_lab)' \
--return 'distinct n1 as node1, "_" as label, printf("%s_%s_%s",type,p,n2) as node2, type as type, t_lab as type_label, p as prop, p_lab as property_label, n2 as value, "_" as id' \
--where 'p_lab.kgtk_lqstring_lang_suffix = "en" AND t_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'n1'


CPU times: user 55.7 s, sys: 11.6 s, total: 1min 7s
Wall time: 42min 51s


In [None]:
!kgtk add-id -i $OUT/entity_RELs.tsv -o $OUT/entity_RELs.temp.tsv --overwrite-id \
&& mv $OUT/entity_RELs.temp.tsv $OUT/entity_RELs.tsv

In [None]:
display(pd.read_csv("{}/entity_RELs.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Aggregating distinct labels with counts

In [None]:
%%time
!kgtk query -i $OUT/entity_RELs.tsv \
-o $OUT/candidate_RELs.tsv --graph-cache $STORE \
--match 'entity_RELs: (e)-[l {type:t, type_label:t_lab, prop:p, property_label:p_lab, value:val}]->(label_id)' \
--return 'distinct t as node1, t_lab as type_label, p as label, p_lab as property_label, val as node2, count(distinct e) as positives, label_id as id' \
--order-by 'count(distinct e) desc'


In [None]:
display(pd.read_csv("{}/candidate_RELs.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=5).fillna(""))


## 4. Create AILs with counts of positive entities
Similar to what we did for AVLs, we also want to keep track of entities --> matching attribute labels for future use in RAL creation (step 5)

We will create attribute *interval* labels from our attribute *value* labels that we previously created. The code that does this is explored in the explore_label_discretization notebook, and implemented in label_discretization.py.

For each entity --> labels file that has a numeric value type (year or quantity) we will:
1. Create a corresponding entity --> bucketed labels file. For example, a label in the input that looks like <country, population, 1,000,000> might get summarized (bucketed) in the output to look like <country, population, (500,000, 2,000,000)>.
2. Use the resulting bucketed entity_attribute_labels file to once again aggregate labels with counts of matching entities. This will give us a candidate_labels_ail file.

*Note on syntax we are using for ranges*: we will define ranges with lower and upper bounds. Ranges may have blank values for the lower and/or upper bounds. A range that only has an upper bound means the bin includes all values <= the upper bound. A range that has no lower or upper bound denotes a single bin that includes all values. Such ranges may be created for labels of a <type, property> have very few datapoints.

Note, the code will create some output about labels that it may not be creating good buckets for. We'll silence this output so it doesn't take up too much space when viewing on github. If you would like to unsilence this output, comment out the `%%capture` lines

### 4.1 Years
Create entity --> bucketed labels file

In [114]:
avl_file_in = "{}/entity_AVLs_time.year.tsv".format(output_dir)
ail_file_out = "{}/entity_AILs_time.year.tsv".format(output_dir)
discretize_labels_fixed_width(avl_file_in, ail_file_out, width=10)

In [None]:
!kgtk query -i $OUT/entity_AILs_time.year.tsv \
-o $OUT/entity_AILs_time.year.temp.tsv --graph-cache $STORE \
--match 'AILs: (e)-[l {type:t, type_label:t_lab, prop:p, property_label:p_lab, lower_bound:lb, upper_bound:ub}]->()' \
--return 'e as node1, "_" as label, printf("%s_%s_%s-%s",t,p,lb,ub) as node2, t as type, t_lab as type_label, p as prop, p_lab as property_label, lb as lower_bound, ub as upper_bound, l as id' \
--order-by 'e'\
&& mv $OUT/entity_AILs_time.year.temp.tsv $OUT/entity_AILs_time.year.tsv

In [119]:
# !kgtk query -i $OUT/entity_attribute_labels_time.year_bucketed.tsv \
# -o $OUT/entity_AILs_time.year.tsv --graph-cache $STORE \
# --match 'labels: (t)-[l {entity:e, type_label:t_lab, label:p, property_label:p_lab, lower_bound:lb, upper_bound:ub}]->(val)' \
# --return 'distinct e as node1, "_" as label, printf("%s_%s_%s-%s",t,p,lb,ub) as node2, t as type, t_lab as type_label, p as prop, p_lab as property_label, lb as lower_bound, ub as upper_bound, "_" as id' \
# --order-by 'e'\

In [120]:
# !kgtk add-id -i $OUT/entity_AILs_time.year.tsv -o $OUT/entity_AILs_time.year.temp.tsv --overwrite-id \
# && mv $OUT/entity_AILs_time.year.temp.tsv $OUT/entity_AILs_time.year.tsv

In [127]:
display(pd.read_csv(f"{output_dir}/entity_AILs_time.year.tsv", delimiter = '\t', nrows=5).fillna(""))

Unnamed: 0,node1,label,node2,type,type_label,prop,property_label,lower_bound,upper_bound,id
0,P1841,_,Q55977691_P580_2010-2020,Q55977691,'Wikidata property for authority control for a...,P580,'start time'@en,2010,2020,E1
1,P2847,_,Q18608871_P2669_2010-2020,Q18608871,'Wikidata property for items about people'@en,P2669,'discontinued date'@en,2010,2020,E2
2,P2847,_,Q24239898_P2669_2010-2020,Q24239898,'Wikidata property for Wikivoyage listings'@en,P2669,'discontinued date'@en,2010,2020,E3
3,P2847,_,Q30041186_P2669_2010-2020,Q30041186,'Wikidata property related to online communiti...,P2669,'discontinued date'@en,2010,2020,E4
4,P2847,_,Q60457486_P2669_2010-2020,Q60457486,'Wikidata property for a discontinued website'@en,P2669,'discontinued date'@en,2010,2020,E5


Aggregating distinct interval labels with positive entity counts

In [124]:
!kgtk query -i $OUT/entity_AILs_time.year.tsv \
-o $OUT/candidate_AILs_time.year.tsv --graph-cache $STORE \
--match 'AILs: (e)-[l {type:t, type_label:t_lab, prop:p, property_label:p_lab, lower_bound:lb, upper_bound:ub}]->(label_id)' \
--return 'distinct t as node1, t_lab as type_label, p as label, p_lab as property_label, "" as node2, lb as lower_bound, ub as upper_bound, count(e) as positives, label_id as id' \
--order-by 'count(e) desc'

In [126]:
display(pd.read_csv("{}/candidate_AILs_time.year.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))

Unnamed: 0,node1,type_label,label,property_label,node2,lower_bound,upper_bound,positives,id
0,Q5,'human'@en,P569,'date of birth'@en,,1950,1960,387577,Q5_P569_1950-1960
1,Q5,'human'@en,P569,'date of birth'@en,,1960,1970,369605,Q5_P569_1960-1970
2,Q5,'human'@en,P569,'date of birth'@en,,1970,1980,343421,Q5_P569_1970-1980
3,Q5,'human'@en,P569,'date of birth'@en,,1940,1950,335242,Q5_P569_1940-1950
4,Q5,'human'@en,P569,'date of birth'@en,,1980,1990,327021,Q5_P569_1980-1990


### 4.2 Quantities
Create entity --> bucketed labels file

In [4]:
avl_file_in = "{}/entity_AVLs_quantity.tsv".format(os.environ["OUT"])
ail_file_out = "{}/entity_AILs_quantity.tsv".format(os.environ["OUT"])
discretize_labels_by_percentile(avl_file_in, ail_file_out, num_bins=5)

In [None]:
!kgtk query -i $OUT/entity_AILs_quantity.tsv \
-o $OUT/entity_AILs_quantity.temp.tsv --graph-cache $STORE \
--match 'AILs: (e)-[l {type:t, type_label:t_lab, prop:p, property_label:p_lab, lower_bound:lb, upper_bound:ub, si_units:si, wd_units:wd}]->()' \
--return 'e as node1, "_" as label, printf("%s_%s_%s-%s_%s_%s",t,p,lb,ub,si,wd) as node2, t as type, t_lab as type_label, p as prop, p_lab as property_label, lb as lower_bound, ub as upper_bound, si as si_units, wd as wd_units, l as id' \
--order-by 'e'\
&& mv $OUT/entity_AILs_quantity.temp.tsv $OUT/entity_AILs_quantity.tsv

In [128]:
# !kgtk query -i $OUT/entity_attribute_labels_quantity_bucketed.tsv \
# -o $OUT/entity_AILs_quantity.tsv --graph-cache $STORE \
# --match 'labels: (t)-[l {entity:e, type_label:t_lab, label:p, property_label:p_lab, lower_bound:lb, upper_bound:ub, si_units:si, wd_units:wd}]->(val)' \
# --return 'distinct e as node1, "_" as label, printf("%s_%s_%s-%s_%s_%s",t,p,lb,ub,si,wd) as node2, t as type, t_lab as type_label, p as prop, p_lab as property_label, lb as lower_bound, ub as upper_bound, si as si_units, wd as wd_units, l as id' \
# --order-by 'e'\

In [129]:
display(pd.read_csv(f"{output_dir}/entity_AILs_quantity.tsv", delimiter = '\t', nrows=5).fillna(""))

Unnamed: 0,node1,label,node2,type,type_label,prop,property_label,lower_bound,upper_bound,si_units,wd_units,id
0,P1004,_,Q19829908_P4876_7883.0-37427.0__,Q19829908,'Wikidata property for authority control for p...,P4876,'number of records'@en,7883.0,37427.0,,,E1
1,P1004,_,Q24075706_P4876_18000.0-95823.0__,Q24075706,"'Wikidata property for authority control, with...",P4876,'number of records'@en,18000.0,95823.0,,,E2
2,P1004,_,Q27525351_P4876_37427.0-45000.0__,Q27525351,'Wikidata property related to music'@en,P4876,'number of records'@en,37427.0,45000.0,,,E3
3,P1014,_,Q27918607_P4876_38128.0-160203.0__,Q27918607,'Wikidata property related to art'@en,P4876,'number of records'@en,38128.0,160203.0,,,E4
4,P1014,_,Q43831109_P4876_53249.0-53249.0__,Q43831109,'Wikidata property related to architecture'@en,P4876,'number of records'@en,53249.0,53249.0,,,E5


Aggregating distinct interval labels with positive entity counts

In [131]:
!kgtk query -i $OUT/entity_AILs_quantity.tsv \
-o $OUT/candidate_AILs_quantity.tsv --graph-cache $STORE \
--match 'AILs: (e)-[l {type:t, type_label:t_lab, prop:p, property_label:p_lab, lower_bound:lb, upper_bound:ub, si_units:si, wd_units:wd}]->(label_id)' \
--return 'distinct t as node1, t_lab as type_label, p as label, p_lab as property_label, "" as node2, lb as lower_bound, ub as upper_bound, si as si_units, wd as wd_units, count(e) as positives, label_id as id' \
--order-by 'count(e) desc'

In [132]:
display(pd.read_csv("{}/candidate_AILs_quantity.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))

Unnamed: 0,node1,type_label,label,property_label,node2,lower_bound,upper_bound,si_units,wd_units,positives,id
0,Q30612,'clinical trial'@en,P2899,'minimum age'@en,,2.0,18.0,,Q577,246462,Q30612_P2899_2.0-18.0__Q577
1,Q2668072,'collection'@en,P1436,'collection or exhibition size'@en,,0.0,0.0,,Q11723795,229933,Q2668072_P1436_0.0-0.0__Q11723795
2,Q2668072,'collection'@en,P1436,'collection or exhibition size'@en,,0.0,0.0,,Q59221146,212568,Q2668072_P1436_0.0-0.0__Q59221146
3,Q30612,'clinical trial'@en,P1132,'number of participants'@en,,0.0,25.0,,,70937,Q30612_P1132_0.0-25.0__
4,Q1931185,'astronomical radio source'@en,P6257,'right ascension'@en,,0.00042,111.4,,Q28390,70371,Q1931185_P6257_0.00042-111.4__Q28390


### 4.3 Combining entity --> attribute interval label mappings to single table

In [133]:
!kgtk cat \
-i $OUT/entity_AILs_quantity.tsv \
-i $OUT/entity_AILs_time.year.tsv \
-o $OUT/entity_AILs_all.tsv

## 5. Create RALs with counts of positive entities

### 5.1 RALs created from attribute *value* labels:

In [None]:
%%time
!kgtk query --debug -i $ITEM_FILE -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-i $OUT/entity_AVLs_all.tsv -o $OUT/entity_RAVLs.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (e1)-[l1 {label:p1}]->(e2), type: (e1)-[]->(t1), entity_AVLs: (e2)-[l2]->(value_label_id), `'"$LABEL_FILE"'`: (p1)-[:label]->(p1_lab), `'"$LABEL_FILE"'`: (t1)-[:label]->(t1_lab)' \
--return 'distinct e1 as node1, "_" as label, printf("%s_%s_%s",t1,p1,value_label_id) as node2, t1 as type, t1_lab as type_label, p1 as prop, p1_lab as property_label, value_label_id as value, "_" as id' \
--order-by 'e1' \
--where 'p1_lab.kgtk_lqstring_lang_suffix = "en" AND t1_lab.kgtk_lqstring_lang_suffix = "en"'

In [None]:
!kgtk add-id -i $OUT/entity_RAVLs.tsv -o $OUT/entity_RAVLs.temp.tsv --overwrite-id \
&& mv $OUT/entity_RAVLs.temp.tsv $OUT/entity_RAVLs.tsv

In [None]:
display(pd.read_csv("{}/entity_RAVLs.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Aggregating distinct labels with counts

In [None]:
%%time
!kgtk query -i $OUT/entity_RAVLs.tsv \
-o $OUT/candidate_RAVLs.tsv --graph-cache $STORE \
--match 'entity_RAVLs: (e)-[l {type:t, type_label:t_lab, prop:p, property_label:p_lab, value:val}]->(label_id)' \
--return 'distinct t as node1, t_lab as type_label, p as label, p_lab as property_label, val as node2, count(distinct e) as positives, label_id as id' \
--order-by 'count(distinct e) desc'


In [None]:
display(pd.read_csv("{}/candidate_RAVLs.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=5).fillna(""))


### 5.2 RALs created from attribute *interval* labels:

In [None]:
%%time
!kgtk query -i $ITEM_FILE -i $OUT/type_mapping.tsv -i $LABEL_FILE \
-i $OUT/entity_AILs_all.tsv -o $OUT/entity_RAILs.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (e1)-[l1 {label:p1}]->(e2), type: (e1)-[]->(t1), entity_AILs: (e2)-[l2]->(value_label_id), `'"$LABEL_FILE"'`: (p1)-[:label]->(p1_lab), `'"$LABEL_FILE"'`: (t1)-[:label]->(t1_lab)' \
--return 'e1 as node1, "_" as label, printf("%s_%s_%s",t1,p1,value_label_id) as node2, t1 as type, t1_lab as type_label, p1 as prop, p1_lab as property_label, value_label_id as value, "_" as id' \
--order-by "e1" \
--where 'p1_lab.kgtk_lqstring_lang_suffix = "en" AND t1_lab.kgtk_lqstring_lang_suffix = "en"'

In [None]:
!kgtk add-id -i $OUT/entity_RAILs.tsv -o $OUT/entity_RAILs.temp.tsv --overwrite-id \
&& mv $OUT/entity_RAILs.temp.tsv $OUT/entity_RAILs.tsv

In [None]:
display(pd.read_csv("{}/entity_RAILs.tsv".format(output_dir), delimiter = '\t', nrows=5).fillna(""))


Aggregating distinct labels with counts

In [None]:
!kgtk query -i $OUT/entity_RAILs.tsv \
-o $OUT/candidate_RAILs.tsv --graph-cache $STORE \
--match 'entity_RAILs: (e)-[l {type:t, type_label:t_lab, prop:p, property_label:p_lab, value:val}]->(label_id)' \
--return 'distinct t as node1, t_lab as type_label, p as label, p_lab as property_label, val as node2, count(distinct e) as positives, label_id as id' \
--order-by 'count(distinct e) desc'


In [None]:
display(pd.read_csv("{}/candidate_RAILs.tsv".format(os.environ["OUT"]), delimiter = '\t', nrows=5).fillna(""))
