# Re-implementing procedure outlined in "Entity Profiling in Knowledge Graphs" (Zhang Et al.)
# This notebook will implement the candidate label creation step
Using a subset of wikidata related to Q44 ("beer")

In [1]:
import os

In [2]:
os.environ['STORE'] = "wikidata.sqlite3.db"
os.environ['DATA'] = "Q44"
os.environ['NAME'] = "Q44"
os.environ["OUT"] = "output"

Helper function for executing commands

In [4]:
def run_command(command, substitution_dictionary = {}):
    """Run a templetized command."""
    #cmd = command.replace("NAME", subset_name)
    for k, v in substitution_dictionary.items():
        cmd = cmd.replace(k, v)
    
    print(cmd)
    output = subprocess.run([cmd], shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    print(output.stdout)
    print(output.stderr)

## Examples of using query command:
### Ex - Counting instances of <type, property, value> along with the property label

In [3]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.label.en.tsv --graph-cache $STORE \
--match 'item: (n1)-[l {label: p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \
--return 'distinct type as type, l.label as prop, n2 as value, count(n1) as count, lab as property_label' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'type, count(n1) desc' \
--limit 10

type	prop	value	count	property_label
Q1066984	P1343	Q97879676	3	'described by source'@en
Q1066984	P1343	Q316838	2	'described by source'@en
Q1066984	P17	Q154195	2	'country'@en
Q1066984	P17	Q183	2	'country'@en
Q1066984	P17	Q2415901	2	'country'@en
Q1066984	P17	Q256961	2	'country'@en
Q1066984	P17	Q41304	2	'country'@en
Q1066984	P17	Q43287	2	'country'@en
Q1066984	P17	Q713750	2	'country'@en
Q1066984	P17	Q7318	2	'country'@en


### Ex - Counting property, value pairs for time properties

In [31]:
!kgtk query -i $DATA/$NAME.part.time.tsv -i $DATA/$NAME.label.en.tsv --graph-cache $STORE \
--match 'time: (n1)-[l {label: p}]->(n2), label: (p)-[:label]->(lab)' \
--return 'distinct l.label as prop, lab as property_label, kgtk_date_year(n2) as year, count(n1) as count' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(n1) desc' \
--limit 10

prop	property_label	year	count
P571	'inception'@en	1991	20
P571	'inception'@en	1918	14
P571	'inception'@en	1989	8
P571	'inception'@en	1996	8
P571	'inception'@en	1821	6
P571	'inception'@en	1825	6
P571	'inception'@en	1960	6
P571	'inception'@en	1975	6
P571	'inception'@en	1994	6
P571	'inception'@en	800	4


### Ex- Counting property, value pairs for quantity properties

In [32]:
!kgtk query -i $DATA/$NAME.part.quantity.tsv -i $DATA/$NAME.label.en.tsv --graph-cache $STORE \
--match 'quantity: (n1)-[l {label: p}]->(n2), label: (p)-[:label]->(lab)' \
--return 'distinct l.label as prop, lab as property_label, n2 as value, count(n1) as count' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(n1) desc' \
--limit 10

prop	property_label	value	count
P3000	'marriageable age'@en	+18Q24564698	56
P2997	'age of majority'@en	+18Q24564698	52
P1279	'inflation rate'@en	+1.7Q11229	41
P2884	'mains voltage'@en	+230Q25250	40
P1279	'inflation rate'@en	+1.8Q11229	38
P1279	'inflation rate'@en	+2.1Q11229	35
P1279	'inflation rate'@en	+1.5Q11229	31
P1279	'inflation rate'@en	+2.8Q11229	31
P1279	'inflation rate'@en	+2Q11229	29
P1279	'inflation rate'@en	+3.5Q11229	28


# We want to create candidate label sets along with counts of:
## 1. number of entities of each type
## 2. number of positive entities for each label 

## 1. Count number of entities of each type:

In [23]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.label.en.tsv -o $OUT/$NAME.entity_counts_per_type.tsv --graph-cache $STORE \
--match 'item: (n1)-[:P31]->(type), label: (type)-[:label]->(lab)' \
--return 'distinct type as type, count(distinct n1) as count' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc' \

In [24]:
!head $OUT/$NAME.entity_counts_per_type.tsv

type	count
Q131734	87
Q3624078	69
Q4830453	50
Q6256	26
Q6881511	23
Q7270	18
Q1998962	16
Q179164	16
Q123480	15


## TODO - how can I use these counts in future queries? Attempting to figure out correct syntax below

In [25]:
!kgtk query -i $OUT/$NAME.entity_counts_per_type.tsv --graph-cache $STORE \
--match 'counts: (t:type)-[]->(c:count)' \
--return 't as type, c as count' \
--order-by 'c desc' \
--limit 10

'node1'



## 2. Create labels with counts of positive entities 
A label is defined by <Type, Property, Value>.
A positive entity for a label is an entity that matches that label.
### 2.a. First creating AVLs and RELs...
Note - I am counting _distinct_ n1s for the positives column. This is to ignore any duplicate triplets in the edge file (which it seems like there are since using distinct changes some of the counts).
# TODO: Need to use kgtk type functions to interpret values as the correct type
I am unsure of how this can be done automatically... Of course I could pass in the file that has time properties and interpret the values as kgtk_date_year(), but how do we know that year is the best granularity for each property in there? Furthermore, don't we want to be able to feed in all of the edge files (perhaps concatenated), and have labels created without manually determining what types each file holds?

In [26]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.part.string.tsv -i $DATA/$NAME.label.en.tsv \
-o $OUT/$NAME.candidate_labels_avl_rel.tsv --graph-cache $STORE \
--match 'string: (n1)-[l {label:p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \
--return 'distinct type as type, p as prop, n2 as value, count(distinct n1) as positives, lab as property_label' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'type, count(distinct n1) desc'

In [28]:
!head $OUT/$NAME.candidate_labels_avl_rel.tsv

type	prop	value	positives	property_label
Q1066984	P1438	"15521"	1	'Jewish Encyclopedia ID (Russian)'@en
Q1066984	P281	"80331"	1	'postal code'@en
Q1066984	P281	"80802"	1	'postal code'@en
Q1066984	P281	"80805"	1	'postal code'@en
Q1066984	P281	"81377"	1	'postal code'@en
Q1066984	P281	"81379"	1	'postal code'@en
Q1066984	P281	"81730"	1	'postal code'@en
Q1066984	P281	"81735"	1	'postal code'@en
Q1066984	P281	"81737"	1	'postal code'@en


### 2.b. Now creating AILs
# TODO - need to determine how to discretize values

### 2.c. Now creating RALs

# Combining generated labels into a single file

## Filtering candidate labels - This can go in a separate notebook
Simple rule-based filter to remove labels that are trivially either unrepresentative or indistinctive