# Re-implementing procedure outlined in "Entity Profiling in Knowledge Graphs" (Zhang Et al.)
# This notebook will implement the candidate label creation step
Using a subset of wikidata related to Q44 ("beer")

In [68]:
import os

In [69]:
work_dir = "/Users/nicklein/Documents/grad_school/Research/data"

In [70]:
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(work_dir)
os.environ['DATA'] = "{}/Q44".format(work_dir)
os.environ['NAME'] = "Q44"
os.environ["OUT"] = "{}/Q44_profiler_output".format(work_dir)

Helper function for executing commands

In [71]:
def run_command(command, substitution_dictionary = {}):
    """Run a templetized command."""
    #cmd = command.replace("NAME", subset_name)
    for k, v in substitution_dictionary.items():
        cmd = cmd.replace(k, v)
    
    print(cmd)
    output = subprocess.run([cmd], shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    print(output.stdout)
    print(output.stderr)

## Examples of using query command:
### Ex - Counting instances of <type, property, value> along with the property label

In [131]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.label.en.tsv --graph-cache $STORE \
--match 'item: (n1)-[l {label: p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \
--return 'distinct type as type, l.label as prop, n2 as value, count(n1) as count, lab as property_label' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'type, count(n1) desc' \
--limit 5

type	prop	value	count	property_label
Q1066984	P1343	Q97879676	3	'described by source'@en
Q1066984	P1343	Q316838	2	'described by source'@en
Q1066984	P17	Q154195	2	'country'@en
Q1066984	P17	Q183	2	'country'@en
Q1066984	P17	Q2415901	2	'country'@en


### Ex - Counting property, value pairs for time properties

In [132]:
!kgtk query -i $DATA/$NAME.part.time.tsv -i $DATA/$NAME.label.en.tsv --graph-cache $STORE \
--match 'time: (n1)-[l {label: p}]->(n2), label: (p)-[:label]->(lab)' \
--return 'distinct l.label as prop, lab as property_label, n2 as val, count(n1) as count' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(n1) desc' \
--limit 5

prop	property_label	val	count
P571	'inception'@en	^1991-00-00T00:00:00Z/9	8
P571	'inception'@en	^1989-00-00T00:00:00Z/9	6
P571	'inception'@en	^1994-00-00T00:00:00Z/9	6
P571	'inception'@en	^1996-00-00T00:00:00Z/9	6
P571	'inception'@en	^1825-01-01T00:00:00Z/9	4


### Ex- Counting property, value pairs for quantity properties

In [133]:
!kgtk query -i $DATA/$NAME.part.quantity.tsv -i $DATA/$NAME.label.en.tsv --graph-cache $STORE \
--match 'quantity: (n1)-[l {label: p}]->(n2), label: (p)-[:label]->(lab)' \
--return 'distinct l.label as prop, lab as property_label, n2 as value, count(n1) as count' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(n1) desc' \
--limit 5

prop	property_label	value	count
P3000	'marriageable age'@en	+18Q24564698	56
P2997	'age of majority'@en	+18Q24564698	52
P1279	'inflation rate'@en	+1.7Q11229	41
P2884	'mains voltage'@en	+230Q25250	40
P1279	'inflation rate'@en	+1.8Q11229	38


# Outline of procedure:
**Goal**:<br>
We want to create candidate label sets including
- Attribute value labels (type, property, *attribute*)
- Realtional entity labels (type, property, *entity*)
- Attribute interval labels (type, property, *range of attribute values*)
- Relational attribute labels (type, property, *attribute or attribute range of another entity*)

To enable subsequent filtering of these labels, we also want to count:
- The number of entities of each type
- The number of entities that match each label (call these "positives")
    
**Steps**:
1. Count the number of entities of each type
    - **ISSUE:** cannot figure out how to format resulting table so I can use it in later kgtk commands
2. Create AVLs trivially from attribute files along with counts of the positive entities for each label
    - **ISSUE:** need to figure out how to call the right kgtk type-conversion function for each value. Pedro says there is a precision tag that I can make use of, but not sure how to do that.
    - At this step, we should also contribute to a mapping of entities --> matching attribute labels to facilitate creating RALs in a later step
        - **TODO**   
3. Create RELs trivially from entity relation files along with counts of positive entities for each label
    - **TODO** though should be very simple.
4. Create AILs by discretizing the AVLs we found, along with counts of positive entities for each label
    - This can be done using the counts that we already have
    - We'll likely try a naive approach at first such as hardcoding a bucket size for each numeric type we care about
    - At this step, we should also contribute to a mapping of entities --> matching attribute labels to facilitate creating RALs in a later step
    - **TODO**
5. Create RALs by using the REL table and the entities --> attribute labels table that built in steps 2 and 4. Also keep track of counts of positive entities for each label
    - **Progress:** I've implemented a query that can create RALs from scratch (ie not using tables built in previous steps), but still need to implement this making use of the tables we built in previous steps.
6. Combine the 4 label tables
    - **TODO**
7. Filter the labels using a simple heuristic based on counts
    - **TODO**

## 1. Count number of entities of each type:

In [98]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.label.en.tsv -o $OUT/$NAME.entity_counts_per_type.tsv --graph-cache $STORE \
--match 'item: (n1)-[:P31]->(type), label: (type)-[:label]->(lab)' \
--return 'distinct type as type, count(distinct n1) as count, lab as entity_label' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc' \

In [81]:
!head -5 $OUT/$NAME.entity_counts_per_type.tsv

type	count	entity_label
Q131734	87	'brewery'@en
Q3624078	69	'sovereign state'@en
Q4830453	50	'business'@en
Q6256	26	'country'@en


### Issue - how can I use these counts in future queries? Attempting to figure out correct syntax below

In [78]:
!kgtk query -i $OUT/$NAME.entity_counts_per_type.tsv --graph-cache $STORE \
--match 'counts: (t:type)-[]->(c:count)' \
--return 't as type, c as count' \
--order-by 'c desc' \
--limit 10

'node2'



## 2. Create AVLs with counts of positive entities
### 2.a. Keep track of entities --> matching attribute labels for future use (TODO)


### Issue: Need to use kgtk type functions to interpret values as the correct type
How do I determine if I am dealing with a time or with a year? Pedro says there is a precision tag that I can make use of, but not sure how to do that.

In [134]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.part.string.tsv -i $DATA/$NAME.label.en.tsv \
-o $OUT/$NAME.candidate_labels_avl_string.tsv --graph-cache $STORE \
--match 'string: (n1)-[l {label:p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \
--return 'distinct type as type, p as prop, n2 as value, count(distinct n1) as positives, lab as property_label' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc'

In [135]:
!head -5 $OUT/$NAME.candidate_labels_avl_string.tsv

type	prop	value	positives	property_label
Q3624078	P3238	"0"	34	'trunk prefix'@en
Q3624078	P3238	novalue	14	'trunk prefix'@en
Q6256	P3238	"0"	12	'trunk prefix'@en
Q179164	P3238	"0"	9	'trunk prefix'@en


In [136]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.part.time.tsv -i $DATA/$NAME.label.en.tsv \
-o $OUT/$NAME.candidate_labels_avl_time.tsv --graph-cache $STORE \
--match 'time: (n1)-[l {label:p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \
--return 'distinct type as type, p as prop, n2 as value, count(distinct n1) as positives, lab as property_label' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc'

In [137]:
!head -5 $OUT/$NAME.candidate_labels_avl_time.tsv

type	prop	value	positives	property_label
Q3624078	P571	^1991-00-00T00:00:00Z/9	3	'inception'@en
Q123480	P571	^1991-00-00T00:00:00Z/9	2	'inception'@en
Q131734	P571	^1985-01-01T00:00:00Z/9	2	'inception'@en
Q131734	P571	^1996-00-00T00:00:00Z/9	2	'inception'@en


## 3. Create RELs with counts of positive entities

## 4. Create AILs with counts of positive entities
### 4.a. Keep track of entities --> matching attribute labels for future use


## 5. Create RALs with counts of positive entities

In [130]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.part.string.tsv -i $DATA/$NAME.label.en.tsv \
--graph-cache $STORE \
--match 'item: (n1)-[l1 {label:p1}]->(n2), item: (n1)-[:P31]->(type1), label: (p1)-[:label]->(lab1), item: (n2)-[l2 {label:p2}]->(n3), item: (n2)-[:P31]->(type2), label: (p2)-[:label]->(lab2)' \
--return 'distinct type1 as type1, p1 as prop1, type2 as type2, p2 as prop2, n3 as value, count(distinct n1) as positives, lab1 as prop1_label, lab2 as prop2_label' \
--where 'lab1.kgtk_lqstring_lang_suffix = "en"' \
--where 'lab2.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc' \
--limit 10

type1	prop1	type2	prop2	value	positives	prop1_label	prop2_label
Q131734	P452	Q8148	P1056	Q44	77	'industry'@en	'product or material produced'@en
Q131734	P452	Q8148	P1343	Q2041543	77	'industry'@en	'described by source'@en
Q131734	P452	Q8148	P361	Q26705422	77	'industry'@en	'part of'@en
Q131734	P452	Q8148	P452	Q1365365	77	'industry'@en	'industry'@en
Q131734	P452	Q8148	P910	Q8311121	77	'industry'@en	'topic\\\\\\\\'s main category'@en
Q131734	P17	Q3624078	P1343	Q19180675	70	'country'@en	'described by source'@en
Q131734	P17	Q3624078	P463	Q1043527	70	'country'@en	'member of'@en
Q131734	P17	Q3624078	P463	Q1065	70	'country'@en	'member of'@en
Q131734	P17	Q3624078	P463	Q170424	70	'country'@en	'member of'@en
Q131734	P17	Q3624078	P463	Q17495	70	'country'@en	'member of'@en


## 6. Combining generated labels into a single file

## 7. Filtering candidate labels - This can go in a separate notebook
Simple rule-based filter to remove labels that are trivially either unrepresentative or indistinctive