# Re-implementing procedure outlined in "Entity Profiling in Knowledge Graphs" (Zhang Et al.)
# This notebook will implement the candidate label creation step
Using a subset of wikidata related to Q44 ("beer")

In [68]:
import os

In [69]:
work_dir = "/Users/nicklein/Documents/grad_school/Research/data"

In [70]:
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(work_dir)
os.environ['DATA'] = "{}/Q44".format(work_dir)
os.environ['NAME'] = "Q44"
os.environ["OUT"] = "{}/Q44_profiler_output".format(work_dir)

Helper function for executing commands

In [71]:
def run_command(command, substitution_dictionary = {}):
    """Run a templetized command."""
    #cmd = command.replace("NAME", subset_name)
    for k, v in substitution_dictionary.items():
        cmd = cmd.replace(k, v)
    
    print(cmd)
    output = subprocess.run([cmd], shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    print(output.stdout)
    print(output.stderr)

## Examples of using query command:
### Ex - Counting instances of <type, property, value> along with the property label

In [131]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.label.en.tsv --graph-cache $STORE \
--match 'item: (n1)-[l {label: p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \
--return 'distinct type as type, l.label as prop, n2 as value, count(n1) as count, lab as property_label' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'type, count(n1) desc' \
--limit 5

type	prop	value	count	property_label
Q1066984	P1343	Q97879676	3	'described by source'@en
Q1066984	P1343	Q316838	2	'described by source'@en
Q1066984	P17	Q154195	2	'country'@en
Q1066984	P17	Q183	2	'country'@en
Q1066984	P17	Q2415901	2	'country'@en


### Ex - Counting property, value pairs for time properties

In [165]:
!kgtk query -i $DATA/$NAME.part.time.tsv -i $DATA/$NAME.label.en.tsv --graph-cache $STORE \
--match 'time: (n1)-[l {label: p}]->(n2), label: (p)-[:label]->(lab)' \
--return 'distinct l.label as prop, lab as property_label, kgtk_date_year(n2) as val, count(n1) as count' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(n1) desc' \
--limit 5

prop	property_label	val	count
P571	'inception'@en	1991	20
P571	'inception'@en	1918	14
P571	'inception'@en	1989	8
P571	'inception'@en	1996	8
P571	'inception'@en	1821	6


### Ex- Counting property, value pairs for quantity properties

In [133]:
!kgtk query -i $DATA/$NAME.part.quantity.tsv -i $DATA/$NAME.label.en.tsv --graph-cache $STORE \
--match 'quantity: (n1)-[l {label: p}]->(n2), label: (p)-[:label]->(lab)' \
--return 'distinct l.label as prop, lab as property_label, n2 as value, count(n1) as count' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(n1) desc' \
--limit 5

prop	property_label	value	count
P3000	'marriageable age'@en	+18Q24564698	56
P2997	'age of majority'@en	+18Q24564698	52
P1279	'inflation rate'@en	+1.7Q11229	41
P2884	'mains voltage'@en	+230Q25250	40
P1279	'inflation rate'@en	+1.8Q11229	38


# Outline of procedure:
**Goal**:<br>
We want to create candidate label sets including
- Attribute value labels (type, property, *attribute*)
- Realtional entity labels (type, property, *entity*)
- Attribute interval labels (type, property, *range of attribute values*)
- Relational attribute labels (type, property, *attribute or attribute range of another entity*)

To enable subsequent filtering of these labels, we also want to count:
- The number of entities of each type
- The number of entities that match each label (call these "positives")
    
**Steps**:
1. Count the number of entities of each type
    - *optional future step*: define type with P279 transitive closure in addition to P31. 
2. Create AVLs trivially from attribute files along with counts of the positive entities for each label
    - **TODO:** create labels for all types that we care about. Currently just string and time.date. Should have the other time precisions, and also quantities
    - At this step, we should also contribute to a mapping of entities --> matching attribute labels to facilitate creating RALs in a later step  
3. Create RELs trivially from entity relation files along with counts of positive entities for each label
4. Create AILs by discretizing the AVLs we found, along with counts of positive entities for each label
    - This can be done using the counts that we already have
    - We'll likely try a naive approach at first such as hardcoding a bucket size for each numeric type we care about
    - At this step, we should also contribute to a mapping of entities --> matching attribute labels to facilitate creating RALs in a later step
    - **kind-of Issue:** in naive bucketing method, run into mod operator doesn't seem to be supported in query
    - **TODO:** make table of type/property --> low_bound/up_bound buckets to be used instead of using mod operator.
    - to do the above it would be nice to have some -/+ infinity value. Otherwise, need logic like "WHERE low_bound is null OR low_bound < value" - this might work?
5. Create RALs by using the REL table and the entities --> attribute labels table that built in steps 2 and 4. Also keep track of counts of positive entities for each label
    - **Progress:** I've implemented a query that can create RALs from scratch (ie not using tables built in previous steps).
    - Also now implemented RALs by using tables made in previous steps. Comparing results to the from-scratch query, they do not seem to match. Looking into where the problem is.
6. Combine the 4 label tables
    - **TODO**
7. Filter the labels using a simple heuristic based on counts
    - **TODO**
    
*Misc issues encountered*
- Finding that some of the kgtk type interpretation functions aren't recognized (e.g. I can use kgtk_date_year no problem, but kgtk_quantity_number_int not recognized)
- Is there a way to make an id by concatenating multiple values in the row? I found stringify, but not sure how to concatenate strings.
    - could be useful for giving labels ids and then we could look up what labels an entity had by their id rather than type prop and value
    - --> yes. kgtk add-id
- kgtk rename-columns doesn't always work when input file == output file. Getting around this right now by creating extra temp files... 
- mod operator not supported in a query?
- is there a notion of infinity/-infinity that I can put in a kgtk file?

## 1. Count number of entities of each type:

In [213]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.label.en.tsv -o $OUT/$NAME.entity_counts_per_type.tsv --graph-cache $STORE \
--match 'item: (n1)-[:P31]->(type), label: (type)-[:label]->(lab)' \
--return 'distinct type as node1_x, count(distinct n1) as node2_x, lab as entity_label, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc' \

In [214]:
!kgtk rename-columns -i $OUT/$NAME.entity_counts_per_type.tsv -o $OUT/$NAME.entity_counts_per_type.tsv \
--old-columns node1_x node2_x --new-columns node1 node2 

In [160]:
!head -5 $OUT/$NAME.entity_counts_per_type.tsv

node1	node2	entity_label	id
Q131734	87	'brewery'@en	_
Q3624078	69	'sovereign state'@en	_
Q4830453	50	'business'@en	_
Q6256	26	'country'@en	_


kgtk_quantity_number_int not recognized below, but otherwise, can use the above tabel like so:

In [171]:
!kgtk query -i $OUT/$NAME.entity_counts_per_type.tsv --graph-cache $STORE \
--match 'counts: (t)-[]->(c)' \
--return 't as type, c as count' \
--order-by 'kgtk_quantity_number_int(c) desc' \
--limit 5

no such function: kgtk_number



## 2. Create AVLs with counts of positive entities
### Keep numeric valued labels and non-numeric valued labels separate to help with discretization at a later step
### 2.a. Keep track of entities --> matching attribute labels for future use


### todo - use a loop to do this for all types we care about (e.g. day / month / year) currently just doing this for years

Creating mapping of entity --> string attribute labels

In [241]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.part.string.tsv -i $DATA/$NAME.label.en.tsv \
-o $OUT/$NAME.entity_attribute_labels_string_temp.tsv --graph-cache $STORE \
--match 'string: (n1)-[l {label:p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \
--return 'n1 as entity, type as type, p as prop, n2 as value, lab as property_label, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'n1'

In [242]:
!kgtk rename-columns -i $OUT/$NAME.entity_attribute_labels_string_temp.tsv \
-o $OUT/$NAME.entity_attribute_labels_string.tsv \
--old-columns type prop --new-columns node1 node2

In [243]:
!head -5 $OUT/$NAME.entity_attribute_labels_string.tsv

entity	node1	node2	value	property_label	id
Q1000597	Q3957	P281	"DE14"	'postal code'@en	_
Q1000597	Q3957	P373	"Burton upon Trent"	'Commons category'@en	_
Q1000597	Q3957	P473	"01283"	'local dialing code'@en	_
Q1000597	Q3957	P613	"SK245225"	'OS grid reference'@en	_


Aggregating distinct labels w/ positive entity counts

In [249]:
!kgtk query -i $OUT/$NAME.entity_attribute_labels_string.tsv \
-o $OUT/$NAME.candidate_labels_avl_string_temp.tsv --graph-cache $STORE \
--match 'labels: (n1)-[l {value:val, property_label:lab, entity:e}]->(n2)' \
--return 'distinct n1 as type, n2 as prop, val as value, count(distinct e) as positives, lab as property_label, "_" as id' \
--order-by 'count(distinct e) desc'

In [250]:
!kgtk rename-columns -i $OUT/$NAME.candidate_labels_avl_string_temp.tsv \
-o $OUT/$NAME.candidate_labels_avl_string.tsv \
--old-columns type prop --new-columns node1 node2

In [251]:
!head -5 $OUT/$NAME.candidate_labels_avl_string.tsv

node1	node2	value	positives	property_label	id
Q3624078	P3238	"0"	34	'trunk prefix'@en	_
Q3624078	P3238	novalue	14	'trunk prefix'@en	_
Q6256	P3238	"0"	12	'trunk prefix'@en	_
Q179164	P3238	"0"	9	'trunk prefix'@en	_


Creating mapping of entity --> attribute labels of type time (precision = year)

In [257]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.part.time.tsv -i $DATA/$NAME.label.en.tsv \
-o $OUT/$NAME.entity_attribute_labels_time.year_temp.tsv --graph-cache $STORE \
--match 'time: (n1)-[l {label:p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \
--return 'n1 as entity, type as type, p as prop, kgtk_date_year(n2) as value, lab as property_label, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en" AND kgtk_date_precision(n2) = 9' \
--order-by 'n1'

In [258]:
!kgtk rename-columns -i $OUT/$NAME.entity_attribute_labels_time.year_temp.tsv \
-o $OUT/$NAME.entity_attribute_labels_time.year.tsv \
--old-columns type prop --new-columns node1 node2

In [259]:
!head -5 $OUT/$NAME.entity_attribute_labels_time.year.tsv

entity	node1	node2	value	property_label	id
Q1019	Q112099	P571	1960	'inception'@en	_
Q1019	Q112099	P571	1960	'inception'@en	_
Q1019	Q3624078	P571	1960	'inception'@en	_
Q1019	Q3624078	P571	1960	'inception'@en	_


Aggregating distinct labels w/ positive entity counts

In [260]:
!kgtk query -i $OUT/$NAME.entity_attribute_labels_time.year.tsv \
-o $OUT/$NAME.candidate_labels_avl_time.year_temp.tsv --graph-cache $STORE \
--match 'labels: (n1)-[l {value:val, property_label:lab, entity:e}]->(n2)' \
--return 'distinct n1 as type, n2 as prop, val as value, count(distinct e) as positives, lab as property_label, "_" as id' \
--order-by 'count(distinct e) desc'

In [261]:
!kgtk rename-columns -i $OUT/$NAME.candidate_labels_avl_time.year_temp.tsv \
-o $OUT/$NAME.candidate_labels_avl_time.year.tsv \
--old-columns type prop --new-columns node1 node2

In [262]:
!head -5 $OUT/$NAME.candidate_labels_avl_time.year.tsv

node1	node2	value	positives	property_label	id
Q3624078	P571	1991	3	'inception'@en	_
Q123480	P571	1991	2	'inception'@en	_
Q131734	P571	1985	2	'inception'@en	_
Q131734	P571	1996	2	'inception'@en	_


## 3. Create RELs with counts of positive entities

In [272]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.label.en.tsv \
-o $OUT/$NAME.candidate_labels_rel_item_temp.tsv --graph-cache $STORE \
--match 'item: (n1)-[l {label:p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \
--return 'distinct type as type, p as prop, n2 as value, count(distinct n1) as positives, lab as property_label, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc'

In [273]:
!kgtk rename-columns -i $OUT/$NAME.candidate_labels_rel_item_temp.tsv \
-o $OUT/$NAME.candidate_labels_rel_item.tsv \
--old-columns type prop --new-columns node1 node2

In [275]:
!head -5 $OUT/$NAME.candidate_labels_rel_item.tsv

node1	node2	value	positives	property_label	id
Q131734	P452	Q869095	77	'industry'@en	_
Q3624078	P463	Q1065	68	'member of'@en	_
Q3624078	P463	Q7817	67	'member of'@en	_
Q3624078	P530	Q183	67	'diplomatic relation'@en	_


## 4. Create AILs with counts of positive entities
### 4.a. Keep track of entities --> matching attribute labels for future use


Just implementing a naive hard-coded bucket-size at first for time.date attribute labels

**issue:** mod operator doesn't seem to be supported here..

**TODO:** make table of type/property --> low_bound/up_bound buckets to be used below instead.

**issue:** to do the above it would be nice to have some -/+ infinity value. Otherwise, need logic like "WHERE low_bound is null OR low_bound < value" - this might work?

In [284]:
!kgtk query -i $OUT/$NAME.entity_attribute_labels_time.year.tsv \
--graph-cache $STORE \
--match 'labels: (n1)-[l {value:val, property_label:lab, entity:e}]->(n2)' \
--return 'e as entity, n1 as type, n2 as prop, (val - (val % 5)) as low_bound, lab as property_label, "_" as id' \
--order-by 'e' \
--limit 5

Unhandled expression type: ('Mod', {'arg1': ('Variable', {'name': 'val'}), 'arg2': ('Literal', {'value': 5})})



## 5. Create RALs with counts of positive entities

Here is an idea of what kind of values we should find (creating RALs from scratch for string attributes only)

In [130]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.part.string.tsv -i $DATA/$NAME.label.en.tsv \
--graph-cache $STORE \
--match 'item: (n1)-[l1 {label:p1}]->(n2), item: (n1)-[:P31]->(type1), label: (p1)-[:label]->(lab1), item: (n2)-[l2 {label:p2}]->(n3), item: (n2)-[:P31]->(type2), label: (p2)-[:label]->(lab2)' \
--return 'distinct type1 as type1, p1 as prop1, type2 as type2, p2 as prop2, n3 as value, count(distinct n1) as positives, lab1 as prop1_label, lab2 as prop2_label' \
--where 'lab1.kgtk_lqstring_lang_suffix = "en"' \
--where 'lab2.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc' \
--limit 10

type1	prop1	type2	prop2	value	positives	prop1_label	prop2_label
Q131734	P452	Q8148	P1056	Q44	77	'industry'@en	'product or material produced'@en
Q131734	P452	Q8148	P1343	Q2041543	77	'industry'@en	'described by source'@en
Q131734	P452	Q8148	P361	Q26705422	77	'industry'@en	'part of'@en
Q131734	P452	Q8148	P452	Q1365365	77	'industry'@en	'industry'@en
Q131734	P452	Q8148	P910	Q8311121	77	'industry'@en	'topic\\\\\\\\'s main category'@en
Q131734	P17	Q3624078	P1343	Q19180675	70	'country'@en	'described by source'@en
Q131734	P17	Q3624078	P463	Q1043527	70	'country'@en	'member of'@en
Q131734	P17	Q3624078	P463	Q1065	70	'country'@en	'member of'@en
Q131734	P17	Q3624078	P463	Q170424	70	'country'@en	'member of'@en
Q131734	P17	Q3624078	P463	Q17495	70	'country'@en	'member of'@en


Now using the REL table and the entities --> attribute labels table that we built in steps 2 and 4. Also keep track of counts of positive entities for each label

**NOTE:** It looks like having all of the entity --> labels mappings together in one file would be nice here. I don't think there is a need to keep these separate? Maybe keep AVL and AIL separate.

**TODO:** Create RALs for all of the attribute labels (currently done for only one of the two attribute types that we have AVLs for, and will need to do for additional AVLs and AILs once they are created)

In [294]:
!kgtk query -i $OUT/$NAME.candidate_labels_rel_item.tsv -i $OUT/$NAME.entity_attribute_labels_string.tsv \
--graph-cache $STORE \
--match 'rel: (t1)-[l1 {value:v1, positives:num_pos}]->(p1), entity_attribute: (t2)-[l {entity:v1, value:v2}]->(p2)' \
--return 't1 as type, p1 as prop, t2 as value_type, p2 as value_prop, v2 as value_val, int(num_pos) as positives, "_" as id' \
--where 't1 = "Q131734" AND p1 = "P452"' \
--limit 10

no such function: int



## 6. Combining generated labels into a single file

## 7. Filtering candidate labels - This can go in a separate notebook
Simple rule-based filter to remove labels that are trivially either unrepresentative or indistinctive