# Re-implementing procedure outlined in "Entity Profiling in Knowledge Graphs" (Zhang Et al.)
# This notebook will implement the candidate label creation step
Using a subset of wikidata related to Q44 ("beer")

In [72]:
import os

In [73]:
work_dir = "/Users/nicklein/Documents/grad_school/Research/data"

In [74]:
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(work_dir)
os.environ['DATA'] = "{}/Q44".format(work_dir)
os.environ['NAME'] = "Q44"
os.environ["OUT"] = "{}/Q44_profiler_output".format(work_dir)

Helper function for executing commands

In [71]:
def run_command(command, substitution_dictionary = {}):
    """Run a templetized command."""
    #cmd = command.replace("NAME", subset_name)
    for k, v in substitution_dictionary.items():
        cmd = cmd.replace(k, v)
    
    print(cmd)
    output = subprocess.run([cmd], shell=True, universal_newlines=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    print(output.stdout)
    print(output.stderr)

## Examples of using query command:
### Ex - Counting instances of <type, property, value> along with the property label

In [131]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.label.en.tsv --graph-cache $STORE \
--match 'item: (n1)-[l {label: p}]->(n2), item: (n1)-[:P31]->(type), label: (p)-[:label]->(lab)' \
--return 'distinct type as type, l.label as prop, n2 as value, count(n1) as count, lab as property_label' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'type, count(n1) desc' \
--limit 5

type	prop	value	count	property_label
Q1066984	P1343	Q97879676	3	'described by source'@en
Q1066984	P1343	Q316838	2	'described by source'@en
Q1066984	P17	Q154195	2	'country'@en
Q1066984	P17	Q183	2	'country'@en
Q1066984	P17	Q2415901	2	'country'@en


### Ex - Counting property, value pairs for time properties

In [165]:
!kgtk query -i $DATA/$NAME.part.time.tsv -i $DATA/$NAME.label.en.tsv --graph-cache $STORE \
--match 'time: (n1)-[l {label: p}]->(n2), label: (p)-[:label]->(lab)' \
--return 'distinct l.label as prop, lab as property_label, kgtk_date_year(n2) as val, count(n1) as count' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(n1) desc' \
--limit 5

prop	property_label	val	count
P571	'inception'@en	1991	20
P571	'inception'@en	1918	14
P571	'inception'@en	1989	8
P571	'inception'@en	1996	8
P571	'inception'@en	1821	6


### Ex- Counting property, value pairs for quantity properties

In [133]:
!kgtk query -i $DATA/$NAME.part.quantity.tsv -i $DATA/$NAME.label.en.tsv --graph-cache $STORE \
--match 'quantity: (n1)-[l {label: p}]->(n2), label: (p)-[:label]->(lab)' \
--return 'distinct l.label as prop, lab as property_label, n2 as value, count(n1) as count' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(n1) desc' \
--limit 5

prop	property_label	value	count
P3000	'marriageable age'@en	+18Q24564698	56
P2997	'age of majority'@en	+18Q24564698	52
P1279	'inflation rate'@en	+1.7Q11229	41
P2884	'mains voltage'@en	+230Q25250	40
P1279	'inflation rate'@en	+1.8Q11229	38


# Outline of procedure:
**Goal**:<br>
We want to create candidate label sets including
- Attribute value labels (type, property, *attribute*)
- Realtional entity labels (type, property, *entity*)
- Attribute interval labels (type, property, *range of attribute values*)
- Relational attribute labels (type, property, *attribute or attribute range of another entity*)

To enable subsequent filtering of these labels, we also want to count:
- The number of entities of each type
- The number of entities that match each label (call these "positives")
    
**Steps**:

0. Create type-mapping
1. Count the number of entities of each type
    - *optional future step*: define type with P279 transitive closure in addition to P31. 
2. Create AVLs trivially from attribute files along with counts of the positive entities for each label
    - At this step, we should also contribute to a mapping of entities --> matching attribute labels to facilitate creating RALs in a later step  
3. Create RELs trivially from entity relation files along with counts of positive entities for each label
4. Create AILs by discretizing the AVLs we found, along with counts of positive entities for each label
    - This can be done using the counts that we already have
    - We'll likely try a naive approach at first such as hardcoding a bucket size for each numeric type we care about
    - At this step, we should also contribute to a mapping of entities --> matching attribute labels to facilitate creating RALs in a later step
5. Create RALs by using the REL table and the entities --> attribute labels table that built in steps 2 and 4. Also keep track of counts of positive entities for each label
    
*Misc issues encountered*
- kgtk rename-columns doesn't always work when input file == output file. Getting around this right now by creating extra temp files... 

## 0. Create type mapping
Using P31 only for now, but can add P279* as well later

In [361]:
!kgtk filter -p ' ; P31 ; ' -i $DATA/$NAME.part.wikibase-item.tsv -o $OUT/$NAME.type_mapping.tsv

In [638]:
!head -5 $OUT/$NAME.type_mapping.tsv | column -t -s $'\t'

id              node1     label  node2
Q1000597-P31-1  Q1000597  P31    Q3957
Q1011-P31-2     Q1011     P31    Q112099
Q1011-P31-1     Q1011     P31    Q3624078
Q1019-P31-2     Q1019     P31    Q112099


## 1. Count number of entities of each type:

In [589]:
!kgtk query -i $OUT/$NAME.type_mapping.tsv -i $DATA/$NAME.label.en.tsv -o $OUT/$NAME.entity_counts_per_type_temp.tsv --graph-cache $STORE \
--match 'type: (n1)-[]->(type), label: (type)-[:label]->(lab)' \
--return 'distinct type as type, lab as type_label, count(distinct n1) as count, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc' \

In [590]:
!kgtk rename-columns -i $OUT/$NAME.entity_counts_per_type_temp.tsv -o $OUT/$NAME.entity_counts_per_type_temp1.tsv \
--old-columns type type_label count --new-columns node1 label node2 

In [591]:
!kgtk add-id -i $OUT/$NAME.entity_counts_per_type_temp1.tsv \
-o $OUT/$NAME.entity_counts_per_type.tsv --overwrite-id

In [637]:
!head -5 $OUT/$NAME.entity_counts_per_type.tsv | column -t -s $'\t'

node1     label                 node2  id
Q131734   'brewery'@en          87     E1
Q3624078  'sovereign state'@en  69     E2
Q4830453  'business'@en         50     E3
Q6256     'country'@en          26     E4


## 2. Create AVLs with counts of positive entities
At this step we also want to keep track of entities --> matching attribute labels for future use. This will help when we are creating RALs (step 5)

We should also keep numeric valued labels and non-numeric valued labels separate to help with discretization at a later step

### strings
Creating mapping of entity --> string attribute labels

In [728]:
!kgtk query -i $OUT/$NAME.type_mapping.tsv -i $DATA/$NAME.part.string.tsv -i $DATA/$NAME.label.en.tsv \
-o $OUT/$NAME.entity_attribute_labels_string_temp.tsv --graph-cache $STORE \
--match 'string: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), label: (p)-[:label]->(lab)' \
--return 'distinct n1 as entity, type as type, p as prop, n2 as value, lab as property_label, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'n1'

In [729]:
!kgtk rename-columns -i $OUT/$NAME.entity_attribute_labels_string_temp.tsv \
-o $OUT/$NAME.entity_attribute_labels_string_temp1.tsv \
--old-columns type prop value --new-columns node1 label node2

In [730]:
!kgtk add-id -i $OUT/$NAME.entity_attribute_labels_string_temp1.tsv \
-o $OUT/$NAME.entity_attribute_labels_string.tsv --overwrite-id

In [731]:
!head -5 $OUT/$NAME.entity_attribute_labels_string.tsv | column -t -s $'\t'

entity    node1  label  node2                property_label           id
Q1000597  Q3957  P281   "DE14"               'postal code'@en         E1
Q1000597  Q3957  P373   "Burton upon Trent"  'Commons category'@en    E2
Q1000597  Q3957  P473   "01283"              'local dialing code'@en  E3
Q1000597  Q3957  P613   "SK245225"           'OS grid reference'@en   E4


Aggregating distinct labels w/ positive entity counts

In [732]:
!kgtk query -i $OUT/$NAME.entity_attribute_labels_string.tsv \
-o $OUT/$NAME.candidate_labels_avl_string_temp.tsv --graph-cache $STORE \
--match 'labels: (type)-[l {label:prop, property_label:lab, entity:e}]->(val)' \
--return 'distinct type as type, prop as prop, val as value, count(distinct e) as positives, lab as property_label, "_" as id' \
--order-by 'count(distinct e) desc'

In [733]:
!kgtk rename-columns -i $OUT/$NAME.candidate_labels_avl_string_temp.tsv \
-o $OUT/$NAME.candidate_labels_avl_string_temp1.tsv \
--old-columns type prop value --new-columns node1 label node2

In [734]:
!kgtk remove-columns -c "id" -i $OUT/$NAME.candidate_labels_avl_string_temp1.tsv \
-o $OUT/$NAME.candidate_labels_avl_string_temp2.tsv

In [735]:
!kgtk add-id -i $OUT/$NAME.candidate_labels_avl_string_temp2.tsv -o $OUT/$NAME.candidate_labels_avl_string.tsv

In [736]:
!head -5 $OUT/$NAME.candidate_labels_avl_string.tsv | column -t -s $'\t'

node1     label  node2    positives  property_label     id
Q3624078  P3238  "0"      34         'trunk prefix'@en  E1
Q3624078  P3238  novalue  14         'trunk prefix'@en  E2
Q6256     P3238  "0"      12         'trunk prefix'@en  E3
Q179164   P3238  "0"      9          'trunk prefix'@en  E4


### Times

Looking at what precisions we need to deal with...

In [634]:
!kgtk query -i $DATA/$NAME.part.time.tsv $DATA/$NAME.label.en.tsv\
--graph-cache $STORE \
--match 'time: (n1)-[l {label:p}]->(n2), label: (p)-[:label]->(lab)' \
--return 'distinct kgtk_date_precision(n2) as precisions, count(n1) as count' \
--limit 10 \
| column -t -s $'\t'

precisions  count
6           12
7           40
8           12
9           697
10          48
11          469


From the above, we have several precisions below precision of year=9. We don't have kgtk type interpretation functions for these granularities, so for now we'll interpret them all as years. Furthermore, we will interpret all times at the year granularity for now.

Additional work can be done later to create labels with finer time granularity if desired.

#### precision = year or broader
#### note - at this time we will include values that have precision finer than a year here as well
We interpret these as years

In [874]:
!kgtk query -i $OUT/$NAME.type_mapping.tsv -i $DATA/$NAME.part.time.tsv -i $DATA/$NAME.label.en.tsv \
-o $OUT/$NAME.entity_attribute_labels_time.year_temp.tsv --graph-cache $STORE \
--match 'time: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), label: (p)-[:label]->(p_lab), label: (type)-[:label]->(t_lab)' \
--return 'distinct n1 as entity, type as type, p as prop, kgtk_date_year(n2) as value, t_lab as type_label, p_lab as property_label, "_" as id' \
--where 't_lab.kgtk_lqstring_lang_suffix = "en" AND p_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'n1'

In [875]:
!kgtk rename-columns -i $OUT/$NAME.entity_attribute_labels_time.year_temp.tsv \
-o $OUT/$NAME.entity_attribute_labels_time.year_temp1.tsv \
--old-columns type prop value --new-columns node1 label node2

In [876]:
!kgtk remove-columns -c "id" -i $OUT/$NAME.entity_attribute_labels_time.year_temp1.tsv \
-o $OUT/$NAME.entity_attribute_labels_time.year_temp2.tsv

In [877]:
!kgtk add-id -i $OUT/$NAME.entity_attribute_labels_time.year_temp2.tsv \
-o $OUT/$NAME.entity_attribute_labels_time.year.tsv

In [878]:
!head -5 $OUT/$NAME.entity_attribute_labels_time.year.tsv | column -t -s $'\t'

entity  node1     label  node2  type_label            property_label  id
Q1011   Q112099   P571   1975   'island nation'@en    'inception'@en  E1
Q1011   Q3624078  P571   1975   'sovereign state'@en  'inception'@en  E2
Q1019   Q112099   P571   1960   'island nation'@en    'inception'@en  E3
Q1019   Q3624078  P571   1960   'sovereign state'@en  'inception'@en  E4


Aggregating distinct labels w/ positive entity counts

In [879]:
!kgtk query -i $OUT/$NAME.entity_attribute_labels_time.year.tsv \
-o $OUT/$NAME.candidate_labels_avl_time.year_temp.tsv --graph-cache $STORE \
--match 'labels: (n1)-[l {label:p, property_label:lab, entity:e}]->(val)' \
--return 'distinct n1 as type, p as prop, val as value, count(distinct e) as positives, lab as property_label, "_" as id' \
--order-by 'count(distinct e) desc'

In [880]:
!kgtk rename-columns -i $OUT/$NAME.candidate_labels_avl_time.year_temp.tsv \
-o $OUT/$NAME.candidate_labels_avl_time.year_temp1.tsv \
--old-columns type prop value --new-columns node1 label node2

In [881]:
!kgtk remove-columns -c "id" -i $OUT/$NAME.candidate_labels_avl_time.year_temp1.tsv \
-o $OUT/$NAME.candidate_labels_avl_time.year_temp2.tsv

In [882]:
!kgtk add-id -i $OUT/$NAME.candidate_labels_avl_time.year_temp2.tsv \
-o $OUT/$NAME.candidate_labels_avl_time.year.tsv

In [883]:
!head -5 $OUT/$NAME.candidate_labels_avl_time.year.tsv | column -t -s $'\t'

node1     label  node2  positives  property_label  id
Q3624078  P571   1991   8          'inception'@en  E1
Q3624078  P571   1918   7          'inception'@en  E2
Q6256     P571   1918   5          'inception'@en  E3
Q6256     P571   1991   5          'inception'@en  E4


#### ~~precision finer than a year~~
~~we interpret these as dates with day, month, year~~

Note, at this time we will give all time values a granularity of year. The results of the below code are currently not used.

In [737]:
!kgtk query -i $OUT/$NAME.type_mapping.tsv -i $DATA/$NAME.part.time.tsv -i $DATA/$NAME.label.en.tsv \
-o $OUT/$NAME.entity_attribute_labels_time.date_temp.tsv --graph-cache $STORE \
--match 'time: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), label: (p)-[:label]->(lab)' \
--return 'distinct n1 as entity, type as type, p as prop, kgtk_date_date(n2) as value, lab as property_label, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en" AND kgtk_date_precision(n2) > 9' \
--order-by 'n1'

In [738]:
!kgtk rename-columns -i $OUT/$NAME.entity_attribute_labels_time.date_temp.tsv \
-o $OUT/$NAME.entity_attribute_labels_time.date_temp1.tsv \
--old-columns type prop value --new-columns node1 label node2

In [739]:
!kgtk remove-columns -c "id" -i $OUT/$NAME.entity_attribute_labels_time.date_temp1.tsv \
-o $OUT/$NAME.entity_attribute_labels_time.date_temp2.tsv

In [740]:
!kgtk add-id -i $OUT/$NAME.entity_attribute_labels_time.date_temp2.tsv \
-o $OUT/$NAME.entity_attribute_labels_time.date.tsv

In [741]:
!head -5 $OUT/$NAME.entity_attribute_labels_time.date.tsv | column -t -s $'\t'

entity    node1     label  node2        property_label  id
Q1011     Q112099   P571   ^1975-07-05  'inception'@en  E1
Q1011     Q3624078  P571   ^1975-07-05  'inception'@en  E2
Q1020773  Q3957     P571   ^1892-10-12  'inception'@en  E3
Q1020773  Q902814   P571   ^1892-10-12  'inception'@en  E4


Aggregating distinct labels w/ positive entity counts

In [742]:
!kgtk query -i $OUT/$NAME.entity_attribute_labels_time.date.tsv \
-o $OUT/$NAME.candidate_labels_avl_time.date_temp.tsv --graph-cache $STORE \
--match 'labels: (n1)-[l {label:p, property_label:lab, entity:e}]->(val)' \
--return 'distinct n1 as type, p as prop, val as value, count(distinct e) as positives, lab as property_label, "_" as id' \
--order-by 'count(distinct e) desc'

In [743]:
!kgtk rename-columns -i $OUT/$NAME.candidate_labels_avl_time.date_temp.tsv \
-o $OUT/$NAME.candidate_labels_avl_time.date_temp1.tsv \
--old-columns type prop value --new-columns node1 label node2

In [744]:
!kgtk remove-columns -c "id" -i $OUT/$NAME.candidate_labels_avl_time.date_temp1.tsv \
-o $OUT/$NAME.candidate_labels_avl_time.date_temp2.tsv

In [745]:
!kgtk add-id -i $OUT/$NAME.candidate_labels_avl_time.date_temp2.tsv \
-o $OUT/$NAME.candidate_labels_avl_time.date.tsv

In [746]:
!head -5 $OUT/$NAME.candidate_labels_avl_time.date.tsv | column -t -s $'\t'

node1     label  node2         positives  property_label                        id
Q1066984  P1249  ^1158-06-14   1          'time of earliest written record'@en  E1
Q1093829  P571   ^1790-07-16   1          'inception'@en                        E2
Q112099   P571   ^-0660-02-11  1          'inception'@en                        E3
Q112099   P571   ^1898-12-10   1          'inception'@en                        E4


## Quantities
Not using a kgtk literal function for now so we can keep the units attached to the values. Otherwise, could interpret everything as a number and drop the units. Seems like we need the units though.

In [918]:
!kgtk query -i $OUT/$NAME.type_mapping.tsv -i $DATA/$NAME.part.quantity.tsv -i $DATA/$NAME.label.en.tsv \
-o $OUT/$NAME.entity_attribute_labels_quantity_temp.tsv --graph-cache $STORE \
--match 'quantity: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), label: (p)-[:label]->(p_lab), label: (type)-[:label]->(t_lab)' \
--return 'distinct n1 as entity, type as type, p as prop, kgtk_quantity_number(n2) as value, kgtk_quantity_si_units(n2) as si_units, kgtk_quantity_wd_units(n2) as wd_units, t_lab as type_label, p_lab as property_label, "_" as id' \
--where 't_lab.kgtk_lqstring_lang_suffix = "en" AND p_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'n1'

In [919]:
!kgtk rename-columns -i $OUT/$NAME.entity_attribute_labels_quantity_temp.tsv \
-o $OUT/$NAME.entity_attribute_labels_quantity_temp1.tsv \
--old-columns type prop value --new-columns node1 label node2

In [920]:
!kgtk remove-columns -c "id" -i $OUT/$NAME.entity_attribute_labels_quantity_temp1.tsv \
-o $OUT/$NAME.entity_attribute_labels_quantity_temp2.tsv

In [921]:
!kgtk add-id -i $OUT/$NAME.entity_attribute_labels_quantity_temp2.tsv \
-o $OUT/$NAME.entity_attribute_labels_quantity.tsv

In [922]:
!head -5 $OUT/$NAME.entity_attribute_labels_quantity.tsv | column -t -s $'\t'

entity    node1    label  node2  si_units            wd_units                      type_label  property_label  id
Q1000597  Q3957    P1082  75074  'town'@en           'population'@en               E1
Q1011     Q112099  P1081  0.57   'island nation'@en  'Human Development Index'@en  E2
Q1011     Q112099  P1081  0.572  'island nation'@en  'Human Development Index'@en  E3
Q1011     Q112099  P1081  0.575  'island nation'@en  'Human Development Index'@en  E4


Aggregating distinct labels w/ positive entity counts

In [752]:
!kgtk query -i $OUT/$NAME.entity_attribute_labels_quantity.tsv \
-o $OUT/$NAME.candidate_labels_avl_quantity_temp.tsv --graph-cache $STORE \
--match 'labels: (n1)-[l {label:p, property_label:lab, entity:e}]->(val)' \
--return 'distinct n1 as type, p as prop, val as value, count(distinct e) as positives, lab as property_label, "_" as id' \
--order-by 'count(distinct e) desc'

In [753]:
!kgtk rename-columns -i $OUT/$NAME.candidate_labels_avl_quantity_temp.tsv \
-o $OUT/$NAME.candidate_labels_avl_quantity_temp1.tsv \
--old-columns type prop value --new-columns node1 label node2

In [754]:
!kgtk remove-columns -c "id" -i $OUT/$NAME.candidate_labels_avl_quantity_temp1.tsv \
-o $OUT/$NAME.candidate_labels_avl_quantity_temp2.tsv

In [755]:
!kgtk add-id -i $OUT/$NAME.candidate_labels_avl_quantity_temp2.tsv \
-o $OUT/$NAME.candidate_labels_avl_quantity.tsv

In [756]:
!head -5 $OUT/$NAME.candidate_labels_avl_quantity.tsv | column -t -s $'\t'

node1     label  node2         positives  property_label         id
Q3624078  P3000  +18Q24564698  54         'marriageable age'@en  E1
Q3624078  P2997  +18Q24564698  51         'age of majority'@en   E2
Q3624078  P2884  +230Q25250    38         'mains voltage'@en     E3
Q3624078  P1279  +1.7Q11229    25         'inflation rate'@en    E4


### Combining entity --> attribute label mappings to single table

Note, currently omitting OUT/NAME.entity_attribute_labels_time.date.tsv from below

In [172]:
!kgtk cat \
-i $OUT/$NAME.entity_attribute_labels_string.tsv \
-i $OUT/$NAME.entity_attribute_labels_time.year.tsv \
-i $OUT/$NAME.entity_attribute_labels_quantity.tsv \
-o $OUT/$NAME.entity_AVLs_all.tsv

## 3. Create RELs with counts of positive entities

In [478]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $OUT/$NAME.type_mapping.tsv -i $DATA/$NAME.label.en.tsv \
-o $OUT/$NAME.candidate_labels_rel_item_temp.tsv --graph-cache $STORE \
--match 'item: (n1)-[l {label:p}]->(n2), type: (n1)-[]->(type), label: (p)-[:label]->(lab)' \
--return 'distinct type as type, p as prop, n2 as value, count(distinct n1) as positives, lab as property_label, "_" as id' \
--where 'lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc'

In [479]:
!kgtk rename-columns -i $OUT/$NAME.candidate_labels_rel_item_temp.tsv \
-o $OUT/$NAME.candidate_labels_rel_item_temp1.tsv \
--old-columns type prop value --new-columns node1 label node2

In [480]:
!kgtk remove-columns -c "id" -i $OUT/$NAME.candidate_labels_rel_item_temp1.tsv \
-o $OUT/$NAME.candidate_labels_rel_item_temp2.tsv

In [481]:
!kgtk add-id -i $OUT/$NAME.candidate_labels_rel_item_temp2.tsv \
-o $OUT/$NAME.candidate_labels_rel_item.tsv

In [627]:
!head -5 $OUT/$NAME.candidate_labels_rel_item.tsv | column -t -s $'\t'

node1     label  node2    positives  property_label            id
Q131734   P452   Q869095  77         'industry'@en             E1
Q3624078  P463   Q1065    68         'member of'@en            E2
Q3624078  P463   Q7817    67         'member of'@en            E3
Q3624078  P530   Q183     67         'diplomatic relation'@en  E4


## 4. Create AILs with counts of positive entities
### 4.a. Keep track of entities --> matching attribute labels for future use


In [141]:
import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from kneed import KneeLocator


# Chooses values for the epsilon and min_samples parameters of DBSCAN to be used for clustering given labels.
def choose_DBSCAN_params(labels_df):
    values = np.array(labels_df.loc[:,"node2"]).reshape(-1, 1)
    
    # edge case only one sample will give an error if we try to compute knn.
    if len(values) < 2:
        print("only one sample for label:\n{}\n".format(labels_df))
        return (1, 1)
    
    # min_samples would ideally be set with some domain-insight.
    # To make this automated, we'll use a heuristic - ln(number of data points)
    min_samples = int(np.floor(np.log(len(values))))
    min_samples = max(min_samples, 1) # don't choose a number less than 1
    
    # epsilon can be chosen by plotting the distances of the k'th nearest neighbor from each point
    # where k is min_samples. Points that belong to clusters should have smaller distances, whereas
    # noise points can have distances that are much farther. We'll look for a knee in this graph to set epsilon.
    neigh = NearestNeighbors(n_neighbors = min_samples + 1) # +1 so we find k'th nearest neighbor not including the point iteslf.
    values_neigh = neigh.fit(values)
    distances, indices = values_neigh.kneighbors(values)
    distances = np.sort(distances[:,min_samples], axis = 0)
    
    kneedle = KneeLocator(range(len(distances)), distances, S=1, curve='convex', direction='increasing', interp_method = 'polynomial')
    epsilon = kneedle.knee_y
    if epsilon == None:
        print("no knee found for these labels:\n{}".format(labels_df))
        epsilon = distances[len(distances) // 2]
        print("using median distance to k nearest neighbor instead ({})\n".format(epsilon))
    
    return (min_samples, epsilon)


# Uses DBSCAN to assign values to clusters
def DBSCAN_1d_values(values, eps, min_samples):
    # dbscan doesn't like eps == 0. This translates to not doing any bucketing.
    # handling this as an edge case so dbscan doesn't throw an exception
    if eps <= 0:
        return values # same-valued points are labeled the same
    
    # dbscan expects multiple dimensions. Edit new object instead of existing view of df.
    values = np.array(values)
    values = np.append(values.reshape(-1,1), np.zeros((len(values),1)), axis = 1)
        
    db = DBSCAN(eps=eps, min_samples=min_samples).fit(values)

    return (db.labels_)


# Given values that have been assigned to clusters,
# return a list of intervals that are consistent with these clusters
def get_intervals_for_values_based_on_clusters(values, labels):
    
    values = np.array(values)
    labels = np.array(labels)
    indexes = np.arange(len(values))
    
    # sort values and corresponding labels in ascending order
    labels = np.array([l for l, v in sorted(zip(labels, values), key=lambda pair: pair[1])])
    indexes = np.array([i for i, v in sorted(zip(indexes, values), key=lambda pair: pair[1])])
    values.sort()
    
    intervals = [(None, None)] # initially a single interval with no lower or upper bound
    
    # create intervals
    cur_label = labels[0]
    for i in range(len(labels)):
        # if new label, set upper bound of previous interval,
        # and start lower bound of a new interval.
        if labels[i] != cur_label:
            prev_interval_lb = (intervals[-1])[0]
            new_interval_edge = values[i-1] + ((values[i] - values[i-1]) / 2)
            intervals[-1] = (prev_interval_lb, new_interval_edge)
            intervals.append((new_interval_edge, None))
            cur_label = labels[i]
    
    # assign intervals to values
    cur_label = labels[0]
    cur_interval_ix = 0
    intervals_for_values = []
    for i in range(len(labels)):
        if labels[i] != cur_label:
            cur_interval_ix += 1
            cur_label = labels[i]
        intervals_for_values.append(intervals[cur_interval_ix])
    
    # rearrange intervals to original order of values
    intervals_for_values_unscrambled = np.zeros(len(intervals_for_values), dtype=tuple)
    for i in range(len(indexes)):
        intervals_for_values_unscrambled[indexes[i]] = intervals_for_values[i]
    
    return intervals_for_values_unscrambled

# Given a file containing numeric valued attribute labels,
# discretize these labels create a new file with the resulting attribute interval labels
def discretize_labels(avl_file_in, ail_file_out):
    df = pd.read_csv(avl_file_in, delimiter='\t')
    
    # add lower bound and upper bound columns
    df.insert(loc = len(df.columns), column = "lower_bound", value = ["" for i in range(df.shape[0])])
    df.insert(loc = len(df.columns), column = "upper_bound", value = ["" for i in range(df.shape[0])])
    
    # blank values in units columns are expected as not all values will have units.
    # we want blank units to compare equal to eachother, so fill NaN's in as "" in units columns.
    if "si_units" in df.columns and "wd_units" in df.columns:
        df.fillna("", inplace = True)
        
    # we also don't want to consider any string type values
    types = [type(v) for v in df.loc[:,"node2"]]
    non_str_mask = [True if (t != str) else False for t in types]
    df = df.loc[non_str_mask]
    
    # get distinct label types (defined by type and property, as well as si and wd units if we have them)
    if "si_units" in df.columns and "wd_units" in df.columns:
        distinct_labels = df.loc[:, ["node1", "label", "si_units", "wd_units"]].drop_duplicates()
    else:
        distinct_labels = df.loc[:, ["node1", "label"]].drop_duplicates()
        
    # Could probably be improved with a list comprehension
    for index, row in distinct_labels.iterrows():
        # Get subset of labels that match this distinct kind of label
        subset_mask = (df["node1"] == row["node1"]) & (df["label"] == row["label"])
        # if we have units, treat these as part of the kind of label
        if "si_units" in df.columns and "wd_units" in df.columns:
            subset_mask = subset_mask & (df["si_units"] == row["si_units"]) & (df["wd_units"] == row["wd_units"])
        subset = df.loc[subset_mask]
        
        values = subset.loc[:,"node2"]
        if(len(values) == 0):
            print("no values found for subset:\n{}\n".format(subset))
            print("row:\n{}\n".format(row))
        
        min_samples, epsilon = choose_DBSCAN_params(subset)
        cluster_labels = DBSCAN_1d_values(values, epsilon, min_samples)
        intervals_for_values = get_intervals_for_values_based_on_clusters(values, cluster_labels)
        lbounds = [pair[0] for pair in intervals_for_values]
        ubounds = [pair[1] for pair in intervals_for_values]
        df.loc[subset_mask,"lower_bound"] = lbounds
        df.loc[subset_mask,"upper_bound"] = ubounds
    
    df.to_csv(ail_file_out, sep = '\t', index = False)

In [143]:
avl_file_in = "{}/{}.entity_attribute_labels_time.year.tsv".format(os.environ["OUT"], os.environ["NAME"])
ail_file_out = "{}/{}.entity_attribute_labels_time.year_bucketed.tsv".format(os.environ["OUT"], os.environ["NAME"])
discretize_labels(avl_file_in, ail_file_out)

only one sample for label:
     entity  node1 label  node2 type_label  property_label  id lower_bound  \
4  Q1020773  Q3957  P571   1892  'town'@en  'inception'@en  E5               

  upper_bound  
4              

only one sample for label:
     entity    node1 label  node2        type_label  property_label  id  \
5  Q1020773  Q902814  P571   1892  'border town'@en  'inception'@en  E6   

  lower_bound upper_bound  
5                          

only one sample for label:
  entity     node1 label  node2                type_label  property_label  id  \
7  Q1027  Q2221906  P571   1968  'geographic location'@en  'inception'@en  E8   

  lower_bound upper_bound  
7                          

only one sample for label:
  entity     node1 label  node2                   type_label  property_label  \
9  Q1027  Q4198907  P571   1968  'parliamentary republic'@en  'inception'@en   

    id lower_bound upper_bound  
9  E10                          

no knee found for these labels:
    entity    

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
    entity     node1 label  node2            type_label  \
82  Q15180  Q3624078  P576   1991  'sovereign state'@en   

                             property_label   id lower_bound upper_bound  
82  'dissolved, abolished or demolished'@en  E83                          

only one sample for label:
    entity    node1 label  node2             type_label  \
83  Q15180  Q512187  P576   1991  'federal republic'@en   

                             property_label   id lower_bound upper_bound  
83  'dissolved, abolished or demolished'@en  E84                          

only one sample for label:
    entity    node1 label  node2            type_label  \
84  Q15180  Q842112  P576   1991  'socialist state'@en   

                             property_label   id lower_bound upper_bound  
84  'dissolved, abolished or demolished'@en  E85                          

only one sample for label:
    entity    node1 label  node2            type_label  \
85  Q15180  Q849866  P576 

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
    entity     node1  label  node2         type_label  \
149  Q1726  Q1187811  P1249   1158  'college town'@en   

                           property_label    id lower_bound upper_bound  
149  'time of earliest written record'@en  E150                          

only one sample for label:
    entity    node1  label  node2             type_label  \
150  Q1726  Q134626  P1249   1158  'district capital'@en   

                           property_label    id lower_bound upper_bound  
150  'time of earliest written record'@en  E151                          

only one sample for label:
    entity      node1  label  node2                     type_label  \
151  Q1726  Q14784328  P1249   1158  'state capital in Germany'@en   

                           property_label    id lower_bound upper_bound  
151  'time of earliest written record'@en  E152                          

only one sample for label:
    entity     node1  label  node2                              type

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
    entity    node1  label  node2               type_label  \
230   Q213  Q123480  P1249    900  'landlocked country'@en   

                           property_label    id lower_bound upper_bound  
230  'time of earliest written record'@en  E231                          

no knee found for these labels:
    entity  node1  label  node2    type_label  \
232   Q213  Q6256  P1249    900  'country'@en   
378    Q34  Q6256  P1249   1100  'country'@en   

                           property_label    id lower_bound upper_bound  
232  'time of earliest written record'@en  E233                          
378  'time of earliest written record'@en  E379                          
using median distance to k nearest neighbor instead (200.0)

only one sample for label:
    entity   node1 label  node2                              type_label  \
259   Q227  Q56061  P571   1991  'administrative territorial entity'@en   

     property_label    id lower_bound upper_bound  
259  '

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
       entity    node1 label  node2   type_label  property_label    id  \
473  Q5042950  Q836904  P571   1903  'brewer'@en  'inception'@en  E474   

    lower_bound upper_bound  
473                          

only one sample for label:
       entity   node1 label  node2    type_label  property_label    id  \
481  Q5374850  Q83405  P571   1837  'factory'@en  'inception'@en  E482   

    lower_bound upper_bound  
481                          

no knee found for these labels:
       entity    node1 label  node2    type_label  \
482  Q5374850  Q131734  P576   1991  'brewery'@en   
506   Q644625  Q131734  P576   2017  'brewery'@en   

                              property_label    id lower_bound upper_bound  
482  'dissolved, abolished or demolished'@en  E483                          
506  'dissolved, abolished or demolished'@en  E507                          
using median distance to k nearest neighbor instead (26.0)

only one sample for label:
       entity   

In [96]:
!head -50 $OUT/$NAME.entity_attribute_labels_time.year_bucketed.tsv | column -t -s $'\t'

entity     node1      label  node2  type_label                           property_label  id   lower_bound  upper_bound
Q1011      Q112099    P571   1975   'island nation'@en                   'inception'@en  E1   619.0
Q1011      Q3624078   P571   1975   'sovereign state'@en                 'inception'@en  E2   1615.5
Q1019      Q112099    P571   1960   'island nation'@en                   'inception'@en  E3   619.0
Q1019      Q3624078   P571   1960   'sovereign state'@en                 'inception'@en  E4   1615.5
Q1020773   Q3957      P571   1892   'town'@en                            'inception'@en  E5
Q1020773   Q902814    P571   1892   'border town'@en                     'inception'@en  E6
Q1027      Q112099    P571   1968   'island nation'@en                   'inception'@en  E7   619.0
Q1027      Q2221906   P571   1968   'geographic location'@en             'inception'@en  E8
Q1027      Q3624078   P571   1968   'sovereign state'@en                 'inception'@en  E9   

Aggregating distinct interval labels with positive entity counts

In [148]:
!kgtk query -i $OUT/$NAME.entity_attribute_labels_time.year_bucketed.tsv \
-o $OUT/$NAME.candidate_labels_ail_time.year_temp.tsv \
--graph-cache $STORE \
--match 'labels: (type)-[l {label:prop, property_label:lab, entity:e, lower_bound:lb, upper_bound:ub}]->(val)' \
--return 'type as type, prop as prop, lb as lower_bound, ub as upper_bound, count(e) as positives, lab as property_label, "_" as id' \
--order-by 'count(e) desc'

In [149]:
!kgtk rename-columns -i $OUT/$NAME.candidate_labels_ail_time.year_temp.tsv -o $OUT/$NAME.candidate_labels_ail_time.year_temp1.tsv \
--old-columns type prop lower_bound --new-columns node1 label node2 

In [150]:
!kgtk add-id -i $OUT/$NAME.candidate_labels_ail_time.year_temp1.tsv \
-o $OUT/$NAME.candidate_labels_ail_time.year.tsv --overwrite-id

In [151]:
!head -20 $OUT/$NAME.candidate_labels_ail_time.year.tsv | column -t -s $'\t'

node1      label  node2   upper_bound     positives       property_label  id
Q3624078   P571   1615.5  71              'inception'@en  E1
Q4830453   P571   1733.5  46              'inception'@en  E2
Q131734    P571   1922.5  20              'inception'@en  E3
Q51576574  P571   956.5   20              'inception'@en  E4
Q7270      P571   1635.5  19              'inception'@en  E5
Q131734    P571   1797.0  1922.5          16              'inception'@en  E6
Q179164    P571   1859.5  14              'inception'@en  E7
Q123480    P571   1529.5  13              'inception'@en  E8
Q3624078   P571   581.5   1370.5          12              'inception'@en  E9
Q112099    P571   619.0   11              'inception'@en  E10
Q6256      P571   1887.5  1940.0          11              'inception'@en  E11
Q6881511   P571   1905.5  11              'inception'@en  E12
Q167270    P571   1588.5  10              'inception'@en  E13
Q6881511   P571   1820.5  1905.5          10              'incep

In [142]:
avl_file_in = "{}/{}.entity_attribute_labels_quantity.tsv".format(os.environ["OUT"], os.environ["NAME"])
ail_file_out = "{}/{}.entity_attribute_labels_quantity_bucketed.tsv".format(os.environ["OUT"], os.environ["NAME"])
discretize_labels(avl_file_in, ail_file_out)



no knee found for these labels:
        entity  node1  label  node2 si_units wd_units type_label  \
0     Q1000597  Q3957  P1082  75074                    'town'@en   
1451  Q1020773  Q3957  P1082  64764                    'town'@en   

       property_label     id lower_bound upper_bound  
0     'population'@en     E1                          
1451  'population'@en  E1452                          
using median distance to k nearest neighbor instead (10310.0)

only one sample for label:
    entity    node1  label node2 si_units wd_units          type_label  \
337  Q1011  Q112099  P6897    87            Q11229  'island nation'@en   

         property_label    id lower_bound upper_bound  
337  'literacy rate'@en  E338                          

only one sample for label:
        entity  node1  label node2 si_units wd_units type_label  \
1452  Q1020773  Q3957  P2044   525            Q11573  'town'@en   

                      property_label     id lower_bound upper_bound  
1452  'elevati



only one sample for label:
     entity     node1  label node2 si_units wd_units  \
2080  Q1027  Q2221906  P2219   3.6            Q11229   

                    type_label                                property_label  \
2080  'geographic location'@en  'real gross domestic product growth rate'@en   

         id lower_bound upper_bound  
2080  E2081                          

only one sample for label:
     entity     node1  label node2 si_units wd_units  \
2131  Q1027  Q2221906  P2855    15            Q11229   

                    type_label property_label     id lower_bound upper_bound  
2131  'geographic location'@en  'VAT-rate'@en  E2132                          

only one sample for label:
     entity     node1  label node2 si_units wd_units  \
2132  Q1027  Q2221906  P2884   230            Q25250   

                    type_label      property_label     id lower_bound  \
2132  'geographic location'@en  'mains voltage'@en  E2133               

     upper_bound  
2132             



only one sample for label:
     entity     node1  label node2 si_units wd_units  \
2653  Q1027  Q4198907  P1198     8            Q11229   

                       type_label          property_label     id lower_bound  \
2653  'parliamentary republic'@en  'unemployment rate'@en  E2654               

     upper_bound  
2653              

only one sample for label:
     entity     node1  label node2 si_units wd_units  \
2685  Q1027  Q4198907  P2046  2040           Q712226   

                       type_label property_label     id lower_bound  \
2685  'parliamentary republic'@en      'area'@en  E2686               

     upper_bound  
2685              

only one sample for label:
     entity     node1  label node2 si_units wd_units  \
2818  Q1027  Q4198907  P2219   3.6            Q11229   

                       type_label  \
2818  'parliamentary republic'@en   

                                    property_label     id lower_bound  \
2818  'real gross domestic product growth rate'@en



only one sample for label:
     entity     node1  label node2 si_units wd_units  \
2869  Q1027  Q4198907  P2855    15            Q11229   

                       type_label property_label     id lower_bound  \
2869  'parliamentary republic'@en  'VAT-rate'@en  E2870               

     upper_bound  
2869              

only one sample for label:
     entity     node1  label node2 si_units wd_units  \
2870  Q1027  Q4198907  P2884   230            Q25250   

                       type_label      property_label     id lower_bound  \
2870  'parliamentary republic'@en  'mains voltage'@en  E2871               

     upper_bound  
2870              

only one sample for label:
     entity     node1  label node2 si_units   wd_units  \
2871  Q1027  Q4198907  P3000    18           Q24564698   

                       type_label         property_label     id lower_bound  \
2871  'parliamentary republic'@en  'marriageable age'@en  E2872               

     upper_bound  
2871              

only

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
     entity    node1  label node2 si_units wd_units             type_label  \
4367  Q1033  Q512187  P2884   230            Q25250  'federal republic'@en   

          property_label     id lower_bound upper_bound  
4367  'mains voltage'@en  E4368                          

only one sample for label:
     entity    node1  label node2 si_units   wd_units             type_label  \
4368  Q1033  Q512187  P3000    18           Q24564698  'federal republic'@en   

             property_label     id lower_bound upper_bound  
4368  'marriageable age'@en  E4369                          





only one sample for label:
     entity      node1  label        node2 si_units wd_units  \
6157  Q1246  Q15634554  P1082  1.88302e+06                     

                               type_label   property_label     id lower_bound  \
6157  'state with limited recognition'@en  'population'@en  E6158               

     upper_bound  
6157              

only one sample for label:
     entity      node1  label  node2 si_units wd_units  \
6170  Q1246  Q15634554  P2046  10909           Q712226   

                               type_label property_label     id lower_bound  \
6170  'state with limited recognition'@en      'area'@en  E6171               

     upper_bound  
6170              

only one sample for label:
     entity      node1  label node2 si_units wd_units  \
6217  Q1246  Q15634554  P2219   3.6            Q11229   

                               type_label  \
6217  'state with limited recognition'@en   

                                    property_label     id lower_bou



only one sample for label:
         entity     node1  label  node2 si_units wd_units  \
6491  Q12875697  Q1349648  P1082  66919                     

                       type_label   property_label     id lower_bound  \
6491  'municipality of Greece'@en  'population'@en  E6492               

     upper_bound  
6491              

only one sample for label:
         entity    node1  label        node2 si_units wd_units    type_label  \
6492  Q12877510  Q131734  P2226  3.89603e+06                    'brewery'@en   

                  property_label     id lower_bound upper_bound  
6492  'market capitalization'@en  E6493                          

only one sample for label:
         entity      node1  label        node2 si_units wd_units  \
6493  Q12877510  Q15075508  P2226  3.89603e+06                     

           type_label              property_label     id lower_bound  \
6493  'beer brand'@en  'market capitalization'@en  E6494               

     upper_bound  
6493           



only one sample for label:
     entity      node1  label   node2 si_units wd_units           type_label  \
6599   Q142  Q20181813  P1120  593865                    'colonial power'@en   

             property_label     id lower_bound upper_bound  
6599  'number of deaths'@en  E6600                          





no knee found for these labels:
      entity      node1  label node2 si_units wd_units           type_label  \
6877    Q142  Q20181813  P2927   0.3            Q11229  'colonial power'@en   
48497    Q31  Q20181813  P2927   0.8            Q11229  'colonial power'@en   

                      property_label      id lower_bound upper_bound  
6877   'water as percent of area'@en   E6878                          
48497  'water as percent of area'@en  E48498                          
using median distance to k nearest neighbor instead (0.5000000000000001)





no knee found for these labels:
      entity     node1  label   node2 si_units wd_units            type_label  \
7046    Q142  Q3624078  P1120  593865                    'sovereign state'@en   
58740    Q38  Q3624078  P1120  633133                    'sovereign state'@en   

              property_label      id lower_bound upper_bound  
7046   'number of deaths'@en   E7047                          
58740  'number of deaths'@en  E58741                          
using median distance to k nearest neighbor instead (39268.0)





no knee found for these labels:
      entity      node1  label   node2 si_units wd_units  \
7493    Q142  Q51576574  P1120  593865                     
59153    Q38  Q51576574  P1120  633133                     

                       type_label         property_label      id lower_bound  \
7493   'Mediterranean country'@en  'number of deaths'@en   E7494               
59153  'Mediterranean country'@en  'number of deaths'@en  E59154               

      upper_bound  
7493               
59153              
using median distance to k nearest neighbor instead (39268.0)



  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity  node1  label   node2 si_units wd_units     type_label  \
7940    Q142  Q7270  P1120  593865                    'republic'@en   
59979    Q38  Q7270  P1120  633133                    'republic'@en   

              property_label      id lower_bound upper_bound  
7940   'number of deaths'@en   E7941                          
59979  'number of deaths'@en  E59980                          
using median distance to k nearest neighbor instead (39268.0)

no knee found for these labels:
      entity  node1  label node2 si_units   wd_units     type_label  \
8231    Q142  Q7270  P3270     6           Q24564698  'republic'@en   
20870   Q183  Q7270  P3270     5           Q24564698  'republic'@en   
20871   Q183  Q7270  P3270     6           Q24564698  'republic'@en   
60256    Q38  Q7270  P3270     6           Q24564698  'republic'@en   
67930    Q41  Q7270  P3270     5           Q24564698  'republic'@en   
81907   Q948  Q7270  P3270     6           Q



no knee found for these labels:
      entity    node1  label node2 si_units wd_units               type_label  \
9210    Q145  Q202686  P1198     6            Q11229  'Commonwealth realm'@en   
15550    Q16  Q202686  P1198     7            Q11229  'Commonwealth realm'@en   
64843   Q408  Q202686  P1198     6            Q11229  'Commonwealth realm'@en   

               property_label      id lower_bound upper_bound  
9210   'unemployment rate'@en   E9211                          
15550  'unemployment rate'@en  E15551                          
64843  'unemployment rate'@en  E64844                          
using median distance to k nearest neighbor instead (0.0)

no knee found for these labels:
      entity    node1  label node2 si_units wd_units               type_label  \
9414    Q145  Q202686  P2219   1.8            Q11229  'Commonwealth realm'@en   
15753    Q16  Q202686  P2219   1.4            Q11229  'Commonwealth realm'@en   
65052   Q408  Q202686  P2219   2.5            Q11229 

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity    node1  label node2 si_units   wd_units  \
9480    Q145  Q202686  P2997    18           Q24564698   
65117   Q408  Q202686  P2997    18           Q24564698   

                    type_label        property_label      id lower_bound  \
9480   'Commonwealth realm'@en  'age of majority'@en   E9481               
65117  'Commonwealth realm'@en  'age of majority'@en  E65118               

      upper_bound  
9480               
65117              
using median distance to k nearest neighbor instead (0.0)

only one sample for label:
     entity    node1  label node2 si_units   wd_units  \
9481   Q145  Q202686  P2999    16           Q24564698   

                   type_label       property_label     id lower_bound  \
9481  'Commonwealth realm'@en  'age of consent'@en  E9482               

     upper_bound  
9481              

no knee found for these labels:
      entity    node1  label node2 si_units   wd_units  \
9482    Q145  Q202686  P327

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity  node1  label node2 si_units   wd_units    type_label  \
10303   Q145  Q6256  P2999    16           Q24564698  'country'@en   
14594   Q159  Q6256  P2999    16           Q24564698  'country'@en   

            property_label      id lower_bound upper_bound  
10303  'age of consent'@en  E10304                          
14594  'age of consent'@en  E14595                          
using median distance to k nearest neighbor instead (0.0)

no knee found for these labels:
      entity     node1  label node2 si_units wd_units  \
10395   Q148  Q1520223  P1198     5            Q11229   
33286   Q225  Q1520223  P1198    28            Q11229   
46614    Q30  Q1520223  P1198   6.7            Q11229   

                         type_label          property_label      id  \
10395  'constitutional republic'@en  'unemployment rate'@en  E10396   
33286  'constitutional republic'@en  'unemployment rate'@en  E33287   
46614  'constitutional republic'@en  'une



no knee found for these labels:
      entity     node1  label        node2 si_units wd_units  \
10397   Q148  Q1520223  P2046  9.59696e+06           Q712226   
33307   Q225  Q1520223  P2046        51197           Q712226   
46640    Q30  Q1520223  P2046  9.82668e+06           Q712226   

                         type_label property_label      id lower_bound  \
10397  'constitutional republic'@en      'area'@en  E10398               
33307  'constitutional republic'@en      'area'@en  E33308               
46640  'constitutional republic'@en      'area'@en  E46641               

      upper_bound  
10397              
33307              
46640              
using median distance to k nearest neighbor instead (229714.0)

no knee found for these labels:
      entity     node1  label node2 si_units wd_units  \
10552   Q148  Q1520223  P2219   6.7            Q11229   
33377   Q225  Q1520223  P2219   2.5            Q11229   
46815    Q30  Q1520223  P2219   1.6            Q11229   

         

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
      entity     node1  label node2 si_units wd_units  \
10601   Q148  Q1520223  P2927   2.8                     

                         type_label                 property_label      id  \
10601  'constitutional republic'@en  'water as percent of area'@en  E10602   

      lower_bound upper_bound  
10601                          

no knee found for these labels:
      entity     node1  label node2 si_units   wd_units  \
10602   Q148  Q1520223  P2997    18           Q24564698   
33422   Q225  Q1520223  P2997    18           Q24564698   

                         type_label        property_label      id lower_bound  \
10602  'constitutional republic'@en  'age of majority'@en  E10603               
33422  'constitutional republic'@en  'age of majority'@en  E33423               

      upper_bound  
10602              
33422              
using median distance to k nearest neighbor instead (0.0)

only one sample for label:
      entity     node1  label node2 



only one sample for label:
      entity    node1  label node2 si_units wd_units            type_label  \
11200   Q148  Q842112  P2219   6.7            Q11229  'socialist state'@en   

                                     property_label      id lower_bound  \
11200  'real gross domestic product growth rate'@en  E11201               

      upper_bound  
11200              

only one sample for label:
      entity    node1  label node2 si_units wd_units            type_label  \
11247   Q148  Q842112  P2855    13            Q11229  'socialist state'@en   

      property_label      id lower_bound upper_bound  
11247  'VAT-rate'@en  E11248                          

only one sample for label:
      entity    node1  label node2 si_units wd_units            type_label  \
11248   Q148  Q842112  P2884   220            Q25250  'socialist state'@en   

           property_label      id lower_bound upper_bound  
11248  'mains voltage'@en  E11249                          

only one sample for labe



only one sample for label:
      entity    node1  label node2 si_units wd_units            type_label  \
11301   Q148  Q842112  P5167    28                    'socialist state'@en   

                          property_label      id lower_bound upper_bound  
11301  'vehicles per thousand people'@en  E11302                          

only one sample for label:
      entity    node1  label node2 si_units wd_units            type_label  \
11302   Q148  Q842112  P7422   -58            Q25267  'socialist state'@en   

                        property_label      id lower_bound upper_bound  
11302  'minimum temperature record'@en  E11303                          

no knee found for these labels:
      entity    node1  label node2 si_units wd_units          type_label  \
11367   Q148  Q859563  P1198     5            Q11229  'secular state'@en   
12676   Q155  Q859563  P1198     7            Q11229  'secular state'@en   

               property_label      id lower_bound upper_bound  
11367  'u

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity    node1  label node2 si_units wd_units          type_label  \
11571   Q148  Q859563  P2855    13            Q11229  'secular state'@en   
15137   Q159  Q859563  P2855    20            Q11229  'secular state'@en   

      property_label      id lower_bound upper_bound  
11571  'VAT-rate'@en  E11572                          
15137  'VAT-rate'@en  E15138                          
using median distance to k nearest neighbor instead (7.0)

only one sample for label:
      entity    node1  label node2 si_units wd_units          type_label  \
11573   Q148  Q859563  P2927   2.8                    'secular state'@en   

                      property_label      id lower_bound upper_bound  
11573  'water as percent of area'@en  E11574                          

no knee found for these labels:
      entity    node1  label node2 si_units   wd_units          type_label  \
11574   Q148  Q859563  P2997    18           Q24564698  'secular state'@en   
1294



no knee found for these labels:
       entity      node1  label node2 si_units wd_units  \
11651  Q14835  Q42744322  P2044   204            Q11573   
18804   Q1726  Q42744322  P2044   519            Q11573   
25219   Q2079  Q42744322  P2044   113            Q11573   

                               type_label                  property_label  \
11651  'urban municipality of Germany'@en  'elevation above sea level'@en   
18804  'urban municipality of Germany'@en  'elevation above sea level'@en   
25219  'urban municipality of Germany'@en  'elevation above sea level'@en   

           id lower_bound upper_bound  
11651  E11652                          
18804  E18805                          
25219  E25220                          
using median distance to k nearest neighbor instead (91.0)

only one sample for label:
       entity    node1  label node2 si_units wd_units              type_label  \
11656  Q14835  Q448801  P2044   204            Q11573  'big district town'@en   

            

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
        entity     node1  label        node2 si_units wd_units  \
11662   Q15180  Q3024240  P1082  2.93048e+08                     
11689  Q153015  Q3024240  P1082  4.80666e+06                     

                    type_label   property_label      id lower_bound  \
11662  'historical country'@en  'population'@en  E11663               
11689  'historical country'@en  'population'@en  E11690               

      upper_bound  
11662              
11689              
using median distance to k nearest neighbor instead (288240910.0)

no knee found for these labels:
        entity     node1  label        node2 si_units wd_units  \
11663   Q15180  Q3024240  P2046  2.24022e+07           Q712226   
11690  Q153015  Q3024240  P2046        14993           Q712226   

                    type_label property_label      id lower_bound upper_bound  
11663  'historical country'@en      'area'@en  E11664                          
11690  'historical country'@en      '



no knee found for these labels:
        entity    node1  label        node2 si_units wd_units    type_label  \
11691  Q153015  Q417175  P1082  4.80666e+06                    'kingdom'@en   
76323  Q756617  Q417175  P1082  5.70725e+06                    'kingdom'@en   

        property_label      id lower_bound upper_bound  
11691  'population'@en  E11692                          
76323  'population'@en  E76324                          
using median distance to k nearest neighbor instead (900590.0)

only one sample for label:
        entity    node1  label  node2 si_units wd_units    type_label  \
11692  Q153015  Q417175  P2046  14993           Q712226  'kingdom'@en   

      property_label      id lower_bound upper_bound  
11692      'area'@en  E11693                          

only one sample for label:
        entity    node1  label node2 si_units wd_units    type_label  \
11693  Q153546  Q131734  P1128   443                    'brewery'@en   

       property_label      id lower_bo

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity     node1  label node2 si_units   wd_units        type_label  \
12504   Q155  Q4209223  P2997    18           Q24564698  'Rechtsstaat'@en   
14047   Q159  Q4209223  P2997    18           Q24564698  'Rechtsstaat'@en   
19496   Q183  Q4209223  P2997    18           Q24564698  'Rechtsstaat'@en   
26823   Q212  Q4209223  P2997    18           Q24564698  'Rechtsstaat'@en   
63586    Q40  Q4209223  P2997    18           Q24564698  'Rechtsstaat'@en   

             property_label      id lower_bound upper_bound  
12504  'age of majority'@en  E12505                          
14047  'age of majority'@en  E14048                          
19496  'age of majority'@en  E19497                          
26823  'age of majority'@en  E26824                          
63586  'age of majority'@en  E63587                          
using median distance to k nearest neighbor instead (0.0)

no knee found for these labels:
      entity     node1  label node2 si_uni

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
      entity     node1  label node2 si_units wd_units        type_label  \
12558   Q155  Q4209223  P7422   -14            Q25267  'Rechtsstaat'@en   

                        property_label      id lower_bound upper_bound  
12558  'minimum temperature record'@en  E12559                          

no knee found for these labels:
      entity    node1  label node2 si_units   wd_units          type_label  \
12946   Q155  Q859563  P3271    14           Q24564698  'secular state'@en   
15149   Q159  Q859563  P3271    17           Q24564698  'secular state'@en   

                                property_label      id lower_bound upper_bound  
12946  'compulsory education (maximum age)'@en  E12947                          
15149  'compulsory education (maximum age)'@en  E15150                          
using median distance to k nearest neighbor instead (3.0)

only one sample for label:
      entity    node1  label node2 si_units wd_units          type_label  \
129

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
       entity      node1  label node2 si_units wd_units  \
13016  Q15887  Q13220204  P2044   228            Q11573   
13017  Q15887  Q13220204  P2044   299            Q11573   

                                              type_label  \
13016  'second-level administrative country subdivisi...   
13017  'second-level administrative country subdivisi...   

                       property_label      id lower_bound upper_bound  
13016  'elevation above sea level'@en  E13017                          
13017  'elevation above sea level'@en  E13018                          
using median distance to k nearest neighbor instead (71.0)

only one sample for label:
       entity      node1  label  node2 si_units wd_units  \
13018  Q15887  Q13220204  P2046  197.5           Q712226   

                                              type_label property_label  \
13018  'second-level administrative country subdivisi...      'area'@en   

           id lower_bound upper_bo

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity     node1  label node2 si_units   wd_units  \
13237   Q159  Q1323642  P3270     6           Q24564698   
70631    Q43  Q1323642  P3270     6           Q24564698   

                          type_label                           property_label  \
13237  'transcontinental country'@en  'compulsory education (minimum age)'@en   
70631  'transcontinental country'@en  'compulsory education (minimum age)'@en   

           id lower_bound upper_bound  
13237  E13238                          
70631  E70632                          
using median distance to k nearest neighbor instead (0.0)

no knee found for these labels:
      entity     node1  label node2 si_units   wd_units  \
13238   Q159  Q1323642  P3271    17           Q24564698   
70632    Q43  Q1323642  P3271    18           Q24564698   

                          type_label                           property_label  \
13238  'transcontinental country'@en  'compulsory education (maximum age)'@e



no knee found for these labels:
      entity    node1  label        node2 si_units wd_units        type_label  \
13364   Q159  Q185145  P2046  1.70754e+07           Q712226  'great power'@en   
13365   Q159  Q185145  P2046  1.71252e+07           Q712226  'great power'@en   
13366   Q159  Q185145  P2046  1.71252e+07           Q712226  'great power'@en   

      property_label      id lower_bound upper_bound  
13364      'area'@en  E13365                          
13365      'area'@en  E13366                          
13366      'area'@en  E13367                          
using median distance to k nearest neighbor instead (9.0)

only one sample for label:
      entity    node1  label      node2 si_units wd_units        type_label  \
13451   Q159  Q185145  P2135  3.335e+11             Q4917  'great power'@en   

           property_label      id lower_bound upper_bound  
13451  'total exports'@en  E13452                          

only one sample for label:
      entity    node1  label n



only one sample for label:
      entity    node1  label node2 si_units wd_units        type_label  \
13553   Q159  Q185145  P6591  45.4            Q25267  'great power'@en   

                        property_label      id lower_bound upper_bound  
13553  'maximum temperature record'@en  E13554                          

only one sample for label:
      entity     node1  label      node2 si_units wd_units  \
13724   Q159  Q3624078  P2135  3.335e+11             Q4917   

                 type_label      property_label      id lower_bound  \
13724  'sovereign state'@en  'total exports'@en  E13725               

      upper_bound  
13724              

only one sample for label:
      entity     node1  label      node2 si_units wd_units        type_label  \
13997   Q159  Q4209223  P2135  3.335e+11             Q4917  'Rechtsstaat'@en   

           property_label      id lower_bound upper_bound  
13997  'total exports'@en  E13998                          

only one sample for label:
     

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity     node1  label node2 si_units wd_units        type_label  \
14099   Q159  Q4209223  P6591  45.4            Q25267  'Rechtsstaat'@en   
19542   Q183  Q4209223  P6591  40.3            Q25267  'Rechtsstaat'@en   

                        property_label      id lower_bound upper_bound  
14099  'maximum temperature record'@en  E14100                          
19542  'maximum temperature record'@en  E19543                          
using median distance to k nearest neighbor instead (5.0999999999999766)

only one sample for label:
      entity    node1  label      node2 si_units wd_units         type_label  \
14270   Q159  Q619610  P2135  3.335e+11             Q4917  'social state'@en   

           property_label      id lower_bound upper_bound  
14270  'total exports'@en  E14271                          



  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity    node1  label node2 si_units wd_units         type_label  \
14319   Q159  Q619610  P2884   220            Q25250  'social state'@en   
20178   Q183  Q619610  P2884   230            Q25250  'social state'@en   
22002   Q184  Q619610  P2884   220            Q25250  'social state'@en   
27145   Q212  Q619610  P2884   230            Q25250  'social state'@en   
59839    Q38  Q619610  P2884   230            Q25250  'social state'@en   

           property_label      id lower_bound upper_bound  
14319  'mains voltage'@en  E14320                          
20178  'mains voltage'@en  E20179                          
22002  'mains voltage'@en  E22003                          
27145  'mains voltage'@en  E27146                          
59839  'mains voltage'@en  E59840                          
using median distance to k nearest neighbor instead (0.0)

only one sample for label:
      entity    node1  label node2 si_units   wd_units         type_lab

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity    node1  label node2 si_units wd_units         type_label  \
14372   Q159  Q619610  P6591  45.4            Q25267  'social state'@en   
20226   Q183  Q619610  P6591  40.3            Q25267  'social state'@en   

                        property_label      id lower_bound upper_bound  
14372  'maximum temperature record'@en  E14373                          
20226  'maximum temperature record'@en  E20227                          
using median distance to k nearest neighbor instead (5.0999999999999766)

only one sample for label:
      entity  node1  label      node2 si_units wd_units    type_label  \
14543   Q159  Q6256  P2135  3.335e+11             Q4917  'country'@en   

           property_label      id lower_bound upper_bound  
14543  'total exports'@en  E14544                          





only one sample for label:
      entity      node1  label      node2 si_units wd_units  \
14816   Q159  Q63791824  P2135  3.335e+11             Q4917   

                                    type_label      property_label      id  \
14816  'countries bordering the Baltic Sea'@en  'total exports'@en  E14817   

      lower_bound upper_bound  
14816                          

no knee found for these labels:
      entity      node1  label node2 si_units   wd_units  \
14866   Q159  Q63791824  P2997    18           Q24564698   
20522   Q183  Q63791824  P2997    18           Q24564698   
24864   Q191  Q63791824  P2997    18           Q24564698   
25672   Q211  Q63791824  P2997    18           Q24564698   
52586    Q33  Q63791824  P2997    18           Q24564698   
54325    Q34  Q63791824  P2997    18           Q24564698   
55403    Q35  Q63791824  P2997    18           Q24564698   
57441    Q36  Q63791824  P2997    18           Q24564698   
58180    Q37  Q63791824  P2997    18           Q2456

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity      node1  label node2 si_units   wd_units  \
14876   Q159  Q63791824  P3271    17           Q24564698   
52589    Q33  Q63791824  P3271    15           Q24564698   
54328    Q34  Q63791824  P3271    16           Q24564698   
57444    Q36  Q63791824  P3271    18           Q24564698   

                                    type_label  \
14876  'countries bordering the Baltic Sea'@en   
52589  'countries bordering the Baltic Sea'@en   
54328  'countries bordering the Baltic Sea'@en   
57444  'countries bordering the Baltic Sea'@en   

                                property_label      id lower_bound upper_bound  
14876  'compulsory education (maximum age)'@en  E14877                          
52589  'compulsory education (maximum age)'@en  E52590                          
54328  'compulsory education (maximum age)'@en  E54329                          
57444  'compulsory education (maximum age)'@en  E57445                          
using media

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
      entity    node1  label node2 si_units wd_units               type_label  \
15850    Q16  Q202686  P6591    45            Q25267  'Commonwealth realm'@en   

                        property_label      id lower_bound upper_bound  
15850  'maximum temperature record'@en  E15851                          

no knee found for these labels:
      entity    node1  label node2 si_units wd_units               type_label  \
15851    Q16  Q202686  P6897    99            Q11229  'Commonwealth realm'@en   
65177   Q408  Q202686  P6897    99            Q11229  'Commonwealth realm'@en   

           property_label      id lower_bound upper_bound  
15851  'literacy rate'@en  E15852                          
65177  'literacy rate'@en  E65178                          
using median distance to k nearest neighbor instead (0.0)

only one sample for label:
      entity    node1  label node2 si_units wd_units               type_label  \
15852    Q16  Q202686  P7422   -63      



no knee found for these labels:
      entity    node1  label node2 si_units wd_units  \
16398    Q16  Q223832  P2219   1.4            Q11229   
65431   Q408  Q223832  P2219   2.5            Q11229   

                                type_label  \
16398  'dominion of the British Empire'@en   
65431  'dominion of the British Empire'@en   

                                     property_label      id lower_bound  \
16398  'real gross domestic product growth rate'@en  E16399               
65431  'real gross domestic product growth rate'@en  E65432               

      upper_bound  
16398              
65431              
using median distance to k nearest neighbor instead (1.1)

no knee found for these labels:
      entity    node1  label node2 si_units   wd_units  \
16449    Q16  Q223832  P3270     6           Q24564698   
65507   Q408  Q223832  P3270     5           Q24564698   

                                type_label  \
16449  'dominion of the British Empire'@en   
65507  'dominion

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity    node1  label node2 si_units wd_units  \
16496    Q16  Q223832  P6897    99            Q11229   
65556   Q408  Q223832  P6897    99            Q11229   

                                type_label      property_label      id  \
16496  'dominion of the British Empire'@en  'literacy rate'@en  E16497   
65556  'dominion of the British Empire'@en  'literacy rate'@en  E65557   

      lower_bound upper_bound  
16496                          
65556                          
using median distance to k nearest neighbor instead (0.0)

only one sample for label:
      entity    node1  label node2 si_units wd_units  \
16497    Q16  Q223832  P7422   -63            Q25267   

                                type_label                   property_label  \
16497  'dominion of the British Empire'@en  'minimum temperature record'@en   

           id lower_bound upper_bound  
16497  E16498                          

only one sample for label:
        entity



only one sample for label:
        entity    node1  label   node2 si_units wd_units  \
17798  Q168648  Q587089  P2046  169.01           Q712226   

                            type_label property_label      id lower_bound  \
17798  'city with municipal rights'@en      'area'@en  E17799               

      upper_bound  
17798              

only one sample for label:
        entity    node1  label   node2 si_units wd_units  \
17802  Q168648  Q681277  P2046  169.01           Q712226   

                         type_label property_label      id lower_bound  \
17802  'city with county rights'@en      'area'@en  E17803               

      upper_bound  
17802              

only one sample for label:
        entity    node1  label   node2 si_units wd_units        type_label  \
17806  Q168648  Q902814  P2046  169.01           Q712226  'border town'@en   

      property_label      id lower_bound upper_bound  
17806      'area'@en  E17807                          

no knee found for these

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
      entity     node1  label   node2 si_units wd_units  \
18587  Q1726  Q1066984  P1540  717308                     

                  type_label        property_label      id lower_bound  \
18587  'financial centre'@en  'male population'@en  E18588               

      upper_bound  
18587              

only one sample for label:
      entity     node1  label node2 si_units wd_units             type_label  \
18588  Q1726  Q1066984  P2044   519            Q11573  'financial centre'@en   

                       property_label      id lower_bound upper_bound  
18588  'elevation above sea level'@en  E18589                          

no knee found for these labels:
      entity     node1  label   node2 si_units wd_units  \
18589  Q1726  Q1066984  P2046  310.71           Q712226   
18590  Q1726  Q1066984  P2046  310.74           Q712226   

                  type_label property_label      id lower_bound upper_bound  
18589  'financial centre'@en      'area'@en



only one sample for label:
      entity     node1  label    node2 si_units wd_units     type_label  \
18615  Q1726  Q1180262  P2130  6.6e+09             Q4916  'residenz'@en   

      property_label      id lower_bound upper_bound  
18615      'cost'@en  E18616                          

only one sample for label:
      entity     node1  label    node2 si_units wd_units     type_label  \
18616  Q1726  Q1180262  P2769  7.2e+09             Q4916  'residenz'@en   

      property_label      id lower_bound upper_bound  
18616    'budget'@en  E18617                          

only one sample for label:
      entity     node1  label   node2 si_units wd_units         type_label  \
18634  Q1726  Q1187811  P1539  754200                    'college town'@en   

               property_label      id lower_bound upper_bound  
18634  'female population'@en  E18635                          

only one sample for label:
      entity     node1  label   node2 si_units wd_units         type_label  \
1863



only one sample for label:
      entity     node1  label   node2 si_units wd_units     type_label  \
18706  Q1726  Q1549591  P1539  754200                    'big city'@en   

               property_label      id lower_bound upper_bound  
18706  'female population'@en  E18707                          

only one sample for label:
      entity     node1  label   node2 si_units wd_units     type_label  \
18707  Q1726  Q1549591  P1540  717308                    'big city'@en   

             property_label      id lower_bound upper_bound  
18707  'male population'@en  E18708                          

only one sample for label:
      entity     node1  label    node2 si_units wd_units     type_label  \
18711  Q1726  Q1549591  P2130  6.6e+09             Q4916  'big city'@en   

      property_label      id lower_bound upper_bound  
18711      'cost'@en  E18712                          

only one sample for label:
      entity     node1  label    node2 si_units wd_units     type_label  \
187



only one sample for label:
      entity     node1  label    node2 si_units wd_units  \
18736  Q1726  Q1637706  P2769  7.2e+09             Q4916   

                                   type_label property_label      id  \
18736  'city with millions of inhabitants'@en    'budget'@en  E18737   

      lower_bound upper_bound  
18736                          

only one sample for label:
      entity    node1  label   node2 si_units wd_units       type_label  \
18754  Q1726  Q200250  P1539  754200                    'metropolis'@en   

               property_label      id lower_bound upper_bound  
18754  'female population'@en  E18755                          

only one sample for label:
      entity    node1  label   node2 si_units wd_units       type_label  \
18755  Q1726  Q200250  P1540  717308                    'metropolis'@en   

             property_label      id lower_bound upper_bound  
18755  'male population'@en  E18756                          

only one sample for label:
     

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity   node1  label node2 si_units wd_units  \
18780  Q1726  Q22865  P2044   519            Q11573   
25201  Q2079  Q22865  P2044   113            Q11573   

                             type_label                  property_label  \
18780  'independent city of Germany'@en  'elevation above sea level'@en   
25201  'independent city of Germany'@en  'elevation above sea level'@en   

           id lower_bound upper_bound  
18780  E18781                          
25201  E25202                          
using median distance to k nearest neighbor instead (406.0)

only one sample for label:
      entity   node1  label    node2 si_units wd_units  \
18783  Q1726  Q22865  P2130  6.6e+09             Q4916   

                             type_label property_label      id lower_bound  \
18783  'independent city of Germany'@en      'cost'@en  E18784               

      upper_bound  
18783              

only one sample for label:
      entity   node1  labe



only one sample for label:
        entity    node1  label   node2 si_units wd_units  \
18859  Q182809  Q209824  P1082  246238                     

                    type_label   property_label      id lower_bound  \
18859  'oblast of Bulgaria'@en  'population'@en  E18860               

      upper_bound  
18859              

only one sample for label:
        entity    node1  label node2 si_units wd_units  \
18860  Q182809  Q209824  P2044   232            Q11573   

                    type_label                  property_label      id  \
18860  'oblast of Bulgaria'@en  'elevation above sea level'@en  E18861   

      lower_bound upper_bound  
18860                          

only one sample for label:
        entity    node1  label node2 si_units wd_units  \
18861  Q182809  Q209824  P2046  5543           Q712226   

                    type_label property_label      id lower_bound upper_bound  
18861  'oblast of Bulgaria'@en      'area'@en  E18862                          

no kn

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity   node1  label node2 si_units wd_units          type_label  \
19837   Q183  Q43702  P2927   2.3            Q11229  'federal state'@en   
49337    Q31  Q43702  P2927   0.8            Q11229  'federal state'@en   

                      property_label      id lower_bound upper_bound  
19837  'water as percent of area'@en  E19838                          
49337  'water as percent of area'@en  E49338                          
using median distance to k nearest neighbor instead (1.4999999999999998)

no knee found for these labels:
      entity   node1  label node2 si_units   wd_units          type_label  \
19838   Q183  Q43702  P2997    18           Q24564698  'federal state'@en   
49338    Q31  Q43702  P2997    18           Q24564698  'federal state'@en   
61793    Q39  Q43702  P2997    18           Q24564698  'federal state'@en   
66254   Q408  Q43702  P2997    18           Q24564698  'federal state'@en   

             property_label      id l

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity   node1  label node2 si_units wd_units          type_label  \
19885   Q183  Q43702  P6794  9.19             Q4916  'federal state'@en   
19886   Q183  Q43702  P6794  9.35             Q4916  'federal state'@en   

          property_label      id lower_bound upper_bound  
19885  'minimum wage'@en  E19886                          
19886  'minimum wage'@en  E19887                          
using median distance to k nearest neighbor instead (0.16000000000003559)

no knee found for these labels:
      entity    node1  label node2 si_units wd_units         type_label  \
19955   Q183  Q619610  P1198     5            Q11229  'social state'@en   
21817   Q184  Q619610  P1198     6            Q11229  'social state'@en   
26971   Q212  Q619610  P1198     8            Q11229  'social state'@en   
59567    Q38  Q619610  P1198    12            Q11229  'social state'@en   

               property_label      id lower_bound upper_bound  
19955  'unemployme

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity      node1  label node2 si_units wd_units  \
20521   Q183  Q63791824  P2927   2.3            Q11229   
25671   Q211  Q63791824  P2927   1.5            Q11229   
57440    Q36  Q63791824  P2927   2.6            Q11229   

                                    type_label                 property_label  \
20521  'countries bordering the Baltic Sea'@en  'water as percent of area'@en   
25671  'countries bordering the Baltic Sea'@en  'water as percent of area'@en   
57440  'countries bordering the Baltic Sea'@en  'water as percent of area'@en   

           id lower_bound upper_bound  
20521  E20522                          
25671  E25672                          
57440  E57441                          
using median distance to k nearest neighbor instead (0.30000000000000127)

no knee found for these labels:
      entity      node1  label node2 si_units wd_units  \
20569   Q183  Q63791824  P6794  9.19             Q4916   
20570   Q183  Q63791824  P6



only one sample for label:
      entity    node1  label  node2 si_units wd_units  \
20946   Q184  Q123480  P1125  0.297                     

                    type_label         property_label      id lower_bound  \
20946  'landlocked country'@en  'Gini coefficient'@en  E20947               

      upper_bound  
20946              

only one sample for label:
      entity    node1  label  node2 si_units wd_units          type_label  \
21236   Q184  Q179164  P1125  0.297                    'unitary state'@en   

              property_label      id lower_bound upper_bound  
21236  'Gini coefficient'@en  E21237                          

only one sample for label:
      entity    node1  label  node2 si_units wd_units         type_label  \
21816   Q184  Q619610  P1125  0.297                    'social state'@en   

              property_label      id lower_bound upper_bound  
21816  'Gini coefficient'@en  E21817                          

only one sample for label:
      entity    nod



only one sample for label:
      entity      node1  label node2 si_units wd_units  \
25237  Q2079  Q61708099  P2044   113            Q11573   

                          type_label                  property_label      id  \
25237  'urban district in Saxony'@en  'elevation above sea level'@en  E25238   

      lower_bound upper_bound  
25237                          

only one sample for label:
      entity      node1  label  node2 si_units wd_units  \
25238  Q2079  Q61708099  P2046  297.8           Q712226   

                          type_label property_label      id lower_bound  \
25238  'urban district in Saxony'@en      'area'@en  E25239               

      upper_bound  
25238              

only one sample for label:
      entity    node1  label node2 si_units wd_units          type_label  \
26183   Q212  Q179164  P3529  9218            Q81893  'unitary state'@en   

           property_label      id lower_bound upper_bound  
26183  'median income'@en  E26184                   



only one sample for label:
      entity     node1  label node2 si_units wd_units            type_label  \
26506   Q212  Q3624078  P3529  9218            Q81893  'sovereign state'@en   

           property_label      id lower_bound upper_bound  
26506  'median income'@en  E26507                          

only one sample for label:
      entity     node1  label node2 si_units wd_units        type_label  \
26829   Q212  Q4209223  P3529  9218            Q81893  'Rechtsstaat'@en   

           property_label      id lower_bound upper_bound  
26829  'median income'@en  E26830                          

only one sample for label:
      entity    node1  label node2 si_units wd_units         type_label  \
27152   Q212  Q619610  P3529  9218            Q81893  'social state'@en   

           property_label      id lower_bound upper_bound  
27152  'median income'@en  E27153                          

only one sample for label:
      entity  node1  label node2 si_units wd_units    type_label  \


  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity    node1  label node2 si_units   wd_units  \
28134   Q213  Q123480  P3270     6           Q24564698   
43799    Q28  Q123480  P3270     3           Q24564698   

                    type_label                           property_label  \
28134  'landlocked country'@en  'compulsory education (minimum age)'@en   
43799  'landlocked country'@en  'compulsory education (minimum age)'@en   

           id lower_bound upper_bound  
28134  E28135                          
43799  E43800                          
using median distance to k nearest neighbor instead (3.0)

no knee found for these labels:
      entity    node1  label node2 si_units   wd_units  \
28135   Q213  Q123480  P3271    15           Q24564698   
43800    Q28  Q123480  P3271    16           Q24564698   

                    type_label                           property_label  \
28135  'landlocked country'@en  'compulsory education (maximum age)'@en   
43800  'landlocked country'@en 

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity      node1  label node2 si_units wd_units  \
32149   Q222  Q51576574  P8328  5.89                     
32150   Q222  Q51576574  P8328  5.98                     

                       type_label        property_label      id lower_bound  \
32149  'Mediterranean country'@en  'Democracy Index'@en  E32150               
32150  'Mediterranean country'@en  'Democracy Index'@en  E32151               

      upper_bound  
32149              
32150              
using median distance to k nearest neighbor instead (0.08999999999999381)

no knee found for these labels:
      entity  node1  label node2 si_units wd_units     type_label  \
32501   Q222  Q7270  P8328  5.89                    'republic'@en   
32502   Q222  Q7270  P8328  5.98                    'republic'@en   

             property_label      id lower_bound upper_bound  
32501  'Democracy Index'@en  E32502                          
32502  'Democracy Index'@en  E32503                     

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
      entity   node1  label  node2 si_units wd_units  \
34636   Q227  Q56061  P2046  86600           Q712226   

                                   type_label property_label      id  \
34636  'administrative territorial entity'@en      'area'@en  E34637   

      lower_bound upper_bound  
34636                          

only one sample for label:
      entity   node1  label node2 si_units wd_units  \
34718   Q227  Q56061  P2219  -3.8            Q11229   

                                   type_label  \
34718  'administrative territorial entity'@en   

                                     property_label      id lower_bound  \
34718  'real gross domestic product growth rate'@en  E34719               

      upper_bound  
34718              

no knee found for these labels:
      entity   node1  label   node2 si_units wd_units  \
34765   Q227  Q56061  P2573  108600                     
34766   Q227  Q56061  P2573  117058                     
34767   Q227  Q560



only one sample for label:
      entity   node1  label node2 si_units   wd_units  \
34770   Q227  Q56061  P2997    18           Q24564698   

                                   type_label        property_label      id  \
34770  'administrative territorial entity'@en  'age of majority'@en  E34771   

      lower_bound upper_bound  
34770                          

only one sample for label:
      entity   node1  label node2 si_units   wd_units  \
34771   Q227  Q56061  P3000    18           Q24564698   

                                   type_label         property_label      id  \
34771  'administrative territorial entity'@en  'marriageable age'@en  E34772   

      lower_bound upper_bound  
34771                          

only one sample for label:
      entity   node1  label node2 si_units wd_units  \
34814   Q227  Q56061  P6897   100            Q11229   

                                   type_label      property_label      id  \
34814  'administrative territorial entity'@en  'lit



only one sample for label:
      entity    node1  label node2 si_units wd_units         type_label  \
35388   Q228  Q208500  P2046   468           Q712226  'principality'@en   

      property_label      id lower_bound upper_bound  
35388      'area'@en  E35389                          

only one sample for label:
      entity    node1  label node2 si_units wd_units         type_label  \
35485   Q228  Q208500  P2855   4.5            Q11229  'principality'@en   

      property_label      id lower_bound upper_bound  
35485  'VAT-rate'@en  E35486                          

only one sample for label:
      entity    node1  label node2 si_units wd_units         type_label  \
35486   Q228  Q208500  P2884   220            Q25250  'principality'@en   

           property_label      id lower_bound upper_bound  
35486  'mains voltage'@en  E35487                          

only one sample for label:
      entity    node1  label node2 si_units wd_units         type_label  \
35487   Q228  Q208500



only one sample for label:
      entity    node1  label node2 si_units wd_units         type_label  \
35495   Q228  Q208500  P6897   100            Q11229  'principality'@en   

           property_label      id lower_bound upper_bound  
35495  'literacy rate'@en  E35496                          

only one sample for label:
      entity      node1  label node2 si_units wd_units            type_label  \
36323   Q229  Q11396118  P1198    16            Q11229  'divided country'@en   

               property_label      id lower_bound upper_bound  
36323  'unemployment rate'@en  E36324                          

only one sample for label:
      entity      node1  label    node2 si_units wd_units  \
36354   Q229  Q11396118  P2046  9242.45           Q712226   

                 type_label property_label      id lower_bound upper_bound  
36354  'divided country'@en      'area'@en  E36355                          

only one sample for label:
      entity      node1  label node2 si_units wd_uni



no knee found for these labels:
      entity  node1  label node2 si_units wd_units  type_label  \
37611   Q229  Q7275  P2219   2.8            Q11229  'state'@en   
56056    Q35  Q7275  P2219   1.1            Q11229  'state'@en   
64703    Q40  Q7275  P2219   1.5            Q11229  'state'@en   

                                     property_label      id lower_bound  \
37611  'real gross domestic product growth rate'@en  E37612               
56056  'real gross domestic product growth rate'@en  E56057               
64703  'real gross domestic product growth rate'@en  E64704               

      upper_bound  
37611              
56056              
64703              
using median distance to k nearest neighbor instead (0.4000000000000002)

no knee found for these labels:
      entity  node1  label node2 si_units wd_units  type_label  \
37676   Q229  Q7275  P2884   240            Q25250  'state'@en   
56120    Q35  Q7275  P2884   230            Q25250  'state'@en   
64751    Q40  Q727

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
       entity     node1  label node2 si_units wd_units            type_label  \
40833  Q23482  Q2264924  P2044    12            Q11573  'port settlement'@en   

                       property_label      id lower_bound upper_bound  
40833  'elevation above sea level'@en  E40834                          

only one sample for label:
       entity     node1  label   node2 si_units wd_units  \
40834  Q23482  Q2264924  P2046  240.62           Q712226   

                 type_label property_label      id lower_bound upper_bound  
40834  'port settlement'@en      'area'@en  E40835                          

only one sample for label:
       entity    node1  label node2 si_units wd_units              type_label  \
40840  Q23482  Q484170  P2044    12            Q11573  'commune of France'@en   

                       property_label      id lower_bound upper_bound  
40840  'elevation above sea level'@en  E40841                          

only one sample for label:
  



only one sample for label:
        entity    node1  label  node2 si_units wd_units  \
42229  Q241898  Q493522  P1082  32318                     

                         type_label   property_label      id lower_bound  \
42229  'municipality of Belgium'@en  'population'@en  E42230               

      upper_bound  
42229              

only one sample for label:
        entity    node1  label node2 si_units wd_units  \
42230  Q241898  Q493522  P2046  75.9           Q712226   

                         type_label property_label      id lower_bound  \
42230  'municipality of Belgium'@en      'area'@en  E42231               

      upper_bound  
42230              

only one sample for label:
        entity    node1  label node2 si_units wd_units  \
42231  Q241898  Q493522  P5982   124                     

                         type_label                  property_label      id  \
42231  'municipality of Belgium'@en  'annual number of weddings'@en  E42232   

      lower_bound upper



only one sample for label:
      entity   node1  label node2 si_units wd_units              type_label  \
44232   Q283  Q11173  P2107  2200            Q25267  'chemical compound'@en   

                 property_label      id lower_bound upper_bound  
44232  'decomposition point'@en  E44233                          

only one sample for label:
      entity   node1  label   node2 si_units wd_units              type_label  \
44233   Q283  Q11173  P2116  40.656           Q752197  'chemical compound'@en   

                      property_label      id lower_bound upper_bound  
44233  'enthalpy of vaporization'@en  E44234                          

only one sample for label:
      entity   node1  label  node2 si_units  wd_units              type_label  \
44234   Q283  Q11173  P2116  9.717           Q6408112  'chemical compound'@en   

                      property_label      id lower_bound upper_bound  
44234  'enthalpy of vaporization'@en  E44235                          

only one sample

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
      entity      node1  label  node2 si_units   wd_units  \
44281   Q283  Q11723014  P2056  9.069           Q20966455   

                         type_label      property_label      id lower_bound  \
44281  'dihydrogen chalcogenide'@en  'heat capacity'@en  E44282               

      upper_bound  
44281              

only one sample for label:
      entity      node1  label    node2 si_units wd_units  \
44282   Q283  Q11723014  P2067  18.0153           Q483261   

                         type_label property_label      id lower_bound  \
44282  'dihydrogen chalcogenide'@en      'mass'@en  E44283               

      upper_bound  
44282              

only one sample for label:
      entity      node1  label node2 si_units  wd_units  \
44283   Q283  Q11723014  P2068  0.56           Q1463969   

                         type_label             property_label      id  \
44283  'dihydrogen chalcogenide'@en  'thermal conductivity'@en  E44284   

      lower_bou

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
      entity      node1  label    node2 si_units   wd_units  \
44292   Q283  Q11723014  P2118  0.01012           Q26162545   

                         type_label            property_label      id  \
44292  'dihydrogen chalcogenide'@en  'kinematic viscosity'@en  E44293   

      lower_bound upper_bound  
44292                          

only one sample for label:
      entity      node1  label node2 si_units   wd_units  \
44293   Q283  Q11723014  P2240    90           Q21061369   

                         type_label                  property_label      id  \
44293  'dihydrogen chalcogenide'@en  'median lethal dose (LD50)'@en  E44294   

      lower_bound upper_bound  
44293                          

no knee found for these labels:
      entity      node1  label  node2 si_units   wd_units  \
44297   Q283  Q11723014  P3071  188.8           Q20966455   
44298   Q283  Q11723014  P3071   69.9           Q20966455   

                         type_label           



only one sample for label:
      entity   node1  label node2 si_units wd_units  type_label  \
44344   Q283  Q50690  P2101     0            Q25267  'oxide'@en   

           property_label      id lower_bound upper_bound  
44344  'melting point'@en  E44345                          

only one sample for label:
      entity   node1  label node2 si_units wd_units  type_label  \
44345   Q283  Q50690  P2102   100            Q25267  'oxide'@en   

           property_label      id lower_bound upper_bound  
44345  'boiling point'@en  E44346                          

only one sample for label:
      entity   node1  label node2 si_units wd_units  type_label  \
44346   Q283  Q50690  P2107  2200            Q25267  'oxide'@en   

                 property_label      id lower_bound upper_bound  
44346  'decomposition point'@en  E44347                          

only one sample for label:
      entity   node1  label   node2 si_units wd_units  type_label  \
44347   Q283  Q50690  P2116  40.656        

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
      entity     node1  label node2 si_units wd_units  type_label  \
44453    Q29  Q1250464  P1198    25            Q11229  'realm'@en   

               property_label      id lower_bound upper_bound  
44453  'unemployment rate'@en  E44454                          

only one sample for label:
      entity     node1  label   node2 si_units wd_units  type_label  \
44488    Q29  Q1250464  P2046  505990           Q712226  'realm'@en   

      property_label      id lower_bound upper_bound  
44488      'area'@en  E44489                          

only one sample for label:
      entity     node1  label node2 si_units wd_units  type_label  \
44671    Q29  Q1250464  P2219   3.2            Q11229  'realm'@en   

                                     property_label      id lower_bound  \
44671  'real gross domestic product growth rate'@en  E44672               

      upper_bound  
44671              

only one sample for label:
      entity     node1  label node2 si_



only one sample for label:
      entity     node1  label node2 si_units wd_units  type_label  \
44805    Q29  Q1250464  P7422   -30            Q25267  'realm'@en   

                        property_label      id lower_bound upper_bound  
44805  'minimum temperature record'@en  E44806                          

no knee found for these labels:
      entity     node1  label node2 si_units   wd_units            type_label  \
45202    Q29  Q3624078  P2998    18           Q24564698  'sovereign state'@en   
65876   Q408  Q3624078  P2998    18           Q24564698  'sovereign state'@en   
76649   Q801  Q3624078  P2998    21           Q24564698  'sovereign state'@en   

              property_label      id lower_bound upper_bound  
45202  'age of candidacy'@en  E45203                          
65876  'age of candidacy'@en  E65877                          
76649  'age of candidacy'@en  E76650                          
using median distance to k nearest neighbor instead (0.0)

no knee found for t

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity     node1  label node2 si_units wd_units       type_label  \
46209    Q30  Q1489259  P1125  34.6                    'superpower'@en   
46210    Q30  Q1489259  P1125  47.7                    'superpower'@en   

              property_label      id lower_bound upper_bound  
46209  'Gini coefficient'@en  E46210                          
46210  'Gini coefficient'@en  E46211                          
using median distance to k nearest neighbor instead (13.100000000000014)

only one sample for label:
      entity     node1  label node2 si_units wd_units       type_label  \
46211    Q30  Q1489259  P1198   6.7            Q11229  'superpower'@en   

               property_label      id lower_bound upper_bound  
46211  'unemployment rate'@en  E46212                          

only one sample for label:
      entity     node1  label        node2 si_units wd_units       type_label  \
46237    Q30  Q1489259  P2046  9.82668e+06           Q712226  'superp



no knee found for these labels:
      entity     node1  label node2 si_units wd_units  \
46612    Q30  Q1520223  P1125  34.6                     
46613    Q30  Q1520223  P1125  47.7                     

                         type_label         property_label      id  \
46612  'constitutional republic'@en  'Gini coefficient'@en  E46613   
46613  'constitutional republic'@en  'Gini coefficient'@en  E46614   

      lower_bound upper_bound  
46612                          
46613                          
using median distance to k nearest neighbor instead (13.100000000000014)

only one sample for label:
      entity     node1  label  node2 si_units wd_units  \
46878    Q30  Q1520223  P3529  43585             Q4917   

                         type_label      property_label      id lower_bound  \
46878  'constitutional republic'@en  'median income'@en  E46879               

      upper_bound  
46878              

only one sample for label:
      entity     node1  label node2 si_units



no knee found for these labels:
      entity     node1  label node2 si_units wd_units  \
47821    Q30  Q5255892  P1125  34.6                     
47822    Q30  Q5255892  P1125  47.7                     

                     type_label         property_label      id lower_bound  \
47821  'democratic republic'@en  'Gini coefficient'@en  E47822               
47822  'democratic republic'@en  'Gini coefficient'@en  E47823               

      upper_bound  
47821              
47822              
using median distance to k nearest neighbor instead (13.100000000000014)

only one sample for label:
      entity     node1  label node2 si_units wd_units  \
47823    Q30  Q5255892  P1198   6.7            Q11229   

                     type_label          property_label      id lower_bound  \
47823  'democratic republic'@en  'unemployment rate'@en  E47824               

      upper_bound  
47823              

only one sample for label:
      entity     node1  label        node2 si_units wd_uni



only one sample for label:
      entity    node1  label   node2 si_units wd_units        type_label  \
50224    Q32  Q165116  P2046  2586.4           Q712226  'grand duchy'@en   

      property_label      id lower_bound upper_bound  
50224      'area'@en  E50225                          

only one sample for label:
      entity    node1  label   node2 si_units wd_units        type_label  \
50283    Q32  Q165116  P2132  101450             Q4917  'grand duchy'@en   

                    property_label      id lower_bound upper_bound  
50283  'nominal GDP per capita'@en  E50284                          

only one sample for label:
      entity    node1  label node2 si_units wd_units        type_label  \
50284    Q32  Q165116  P2219     4            Q11229  'grand duchy'@en   

                                     property_label      id lower_bound  \
50284  'real gross domestic product growth rate'@en  E50285               

      upper_bound  
50284              

only one sample for la



no knee found for these labels:
      entity    node1  label node2 si_units wd_units          type_label  \
50680    Q32  Q179164  P5167   661                    'unitary state'@en   
50681    Q32  Q179164  P5167   662                    'unitary state'@en   

                          property_label      id lower_bound upper_bound  
50680  'vehicles per thousand people'@en  E50681                          
50681  'vehicles per thousand people'@en  E50682                          
using median distance to k nearest neighbor instead (1.0)

only one sample for label:
          entity   node1  label node2 si_units wd_units   type_label  \
50972  Q32241267  Q23442  P2044    42            Q11573  'island'@en   

                       property_label      id lower_bound upper_bound  
50972  'elevation above sea level'@en  E50973                          

no knee found for these labels:
      entity      node1  label node2 si_units wd_units  \
52636    Q33  Q63791824  P7422 -51.5            



only one sample for label:
         entity     node1  label    node2 si_units wd_units       type_label  \
52640  Q3319685  Q6881511  P2403  1.2e+10             Q4917  'enterprise'@en   

          property_label      id lower_bound upper_bound  
52640  'total assets'@en  E52641                          

no knee found for these labels:
      entity    node1  label node2 si_units wd_units          type_label  \
53019    Q34  Q179164  P2834    20            Q11229  'unitary state'@en   
53020    Q34  Q179164  P2834    25            Q11229  'unitary state'@en   

                 property_label      id lower_bound upper_bound  
53019  'individual tax rate'@en  E53020                          
53020  'individual tax rate'@en  E53021                          
using median distance to k nearest neighbor instead (5.0)

no knee found for these labels:
      entity     node1  label node2 si_units wd_units            type_label  \
53452    Q34  Q3624078  P2834    20            Q11229  'sovereig



only one sample for label:
       entity     node1  label node2 si_units   wd_units  \
54377  Q34366  Q5852411  P2999    17           Q24564698   

                    type_label       property_label      id lower_bound  \
54377  'state of Australia'@en  'age of consent'@en  E54378               

      upper_bound  
54377              

only one sample for label:
      entity      node1  label node2 si_units wd_units           type_label  \
54688    Q35  Q20181813  P3270     5                    'colonial power'@en   

                                property_label      id lower_bound upper_bound  
54688  'compulsory education (minimum age)'@en  E54689                          

only one sample for label:
      entity      node1  label node2 si_units wd_units           type_label  \
54736    Q35  Q20181813  P7422 -31.2            Q25267  'colonial power'@en   

                        property_label      id lower_bound upper_bound  
54736  'minimum temperature record'@en  E54737      

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
      entity      node1  label  node2 si_units wd_units  \
55766    Q35  Q66724388  P3529  44360             Q4917   

                                              type_label      property_label  \
55766  'autonomous country within the Kingdom of Denm...  'median income'@en   

           id lower_bound upper_bound  
55766  E55767                          

only one sample for label:
      entity      node1  label node2 si_units wd_units  \
55813    Q35  Q66724388  P7422 -31.2            Q25267   

                                              type_label  \
55813  'autonomous country within the Kingdom of Denm...   

                        property_label      id lower_bound upper_bound  
55813  'minimum temperature record'@en  E55814                          

only one sample for label:
      entity  node1  label node2 si_units wd_units  type_label  \
56124    Q35  Q7275  P3270     5                    'state'@en   

                                property



only one sample for label:
      entity    node1  label node2 si_units wd_units          type_label  \
58581    Q38  Q179164  P2436   230            Q25250  'unitary state'@en   

      property_label      id lower_bound upper_bound  
58581   'voltage'@en  E58582                          

only one sample for label:
      entity     node1  label node2 si_units wd_units            type_label  \
58994    Q38  Q3624078  P2436   230            Q25250  'sovereign state'@en   

      property_label      id lower_bound upper_bound  
58994   'voltage'@en  E58995                          

only one sample for label:
      entity      node1  label node2 si_units wd_units  \
59407    Q38  Q51576574  P2436   230            Q25250   

                       type_label property_label      id lower_bound  \
59407  'Mediterranean country'@en   'voltage'@en  E59408               

      upper_bound  
59407              

only one sample for label:
      entity    node1  label   node2 si_units wd_units 



only one sample for label:
      entity    node1  label node2 si_units wd_units          type_label  \
60776    Q39  Q170156  P1198     4            Q11229  'confederation'@en   

               property_label      id lower_bound upper_bound  
60776  'unemployment rate'@en  E60777                          

only one sample for label:
      entity    node1  label  node2 si_units wd_units          type_label  \
60807    Q39  Q170156  P2046  41285           Q712226  'confederation'@en   

      property_label      id lower_bound upper_bound  
60807      'area'@en  E60808                          

only one sample for label:
      entity    node1  label node2 si_units wd_units          type_label  \
60962    Q39  Q170156  P2219   1.3            Q11229  'confederation'@en   

                                     property_label      id lower_bound  \
60962  'real gross domestic product growth rate'@en  E60963               

      upper_bound  
60962              

only one sample for label:

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity    node1  label node2 si_units wd_units               type_label  \
64870   Q408  Q202686  P1689  38.3            Q11229  'Commonwealth realm'@en   
64871   Q408  Q202686  P1689  40.5            Q11229  'Commonwealth realm'@en   

                                         property_label      id lower_bound  \
64870  'central government debt as a percent of GDP'@en  E64871               
64871  'central government debt as a percent of GDP'@en  E64872               

      upper_bound  
64870              
64871              
using median distance to k nearest neighbor instead (2.200000000000033)

only one sample for label:
      entity    node1  label node2 si_units wd_units               type_label  \
64872   Q408  Q202686  P2043  3860           Q828224  'Commonwealth realm'@en   

      property_label      id lower_bound upper_bound  
64872    'length'@en  E64873                          

only one sample for label:
      entity    node1  la

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
      entity   node1  label node2 si_units wd_units          type_label  \
66009   Q408  Q43702  P2043  3860           Q828224  'federal state'@en   

      property_label      id lower_bound upper_bound  
66009    'length'@en  E66010                          

only one sample for label:
      entity   node1  label node2 si_units wd_units          type_label  \
66014   Q408  Q43702  P2049  4000           Q828224  'federal state'@en   

      property_label      id lower_bound upper_bound  
66014     'width'@en  E66015                          

no knee found for these labels:
      entity   node1  label  node2 si_units wd_units          type_label  \
66236   Q408  Q43702  P2547  35877           Q828224  'federal state'@en   
66237   Q408  Q43702  P2547  59736           Q828224  'federal state'@en   

       property_label      id lower_bound upper_bound  
66236  'perimeter'@en  E66237                          
66237  'perimeter'@en  E66238                    



only one sample for label:
      entity  node1  label node2 si_units wd_units     type_label  \
67927    Q41  Q7270  P2997    18                    'republic'@en   

             property_label      id lower_bound upper_bound  
67927  'age of majority'@en  E67928                          

no knee found for these labels:
      entity   node1  label   node2 si_units wd_units  \
69702   Q424  Q41614  P2046  181035           Q712226   
79581   Q869  Q41614  P2046  513120           Q712226   

                         type_label property_label      id lower_bound  \
69702  'constitutional monarchy'@en      'area'@en  E69703               
79581  'constitutional monarchy'@en      'area'@en  E79582               

      upper_bound  
69702              
79581              
using median distance to k nearest neighbor instead (332084.5)

no knee found for these labels:
      entity   node1  label node2 si_units wd_units  \
69807   Q424  Q41614  P2219     7            Q11229   
79756   Q869  Q4

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument
  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


no knee found for these labels:
      entity   node1  label node2 si_units wd_units  \
69858   Q424  Q41614  P2855    10            Q11229   
79809   Q869  Q41614  P2855    10            Q11229   

                         type_label property_label      id lower_bound  \
69858  'constitutional monarchy'@en  'VAT-rate'@en  E69859               
79809  'constitutional monarchy'@en  'VAT-rate'@en  E79810               

      upper_bound  
69858              
79809              
using median distance to k nearest neighbor instead (0.0)

no knee found for these labels:
      entity   node1  label node2 si_units wd_units  \
69859   Q424  Q41614  P2884   230            Q25250   
79810   Q869  Q41614  P2884   220            Q25250   

                         type_label      property_label      id lower_bound  \
69859  'constitutional monarchy'@en  'mains voltage'@en  E69860               
79810  'constitutional monarchy'@en  'mains voltage'@en  E79811               

      upper_bound  
6985



only one sample for label:
      entity     node1  label node2 si_units wd_units  \
70356    Q43  Q1323642  P1198     9            Q11229   

                          type_label          property_label      id  \
70356  'transcontinental country'@en  'unemployment rate'@en  E70357   

      lower_bound upper_bound  
70356                          

no knee found for these labels:
      entity     node1  label node2 si_units   wd_units  \
70629    Q43  Q1323642  P3001    58           Q24564698   
70630    Q43  Q1323642  P3001    60           Q24564698   

                          type_label       property_label      id lower_bound  \
70629  'transcontinental country'@en  'retirement age'@en  E70630               
70630  'transcontinental country'@en  'retirement age'@en  E70631               

      upper_bound  
70629              
70630              
using median distance to k nearest neighbor instead (2.0)

no knee found for these labels:
        entity node1  label node2 si_units 



only one sample for label:
      entity      node1  label   node2 si_units wd_units        type_label  \
73219   Q459  Q15344922  P2046  101.98           Q712226  'oblast seat'@en   

      property_label      id lower_bound upper_bound  
73219      'area'@en  E73220                          

only one sample for label:
      entity      node1  label   node2 si_units wd_units  \
73223   Q459  Q50330360  P1082  384088                     

                     type_label   property_label      id lower_bound  \
73223  'second largest city'@en  'population'@en  E73224               

      upper_bound  
73223              

only one sample for label:
      entity      node1  label node2 si_units wd_units  \
73224   Q459  Q50330360  P2044   164            Q11573   

                     type_label                  property_label      id  \
73224  'second largest city'@en  'elevation above sea level'@en  E73225   

      lower_bound upper_bound  
73224                          

only one sa



no knee found for these labels:
      entity      node1  label node2 si_units wd_units  \
73270    Q55  Q15304003  P1198   3.8            Q11229   
73271    Q55  Q15304003  P1198   4.1            Q11229   
73272    Q55  Q15304003  P1198     7            Q11229   

                                           type_label          property_label  \
73270  'country of the Kingdom of the Netherlands'@en  'unemployment rate'@en   
73271  'country of the Kingdom of the Netherlands'@en  'unemployment rate'@en   
73272  'country of the Kingdom of the Netherlands'@en  'unemployment rate'@en   

           id lower_bound upper_bound  
73270  E73271                          
73271  E73272                          
73272  E73273                          
using median distance to k nearest neighbor instead (0.3000000000000027)

only one sample for label:
      entity      node1  label  node2 si_units wd_units  \
73299    Q55  Q15304003  P2046  41543           Q712226   

                              



only one sample for label:
      entity      node1  label node2 si_units wd_units  \
73541    Q55  Q15304003  P2884   230            Q25250   

                                           type_label      property_label  \
73541  'country of the Kingdom of the Netherlands'@en  'mains voltage'@en   

           id lower_bound upper_bound  
73541  E73542                          

only one sample for label:
      entity      node1  label node2 si_units wd_units  \
73542    Q55  Q15304003  P2927  18.7            Q11229   

                                           type_label  \
73542  'country of the Kingdom of the Netherlands'@en   

                      property_label      id lower_bound upper_bound  
73542  'water as percent of area'@en  E73543                          

only one sample for label:
      entity      node1  label node2 si_units   wd_units  \
73543    Q55  Q15304003  P2997    18           Q24564698   

                                           type_label        property_



no knee found for these labels:
      entity     node1  label node2 si_units wd_units  \
74012    Q61  Q1093829  P2046   177           Q712226   
74013    Q61  Q1093829  P2046   259           Q712226   

                           type_label property_label      id lower_bound  \
74012  'city of the United States'@en      'area'@en  E74013               
74013  'city of the United States'@en      'area'@en  E74014               

      upper_bound  
74012              
74013              
using median distance to k nearest neighbor instead (82.0)

only one sample for label:
      entity     node1  label  node2 si_units wd_units  \
74014    Q61  Q1093829  P2927  10.67            Q11229   

                           type_label                 property_label      id  \
74014  'city of the United States'@en  'water as percent of area'@en  E74015   

      lower_bound upper_bound  
74014                          

only one sample for label:
      entity     node1  label node2 si_units wd_un



only one sample for label:
      entity   node1  label node2 si_units  wd_units             type_label  \
74129   Q663  Q11344  P2068   236           Q1463969  'chemical element'@en   

                  property_label      id lower_bound upper_bound  
74129  'thermal conductivity'@en  E74130                          

only one sample for label:
      entity   node1  label node2 si_units wd_units             type_label  \
74130   Q663  Q11344  P2101  1220            Q42289  'chemical element'@en   

           property_label      id lower_bound upper_bound  
74130  'melting point'@en  E74131                          

only one sample for label:
      entity   node1  label node2 si_units wd_units             type_label  \
74131   Q663  Q11344  P2101   660            Q25267  'chemical element'@en   

           property_label      id lower_bound upper_bound  
74131  'melting point'@en  E74132                          

only one sample for label:
      entity   node1  label node2 si_units



only one sample for label:
      entity    node1  label node2 si_units wd_units          type_label  \
74146   Q663  Q214609  P1108  1.61                    'base material'@en   

               property_label      id lower_bound upper_bound  
74146  'electronegativity'@en  E74147                          

only one sample for label:
      entity    node1  label node2 si_units   wd_units          type_label  \
74150   Q663  Q214609  P2054   2.7           Q13147228  'base material'@en   

      property_label      id lower_bound upper_bound  
74150   'density'@en  E74151                          

only one sample for label:
      entity    node1  label    node2 si_units   wd_units          type_label  \
74151   Q663  Q214609  P2055  3.5e+07           Q20966435  'base material'@en   

                     property_label      id lower_bound upper_bound  
74151  'electrical conductivity'@en  E74152                          

only one sample for label:
      entity    node1  label    node2 



only one sample for label:
       entity      node1  label node2 si_units wd_units  \
74175  Q69007  Q14770218  P2044   593            Q11573   

                                 type_label                  property_label  \
74175  'cantonal capital of Switzerland'@en  'elevation above sea level'@en   

           id lower_bound upper_bound  
74175  E74176                          

only one sample for label:
       entity      node1  label  node2 si_units wd_units  \
74176  Q69007  Q14770218  P2046  28.09           Q712226   

                                 type_label property_label      id  \
74176  'cantonal capital of Switzerland'@en      'area'@en  E74177   

      lower_bound upper_bound  
74176                          

only one sample for label:
       entity      node1  label node2 si_units wd_units  \
74189  Q69007  Q54935504  P2044   593            Q11573   

                     type_label                  property_label      id  \
74189  'city of Switzerland'@en  'eleva



only one sample for label:
        entity     node1  label    node2 si_units wd_units     type_label  \
77073  Q805734  Q4830453  P2295  6.3e+09            Q41044  'business'@en   

        property_label      id lower_bound upper_bound  
77073  'net profit'@en  E77074                          

no knee found for these labels:
        entity     node1  label        node2 si_units wd_units  \
77075  Q805734  Q6881511  P2139  7.45188e+10            Q41044   
77076  Q805734  Q6881511  P2139     8.93e+10            Q41044   

            type_label      property_label      id lower_bound upper_bound  
77075  'enterprise'@en  'total revenue'@en  E77076                          
77076  'enterprise'@en  'total revenue'@en  E77077                          
using median distance to k nearest neighbor instead (14781200000.000013)

only one sample for label:
        entity     node1  label    node2 si_units wd_units       type_label  \
77077  Q805734  Q6881511  P2295  6.3e+09            Q41044  '

  return (a - min(a)) / (max(a) - min(a))
The line is probably not polynomial, try plotting
the difference curve with plt.plot(knee.x_difference, knee.y_difference)
Also check that you aren't mistakenly setting the curve argument


only one sample for label:
      entity     node1  label        node2 si_units wd_units  \
79136   Q869  Q3624078  P1174  3.25883e+07                     

                 type_label          property_label      id lower_bound  \
79136  'sovereign state'@en  'visitors per year'@en  E79137               

      upper_bound  
79136              

only one sample for label:
      entity   node1  label        node2 si_units wd_units  \
79556   Q869  Q41614  P1174  3.25883e+07                     

                         type_label          property_label      id  \
79556  'constitutional monarchy'@en  'visitors per year'@en  E79557   

      lower_bound upper_bound  
79556                          

only one sample for label:
      entity   node1  label node2 si_units wd_units  \
79557   Q869  Q41614  P1198   0.9            Q11229   

                         type_label          property_label      id  \
79557  'constitutional monarchy'@en  'unemployment rate'@en  E79558   

      lower



only one sample for label:
      entity    node1  label node2 si_units wd_units              type_label  \
80193   Q916  Q200464  P6591  43.5            Q25267  'Portuguese Empire'@en   

                        property_label      id lower_bound upper_bound  
80193  'maximum temperature record'@en  E80194                          

only one sample for label:
      entity    node1  label node2 si_units wd_units              type_label  \
80194   Q916  Q200464  P6897    66            Q11229  'Portuguese Empire'@en   

           property_label      id lower_bound upper_bound  
80194  'literacy rate'@en  E80195                          

no knee found for these labels:
          entity      node1  label node2 si_units wd_units       type_label  \
80547  Q93552342  Q15075508  P6088    20                    'beer brand'@en   
80561  Q93557205  Q15075508  P6088  29.5                    'beer brand'@en   
80573  Q93558270  Q15075508  P6088    20                    'beer brand'@en   
80587  Q



no knee found for these labels:
          entity    node1  label node2 si_units wd_units      type_label  \
80551  Q93552342  Q167270  P6088    20                    'trademark'@en   
80565  Q93557205  Q167270  P6088  29.5                    'trademark'@en   
80577  Q93558270  Q167270  P6088    20                    'trademark'@en   
80591  Q93559285  Q167270  P6088    15                    'trademark'@en   
80603  Q93560567  Q167270  P6088    26                    'trademark'@en   

             property_label      id lower_bound upper_bound  
80551  'beer bitterness'@en  E80552                          
80565  'beer bitterness'@en  E80566                          
80577  'beer bitterness'@en  E80578                          
80591  'beer bitterness'@en  E80592                          
80603  'beer bitterness'@en  E80604                          
using median distance to k nearest neighbor instead (3.5)

only one sample for label:
          entity    node1  label node2 si_units  wd_u



only one sample for label:
          entity    node1  label node2 si_units  wd_units       type_label  \
80584  Q93559285  Q131413  P2665   5.4           Q2080811  'wheat beer'@en   

               property_label      id lower_bound upper_bound  
80584  'alcohol by volume'@en  E80585                          

only one sample for label:
          entity    node1  label node2 si_units wd_units       type_label  \
80585  Q93559285  Q131413  P6088    15                    'wheat beer'@en   

             property_label      id lower_bound upper_bound  
80585  'beer bitterness'@en  E80586                          

only one sample for label:
          entity    node1  label node2 si_units  wd_units  type_label  \
80604  Q93560567  Q217825  P2665   5.5           Q2080811  'stout'@en   

               property_label      id lower_bound upper_bound  
80604  'alcohol by volume'@en  E80605                          

only one sample for label:
          entity    node1  label node2 si_units wd

In [144]:
!head -50 $OUT/$NAME.entity_attribute_labels_quantity_bucketed.tsv | column -t -s $'\t'

entity    node1    label  node2               si_units            wd_units                      type_label  property_label  id     lower_bound  upper_bound
Q1000597  Q3957    P1082  75074.0             'town'@en           'population'@en               E1
Q1011     Q112099  P1081  0.57                'island nation'@en  'Human Development Index'@en  E2          0.534           0.616
Q1011     Q112099  P1081  0.5720000000000001  'island nation'@en  'Human Development Index'@en  E3          0.534           0.616
Q1011     Q112099  P1081  0.575               'island nation'@en  'Human Development Index'@en  E4          0.534           0.616
Q1011     Q112099  P1081  0.585               'island nation'@en  'Human Development Index'@en  E5          0.534           0.616
Q1011     Q112099  P1081  0.589               'island nation'@en  'Human Development Index'@en  E6          0.534           0.616
Q1011     Q112099  P1081  0.5920000000000001  'island nation'@en  'Human Development Index'@en 

Aggregating distinct interval labels with positive entity counts

In [164]:
!kgtk query -i $OUT/$NAME.entity_attribute_labels_quantity_bucketed.tsv \
-o $OUT/$NAME.candidate_labels_ail_quantity_temp.tsv \
--graph-cache $STORE \
--match 'labels: (type)-[l {label:prop, property_label:lab, si_units:si, wd_units:wd, entity:e, lower_bound:lb, upper_bound:ub}]->(val)' \
--return 'type as type, prop as prop, si as si_units, wd as wd_units, lb as lower_bound, ub as upper_bound, count(e) as positives, lab as property_label, "_" as id' \
--order-by 'count(e) desc'

In [165]:
!kgtk rename-columns -i $OUT/$NAME.candidate_labels_ail_quantity_temp.tsv -o $OUT/$NAME.candidate_labels_ail_quantity_temp1.tsv \
--old-columns type prop lower_bound --new-columns node1 label node2 

In [166]:
!kgtk add-id -i $OUT/$NAME.candidate_labels_ail_quantity_temp1.tsv \
-o $OUT/$NAME.candidate_labels_ail_quantity.tsv --overwrite-id

In [167]:
!head -20 $OUT/$NAME.candidate_labels_ail_quantity.tsv | column -t -s $'\t'

node1     label  si_units    wd_units            node2                         upper_bound                   positives             property_label  id
Q3624078  P2131  Q4917       800249412297.1975   2752                          'nominal GDP'@en              E1
Q3624078  P1082  41699675.5  2689                'population'@en               E2
Q3624078  P2134  Q4917       67399246727.5       2529                          'total reserves'@en           E3
Q3624078  P2132  Q4917       21120.5             2424                          'nominal GDP per capita'@en   E4
Q3624078  P1081  0.595       0.9405000000000001  1893                          'Human Development Index'@en  E5
Q3624078  P1279  Q11229      -5.6                42.7                          1516                          'inflation rate'@en   E6
Q3624078  P4010  Q550207     1305343874679.5     1514                          'GDP (PPP)'@en                E7
Q3624078  P2299  Q550207     31357.3805          1463             

### Combining entity --> attribute interval label mappings to single table

In [171]:
!kgtk cat \
-i $OUT/$NAME.entity_attribute_labels_quantity_bucketed.tsv \
-i $OUT/$NAME.entity_attribute_labels_time.year_bucketed.tsv \
-o $OUT/$NAME.entity_AILs_all.tsv

## 5. Create RALs with counts of positive entities

Here is an idea of what kind of values we should find (creating RALs from scratch for string attributes only)

In [626]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $DATA/$NAME.part.string.tsv \
-i $OUT/$NAME.type_mapping.tsv -i $DATA/$NAME.label.en.tsv --graph-cache $STORE \
--match 'item: (n1)-[l1 {label:p1}]->(n2), type: (n1)-[]->(type1), label: (p1)-[:label]->(lab1), string: (n2)-[l2 {label:p2}]->(n3), type: (n2)-[]->(type2), label: (p2)-[:label]->(lab2)' \
--return 'distinct type1 as type1, p1 as prop1, type2 as type2, p2 as prop2, n3 as value, count(distinct n1) as positives, lab1 as prop1_label, lab2 as prop2_label' \
--where 'lab1.kgtk_lqstring_lang_suffix = "en"' \
--where 'lab2.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'count(distinct n1) desc' \
--limit 5 \
| column -t -s $'\t'

type1     prop1  type2     prop2  value           positives  prop1_label               prop2_label
Q131734   P452   Q8148     P373   "Beer brewing"  77         'industry'@en             'Commons category'@en
Q3624078  P530   Q3624078  P3238  "0"             68         'diplomatic relation'@en  'trunk prefix'@en
Q3624078  P530   Q4209223  P3238  "0"             68         'diplomatic relation'@en  'trunk prefix'@en
Q3624078  P530   Q43702    P3238  "0"             68         'diplomatic relation'@en  'trunk prefix'@en
Q3624078  P530   Q619610   P3238  "0"             68         'diplomatic relation'@en  'trunk prefix'@en


**NOTE:** It looks like having all of the entity --> labels mappings together in one file would be nice here. I don't think there is a need to keep these separate? Maybe keep AVL and AIL separate.

Now using the REL table and the entities --> attribute labels table that we built in steps 2 and 4. Also keep track of counts of positive entities for each label

Trying to reuse RELs, however, when counting positives, we would need to sum each num_pos that matches the line - not sure how to do this, so the below won't capture when type1 --> x1 and type1 --> x2 resolve to be the same label: i.e. type1 --> typex --> val. See further down for alternate solution.

In [624]:
!kgtk query -i $OUT/$NAME.candidate_labels_rel_item.tsv -i $OUT/$NAME.entity_attribute_labels_string.tsv \
--graph-cache $STORE \
--match 'rel: (t1)-[l1 {label:p1, positives:num_pos}]->(v1), entity_attribute: (t2)-[l {entity:v1, label:p2}]->(v2)' \
--return 't1 as type, p1 as prop, t2 as value_type, p2 as value_prop, v2 as value_val, num_pos as positives, "_" as id' \
--order-by "kgtk_quantity_number_int(num_pos) desc" \
--limit 5 \
| column -t -s $'\t'

type      prop  value_type  value_prop  value_val       positives  id
Q131734   P452  Q8148       P373        "Beer brewing"  77         _
Q3624078  P530  Q3624078    P2258       "262"           67         _
Q3624078  P530  Q3624078    P2572       "Deutschland"   67         _
Q3624078  P530  Q3624078    P2979       "211"           67         _
Q3624078  P530  Q3624078    P2979       "218"           67         _


This way should work.

In [625]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $OUT/$NAME.type_mapping.tsv \
-i $OUT/$NAME.entity_attribute_labels_string.tsv --graph-cache $STORE \
--match 'item: (n1)-[l1 {label:p1}]->(n2), type: (n1)-[]->(t1), entity_attribute: (t2)-[l2 {label:p2, entity:n2}]->(val)' \
--return 't1 as type1, p1 as prop1, t2 as type2, p2 as prop2, val as value, count(distinct n1) as positives, "_" as id' \
--order-by "count(distinct n1) desc" \
--limit 5 \
| column -t -s $'\t'

type1     prop1  type2     prop2  value           positives  id
Q131734   P452   Q8148     P373   "Beer brewing"  77         _
Q3624078  P530   Q3624078  P3238  "0"             68         _
Q3624078  P530   Q4209223  P3238  "0"             68         _
Q3624078  P530   Q43702    P3238  "0"             68         _
Q3624078  P530   Q619610   P3238  "0"             68         _


And now doing this for all attribute types:

attribute value labels:

In [178]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $OUT/$NAME.type_mapping.tsv -i $DATA/$NAME.label.en.tsv \
-i $OUT/$NAME.entity_AVLs_all.tsv -o $OUT/$NAME.candidate_labels_ravl_temp.tsv --graph-cache $STORE \
--match 'item: (n1)-[l1 {label:p1}]->(n2), type: (n1)-[]->(t1), entity_AVLs: (t2)-[l2 {label:p2, entity:n2}]->(val), label: (p2)-[:label]->(lab2)' \
--return 't1 as type1, p1 as prop1, t2 as type2, p2 as prop2, lab2 as prop2_label, val as value, count(distinct n1) as positives, "_" as id' \
--order-by "count(distinct n1) desc" \
--where 'lab2.kgtk_lqstring_lang_suffix = "en"'

In [179]:
!kgtk rename-columns -i $OUT/$NAME.candidate_labels_ravl_temp.tsv -o $OUT/$NAME.candidate_labels_ravl_temp1.tsv \
--old-columns type1 prop1 type2 --new-columns node1 label node2 

In [180]:
!kgtk add-id -i $OUT/$NAME.candidate_labels_ravl_temp1.tsv \
-o $OUT/$NAME.candidate_labels_ravl.tsv --overwrite-id

In [181]:
!head $OUT/$NAME.candidate_labels_ravl.tsv | column -t -s $'\t'

node1     label  node2     prop2  prop2_label                   value           positives  id
Q131734   P452   Q8148     P373   'Commons category'@en         "Beer brewing"  77         E1
Q131734   P452   Q8148     P580   'start time'@en               -3500           77         E2
Q3624078  P530   Q3624078  P1081  'Human Development Index'@en  0.801           68         E3
Q3624078  P530   Q3624078  P1081  'Human Development Index'@en  0.809           68         E4
Q3624078  P530   Q3624078  P1081  'Human Development Index'@en  0.814           68         E5
Q3624078  P530   Q3624078  P1081  'Human Development Index'@en  0.824           68         E6
Q3624078  P530   Q3624078  P1081  'Human Development Index'@en  0.829           68         E7
Q3624078  P530   Q3624078  P1081  'Human Development Index'@en  0.83            68         E8
Q3624078  P530   Q3624078  P1081  'Human Development Index'@en  0.834           68         E9


attribute interval labels:

In [173]:
!kgtk query -i $DATA/$NAME.part.wikibase-item.tsv -i $OUT/$NAME.type_mapping.tsv -i $DATA/$NAME.label.en.tsv \
-i $OUT/$NAME.entity_AILs_all.tsv -o $OUT/$NAME.candidate_labels_rail_temp.tsv --graph-cache $STORE \
--match 'item: (n1)-[l1 {label:p1}]->(n2), type: (n1)-[]->(t1), entity_AILs: (t2)-[l2 {label:p2, entity:n2, lower_bound:lb, upper_bound:ub, wd_units:wd, si_units:si}]->(val), label: (p2)-[:label]->(lab2)' \
--return 't1 as type1, p1 as prop1, t2 as type2, p2 as prop2, lab2 as prop2_label, si as si_units, wd as wd_units, lb as lower_bound, ub as upper_bound, count(distinct n1) as positives, "_" as id' \
--order-by "count(distinct n1) desc" \
--where 'lab2.kgtk_lqstring_lang_suffix = "en"'

In [174]:
!kgtk rename-columns -i $OUT/$NAME.candidate_labels_rail_temp.tsv -o $OUT/$NAME.candidate_labels_rail_temp1.tsv \
--old-columns type1 prop1 type2 --new-columns node1 label node2 

In [175]:
!kgtk add-id -i $OUT/$NAME.candidate_labels_rail_temp1.tsv \
-o $OUT/$NAME.candidate_labels_rail.tsv --overwrite-id

In [176]:
!head $OUT/$NAME.candidate_labels_rail.tsv | column -t -s $'\t'

node1    label  node2     prop2  prop2_label                                   si_units  wd_units            lower_bound  upper_bound  positives  id
Q131734  P452   Q8148     P580   'start time'@en                               77        E1
Q131734  P17    Q3624078  P1081  'Human Development Index'@en                  0.595     0.9405000000000001  70           E2
Q131734  P17    Q3624078  P1279  'inflation rate'@en                           Q11229    -5.6                42.7         70           E3
Q131734  P17    Q3624078  P2131  'nominal GDP'@en                              Q4917     800249412297.1975   70           E4
Q131734  P17    Q3624078  P2219  'real gross domestic product growth rate'@en  Q11229    -4.609999999999999  70           E5
Q131734  P17    Q3624078  P2250  'life expectancy'@en                          Q577      70.059205           83.084145    70           E6
Q131734  P17    Q3624078  P2299  'PPP GDP per capita'@en                       Q550207   31357.3805  