# Explore entities in input dataset
In this notebook, we'll quickly take a look at the top attributes and relations amongst entities of a given type. This will help validate the input data and give a sense of what attributes we might expect to see being used to later profile entities of the given type.

In [2]:
import os
import pandas as pd
from utility import run_command
from utility import rename_cols_and_overwrite_id
from label_discretization import discretize_labels

### Parameters
**Required**  
*item_file*: file path for the file that contains entity to entity relationships (e.g. wikibase-item)  
*time_file*: file path for the file that contains entity to time-type values  
*quantity_file*: file path for the file that contains entity to quantity-type values (remember to specify the trimmed file if you did the quantity file trimming pre-processing step).  
*label_file*: file path for the file that contains wikidata labels  
*work_dir*: path to folder where files created by this notebook should be stored  
*store_dir*: path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

**Optional**    
*string_file*: file path for the file that contains entity to string-type values  

In [84]:
data_dir = "../../wikidata_films/data" # my data files are all in the same directory, so I'll reuse this path prefix

# **REQUIRED**
item_file = "{}/claims.wikibase-item.tsv.gz".format(data_dir)
time_file = "{}/claims.time.tsv.gz".format(data_dir)
quantity_file = "{}/claims.quantity_trimmed.tsv.gz".format(data_dir)
label_file = "{}/labels.en.tsv.gz".format(data_dir)
work_dir = "../../wikidata_films/profiler_work"
store_dir = "../../wikidata_films"

# **optional**
string_file = None#"{}/claims.string.tsv.gz".format(data_dir)

### Process parameters and set up variables / file names

In [85]:
# Ensure paths are absolute
item_file = os.path.abspath(item_file)
time_file = os.path.abspath(time_file)
quantity_file = os.path.abspath(quantity_file)
label_file = os.path.abspath(label_file)
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)
if string_file:
    string_file = os.path.abspath(string_file)
    
# Create directories
if not os.path.exists(work_dir):
    os.makedirs(work_dir)
output_dir = "{}/explore_entities".format(work_dir)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# adding some environment variables we'll be using frequently
os.environ['ITEM_FILE'] = item_file
os.environ['TIME_FILE'] = time_file
os.environ['QUANTITY_FILE'] = quantity_file
os.environ['LABEL_FILE'] = label_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['OUT'] = output_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

## Create type mapping
Mapping is from entity (Q node) to the entity's type (another Q node). Using P31 only for now, but can add P279* as well later

In [86]:
!kgtk query -i $ITEM_FILE \
-o $OUT/type_mapping.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (e)-[l {label:"P31"}]->(type)' \
--return 'distinct l as id, e as node1, l.label as label, type as node2'

^C

Keyboard interrupt in query -i /Users/nicklein/Documents/grad_school/Research.nosync/wikidata_films/data/claims.wikibase-item.tsv.gz -o /Users/nicklein/Documents/grad_school/Research.nosync/wikidata_films/profiler_work/explore_entities/type_mapping.tsv --graph-cache /Users/nicklein/Documents/grad_school/Research.nosync/wikidata_films/wikidata.sqlite3.db --match `/Users/nicklein/Documents/grad_school/Research.nosync/wikidata_films/data/claims.wikibase-item.tsv.gz`: (e)-[l {label:"P31"}]->(type) --return distinct l as id, e as node1, l.label as label, type as node2.


In [6]:
!head -5 $OUT/type_mapping.tsv | column -t -s $'\t'

id                              node1  label  node2
P10-P31-Q18610173-85ef4d24-0    P10    P31    Q18610173
P1001-P31-Q15720608-deeedec9-0  P1001  P31    Q15720608
P1001-P31-Q22984026-8beb0cfe-0  P1001  P31    Q22984026
P1001-P31-Q22997934-1e5b1a96-0  P1001  P31    Q22997934


## Look at top attributes amongst entities within a type

In [77]:
# Change the Q-node type here to look at attributes of entities of a different type
# type_to_profile = "Q44" # beer
# type_to_profile = "Q282" # wine
# type_to_profile = "Q154" # alcoholic beverage
type_to_profile = "Q11424" # film
os.environ["TYPE"] = type_to_profile

Number of entities of this type:

In [78]:
!kgtk filter -p " ; P31 ; $TYPE " -i $OUT/type_mapping.tsv | wc -l | awk '{print $1-1}'

2276


Some examples:

In [79]:
!kgtk filter -p " ; P31 ; $TYPE " -i $OUT/type_mapping.tsv | head | column -t -s $'\t'

id                              node1       label  node2
Q100722103-P31-Q282-e63e3347-0  Q100722103  P31    Q282
Q1032955-P31-Q282-2eca911b-0    Q1032955    P31    Q282
Q1054633-P31-Q282-00df35b1-0    Q1054633    P31    Q282
Q1055673-P31-Q282-c2c5de60-0    Q1055673    P31    Q282
Q1056600-P31-Q282-84f232f2-0    Q1056600    P31    Q282
Q1056603-P31-Q282-e878da1c-0    Q1056603    P31    Q282
Q10728966-P31-Q282-df0316eb-0   Q10728966   P31    Q282
Q1109383-P31-Q282-1d872982-0    Q1109383    P31    Q282
Q1109394-P31-Q282-d6f71779-0    Q1109394    P31    Q282


Number of these entities that have an english label:

In [80]:
!kgtk query -i $OUT/type_mapping.tsv -i $LABEL_FILE \
--graph-cache $STORE \
--match 'type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (e)-[:label]->(e_lab)' \
--return 'e as entity, e_lab as entity_label' \
--where 'e_lab.kgtk_lqstring_lang_suffix = "en"' \
| wc -l | awk '{print $1-1}'

205


Top quantity attributes amongst entities of this type:

In [81]:
!kgtk query -i $QUANTITY_FILE -i $LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
--limit 10 \
--match '`'"$QUANTITY_FILE"'`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (prop)-[:label]->(prop_lab)' \
--return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
--where 'prop_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'ents_with_this_prop desc' \
| column -t -s $'\t'

prop   prop_label              ents_with_this_prop
P2665  'alcohol by volume'@en  13
P2046  'area'@en               1


Top time attributes amongst entities of this type:

In [82]:
!kgtk query -i $TIME_FILE -i $LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
--limit 10 \
--match '`'"$TIME_FILE"'`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (prop)-[:label]->(prop_lab)' \
--return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
--where 'prop_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'ents_with_this_prop desc' \
| column -t -s $'\t'

prop  prop_label      ents_with_this_prop
P571  'inception'@en  74


Top relations amongst entities of this type:

In [83]:
!kgtk query -i $ITEM_FILE -i $LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
--limit 10 \
--match '`'"$ITEM_FILE"'`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (prop)-[:label]->(prop_lab)' \
--return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
--where 'prop_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'ents_with_this_prop desc' \
| column -t -s $'\t'

prop   prop_label                                             ents_with_this_prop
P31    'instance of'@en                                       2276
P279   'subclass of'@en                                       2144
P17    'country'@en                                           184
P1389  'product certification'@en                             159
P131   'located in the administrative territorial entity'@en  50
P1071  'location of creation'@en                              17
P462   'color'@en                                             17
P176   'manufacturer'@en                                      12
P495   'country of origin'@en                                 11
P366   'use'@en                                               8
