# Explore entities in input dataset
In this notebook, we'll quickly take a look at the top attributes and relations amongst entities of a given type. This will help validate the input data and give a sense of what attributes we might expect to see being used to later profile entities of the given type.

In [2]:
import os
import pandas as pd
from utility import run_command
from utility import rename_cols_and_overwrite_id
from label_discretization import discretize_labels

### Parameters
**Required**  
*item_file*: file path for the file that contains entity to entity relationships (e.g. wikibase-item)  
*time_file*: file path for the file that contains entity to time-type values  
*quantity_file*: file path for the file that contains entity to quantity-type values (remember to specify the trimmed file if you did the quantity file trimming pre-processing step).  
*label_file*: file path for the file that contains wikidata labels  
*work_dir*: path to folder where files created by this notebook should be stored  
*store_dir*: path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

**Optional**    
*string_file*: file path for the file that contains entity to string-type values  

In [18]:
data_dir = "../../wikidata_films/data" # my data files are all in the same directory, so I'll reuse this path prefix

# **REQUIRED**
item_file = "{}/claims.wikibase-item.tsv.gz".format(data_dir)
time_file = "{}/claims.time.tsv.gz".format(data_dir)
quantity_file = "{}/claims.quantity_trimmed.tsv.gz".format(data_dir)
label_file = "{}/labels.en.tsv.gz".format(data_dir)
work_dir = "../../wikidata_films/profiler_work"
store_dir = "../../wikidata_films"

# **optional**
string_file = "{}/claims.string.tsv.gz".format(data_dir)

### Process parameters and set up variables / file names

In [19]:
# Ensure paths are absolute
item_file = os.path.abspath(item_file)
time_file = os.path.abspath(time_file)
quantity_file = os.path.abspath(quantity_file)
label_file = os.path.abspath(label_file)
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)
if string_file:
    string_file = os.path.abspath(string_file)
    
# Create directories
if not os.path.exists(work_dir):
    os.makedirs(work_dir)
output_dir = "{}/explore_entities".format(work_dir)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# adding some environment variables we'll be using frequently
os.environ['ITEM_FILE'] = item_file
os.environ['TIME_FILE'] = time_file
os.environ['QUANTITY_FILE'] = quantity_file
os.environ['LABEL_FILE'] = label_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['OUT'] = output_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

## Create type mapping
Mapping is from entity (Q node) to the entity's type (another Q node). Using P31 only for now, but can add P279* as well later

In [5]:
!kgtk query -i $ITEM_FILE \
-o $OUT/type_mapping.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (e)-[l {label:"P31"}]->(type)' \
--return 'distinct l as id, e as node1, l.label as label, type as node2'

In [6]:
!head -5 $OUT/type_mapping.tsv | column -t -s $'\t'

id                              node1  label  node2
P10-P31-Q18610173-85ef4d24-0    P10    P31    Q18610173
P1000-P31-Q18608871-093affb5-0  P1000  P31    Q18608871
P1001-P31-Q15720608-deeedec9-0  P1001  P31    Q15720608
P1001-P31-Q22984026-8beb0cfe-0  P1001  P31    Q22984026


## Look at top attributes amongst entities within a type

In [7]:
# Change the Q-node type here to look at attributes of entities of a different type
# type_to_profile = "Q44" # beer
# type_to_profile = "Q282" # wine
# type_to_profile = "Q154" # alcoholic beverage
type_to_profile = "Q11424" # film
os.environ["TYPE"] = type_to_profile

Number of entities of this type:

In [8]:
!kgtk filter -p " ; P31 ; $TYPE " -i $OUT/type_mapping.tsv | wc -l | awk '{print $1-1}'

176668


Some examples:

In [9]:
!kgtk filter -p " ; P31 ; $TYPE " -i $OUT/type_mapping.tsv | head | column -t -s $'\t'

id                                node1       label  node2
Q1000094-P31-Q11424-63dc6868-0    Q1000094    P31    Q11424
Q1000174-P31-Q11424-43e4b2fb-0    Q1000174    P31    Q11424
Q1000394-P31-Q11424-2b44b6cc-0    Q1000394    P31    Q11424
Q10005695-P31-Q11424-d652d3ab-0   Q10005695   P31    Q11424
Q100057542-P31-Q11424-5f3fc57c-0  Q100057542  P31    Q11424
Q10007277-P31-Q11424-2a961c75-0   Q10007277   P31    Q11424
Q1000825-P31-Q11424-a932e4d2-0    Q1000825    P31    Q11424
Q1000826-P31-Q11424-8c07b338-0    Q1000826    P31    Q11424
Q100097551-P31-Q11424-0d5dad84-0  Q100097551  P31    Q11424


Number of these entities that have an english label:

In [10]:
!kgtk query -i $OUT/type_mapping.tsv -i $LABEL_FILE \
--graph-cache $STORE \
--match 'type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (e)-[:label]->(e_lab)' \
--return 'e as entity, e_lab as entity_label' \
--where 'e_lab.kgtk_lqstring_lang_suffix = "en"' \
| wc -l | awk '{print $1-1}'

176406


Top quantity attributes amongst entities of this type:

In [13]:
!kgtk query -i $QUANTITY_FILE -i $LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
--limit 10 \
--match '`'"$QUANTITY_FILE"'`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (prop)-[:label]->(prop_lab)' \
--return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
--where 'prop_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'ents_with_this_prop desc' \
| column -t -s $'\t'

prop   prop_label                         ents_with_this_prop
P2047  'duration'@en                      59289
P2130  'cost'@en                          2465
P2142  'box office'@en                    1692
P1113  'number of episodes'@en            376
P2769  'budget'@en                        181
P1110  'attendance'@en                    104
P2437  'number of seasons'@en             73
P1104  'number of pages'@en               63
P2635  'number of parts of this work'@en  17
P2043  'length'@en                        8


Top time attributes amongst entities of this type:

In [14]:
!kgtk query -i $TIME_FILE -i $LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
--limit 10 \
--match '`'"$TIME_FILE"'`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (prop)-[:label]->(prop_lab)' \
--return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
--where 'prop_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'ents_with_this_prop desc' \
| column -t -s $'\t'

prop   prop_label                      ents_with_this_prop
P577   'publication date'@en           172408
P1191  'date of first performance'@en  537
P580   'start time'@en                 343
P585   'point in time'@en              233
P571   'inception'@en                  223
P582   'end time'@en                   207
P2754  'production date'@en            90
P2913  'date depicted'@en              22
P3893  'public domain date'@en         11
P2031  'work period (start)'@en        7


Top relations amongst entities of this type:

In [15]:
!kgtk query -i $ITEM_FILE -i $LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
--limit 10 \
--match '`'"$ITEM_FILE"'`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (prop)-[:label]->(prop_lab)' \
--return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
--where 'prop_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'ents_with_this_prop desc' \
| column -t -s $'\t'

prop  prop_label                                 ents_with_this_prop
P31   'instance of'@en                           176667
P495  'country of origin'@en                     166166
P364  'original language of film or TV show'@en  143025
P136  'genre'@en                                 140307
P57   'director'@en                              136285
P161  'cast member'@en                           118539
P462  'color'@en                                 107132
P58   'screenwriter'@en                          77204
P86   'composer'@en                              70155
P344  'director of photography'@en               52926


Top string attributes amongst entities of this type:

In [20]:
if not string_file:
    print("No string attribute file was provided in the parameters section, skipping this step.")
else:
    # perform query
    command = "$kgtk query -i STRING_FILE -i LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
               --limit 10 \
               --match '`STRING_FILE`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`TYPE`), `LABEL_FILE`: (prop)-[:label]->(prop_lab)' \
               --return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
               --where 'prop_lab.kgtk_lqstring_lang_suffix = \"en\"' \
               --order-by 'ents_with_this_prop desc' \
               | column -t -s $'\t'"
    run_command(command, {"STRING_FILE" : string_file, "LABEL_FILE" : label_file, "TYPE" : type_to_profile})

prop   prop_label                           ents_with_this_prop
P373   'Commons category'@en                7064
P217   'inventory number'@en                108
P444   'review score'@en                    90
P935   'Commons gallery'@en                 71
P1814  'name in kana'@en                    59
P2572  'hashtag'@en                         39
P2093  'author name string'@en              32
P2364  'production code'@en                 28
P2001  'Revised Romanization'@en            26
P1942  'McCune-Reischauer romanization'@en  13

