# Explore entities in input dataset
In this notebook, we'll quickly take a look at the top attributes and relations amongst entities of a given type. This will help validate the input data and give a sense of what attributes we might expect to see being used to later profile entities of the given type.

In [5]:
import os
import pandas as pd
from utility import run_command
from utility import rename_cols_and_overwrite_id

### Parameters
**Required**  
*item_file*: file path for the file that contains entity to entity relationships (e.g. wikibase-item)  
*time_file*: file path for the file that contains entity to time-type values  
*quantity_file*: file path for the file that contains entity to quantity-type values (remember to specify the trimmed file if you did the quantity file trimming pre-processing step).  
*label_file*: file path for the file that contains wikidata labels  
*work_dir*: path to folder where files created by this notebook should be stored  
*store_dir*: path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

**Optional**    
*string_file*: file path for the file that contains entity to string-type values  

In [6]:
data_dir = "../../wikidata_humans/data" # my data files are all in the same directory, so I'll reuse this path prefix

# **REQUIRED**
item_file = "{}/claims.wikibase-item.tsv.gz".format(data_dir)
time_file = "{}/claims.time.tsv.gz".format(data_dir)
quantity_file = "{}/claims.quantity.tsv.gz".format(data_dir)
label_file = "{}/labels.en.tsv.gz".format(data_dir)
work_dir = "../../wikidata_humans/profiler_work"
store_dir = "../../wikidata_humans"

# **optional**
string_file = "{}/claims.string.tsv.gz".format(data_dir)

### Process parameters and set up variables / file names

In [7]:
# Ensure paths are absolute
item_file = os.path.abspath(item_file)
time_file = os.path.abspath(time_file)
quantity_file = os.path.abspath(quantity_file)
label_file = os.path.abspath(label_file)
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)
if string_file:
    string_file = os.path.abspath(string_file)
    
# Create directories
if not os.path.exists(work_dir):
    os.makedirs(work_dir)
output_dir = "{}/explore_entities".format(work_dir)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# adding some environment variables we'll be using frequently
os.environ['ITEM_FILE'] = item_file
os.environ['TIME_FILE'] = time_file
os.environ['QUANTITY_FILE'] = quantity_file
os.environ['LABEL_FILE'] = label_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['OUT'] = output_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

## Create type mapping
Mapping is from entity (Q node) to the entity's type (another Q node).  
To define type, we can use P31 or P279*. We may use other properties depending on what we want to profile (e.g. the property that defines whether an entity is a Politician is P106 - occupation).

In [8]:
type_defining_property = "P31"
os.environ["TYPE_PROP"] = type_defining_property

In [9]:
!kgtk query -i $ITEM_FILE \
-o $OUT/type_mapping.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (e)-[l {label:"'"$TYPE_PROP"'"}]->(type)' \
--return 'distinct l as id, e as node1, l.label as label, type as node2'

In [10]:
!head -5 $OUT/type_mapping.tsv | column -t -s $'\t'

id                              node1  label  node2
P10-P31-Q18610173-85ef4d24-0    P10    P31    Q18610173
P1000-P31-Q18608871-093affb5-0  P1000  P31    Q18608871
P1001-P31-Q15720608-deeedec9-0  P1001  P31    Q15720608
P1001-P31-Q22984026-8beb0cfe-0  P1001  P31    Q22984026


## Look at top attributes amongst entities within a type

In [11]:
# Change the Q-node type here to look at attributes of entities of a different type
# type_to_profile = "Q44" # beer
# type_to_profile = "Q282" # wine
# type_to_profile = "Q154" # alcoholic beverage
# type_to_profile = "Q11424" # film
# type_to_profile = "Q82955" # politician
type_to_profile = "Q5" # Human
os.environ["TYPE"] = type_to_profile

Number of entities of this type:

In [12]:
!kgtk filter -p " ; $TYPE_PROP ; $TYPE " -i $OUT/type_mapping.tsv | wc -l | awk '{print $1-1}'

7984232


Some examples:

In [13]:
!kgtk filter -p " ; $TYPE_PROP ; $TYPE " -i $OUT/type_mapping.tsv | head | column -t -s $'\t'

id                            node1       label  node2
Q10000001-P31-Q5-cc1c4199-0   Q10000001   P31    Q5
Q1000002-P31-Q5-5c9914ea-0    Q1000002    P31    Q5
Q1000005-P31-Q5-4d5e2a2b-0    Q1000005    P31    Q5
Q1000006-P31-Q5-38728290-0    Q1000006    P31    Q5
Q100000811-P31-Q5-ea39ff32-0  Q100000811  P31    Q5
Q100000814-P31-Q5-183f7bff-0  Q100000814  P31    Q5
Q100000817-P31-Q5-7b91c1ab-0  Q100000817  P31    Q5
Q100000831-P31-Q5-e8c92610-0  Q100000831  P31    Q5
Q100000832-P31-Q5-911e44ee-0  Q100000832  P31    Q5


Number of these entities that have an english label:

In [14]:
!kgtk query -i $OUT/type_mapping.tsv -i $LABEL_FILE \
--graph-cache $STORE \
--match 'type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (e)-[:label]->(e_lab)' \
--return 'e as entity, e_lab as entity_label' \
--where 'e_lab.kgtk_lqstring_lang_suffix = "en"' \
| wc -l | awk '{print $1-1}'

7958993


Top quantity attributes amongst entities of this type:

In [15]:
!kgtk query -i $QUANTITY_FILE -i $LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
--limit 10 \
--match '`'"$QUANTITY_FILE"'`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (prop)-[:label]->(prop_lab)' \
--return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
--where 'prop_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'ents_with_this_prop desc' \
| column -t -s $'\t'

prop   prop_label                                  ents_with_this_prop
P2048  'height'@en                                 154831
P2067  'mass'@en                                   126183
P1971  'number of children'@en                     20348
P1087  'Elo rating'@en                             16976
P1350  'number of matches played/races/starts'@en  10705
P6509  'total goals in career'@en                  10693
P6544  'total points in career'@en                 7079
P6546  'penalty minutes in career'@en              7078
P6545  'total assists in career'@en                7077
P6543  'total shots in career'@en                  6021


Top time attributes amongst entities of this type:

In [16]:
!kgtk query -i $TIME_FILE -i $LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
--limit 10 \
--match '`'"$TIME_FILE"'`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (prop)-[:label]->(prop_lab)' \
--return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
--where 'prop_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'ents_with_this_prop desc' \
| column -t -s $'\t'

prop   prop_label                               ents_with_this_prop
P569   'date of birth'@en                       4255892
P570   'date of death'@en                       2127385
P2031  'work period (start)'@en                 218548
P2032  'work period (end)'@en                   27567
P1317  'floruit'@en                             20175
P1636  'date of baptism in early childhood'@en  2679
P4602  'date of burial or cremation'@en         1906
P746   'date of disappearance'@en               874
P585   'point in time'@en                       210
P580   'start time'@en                          128


Top relations amongst entities of this type:

In [17]:
!kgtk query -i $ITEM_FILE -i $LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
--limit 10 \
--match '`'"$ITEM_FILE"'`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (prop)-[:label]->(prop_lab)' \
--return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
--where 'prop_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'ents_with_this_prop desc' \
| column -t -s $'\t'

prop  prop_label                   ents_with_this_prop
P31   'instance of'@en             7984212
P21   'sex or gender'@en           5984186
P106  'occupation'@en              5767333
P735  'given name'@en              4790117
P27   'country of citizenship'@en  3293085
P734  'family name'@en             2511540
P19   'place of birth'@en          2431392
P69   'educated at'@en             1307297
P641  'sport'@en                   932865
P20   'place of death'@en          928957


Top string attributes amongst entities of this type:

In [18]:
if not string_file:
    print("No string attribute file was provided in the parameters section, skipping this step.")
else:
    # perform query
    command = "$kgtk query -i STRING_FILE -i LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
               --limit 10 \
               --match '`STRING_FILE`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`TYPE`), `LABEL_FILE`: (prop)-[:label]->(prop_lab)' \
               --return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
               --where 'prop_lab.kgtk_lqstring_lang_suffix = \"en\"' \
               --order-by 'ents_with_this_prop desc' \
               | column -t -s $'\t'"
    run_command(command, {"STRING_FILE" : string_file, "LABEL_FILE" : label_file, "TYPE" : type_to_profile})

prop   prop_label                           ents_with_this_prop
P373   'Commons category'@en                515460
P1814  'name in kana'@en                    127400
P1472  'Commons Creator page'@en            45196
P742   'pseudonym'@en                       33807
P1618  'sport number'@en                    24795
P935   'Commons gallery'@en                 15702
P2001  'Revised Romanization'@en            6412
P1942  'McCune-Reischauer romanization'@en  5888
P528   'catalog code'@en                    5019
P835   'author citation (zoology)'@en       4195

