# Explore entities in input dataset
In this notebook, we'll quickly take a look at the top attributes and relations amongst entities of a given type. This will help validate the input data and give a sense of what attributes we might expect to see being used to later profile entities of the given type.

In [1]:
import os
import pandas as pd
from utility import run_command
from utility import rename_cols_and_overwrite_id

### Parameters
**Required**  
*item_file*: file path for the file that contains entity to entity relationships (e.g. wikibase-item)  
*time_file*: file path for the file that contains entity to time-type values  
*quantity_file*: file path for the file that contains entity to quantity-type values (remember to specify the trimmed file if you did the quantity file trimming pre-processing step).  
*label_file*: file path for the file that contains wikidata labels  
*work_dir*: path to folder where files created by this notebook should be stored  
*store_dir*: path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

**Optional**    
*string_file*: file path for the file that contains entity to string-type values  

In [36]:
data_dir = "../../wikidata_politicians/data" # my data files are all in the same directory, so I'll reuse this path prefix

# **REQUIRED**
item_file = "{}/claims.wikibase-item.tsv.gz".format(data_dir)
time_file = "{}/claims.time.tsv.gz".format(data_dir)
quantity_file = "{}/claims.quantity.tsv.gz".format(data_dir)
label_file = "{}/labels.en.tsv.gz".format(data_dir)
work_dir = "../../wikidata_films/profiler_work"
store_dir = "../../wikidata_films"

# **optional**
string_file = "{}/claims.string.tsv.gz".format(data_dir)

### Process parameters and set up variables / file names

In [37]:
# Ensure paths are absolute
item_file = os.path.abspath(item_file)
time_file = os.path.abspath(time_file)
quantity_file = os.path.abspath(quantity_file)
label_file = os.path.abspath(label_file)
work_dir = os.path.abspath(work_dir)
store_dir = os.path.abspath(store_dir)
if string_file:
    string_file = os.path.abspath(string_file)
    
# Create directories
if not os.path.exists(work_dir):
    os.makedirs(work_dir)
output_dir = "{}/explore_entities".format(work_dir)
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

# adding some environment variables we'll be using frequently
os.environ['ITEM_FILE'] = item_file
os.environ['TIME_FILE'] = time_file
os.environ['QUANTITY_FILE'] = quantity_file
os.environ['LABEL_FILE'] = label_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['OUT'] = output_dir
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

## Create type mapping
Mapping is from entity (Q node) to the entity's type (another Q node).  
To define type, we can use P31 or P279*. We may use other properties depending on what we want to profile (e.g. the property that defines whether an entity is a Politician is P106 - occupation).

In [20]:
type_defining_property = "P106"
os.environ["TYPE_PROP"] = type_defining_property

In [29]:
!kgtk query -i $ITEM_FILE \
-o $OUT/type_mapping.tsv --graph-cache $STORE \
--match '`'"$ITEM_FILE"'`: (e)-[l {label:"'"$TYPE_PROP"'"}]->(type)' \
--return 'distinct l as id, e as node1, l.label as label, type as node2'

In [30]:
!head -5 $OUT/type_mapping.tsv | column -t -s $'\t'

id                                   node1       label  node2
Q10007-P106-Q1097498-71127567-0      Q10007      P106   Q1097498
Q1000926-P106-Q372436-773c1779-0     Q1000926    P106   Q372436
Q100109193-P106-Q2478141-24bc7030-0  Q100109193  P106   Q2478141
Q10011-P106-Q1097498-2778ac1a-0      Q10011      P106   Q1097498


## Look at top attributes amongst entities within a type

In [31]:
# Change the Q-node type here to look at attributes of entities of a different type
# type_to_profile = "Q44" # beer
# type_to_profile = "Q282" # wine
# type_to_profile = "Q154" # alcoholic beverage
# type_to_profile = "Q11424" # film
type_to_profile = "Q82955" # politician
os.environ["TYPE"] = type_to_profile

Number of entities of this type:

In [32]:
!kgtk filter -p " ; $TYPE_PROP ; $TYPE " -i $OUT/type_mapping.tsv | wc -l | awk '{print $1-1}'

2657


Some examples:

In [33]:
!kgtk filter -p " ; $TYPE_PROP ; $TYPE " -i $OUT/type_mapping.tsv | head | column -t -s $'\t'

id                                 node1       label  node2
Q1001159-P106-Q82955-047f990a-0    Q1001159    P106   Q82955
Q100487161-P106-Q82955-1ac1ed0d-0  Q100487161  P106   Q82955
Q100605214-P106-Q82955-ce9366a2-0  Q100605214  P106   Q82955
Q100729152-P106-Q82955-712a5b0b-0  Q100729152  P106   Q82955
Q100731488-P106-Q82955-3a49c4dc-0  Q100731488  P106   Q82955
Q100737619-P106-Q82955-35480579-0  Q100737619  P106   Q82955
Q101092015-P106-Q82955-81e5834b-0  Q101092015  P106   Q82955
Q101436380-P106-Q82955-2654a973-0  Q101436380  P106   Q82955
Q1015710-P106-Q82955-f438772b-0    Q1015710    P106   Q82955


Number of these entities that have an english label:

In [34]:
!kgtk query -i $OUT/type_mapping.tsv -i $LABEL_FILE \
--graph-cache $STORE \
--match 'type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (e)-[:label]->(e_lab)' \
--return 'e as entity, e_lab as entity_label' \
--where 'e_lab.kgtk_lqstring_lang_suffix = "en"' \
| wc -l | awk '{print $1-1}'

2635


Top quantity attributes amongst entities of this type:

In [38]:
!kgtk query -i $QUANTITY_FILE -i $LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
--limit 10 \
--match '`'"$QUANTITY_FILE"'`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (prop)-[:label]->(prop_lab)' \
--return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
--where 'prop_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'ents_with_this_prop desc' \
| column -t -s $'\t'

prop   prop_label                  ents_with_this_prop
P1971  'number of children'@en     290
P2048  'height'@en                 27
P2067  'mass'@en                   8
P2218  'net worth'@en              5
P1352  'ranking'@en                1
P2021  'Erdős number'@en           1
P2097  'term length of office'@en  1
P2138  'total liabilities'@en      1
P2139  'total revenue'@en          1
P2295  'net profit'@en             1


Top time attributes amongst entities of this type:

In [39]:
!kgtk query -i $TIME_FILE -i $LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
--limit 10 \
--match '`'"$TIME_FILE"'`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (prop)-[:label]->(prop_lab)' \
--return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
--where 'prop_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'ents_with_this_prop desc' \
| column -t -s $'\t'

prop   prop_label                               ents_with_this_prop
P569   'date of birth'@en                       2529
P570   'date of death'@en                       1001
P2031  'work period (start)'@en                 52
P4602  'date of burial or cremation'@en         19
P1636  'date of baptism in early childhood'@en  6
P2032  'work period (end)'@en                   4
P1317  'floruit'@en                             3
P571   'inception'@en                           2
P580   'start time'@en                          1
P582   'end time'@en                            1


Top relations amongst entities of this type:

In [40]:
!kgtk query -i $ITEM_FILE -i $LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
--limit 10 \
--match '`'"$ITEM_FILE"'`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`'"$TYPE"'`), `'"$LABEL_FILE"'`: (prop)-[:label]->(prop_lab)' \
--return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
--where 'prop_lab.kgtk_lqstring_lang_suffix = "en"' \
--order-by 'ents_with_this_prop desc' \
| column -t -s $'\t'

prop  prop_label                   ents_with_this_prop
P106  'occupation'@en              2657
P31   'instance of'@en             2657
P21   'sex or gender'@en           2618
P39   'position held'@en           2572
P27   'country of citizenship'@en  2469
P19   'place of birth'@en          2309
P735  'given name'@en              2094
P102  'political party'@en         1946
P734  'family name'@en             1879
P69   'educated at'@en             1688


Top string attributes amongst entities of this type:

In [41]:
if not string_file:
    print("No string attribute file was provided in the parameters section, skipping this step.")
else:
    # perform query
    command = "$kgtk query -i STRING_FILE -i LABEL_FILE -i $OUT/type_mapping.tsv --graph-cache $STORE \
               --limit 10 \
               --match '`STRING_FILE`: (e)-[l {label:prop}]->(), type: (e)-[]->(:`TYPE`), `LABEL_FILE`: (prop)-[:label]->(prop_lab)' \
               --return 'distinct prop as prop, prop_lab as prop_label, count(distinct e) as ents_with_this_prop' \
               --where 'prop_lab.kgtk_lqstring_lang_suffix = \"en\"' \
               --order-by 'ents_with_this_prop desc' \
               | column -t -s $'\t'"
    run_command(command, {"STRING_FILE" : string_file, "LABEL_FILE" : label_file, "TYPE" : type_to_profile})

prop   prop_label                             ents_with_this_prop
P373   'Commons category'@en                  1480
P935   'Commons gallery'@en                   162
P528   'catalog code'@en                      68
P898   'IPA transcription'@en                 32
P1472  'Commons Creator page'@en              31
P1814  'name in kana'@en                      30
P742   'pseudonym'@en                         12
P1942  'McCune-Reischauer romanization'@en    8
P2001  'Revised Romanization'@en              8
P1438  'Jewish Encyclopedia ID (Russian)'@en  7

