# This notebook will implement an optional pre-processing step that can be done prior to candidate label creation.
We will trim down a given quantity file by using associated qualifiers to keep only the most up-to-date value for time-series data. For example, if your quantity file includes the human development index (HDI) of various countries, you may find that there are many values of HDI for each country. Trimming these values by keeping only the most recent will mitigate confusing results (e.g. labeling all countries as having a small population since all countries at some point had a smaller population) and reduce the number of candidate labels we create in later steps.

In [16]:
import os
import pandas as pd
from utility import run_command
from utility import rename_cols_and_overwrite_id

### Parameters
**Required**   
*quantity_file*: file path for the file that contains entity to quantity-type values  
*qualifiers_file*: file path for the file that contains wikidata labels  
*out_file*: file path for the file that we will save the trimmed results in   
*store_dir*: path to folder containing the sqlite3.db file that we will use for our queries. We will reuse an existing file if there is one in this folder. Otherwise we will create a new one.

In [17]:
# **REQUIRED**
quantity_file = "../../Q154/data/parts/claims.quantity.tsv.gz"
qualifiers_file = "../../Q154/data/parts/qualifiers.quantity.tsv.gz"
out_file = "../../Q154/data/parts/claims.quantity_trimmed.tsv.gz"
store_dir = "../../Q154"

### Process parameters and set up variables / file names

In [18]:
# Ensure paths are absolute
quantity_file = os.path.abspath(quantity_file)
qualifiers_file = os.path.abspath(qualifiers_file)
out_file = os.path.abspath(out_file)
store_dir = os.path.abspath(store_dir)

# Environment variables for kgtk commands
os.environ['QUANTITY_FILE'] = quantity_file
os.environ['QUALIFIERS_FILE'] = qualifiers_file
os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(store_dir)
os.environ['OUT_FILE'] = out_file
os.environ['kgtk'] = "kgtk" # Need to do this for kgtk to be recognized as a command when passing it through a subprocess call

In [19]:
!gzcat $QUALIFIERS_FILE | head -5

id	node1	label	node2
Q1000-P1081-0d345f-3a33abf5-0-P585-90ede1-0	Q1000-P1081-0d345f-3a33abf5-0	P585	^2004-01-01T00:00:00Z/9
Q1000-P1081-0d345f-6da37c02-0-P585-e4bef8-0	Q1000-P1081-0d345f-6da37c02-0	P585	^2003-01-01T00:00:00Z/9
Q1000-P1081-1100e3-c7631769-0-P585-c03b8d-0	Q1000-P1081-1100e3-c7631769-0	P585	^1992-01-01T00:00:00Z/9
Q1000-P1081-1ada51-7c71c229-0-P585-7131d5-0	Q1000-P1081-1ada51-7c71c229-0	P585	^2002-01-01T00:00:00Z/9
gzcat: error writing to output: Broken pipe
gzcat: /Users/nicklein/Documents/grad_school/Research.nosync/Q154/data/parts/qualifiers.quantity.tsv.gz: uncompress failed


In [20]:
!gzcat $QUANTITY_FILE | head -5

id	node1	label	node2	node2;wikidatatype
Q1000-P1081-0d345f-3a33abf5-0	Q1000	P1081	+0.641	quantity
Q1000-P1081-0d345f-6da37c02-0	Q1000	P1081	+0.641	quantity
Q1000-P1081-1100e3-c7631769-0	Q1000	P1081	+0.624	quantity
Q1000-P1081-1ada51-7c71c229-0	Q1000	P1081	+0.639	quantity
gzcat: error writing to output: Broken pipe
gzcat: /Users/nicklein/Documents/grad_school/Research.nosync/Q154/data/parts/claims.quantity.tsv.gz: uncompress failed


In [24]:
!kgtk query --graph-cache $STORE \
-i $QUANTITY_FILE -i $QUALIFIERS_FILE -o $OUT_FILE \
--match '`'"$QUANTITY_FILE"'`: (n1)-[l {label:prop}]->(n2), `'"$QUALIFIERS_FILE"'`: (l)-[q {label:P585}]->(t)' \
--return 'distinct n1, prop as label, n2 as node2, q.label as qualifier, kgtk_date_and_time(t) as time, l as id' \
--order-by 'n1, prop, q.label, time desc'

In [27]:
!gzcat $OUT_FILE | head | column -t -s $'\t'

gzcat: error writing to output: Broken pipe
gzcat: /Users/nicklein/Documents/grad_school/Research.nosync/Q154/data/parts/claims.quantity_trimmed.tsv.gz: uncompress failed
node1  label  node2   qualifier  time                   id
Q1000  P1081  +0.702  P585       ^2017-01-01T00:00:00Z  Q1000-P1081-345681-88a99cab-0
Q1000  P1081  +0.698  P585       ^2016-01-01T00:00:00Z  Q1000-P1081-ca6790-6e605c9e-0
Q1000  P1081  +0.694  P585       ^2015-01-01T00:00:00Z  Q1000-P1081-81998e-cc0171b8-0
Q1000  P1081  +0.693  P585       ^2014-01-01T00:00:00Z  Q1000-P1081-70c922-1b54e63b-0
Q1000  P1081  +0.684  P585       ^2014-00-00T00:00:00Z  Q1000-P1081-703676-2421fe86-0
Q1000  P1081  +0.687  P585       ^2013-01-01T00:00:00Z  Q1000-P1081-f9db15-49b5f867-0
Q1000  P1081  +0.679  P585       ^2013-00-00T00:00:00Z  Q1000-P1081-eec082-fcd7ee29-0
Q1000  P1081  +0.678  P585       ^2012-01-01T00:00:00Z  Q1000-P1081-85a71b-cef7b7e5-0
Q1000  P1081  +0.673  P585       ^2012-00-00T00:00:00Z  Q1000-P1081-563