# Generating Subsets of Wikidata

>Warning: 
**This notebook is under construction and it doesn't work**

## Purpose

>This notebook is used to create smaller subgraphs from a larger input Wikidata graph. Notebook users can provide a list of Wikidata classes (**QNodes**) to remove and preserve to create desired subsets of Wikidata. 


### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

UPDATE EXAMPLE INVOCATION


```
papermill Wikidata\ Useful\ Files.ipynb useful-files.out.ipynb \
-p wiki_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz \
-p label_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.label.en.tsv.gz \
-p item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.wikibase-item.tsv.gz \
-p property_item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.property.wikibase-item.tsv.gz \
-p qual_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/qual.tsv.gz \
-p output_path <local folder> \
-p output_folder useful_files_v4 \
-p temp_folder temp.useful_files_v4 \
-p delete_database no \
-p compute_pagerank no \
-p languages es,ru,zh-cn 
```

In [2]:
# Parameters

# Folder on local machine where to create the output and temporary folders
# output_path = "/Users/pedroszekely/Downloads/kypher"
output_path = "/Users/markmann/Downloads/subset"

# The names of the output and temporary folders
output_folder = "output"
temp_folder = "temp.output"

# Classes to remove
#Q34508 - video tape recording
# remove_classes = "Q13442814, Q523, Q16521, Q318, Q7318358, Q7187, Q11173, Q8054, Q5, Q13100073, Q8502, Q3305213, Q4022, Q79007, Q1931185, Q30612, Q101352, Q54050, Q13433827, Q2668072, Q23397, Q3863, Q11424, Q482994, Q47150325, Q16970, Q18593264, Q355304, Q9842, Q7725634, Q27020041, Q56436498, Q2154519, Q61443690, Q49008, Q3331189, Q47521, Q5084, Q19389637, Q21014462, Q4164871, Q11060274, Q5633421, Q39816, Q5185279, Q55488, Q134556, Q22698, Q985488, Q1260524, Q204107, Q2225692, Q215380, Q71963409, Q452237, Q93184, Q12323"
remove_classes = "Q34508"

# The location of input files
# wiki_root_folder = "/Volumes/GoogleDrive/Shared\ drives/KGTK/datasets/wikidata-20200803-v4/"
# wiki_root_folder = "/Volumes/GoogleDrive/Shared\ drives/KGTK/datasets/wikidata-20200803-v4/"
# wiki_root_folder = "/Users/pedroszekely/Downloads/kypher/wikidataos-v4/"
wiki_root_folder = "/Users/markmann/Google\ Drive/Shared\ drives/KGTK/datasets/wikidataos-v4-mm-2/"

claims_file = "claims.tsv.gz"
label_file = "labels.en.tsv.gz"
alias_file = "aliases.en.tsv.gz"
description_file = "descriptions.en.tsv.gz"
item_file = "claims.wikibase-item.tsv.gz"
qual_file = "qualifiers.tsv.gz"
property_datatypes_file = "metadata.property.datatypes.tsv.gz"
metadata_file = "metadata.types.tsv.gz" #Add
isa_file = "derived.isa.tsv.gz"
p279star_file = "derived.P279star.tsv.gz"

# Useful files Jupyter notebook
useful_files_notebook = "Wikidata Useful Files.ipynb"
notebooks_folder = "/Users/markmann/Desktop/CKG/kgtk_subset/kgtk/examples/"

# Location of the cache database for kypher
# cache_path = "/Users/pedroszekely/Downloads/kypher/wikidataos-v4"
cache_path = f'{output_path}/{output_folder}'

#Additional parameters
delete_database = "no"
compute_pagerank = "no"
languages = ""

### Needs fixing
# Whether to delete the cache database
if delete_database and delete_database.lower().strip() == 'yes':
    delete_database = True
else:
    delete_database = False

### Needs fixing
if compute_pagerank and compute_pagerank.lower().strip() == 'yes':
    compute_pagerank = True
else:
    compute_pagerank = False

if languages:
    languages = languages.split(',')

In [3]:
import io
import os
import subprocess
import sys

import numpy as np
import pandas as pd

import altair as alt

import papermill as pm

## Set up variables for files

In [4]:
#Environment variables
if cache_path:
    os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)

#Python variables
if cache_path:
    store = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    store = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)

out = "{}/{}".format(output_path, output_folder)
temp = "{}/{}".format(output_path, temp_folder)

claims = wiki_root_folder + claims_file
labels = wiki_root_folder + label_file
aliases = wiki_root_folder + alias_file
descriptions = wiki_root_folder + description_file
items = wiki_root_folder + item_file
quals = wiki_root_folder + qual_file
datatypes = wiki_root_folder + property_datatypes_file
metadata = wiki_root_folder + metadata_file #Add
isa = wiki_root_folder + isa_file
p279star = wiki_root_folder + p279star_file

# shortcuts to commands
kgtk = "time kgtk --debug"
kypher = "kgtk query --debug --graph-cache " + store

In [None]:
#Check files
#Example
#Q34508 - video tape recording

# !gzcat {p279star} | head
# !gzcat {descriptions} | head

# !zgrep 'Q34508' {datatypes}
# !zgrep '\Q.*' {datatypes}
# !zgrep 'Q34508' {claims}

Go to the output directory and create the subfolders for the output files and the temporary files

In [4]:
!cd $output_path
!mkdir {out}
!mkdir {temp}

mkdir: /Users/markmann/Downloads/subset/output: File exists
mkdir: /Users/markmann/Downloads/subset/temp.output: File exists


Clean up the output and temp folders before we start

In [5]:
# !rm {out}/*.tsv {out}/*.tsv.gz
# !rm {temp}/*.tsv {temp}/*.tsv.gz

if delete_database:
    !rm {out}/*.tsv {out}/*.tsv.gz
    !rm {temp}/*.tsv {temp}/*.tsv.gz

In [None]:
#COPY REFERENCE FILES
# !cp {datatypes} {out}
# !cp {metadata} {out}

### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [44]:
!{kypher} -i {claims} \
--match '()-[]->()' \
--limit 10

[2021-02-05 13:51:06 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 AS graph_1_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video"	normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property
P10-P1659-P18-5e4b9c4f-0	P10	P1659	P18	normal	wikibase-property
P10-P1659-P4238-d21d1ac0-0	P10	P1659	P4238	normal	wikibase-property
P10-P1659-P51-86aca4c5-0	P10	P1659	P51	normal	wikibase-property
P10-P1855-Q7378-555592a4-0	P10	P1855	Q7378	normal	wikibase-item
P10-P31-Q18610173-85ef4d24-0	P10	P31	Q18610173	normal	wikibase-item
P1000-P1629-Q1241356-d5c10f50-0	P1000	P1629	Q1241356	normal	wikibase-item


## Creating a list of all the items  to remove

In [None]:
remove_list = set()

**Add classes to remove directly here:** <br>
- **NOTE:** This will only remove items that have a P31/P279 relation with the class
- **Example:** Let's remove the class (videotape recording, 'Q34508')

In [None]:
remove_list = set()
classes_to_remove = ['Q34508'] #Parameter
[for cl in classes_to_remove remove_list.add(cl)]

**Compute classes to remove, based on instances here:**
- **Example:** Let's remove classes that are part of instance (Fireball, 'Q5451712'), (Bush, 'Q1017471') 
- **NOTE:** The expected class to remove is (whisky, 'Q281') (beer, 'Q44')

In [96]:
#For each instance, find all classes associated with that instance
instances_to_remove = ['Q5451712', 'Q1017471'] #Parameter
instances = ', '.join([f'"{instance}"' for instance in instances_to_remove])
!{kypher} -i {claims} \
--match '(instance)-[:P31]->(c)' \
--where 'instance in [{instances}]' \

#ISSUE: Need two query all node2 for all P31/P279 relations for given list of instances.
#SOLN: Do 2 queries for now, and cat them together
# !{kypher} -i {claims} -o {temp}/instances.p31.remove.tsv.gz \
# --match '(instance)-[r {label: "P31"}]->(c)' \
# --where 'instance in [{instances}]'



[2021-02-05 16:40:08 query]: SQL Translation:
---------------------------------------------
  SELECT *
     FROM graph_1 AS graph_1_c1
     WHERE graph_1_c1."label"=?
     AND (graph_1_c1."node1" IN (?, ?))
  PARAS: ['P31', 'Q5451712', 'Q1017471']
---------------------------------------------
id	node1	label	node2	rank	node2;wikidatatype
Q1017471-P31-Q15075508-61e783df-0	Q1017471	P31	Q15075508	normal	wikibase-item
Q1017471-P31-Q44-7580116c-0	Q1017471	P31	Q44	normal	wikibase-item
Q5451712-P31-Q281-2d4512be-0	Q5451712	P31	Q281	normal	wikibase-item


In [75]:
#GOAL: Find all instances of beer (Q44) class in claims file
# !{kypher} -i {claims} \
# --match '(n1)-[:P31]->(:Q44)' \
# --limit 10

## Create a list of all items to protect

## Check for conflicts

### Compute the items to be removed

First look at the classes we will remove

In [7]:
cmd = "wd u {}".format(" ".join(remove_classes.split(",")))
!{cmd}

[90mid[39m Q34508
[42mLabel[49m videotape recording
[44mDescription[49m electronic medium for the recording, copying and broadcasting of moving visual images
[30m[47minstance of[49m[39m [90m(P31)[39m[90m: [39m electronic media [90m(Q1209283)[39m
[30m[47msubclass of[49m[39m [90m(P279)[39m[90m: [39melectronic media [90m(Q1209283)[39m | moving image [90m(Q10301427)[39m | audiovisual work [90m(Q2431196)[39m


Compose the kypher command to remove the classes

In [8]:
!zcat < {isa} | head | col

node1	label	node2
zcat: P10	isa	Q18610173
P1000	isa	Q18608871
P1001	isa	Q15720608
P1001	isa	Q22984026
P1001	isa	Q22997934
error writing to output: Broken pipe
P1001	isa	Q61719275
P1001	isa	Q70564278
P1002	isa	Q22963600
P1003	isa	Q19595382


Run the command, the items to remove will be in file `{temp}/items.remove.tsv.gz`

In [68]:
# remove_classes = "Q13442814, Q523, Q16521" #Input
classes = ", ".join(list(map(lambda x: '"{}"'.format(x), remove_classes.replace(" ", "").split(",")))) #Output "x", "y", "z"
!{kypher}  -i {isa} -i {p279star} -o {temp}/items.remove.tsv.gz \
--match 'isa: (n1)-[:isa]->(c), P279star: (c)-[]->(class)' \
--where 'class in [{classes}]' \
--return 'distinct n1, "p31_p279star" as label, class as node2'

"Q13442814", "Q523", "Q16521"


Preview the `items.remove` file. Count number of class instances to remove

In [50]:
# !zcat < {temp}/items.remove.tsv.gz | head | col

#Count of remove-list class Q34508 in claims file
# !zgrep 'Q34508' {claims} -c #466
# !zgrep '/tQ5/t' {claims} -c

#Count the remove-list class (Q34508) in items.remove.tsv
!zcat < {temp}/items.remove.tsv.gz
# !zgrep 'Q34508' {temp}/items.remove.tsv.gz -c #466

node1	label	node2
Q100000003	p31_p279star	Q34508
Q100000011	p31_p279star	Q34508
Q100000014	p31_p279star	Q34508
Q100000021	p31_p279star	Q34508
Q100000029	p31_p279star	Q34508
Q100000033	p31_p279star	Q34508
Q100000049	p31_p279star	Q34508
Q100328888	p31_p279star	Q34508
Q100370118	p31_p279star	Q34508
Q100431477	p31_p279star	Q34508
Q100477946	p31_p279star	Q34508
Q100982847	p31_p279star	Q34508
Q100982908	p31_p279star	Q34508
Q101077837	p31_p279star	Q34508
Q101079766	p31_p279star	Q34508
Q101094418	p31_p279star	Q34508
Q101243034	p31_p279star	Q34508
Q101246930	p31_p279star	Q34508
Q101246967	p31_p279star	Q34508
Q101568768	p31_p279star	Q34508
Q101771665	p31_p279star	Q34508
Q102013680	p31_p279star	Q34508
Q102046764	p31_p279star	Q34508
Q102117797	p31_p279star	Q34508
Q102163686	p31_p279star	Q34508
Q102163797	p31_p279star	Q34508
Q102225310	p31_p279star	Q34508
Q10267876	p31_p279star	Q34508
Q11931979	p31_p279star	Q34508
Q15527021	p31_p279star	Q34508
Q16014350	p31_p279star	Q

In [35]:
!zcat < {temp}/items.remove.tsv.gz | wc

     467    1401   13994


In [23]:
# !zcat < {temp}/items.remove.tsv.gz | grep 'Q34508'

In [None]:
# !zcat < {temp}/items.remove.tsv.gz | grep 'Q15874936\t'

Collect all the classes of items we will remove, just as a sanity check

In [20]:
!{kypher} -i {temp}/items.remove.tsv.gz \
--match '()-[]->(n2)' \
--return 'distinct n2' \
--limit 10

[2021-02-05 10:24:56 sqlstore]: DROP graph data table graph_4 from /Users/markmann/Downloads/subset/temp.output/items.remove.tsv.gz
[2021-02-05 10:24:56 sqlstore]: IMPORT graph directly into table graph_6 from /Users/markmann/Downloads/subset/temp.output/items.remove.tsv.gz ...
[2021-02-05 10:24:57 query]: SQL Translation:
---------------------------------------------
  SELECT DISTINCT graph_6_c1."node2"
     FROM graph_6 AS graph_6_c1
     LIMIT ?
  PARAS: [10]
---------------------------------------------
node2
Q34508


## Create the reduced edges file

### Remove the items from the all.tsv and the label, alias and description files
We will be left with `reduced` files where the edges do not have the unwanted items. We have to remove them from the node1 and node2 positions, so we need to run the ifnotexists commands twice.

Before we start preview the files to see the column headings and check whether they look sorted.

In [21]:
!$kgtk sort2 -i {temp}/items.remove.tsv.gz -o {temp}/items.remove.sorted.tsv.gz


real	0m0.980s
user	0m0.493s
sys	0m0.193s


In [36]:
!zcat < {temp}/items.remove.sorted.tsv.gz | head | col
# !zgrep 'Q34508' {temp}/items.remove.sorted.tsv.gz -c #466

node1	label	node2
Q100000003	p31_p279star	Q34508
Q100000011	p31_p279star	Q34508
Q100000014	p31_p279star	Q34508
Q100000021	p31_p279star	Q34508
Q100000029	p31_p279star	Q34508
Q100000033	p31_p279star	Q34508
Q100000049	p31_p279star	Q34508
Q100328888	p31_p279star	Q34508
Q100370118	p31_p279star	Q34508


In [23]:
!zcat < {claims} | head -5 | col

zcat: error writing to output: Broken pipe
id	node1	label	node2	rank	node2;wikidatatype
P10-P1628-32b85d-7927ece6-0	P10	P1628	"http://www.w3.org/2006/vcard/ns#Video" normal	url
P10-P1628-acf60d-b8950832-0	P10	P1628	"https://schema.org/video"	normal	url
P10-P1629-Q34508-bcc39400-0	P10	P1629	Q34508	normal	wikibase-item
P10-P1659-P1651-c4068028-0	P10	P1659	P1651	normal	wikibase-property


Remove from the full set of edges those edges that have a `node1` present in `items.remove.sorted.tsv`

In [47]:
!$kgtk ifnotexists -i {claims} -o {temp}/item.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted 


real	4m29.175s
user	4m21.486s
sys	0m2.640s


From the remaining edges, remove those that have a `node2` present in `items.remove.sorted.tsv`

In [51]:
!$kgtk sort2 -i {temp}/item.edges.reduced.tsv.gz -o {temp}/item.edges.reduced.sorted.tsv.gz \
--columns node2 label node1 id


real	7m32.465s
user	3m24.460s
sys	2m55.762s


In [52]:
!$kgtk ifnotexists -i {temp}/item.edges.reduced.sorted.tsv.gz -o {temp}/item.edges.reduced.2.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node2 \
--filter-keys node1 \
--presorted 


real	5m0.397s
user	4m47.034s
sys	0m3.466s


Create a file with the labels

In [7]:
!$kgtk ifnotexists -i {labels} -o {temp}/label.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted


real	0m47.814s
user	0m42.495s
sys	0m0.796s


In [None]:
languages = ['en']

In [8]:
#GOAL: Must sort the labels file

#NOT WORKING 
# for lang in languages:
#     cmd = f"kgtk sort2 -i {temp}/label.{lang}.edges.reduced.tsv.gz -o {out}/labels.{lang}.tsv.gz" 
#     !$cmd

#WORKING
!$kgtk sort2 -i {temp}/label.edges.reduced.tsv.gz -o {out}/labels.tsv.gz


real	0m12.035s
user	0m11.791s
sys	0m0.751s


Create a file with the aliases

In [9]:
!$kgtk ifnotexists -i {aliases} -o {temp}/alias.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted


real	0m15.988s
user	0m13.307s
sys	0m0.314s


In [None]:
#NOT WORKING: Languages issue
# for lang in languages:
#     cmd = f"kgtk --debug ifnotexists -i {wiki_root_folder}aliases.{lang}.tsv.gz \
#     -o {temp}/alias.{lang}.edges.reduced.tsv.gz \
#     --filter-on {temp}/items.remove.sorted.tsv.gz \
#     --input-keys node1 \
#     --filter-keys node1 \
#     --presorted"
#     !$cmd

In [None]:
# for lang in languages:
#     cmd = f"kgtk sort2 -i {temp}/alias.{lang}.edges.reduced.tsv.gz -o {out}/aliases.{lang}.tsv.gz" 
#     !$cmd

Create a file with the descriptions

In [10]:
!$kgtk ifnotexists -i {descriptions} -o {temp}/description.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted


real	0m26.461s
user	0m23.693s
sys	0m0.462s


In [None]:
# for lang in languages:
#     cmd = f"kgtk --debug ifnotexists -i {wiki_root_folder}descriptions.{lang}.tsv.gz \
#     -o {temp}/description.{lang}.edges.reduced.tsv.gz \
#     --filter-on {temp}/items.remove.sorted.tsv.gz \
#     --input-keys node1 \
#     --filter-keys node1 \
#     --presorted"
#     !$cmd

In [None]:
# for lang in languages:
#     cmd = f"kgtk sort2 -i {temp}/description.{lang}.edges.reduced.tsv.gz -o {out}/descriptions.{lang}.tsv.gz" 
#     !$cmd

### Produce the output files for claims, labels, aliases and descriptions

In [11]:
!$kgtk sort2 -i {temp}/item.edges.reduced.2.tsv.gz -o {out}/claims.tsv.gz 


real	4m32.234s
user	3m12.227s
sys	1m10.920s


In [12]:
!$kgtk sort2 -i {temp}/label.edges.reduced.tsv.gz -o {out}/labels.en.tsv.gz 


real	0m16.346s
user	0m12.797s
sys	0m0.943s


In [13]:
!$kgtk sort2 -i {temp}/alias.edges.reduced.tsv.gz -o {out}/aliases.en.tsv.gz 


real	0m3.930s
user	0m3.757s
sys	0m0.311s


In [14]:
!$kgtk sort2 -i {temp}/description.edges.reduced.tsv.gz -o {out}/descriptions.en.tsv.gz 


real	0m7.004s
user	0m7.026s
sys	0m0.506s


Sanity checks to see if it looks reasonable

## Tests: Check each output file if class was removed / protected

In [46]:
#test_1/remove-list/Q34508 removed? 
# !zcat < {out}/claims.tsv.gz | head | col
!zgrep 'Q34508' {temp}/item.edges.reduced.2.tsv.gz -c
# !zgrep 'Q34508' {out}/claims.tsv.gz -c #

121


## Create the reduced qualifiers file
We do this by finding all the ids of the reduced edges file, and then selecting out from `qual.tsv`

We need to join by id, so we need to sort both files by id, node1, label, node2:

- `{quals}` 
- `{out}/claims.tsv.gz` 

In [15]:
!zcat < {quals} | head | column -t -s $'\t' 

zcat: error writing to output: Broken pipe
id                                               node1                             label  node2                          node2;wikidatatype
P10-P1855-Q7378-555592a4-0-P10-8a982d-0          P10-P1855-Q7378-555592a4-0        P10    "Elephants Dream (2006).webm"  commonsMedia
P1000-P1896-f63a36-b84f3cd2-0-P1476-bf511b-0     P1000-P1896-f63a36-b84f3cd2-0     P1476  'FAI records'@en               monolingualtext
P1001-P1855-Q29868931-76b67d84-0-P1001-Q11736-0  P1001-P1855-Q29868931-76b67d84-0  P1001  Q11736                         wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q17269-0  P1001-P1855-Q29868931-76b67d84-0  P1001  Q17269                         wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q21208-0  P1001-P1855-Q29868931-76b67d84-0  P1001  Q21208                         wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q34800-0  P1001-P1855-Q29868931-76b67d84-0  P1001  Q34800                         wikibase-item

Run `ifexists` to select out the quals for the edges in `{out}/wikidataos.qual.tsv.gz`. Note that we use `node1` in the qualifier file, matching to `id` in the `wikidataos.all.tsv` file.

In [16]:
!$kgtk ifexists -i {quals} -o {out}/qualifiers.tsv.gz \
--filter-on {out}/claims.tsv.gz \
--input-keys node1 \
--filter-keys id \
--presorted


real	2m21.115s
user	2m8.591s
sys	0m1.904s


Look at the final output for qualifiers

In [17]:
!zcat < {out}/qualifiers.tsv.gz | head | col

zcat: error writing to output: Broken pipe
id	node1	label	node2	node2;wikidatatype
P10-P1855-Q7378-555592a4-0-P10-8a982d-0 P10-P1855-Q7378-555592a4-0	P10	"Elephants Dream (2006).webm"	commonsMedia
P1000-P1896-f63a36-b84f3cd2-0-P1476-bf511b-0	P1000-P1896-f63a36-b84f3cd2-0	P1476	'FAI records'@en	monolingualtext
P1001-P1855-Q29868931-76b67d84-0-P1001-Q11736-0 P1001-P1855-Q29868931-76b67d84-0	P1001	Q11736	wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q17269-0 P1001-P1855-Q29868931-76b67d84-0	P1001	Q17269	wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q21208-0 P1001-P1855-Q29868931-76b67d84-0	P1001	Q21208	wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q34800-0 P1001-P1855-Q29868931-76b67d84-0	P1001	Q34800	wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q41079-0 P1001-P1855-Q29868931-76b67d84-0	P1001	Q41079	wikibase-item
P1001-P1855-Q29868931-76b67d84-0-P1001-Q42392-0 P1001-P1855-Q29868931-76b67d84-0	P1001	Q42392	wikibase-item
P1001-P1855-Q29868931-76b67d84-

In [18]:
# kgtk_path = "/Users/pedroszekely/Documents/GitHub/kgtk"
kgtk_path = "/Users/markmann/Desktop/CKG/kgtk_subset/kgtk"
os.environ["EXAMPLES_DIR"] = kgtk_path + "/examples"
os.environ["USECASE_DIR"] = kgtk_path + "/use-cases"
os.environ["TEMP"] = temp
os.environ["OUT"] = out
os.environ["DATATYPES"] = datatypes
os.environ["METADATA"] = metadata

In [19]:
!ls "$TEMP"

Wikidata Useful Files Out.ipynb  item.edges.reduced.tsv.gz
alias.edges.reduced.tsv.gz       items.remove.sorted.tsv.gz
description.edges.reduced.tsv.gz items.remove.tsv.gz
item.edges.reduced.2.tsv.gz      label.edges.reduced.tsv.gz
item.edges.reduced.sorted.tsv.gz partition-wikidata.out.ipynb


In [20]:
!ls "$OUT"

aliases.en.tsv.gz                  metadata.types.tsv.gz
all.tsv.gz                         [34mparts[m[m
claims.tsv.gz                      qualifiers.tsv.gz
descriptions.en.tsv.gz             [34mtemp.useful_files[m[m
labels.en.tsv.gz                   [34museful_files[m[m
labels.tsv.gz                      wikidata.sqlite3.db
metadata.property.datatypes.tsv.gz


In [21]:
!kgtk cat \
-i "$OUT"/aliases.en.tsv.gz \
-i "$OUT"/descriptions.en.tsv.gz \
-i "$OUT"/qualifiers.tsv.gz \
-i "$OUT"/claims.tsv.gz \
-i "$OUT"/labels.en.tsv.gz \
-i "$OUT"/metadata.property.datatypes.tsv.gz \
-i "$OUT"/metadata.types.tsv.gz \
-o "$OUT"/all.tsv.gz

In [22]:
!ls {os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb"}

/Users/markmann/Desktop/CKG/kgtk_subset/kgtk/examples/partition-wikidata.ipynb


In [None]:
pm.execute_notebook(
    os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb",
    os.environ["TEMP"] + "/partition-wikidata.out.ipynb",
    parameters=dict(
        wikidata_input_path = os.environ["OUT"] + "/all.tsv.gz",
        wikidata_parts_path = os.environ["OUT"] + "/parts",
        temp_folder_path = os.environ["OUT"] + "/parts/temp",
        sort_extras = "--buffer-size 30% --temporary-directory $OUT/parts/temp",
        verbose = False
    )
)
;

In [None]:
pm.execute_notebook(
    os.environ["USECASE_DIR"] + "/Wikidata Useful Files.ipynb",
    os.environ["TEMP"] + "/Wikidata Useful Files Out.ipynb",
    parameters=dict(
        output_path = os.environ["OUT"],
        output_folder = "useful_files",
        temp_folder = "temp.useful_files",
        wiki_root_folder = os.environ["OUT"] + "/parts/",
        cache_path = os.environ["OUT"] + "/temp.useful_files",
        languages = 'en',
        compute_pagerank = True,
        delete_database = False
    )
)
;

## Sanity checks

- After removing classes, check that the class does not occur in the resulting claims file. 
- After protecting classes, check the class was occurs in the resulting claims file.

Check that the class we removed `Q34508` was removed from claims file

In [None]:
!{kypher} -i {out}/claims.tsv.gz \
--match '(n1:Q34508)-[l]->(n2)' \
--limit 10 \
| col

In [None]:
!{kypher} -i {out}/claims.tsv.gz \
--match '(n1:P131)-[l]->(n2)' \
--limit 10 \
| col

## Compute the derived files using the `Wikidata Useful Files` Jupyter notebook

Compute `claims.wikibase-item.tsv.gz` which would be computed by the Wikidata partitioner, but we are not using it here yet

In [None]:
!zcat < "{datatypes}" | head | col

In [None]:
!{kypher} -i {out}/claims.tsv.gz -i "{datatypes}" -o {out}/claims.wikibase-item.tsv.gz \
--match 'claims: (n1)-[l {label: p}]->(n2), datatypes: (p)-[:datatype]->(:`wikibase-item`)' \
--return 'l as id, n1 as node1, p as label, n2 as node2' \
--order-by 'l' 

To compute the derived files we use papermill to run the `Wikidata Useful Files` notebook.

In [None]:
pm.execute_notebook(
    notebooks_folder + useful_files_notebook,
    temp + "/useful_files_notebook_output.ipynb",
    parameters=dict(
        output_path=output_path,
        output_folder=output_folder,
        temp_folder=temp_folder,
        wiki_root_folder=wiki_root_folder,
        claims_file="claims.tsv.gz",
        label_file="labels.en.tsv.gz",
        alias_file="aliases.en.tsv.gz",
        description_file="descriptions.en.tsv.gz",
        item_file="claims.wikibase-item.tsv.gz",
        cache_path=cache_path,
        delete_database=delete_database,
        compute_pagerank=compute_pagerank
    )
)

Look at the columns so we know how to construct the kypher query

## Summary of results

In [None]:
!ls -lh {out}/*wikidataos.*

In [None]:
!zcat < {out}/wikidataos.all.tsv.gz | wc

## Verification

The edges file must contain edges for properties, this is not the case on 2020-11-10`


In [None]:
!{kgtk} -i "{claims}" \
--match '(:P10)-[l]->(n2)' \
--limit 10

## concatenate files to get the `all` file

In [None]:
lad = []
if 'en' not in languages:
    languages.append('en')
for lang in languages:
    lad.append(f"{out}/labels.{lang}.tsv.gz")
    lad.append(f"{out}/aliases.{lang}.tsv.gz")
    lad.append(f"{out}/descriptions.{lang}.tsv.gz")
lad_file_list = " ".join(lad)

In [None]:
!kgtk cat -i {out}/claims.tsv.gz \
{lad_file_list} \
{out}/qualifiers.tsv.gz \
{out}/metadata.pagerank.undirected.tsv.gz \
{out}/metadata.pagerank.directed.tsv.gz \
{out}/metadata.in_degree.tsv.gz \
{out}/metadata.out_degree.tsv.gz \
-o {out}/wikidataos.all.tsv.gz

## concatenate files to get the `all for triples` file


In [None]:
!kgtk cat -i $OUT/wikidataos.all.tsv.gz \
$OUT/derived.P31.tsv.gz \
$OUT/derived.P279.tsv.gz \
$OUT/derived.isa.tsv.gz \
$OUT/derived.P279star.tsv.gz \
-o $OUT/wikidataos.all.for.triples.tsv.gz

## Filter out `novalue`, `somevalue` and `P9`

In [None]:
!kgtk filter -i $OUT/wikidataos.all.for.triples.tsv.gz \
    -o $OUT/wikidataos.all.for.triples.filtered.tsv.gz \
    -p ';;somevalue,novalue,P9' --invert

## Add ids for any edge with missing id

In [None]:
!kgtk add-id -i $OUT/wikidataos.all.for.triples.filtered.tsv.gz \
-o $OUT/wikidataos.all.for.triples.filtered.id.tsv.gz \
--id-style wikidata

## Sort by `id`

In [None]:
!kgtk sort2 -i $OUT/wikidataos.all.for.triples.filtered.id.tsv.gz \
-o $OUT/wikidataos.all.for.triples.filtered.id.sorted.tsv.gz 
-c id