# Generating Subsets of Wikidata

>Warning: 
**This notebook is under construction and it doesn't work**

## Purpose

>This notebook is used to create smaller subgraphs from a larger input Wikidata graph. Notebook users can provide a list of Wikidata classes (**QNodes**) to remove and preserve to create desired subsets of Wikidata. 


### Batch Invocation
Example batch command. The second argument is a notebook where the output will be stored. You can load it to see progress.

UPDATE EXAMPLE INVOCATION


```
papermill Wikidata\ Useful\ Files.ipynb useful-files.out.ipynb \
-p wiki_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/all.tsv.gz \
-p label_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.label.en.tsv.gz \
-p item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.wikibase-item.tsv.gz \
-p property_item_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/part.property.wikibase-item.tsv.gz \
-p qual_file /Volumes/GoogleDrive/Shared\ drives/KGTK-public-graphs/wikidata-20200803-v3/qual.tsv.gz \
-p output_path <local folder> \
-p output_folder useful_files_v4 \
-p temp_folder temp.useful_files_v4 \
-p delete_database no \
-p compute_pagerank no \
-p languages es,ru,zh-cn 
```

In [1]:
# Parameters

# Folder on local machine where to create the output and temporary folders
# output_path = "/Users/pedroszekely/Downloads/kypher"
output_path = "/Users/markmann/Downloads/subset"

# The names of the output and temporary folders
output_folder = "output"
temp_folder = "temp.output"

# Classes to remove
# remove_classes = "Q13442814, Q523, Q16521, Q318, Q7318358, Q7187, Q11173, Q8054, Q5, Q13100073, Q8502, Q3305213, Q4022, Q79007, Q1931185, Q30612, Q101352, Q54050, Q13433827, Q2668072, Q23397, Q3863, Q11424, Q482994, Q47150325, Q16970, Q18593264, Q355304, Q9842, Q7725634, Q27020041, Q56436498, Q2154519, Q61443690, Q49008, Q3331189, Q47521, Q5084, Q19389637, Q21014462, Q4164871, Q11060274, Q5633421, Q39816, Q5185279, Q55488, Q134556, Q22698, Q985488, Q1260524, Q204107, Q2225692, Q215380, Q71963409, Q452237, Q93184, Q12323"

# The location of input files
# wiki_root_folder = "/Volumes/GoogleDrive/Shared\ drives/KGTK/datasets/wikidata-20200803-v4/"
# wiki_root_folder = "/Volumes/GoogleDrive/Shared\ drives/KGTK/datasets/wikidata-20200803-v4/"
# wiki_root_folder = "/Users/pedroszekely/Downloads/kypher/wikidataos-v4/"
wiki_root_folder = "/Users/markmann/Google\ Drive\ File\ Stream/Shared\ drives/KGTK/datasets/wikidataos-v4-mm-2/"

claims_file = "claims.tsv.gz"
label_file = "labels.en.tsv.gz"
alias_file = "aliases.en.tsv.gz"
description_file = "descriptions.en.tsv.gz"
item_file = "claims.wikibase-item.tsv.gz"
qual_file = "qualifiers.tsv.gz"
property_datatypes_file = "metadata.property.datatypes.tsv.gz"
isa_file = "derived.isa.tsv.gz"
p279star_file = "derived.P279star.tsv.gz"

# Useful files Jupyter notebook
useful_files_notebook = "Wikidata Useful Files.ipynb"
notebooks_folder = "/Users/markmann/Desktop/CKG/kgtk_subset/kgtk/examples/"

# Location of the cache database for kypher
# cache_path = "/Users/pedroszekely/Downloads/kypher/wikidataos-v4"
cache_path = f'{output_path}/{output_folder}'

#Additional parameters
delete_database = "no"
compute_pagerank = "no"
languages = ""

### Needs fixing
# Whether to delete the cache database
if delete_database and delete_database.lower().strip() == 'yes':
    delete_database = True
else:
    delete_database = False

### Needs fixing
if compute_pagerank and compute_pagerank.lower().strip() == 'yes':
    compute_pagerank = True
else:
    compute_pagerank = False

if languages:
    languages = languages.split(',')

In [2]:
import io
import os
import subprocess
import sys
import re

import numpy as np
import pandas as pd

import altair as alt

import papermill as pm
import gzip

import gzip
import time
from operator import itemgetter

from utils import Remove_Classes

## Set up variables for files

In [3]:
#Environment variables
if cache_path:
    os.environ['STORE'] = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    os.environ['STORE'] = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)

#Python variables
if cache_path:
    store = "{}/wikidata.sqlite3.db".format(cache_path)
else:
    store = "{}/{}/wikidata.sqlite3.db".format(output_path, temp_folder)

out = "{}/{}".format(output_path, output_folder)
temp = "{}/{}".format(output_path, temp_folder)

claims = wiki_root_folder + claims_file
labels = wiki_root_folder + label_file
aliases = wiki_root_folder + alias_file
descriptions = wiki_root_folder + description_file
items = wiki_root_folder + item_file
quals = wiki_root_folder + qual_file
datatypes = wiki_root_folder + property_datatypes_file
isa = wiki_root_folder + isa_file
p279star = wiki_root_folder + p279star_file

# shortcuts to commands
kgtk = "time kgtk --debug"
kypher = "kgtk query --debug --graph-cache " + store

Go to the output directory and create the subfolders for the output files and the temporary files

In [4]:
!cd $output_path
!mkdir {out}
!mkdir {temp}

mkdir: /Users/markmann/Downloads/subset/output: File exists
mkdir: /Users/markmann/Downloads/subset/temp.output: File exists


Clean up the output and temp folders before we start

In [None]:
# !rm {out}/*.tsv {out}/*.tsv.gz
# !rm {temp}/*.tsv {temp}/*.tsv.gz

if delete_database:
    !rm {out}/*.tsv {out}/*.tsv.gz
    !rm {temp}/*.tsv {temp}/*.tsv.gz

### Preview the input files

It is always a good practice to peek a the files to make sure the column headings are what we expect

In [None]:
# !gzcat {claims} | head
# !zgrep 'Q34508' {claims} #Exists
# !zgrep 'Q34508' {labels} #Exists
# !zgrep 'Q34508' {aliases} #Exists
# !zgrep 'Q34508' {descriptions} #Exists
# !{descriptions}

!zgrep '\tQ34508\t' {claims} -c

In [None]:
!{kypher} -i {claims} \
--match '()-[]->()' \
--limit 10

## Creating a list of all the items we want to remove

### Compute the items to be removed

Use the methods of the below `Remove_Classes` helper to build a list of classes (QNodes) to remove from the Wikidata input files.

In [None]:
class Remove_Classes():
    
    ######### INIT #########
    def __init__(self, temp):
        '''temp - should be the path to temp folder'''
        self.temp = temp
        self.classes_to_remove = set()
        self.classes_to_protect = set()
        
    def count_classes(self, isa, p279star, claims):
        '''Finds all classes from isa and p279star files, then counts instances of classes in claims file'''
        #Get union of classes from isa and p279star files
        self.class_set = self.find_all_classes(isa, p279star)

        #Query the claims file, and get a count of each class
        self.class_counts = self.get_class_counts(claims)
    
    def find_all_classes(self, isa, p279star): #called by count_classes()
        isa_p = isa.replace('\\', '')
        p279star_p = p279star.replace('\\', '')
        class_files = [isa_p, p279star_p]
        class_set = set()
        for file in class_files:
            fd = gzip.open(file, 'rt')
            lines=fd.readlines()
            count = 0
            for line in lines[1:]:
                qnode = line.split('\t')[2].strip()
                class_set.add(qnode)
        return class_set
    
    def get_class_counts(self, claims): #called by count_classes()
        claims_p = claims.replace('\\', '')
        fd = gzip.open(claims_p, 'rt')
        lines=fd.readlines()

        class_counts = dict()
        for line in lines[1:]:
            n1 = line.split('\t')[1]
            n2 = line.split('\t')[3]
            for n in [n1, n2]:
                if n in self.class_set:
                    if n not in class_counts: class_counts[n] = 1
                    else: class_counts[n] += 1
        return class_counts
    
    ######### METHODS #########                       
    def add_instances(self, instances, **kwargs):
        '''Identify the set of all classes for a list of instances
        instancess - a list of Wikidata instances <list> 
        kwargs: remove - whether to add to the remove_list or protect_list (True/False)'''
        if kwargs:
            for qnode in instances:
                # !wd u {qnode} > {self.temp}/summary.txt #ipynb
                command = f"wd u {qnode} > {self.temp}/summary.txt" #py
                os.system(command)
                fd = open(f'{self.temp}/summary.txt', "r")
                lines = fd.readlines()
                for line in lines:
                    if line.split(":")[0] in ['instance of (P31)', 'subclass of (P279)']:
                        classes_raw = line.split(":")[1].split('|')
                        for c in classes_raw:
                            res = re.findall(r'\(.*?\)', c)[0].replace('(','').replace(')','')

                            #Add to remove_list or protect_list based on `remove` setting
                            if kwargs['remove']: 
                                self.classes_to_remove.add(res)
                            else:
                                print('result: ', res)
                                self.add_classes_to_protect([res])
        else: print('Error: Please specify remove parameter; ex: remove=False, remove=True')
        
    def add_classes_to_remove(self, **kwargs):
        '''Add classes manually to set of classes to remove
        kwargs: classes - a list of Wikidata classes <list>
        kwargs: size - adds classes with # instances < size'''
        if 'classes' in kwargs:
            if isinstance(kwargs['classes'], list):
                [self.classes_to_remove.add(c) for c in kwargs['classes']]
            else: 
                print('must pass in a list of classes')
        if 'size' in kwargs:
            for key in self.class_counts.keys():
                if self.class_counts[key] <= kwargs['size']: 
                    self.classes_to_remove.add(key)

    def add_classes_to_protect(self, classes):
        '''Add classes manually to set of classes to protect
        args: classes - list of Wikidata classes'''
        for c in list(classes):
            # !wdtaxonomy -r {c} -f csv -o {self.temp}/superclass_raw.txt #ipynb
            command = f'wdtaxonomy -r {c} -f csv -o {self.temp}/superclass_raw.txt' #py
            os.system(command)
            fd = open(f'{self.temp}/superclass_raw.txt', "r")
            lines = fd.readlines()
            for line in lines[1:]:
                qnode = line.split(',')[1]
                if qnode[0].lower() == 'q': 
                    self.classes_to_protect.add(qnode)
                    
    def check_conflict(self):
        '''Check if any conflicts exist between remove_classes and protect_classses.
        If there is an conflict, point out the problematic remove-protect pair. '''
        
        #
        

In [None]:
#WORKING
rc = Remove_Classes(temp)
rc.count_classes(isa, p279star, claims)

# Test 1a: Remove instances (Q5451712, Fireball)
instances = ['Q2468862']
rc.add_instances(instances, remove=True)
print(len(rc.classes_to_remove)) #>> 1

#Test 1b: Remove classes (Q30612, clinical trial)
classes = ['Q30612']
rc.add_classes_to_remove(classes=classes)
print(len(rc.classes_to_remove)) #>> 2

# Test 1c: Remove classes with size <= 5
rc.add_classes_to_remove(size=5)
print(len(rc.classes_to_remove)) #>> 235179

### Compute the items to be protected

Use the methods of `Remove_Classes` helper to identify classes to preserve (QNode). Also check for conflicts with the list of classes to remove, and display any conflicts to user.

In [13]:
#TESTING
rc = Remove_Classes(temp)

#Test 2a: Protect instances (Q15874936, Michelob)
instances = ['Q15874936']
rc.add_instances(instances, remove=False)
print(len(rc.classes_to_protect)) #>> 90

#Test 2b-v1: Protect classes (Q44, beer)
# classes = ['Q44']
# rc.add_classes_to_protect(classes)
# print(len(rc.classes_to_protect)) #>> single: size = 31, union_2a: size = 90 (complete overlap)

#Test 2b-v2: Protect classes (Q34508, videotape)
classes = ['Q34508']
rc.add_classes_to_protect(classes)
print(len(rc.classes_to_protect)) #>> single: size = 60, union_2a: size = 103 (partial overlap)

result:  Q15075508
90
103


Check for any conflicts between `classes_to_remove` and `classes_to_protect` and let the user know in `stdout`.

### Collect testing metrics

For each file, count the number of classes and instances of examples we are testing. 
We will then remove these classes, and check that the notebook is removing classes as expected. 

In [None]:
files = [('claims', claims), ('labels', labels), ('aliases', aliases), ('descriptions', descriptions)]
test_classes = ["Q281", "Q30612"]

# #Test 1a: Count class (Q281, whisky) of given instance (Q5451712, Fireball)
#Test 1b: Count class (Q30612, clinical trial)
file_counts = {'before': dict(), 'after': dict()}
for file in files:
    file_counts['before'][file[0]] = {c: 0 for c in test_classes}
    fd = gzip.open(file[1].replace('\\', ''), 'rt')
    lines=fd.readlines()
    for line in lines:
        for c in test_classes:
            if re.search(f"\t{c}\t", line): 
                file_counts['before'][file[0]][c] += 1

#Test 1c: Count the amount of classes with <= 5 instances
size = 5
count_classes_small = 0
for key in rc.class_counts.keys():
    if rc.class_counts[key] <= size: 
        count_classes_small += 1
# count_classes_small

In [None]:
file_counts #1a, 1b
count_classes_small #1c

Compose the kypher command to remove the classes

In [None]:
!zcat < {isa} | head | col

Run the command, the items to remove will be in file `{temp}/items.remove.tsv.gz`

In [None]:
", ".join(list(rc.classes_to_remove))
# classes = ", ".join(list(map(lambda x: '"{}"'.format(x), remove_classes.replace(" ", "").split(","))))

# !{kypher}  -i {isa} -i {p279star} -o {temp}/items.remove.tsv.gz \
# --match 'isa: (n1)-[:isa]->(c), P279star: (c)-[]->(class)' \
# --where 'class in [{classes}]' \
# --return 'distinct n1, "p31_p279star" as label, class as node2'

Preview the file

In [None]:
!zcat < {temp}/items.remove.tsv.gz | head | col

In [None]:
!zcat < {temp}/items.remove.tsv.gz | wc

In [None]:
!zcat < {temp}/items.remove.tsv.gz | grep 'Q502268\t'

In [None]:
!zcat < {temp}/items.remove.tsv.gz | grep 'Q15874936\t'

Collect all the classes of items we will remove, just as a sanity check

In [None]:
!{kypher} -i {temp}/items.remove.tsv.gz \
--match '()-[]->(n2)' \
--return 'distinct n2' \
--limit 10

## Create the reduced edges file

### Remove the items from the all.tsv and the label, alias and description files
We will be left with `reduced` files where the edges do not have the unwanted items. We have to remove them from the node1 and node2 positions, so we need to run the ifnotexists commands twice.

Before we start preview the files to see the column headings and check whether they look sorted.

In [None]:
!$kgtk sort2 -i {temp}/items.remove.tsv.gz -o {temp}/items.remove.sorted.tsv.gz

In [None]:
!zcat < {temp}/items.remove.sorted.tsv.gz | head | col

In [None]:
!zcat < "{claims}" | head -5 | col

Remove from the full set of edges those edges that have a `node1` present in `items.remove.sorted.tsv`

In [None]:
!$kgtk ifnotexists -i "{claims}" -o {temp}/item.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted 

From the remaining edges, remove those that have a `node2` present in `items.remove.sorted.tsv`

In [None]:
!$kgtk sort2 -i {temp}/item.edges.reduced.tsv.gz -o {temp}/item.edges.reduced.sorted.tsv.gz \
--columns node2 label node1 id

In [None]:
!$kgtk ifnotexists -i {temp}/item.edges.reduced.sorted.tsv.gz -o {temp}/item.edges.reduced.2.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node2 \
--filter-keys node1 \
--presorted 

Create a file with the labels

In [None]:
!$kgtk ifnotexists -i {labels} -o {temp}/label.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

In [None]:
languages = 'en'

In [None]:
for lang in languages:
    cmd = f"kgtk sort2 -i {temp}/label.{lang}.edges.reduced.tsv.gz -o {out}/labels.{lang}.tsv.gz" 
    !$cmd

Create a file with the aliases

In [None]:
!$kgtk ifnotexists -i {aliases} -o {temp}/alias.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

In [None]:
for lang in languages:
    cmd = f"kgtk --debug ifnotexists -i {wiki_root_folder}aliases.{lang}.tsv.gz \
    -o {temp}/alias.{lang}.edges.reduced.tsv.gz \
    --filter-on {temp}/items.remove.sorted.tsv.gz \
    --input-keys node1 \
    --filter-keys node1 \
    --presorted"
    !$cmd

In [None]:
for lang in languages:
    cmd = f"kgtk sort2 -i {temp}/alias.{lang}.edges.reduced.tsv.gz -o {out}/aliases.{lang}.tsv.gz" 
    !$cmd

Create a file with the descriptions

In [None]:
!$kgtk ifnotexists -i {descriptions} -o {temp}/description.edges.reduced.tsv.gz \
--filter-on {temp}/items.remove.sorted.tsv.gz \
--input-keys node1 \
--filter-keys node1 \
--presorted

In [None]:
for lang in languages:
    cmd = f"kgtk --debug ifnotexists -i {wiki_root_folder}descriptions.{lang}.tsv.gz \
    -o {temp}/description.{lang}.edges.reduced.tsv.gz \
    --filter-on {temp}/items.remove.sorted.tsv.gz \
    --input-keys node1 \
    --filter-keys node1 \
    --presorted"
    !$cmd

In [None]:
for lang in languages:
    cmd = f"kgtk sort2 -i {temp}/description.{lang}.edges.reduced.tsv.gz -o {out}/descriptions.{lang}.tsv.gz" 
    !$cmd

### Produce the output files for claims, labels, aliases and descriptions

In [None]:
!$kgtk sort2 -i {temp}/item.edges.reduced.2.tsv.gz -o {out}/claims.tsv.gz 

In [None]:
!$kgtk sort2 -i {temp}/label.edges.reduced.tsv.gz -o {out}/labels.en.tsv.gz 

In [None]:
!$kgtk sort2 -i {temp}/alias.edges.reduced.tsv.gz -o {out}/aliases.en.tsv.gz 

In [None]:
!$kgtk sort2 -i {temp}/description.edges.reduced.tsv.gz -o {out}/descriptions.en.tsv.gz 

### Check test results

## Create the reduced qualifiers file
We do this by finding all the ids of the reduced edges file, and then selecting out from `qual.tsv`

We need to join by id, so we need to sort both files by id, node1, label, node2:

- `{quals}` 
- `{out}/claims.tsv.gz` 

In [None]:
!zcat < "{quals}" | head | column -t -s $'\t' 

Run `ifexists` to select out the quals for the edges in `{out}/wikidataos.qual.tsv.gz`. Note that we use `node1` in the qualifier file, matching to `id` in the `wikidataos.all.tsv` file.

In [None]:
!$kgtk ifexists -i "{quals}" -o {out}/qualifiers.tsv.gz \
--filter-on {out}/claims.tsv.gz \
--input-keys node1 \
--filter-keys id \
--presorted

Look at the final output for qualifiers

In [None]:
!zcat < {out}/qualifiers.tsv.gz | head | col

In [None]:
kgtk_path = "/Users/pedroszekely/Documents/GitHub/kgtk"
os.environ["EXAMPLES_DIR"] = kgtk_path + "/examples"
os.environ["USECASE_DIR"] = kgtk_path + "/use-cases"
os.environ["TEMP"] = temp
os.environ["OUT"] = out

In [None]:
!ls "$TEMP"

In [None]:
!ls "$OUT"

In [None]:
!echo $kgtk

In [None]:
!kgtk cat \
-i "$OUT"/aliases.en.tsv.gz \
-i "$OUT"/descriptions.en.tsv.gz \
-i "$OUT"/qualifiers.tsv.gz \
-i "$OUT"/claims.tsv.gz \
-i "$OUT"/labels.en.tsv.gz \
-i "$OUT"/metadata.property.datatypes.tsv.gz \
-i "$OUT"/metadata.types.tsv.gz \
-o "$OUT"/all.tsv.gz

In [None]:
!ls {os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb"}

In [None]:
pm.execute_notebook(
    os.environ["EXAMPLES_DIR"] + "/partition-wikidata.ipynb",
    os.environ["TEMP"] + "/partition-wikidata.out.ipynb",
    parameters=dict(
        wikidata_input_path = os.environ["OUT"] + "/all.tsv.gz",
        wikidata_parts_path = os.environ["OUT"] + "/parts",
        temp_folder_path = os.environ["OUT"] + "/parts/temp",
        sort_extras = "--buffer-size 30% --temporary-directory $OUT/parts/temp",
        verbose = False
    )
)
;

In [None]:
pm.execute_notebook(
    os.environ["USECASE_DIR"] + "/Wikidata Useful Files.ipynb",
    os.environ["TEMP"] + "/Wikidata Useful Files Out.ipynb",
    parameters=dict(
        output_path = os.environ["OUT"],
        output_folder = "useful_files",
        temp_folder = "temp.useful_files",
        wiki_root_folder = os.environ["OUT"] + "/parts/",
        cache_path = os.environ["OUT"] + "/temp.useful_files",
        languages = 'en',
        compute_pagerank = True,
        delete_database = False
    )
)
;

## Sanity checks

In [None]:
!{kypher} -i {out}/claims.tsv.gz \
--match '(n1:Q368441)-[l]->(n2)' \
--limit 10 \
| col

In [None]:
!{kypher} -i {out}/claims.tsv.gz \
--match '(n1:P131)-[l]->(n2)' \
--limit 10 \
| col

## Compute the derived files using the `Wikidata Useful Files` Jupyter notebook

Compute `claims.wikibase-item.tsv.gz` which would be computed by the Wikidata partitioner, but we are not using it here yet

In [None]:
!zcat < "{datatypes}" | head | col

In [None]:
!{kypher} -i {out}/claims.tsv.gz -i "{datatypes}" -o {out}/claims.wikibase-item.tsv.gz \
--match 'claims: (n1)-[l {label: p}]->(n2), datatypes: (p)-[:datatype]->(:`wikibase-item`)' \
--return 'l as id, n1 as node1, p as label, n2 as node2' \
--order-by 'l' 

To compute the derived files we use papermill to run the `Wikidata Useful Files` notebook.

In [None]:
pm.execute_notebook(
    notebooks_folder + useful_files_notebook,
    temp + "/useful_files_notebook_output.ipynb",
    parameters=dict(
        output_path=output_path,
        output_folder=output_folder,
        temp_folder=temp_folder,
        wiki_root_folder=wiki_root_folder,
        claims_file="claims.tsv.gz",
        label_file="labels.en.tsv.gz",
        alias_file="aliases.en.tsv.gz",
        description_file="descriptions.en.tsv.gz",
        item_file="claims.wikibase-item.tsv.gz",
        cache_path=cache_path,
        delete_database=delete_database,
        compute_pagerank=compute_pagerank
    )
)

Look at the columns so we know how to construct the kypher query

## Summary of results

In [None]:
!ls -lh {out}/*wikidataos.*

In [None]:
!zcat < {out}/wikidataos.all.tsv.gz | wc

## Verification

The edges file must contain edges for properties, this is not the case on 2020-11-10`


In [None]:
!{kgtk} -i "{claims}" \
--match '(:P10)-[l]->(n2)' \
--limit 10

## concatenate files to get the `all` file

In [None]:
lad = []
if 'en' not in languages:
    languages.append('en')
for lang in languages:
    lad.append(f"{out}/labels.{lang}.tsv.gz")
    lad.append(f"{out}/aliases.{lang}.tsv.gz")
    lad.append(f"{out}/descriptions.{lang}.tsv.gz")
lad_file_list = " ".join(lad)

In [None]:
!kgtk cat -i {out}/claims.tsv.gz \
{lad_file_list} \
{out}/qualifiers.tsv.gz \
{out}/metadata.pagerank.undirected.tsv.gz \
{out}/metadata.pagerank.directed.tsv.gz \
{out}/metadata.in_degree.tsv.gz \
{out}/metadata.out_degree.tsv.gz \
-o {out}/wikidataos.all.tsv.gz

## concatenate files to get the `all for triples` file


In [None]:
!kgtk cat -i $OUT/wikidataos.all.tsv.gz \
$OUT/derived.P31.tsv.gz \
$OUT/derived.P279.tsv.gz \
$OUT/derived.isa.tsv.gz \
$OUT/derived.P279star.tsv.gz \
-o $OUT/wikidataos.all.for.triples.tsv.gz

## Filter out `novalue`, `somevalue` and `P9`

In [None]:
!kgtk filter -i $OUT/wikidataos.all.for.triples.tsv.gz \
    -o $OUT/wikidataos.all.for.triples.filtered.tsv.gz \
    -p ';;somevalue,novalue,P9' --invert

## Add ids for any edge with missing id

In [None]:
!kgtk add-id -i $OUT/wikidataos.all.for.triples.filtered.tsv.gz \
-o $OUT/wikidataos.all.for.triples.filtered.id.tsv.gz \
--id-style wikidata

## Sort by `id`

In [None]:
!kgtk sort2 -i $OUT/wikidataos.all.for.triples.filtered.id.tsv.gz \
-o $OUT/wikidataos.all.for.triples.filtered.id.sorted.tsv.gz 
-c id