# Partitioning a subset of Wikidata

This notebook illustrates how to partition a Wikidata KGTK edges file.

Parameters are set up in the first cell so that we can run this notebook in batch mode. Example invocation command:

```
papermill partition-wikidata.ipynb partition-wikidata.out.ipynb \
-p wikidata_input_path /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20201130/data/all.tsv.gz \
-p wikidata_parts_path /data3/rogers/kgtk/gd/kgtk_public_graphs/cache/wikidata-20201130/parts \
```

Here is a sample of the records that might appear in the input KGTK file:
```
id	node1	label	node2	rank	node2;wikidatatype	lang
Q1-P1036-418bc4-78f5a565-0	Q1	P1036	"113"	normal	external-id	
Q1-P1343-Q19190511-ab132b87-0   Q1      P1343   Q19190511       normal  wikibase-item   
Q1-P18-92a7b3-0dcac501-0        Q1      P18     "Hubble ultra deep field.jpg"   normal  commonsMedia    
Q1-P2386-cedfb0-0fdbd641-0      Q1      P2386   +880000000000000000000000Q828224        normal  quantity        
Q1-P580-a2fccf-63cf4743-0       Q1      P580    ^-13798000000-00-00T00:00:00Z/3 normal  time    
Q1-P920-47c0f2-52689c4e-0       Q1      P920    "LEM201201756"  normal  string  
Q1-P1343-Q19190511-ab132b87-0-P805-Q84065667-0  Q1-P1343-Q19190511-ab132b87-0   P805    Q84065667               wikibase-item   
Q1-P1343-Q88672152-5080b9e2-0-P304-5724c3-0     Q1-P1343-Q88672152-5080b9e2-0   P304    "13-36"         string  
Q1-P2670-Q18343-030eb87e-0-P1107-ce87f8-0       Q1-P2670-Q18343-030eb87e-0      P1107   +0.70           quantity        
Q1-P793-Q273508-1900d69c-0-P585-a2fccf-0        Q1-P793-Q273508-1900d69c-0      P585    ^-13798000000-00-00T00:00:00Z/3         time    
P10-alias-en-282226-0   P10     alias   'gif'@en
P10-description-en      P10     description     'relevant video. For images, use the property P18. For film trailers, qualify with \"object has role\" (P3831)=\"trailer\" (Q622550)'@en                        en
P10-label-en    P10     label   'video'@en                      en
Q1-addl_wikipedia_sitelink-19e42a-0     Q1      addl_wikipedia_sitelink http://enwikiquote.org/wiki/Universe                    en
Q1-addl_wikipedia_sitelink-19e42a-0-language-0  Q1-addl_wikipedia_sitelink-19e42a-0     sitelink-language       en                      en
Q1-addl_wikipedia_sitelink-19e42a-0-site-0      Q1-addl_wikipedia_sitelink-19e42a-0     sitelink-site   enwikiquote                     en
Q1-addl_wikipedia_sitelink-19e42a-0-title-0     Q1-addl_wikipedia_sitelink-19e42a-0     sitelink-title  "Universe"                      en
Q1-wikipedia_sitelink-5e459a-0  Q1      wikipedia_sitelink      http://en.wikipedia.org/wiki/Universe                   en
Q1-wikipedia_sitelink-5e459a-0-badge-Q17437798  Q1-wikipedia_sitelink-5e459a-0  sitelink-badge  Q17437798                       en
Q1-wikipedia_sitelink-5e459a-0-language-0       Q1-wikipedia_sitelink-5e459a-0  sitelink-language       en                      en
Q1-wikipedia_sitelink-5e459a-0-site-0   Q1-wikipedia_sitelink-5e459a-0  sitelink-site   enwiki                  en
Q1-wikipedia_sitelink-5e459a-0-title-0  Q1-wikipedia_sitelink-5e459a-0  sitelink-title  "Universe"                      en
```
Here are some contraints on the contents of the input file:
- The input file starts with a KGTK header record.
  - In addition to the `id`, `node1`, `label`, and `node2` columns, the file may contain the `node2;wikidatatype` column.
  - The `node2;wikidatatype` column is used to partition claims by Wikidata property datatype.
  - If it does not exist, it will be created during the partitioning process and populated using `datatype` relationships.
  - If it does exist, any empty values in the column will be populated using `datatype` relationships.
- The `id` column must contain a nonempty value.
- The first section of an `id` value must be the `node` value for the record.
  - The qualifier extraction operations depend upon this constraint. 
- In addition to the claims and qualifiers, the input file is expected to contain:
  - English language labels for all property entities appearing in the file.
- The input file ought to contain the following:
  - claims records,
  - qualifier records,
  - alias records in appropriate languages,
  - description records in appropriate languages,
  - label records in appropriate languages, and
  - sitelink records in appropriate languages.
  - `datatype` records that map Wikidata property entities to Wikidata property datatypes. These records are required if the input file does not contain the `node2;wikidatatype` column.
- Additionally, this script provides for the appearance of `type` records in the input file.
  - `type` records that list all `entityId` values and identify them as properties or items. These records provides a correctness check on the operation of `kgtk import-wikidata`, and may be deprecated in the future.
- The input file is assumed to be unsorted. If it is already sorted on the (`id` `node1` `label` `node2`) columns , then set the `presorted` parameter to `True` to shorten the execution time of this script.

### Parameters for invoking the notebook

| Parameter | Description | Default |
| --------- | ----------- | ------- |
| `wikidata_input_path` | A folder containing the Wikidata KGTK edges to partition. | '/data4/rogers/elicit/cache/datasets/wikidata-20200803/data/all.tsv.gz' |
| `wikidata_parts_path` | A folder to receive the partitioned Wikidata files, such as `part.wikibase-item.tsv.gz` | '/data4/rogers/elicit/cache/datasets/wikidata-20200803/parts' |
| `temp_folder_path` |    A folder that may be used for temporary files. | wikidata_parts_path + '/temp' |
| `gzip_command` |        The compression command for sorting. | 'pigz'  (Note: use version 2.4 or later)|
| `kgtk_command` |        The kgtk commmand. | 'time kgtk' |
| `kgtk_options` |        The kgtk commmand options. | '--debug --timing' |
| `kgtk_extension` |      The file extension for generated KGTK files. Appending `.gz` implies gzip compression. | 'tsv.gz' |
| `presorted` |           When True, the input file is already sorted on the (`id` `node1` `label` `node2`) columns. | 'False' |
| `sort_extras` |         Extra parameters for the sort program.  The default specifies a path for temporary files. Other useful parameters include '--buffer-size' and '--parallel'. | '--parallel 24 --buffer-size 30% --temporary-directory ' + temp_folder_path |
| `use_mgzip` |           When True, use the mgzip program where appropriate for faster compression. | 'True' |
| `verbose` |             When True, produce additional feedback messages. | 'True' |

Note: if `pigz` version 2.4 (or later) is not available on your system, use `gzip`.


In [1]:
# Parameters
wikidata_input_path = "/Users/markmann/Downloads/subset/output/all.tsv.gz"
wikidata_parts_path = "/Users/markmann/Downloads/subset/output/parts"
temp_folder_path =    wikidata_parts_path + '/temp'
gzip_command =        'gzip' #'pigz'
kgtk_command =        'time kgtk'
kgtk_options =        '--debug --timing'
kgtk_extension =      'tsv.gz'
presorted =           'False'
sort_extras =         '--parallel 24 --buffer-size 30% --temporary-directory ' + temp_folder_path
use_mgzip =           'True'
verbose =             'False'



In [2]:
print('wikidata_input_path = %s' % repr(wikidata_input_path))
print('wikidata_parts_path = %s' % repr(wikidata_parts_path))
print('temp_folder_path = %s' % repr(temp_folder_path))
print('gzip_command = %s' % repr(gzip_command))
print('kgtk_command = %s' % repr(kgtk_command))
print('kgtk_options = %s' % repr(kgtk_options))
print('kgtk_extension = %s' % repr(kgtk_extension))
print('presorted = %s' % repr(presorted))
print('sort_extras = %s' % repr(sort_extras))
print('use_mgzip = %s' % repr(use_mgzip))
print('verbose = %s' % repr(verbose))


wikidata_input_path = '/Users/markmann/Downloads/subset/output/all.tsv.gz'
wikidata_parts_path = '/Users/markmann/Downloads/subset/output/parts'
temp_folder_path = '/Users/markmann/Downloads/subset/output/parts/temp'
gzip_command = 'gzip'
kgtk_command = 'time kgtk'
kgtk_options = '--debug --timing'
kgtk_extension = 'tsv.gz'
presorted = 'False'
sort_extras = '--parallel 24 --buffer-size 30% --temporary-directory /Users/markmann/Downloads/subset/output/parts/temp'
use_mgzip = 'True'
verbose = 'False'


### Create working folders and empty them

In [3]:
!mkdir {wikidata_parts_path}
!mkdir {temp_folder_path}

mkdir: /Users/markmann/Downloads/subset/output/parts: File exists
mkdir: /Users/markmann/Downloads/subset/output/parts/temp: File exists


In [4]:
!rm {wikidata_parts_path}/*.tsv {wikidata_parts_path}/*.tsv.gz
!rm {temp_folder_path}/*.tsv {temp_folder_path}/*.tsv.gz

rm: /Users/markmann/Downloads/subset/output/parts/*.tsv: No such file or directory
rm: /Users/markmann/Downloads/subset/output/parts/temp/*.tsv: No such file or directory
rm: /Users/markmann/Downloads/subset/output/parts/temp/*.tsv.gz: No such file or directory


### Sort the Input Data Unless Presorted
Sort the input data file by (id, node1, label, node2).
This may take a while.

In [5]:
if presorted.lower() == "true": 
    print('Using a presorted input file %s.' % repr(wikidata_input_path))
    partition_input_file = wikidata_input_path 
else: 
    print('Sorting the input file %s.' % repr(wikidata_input_path))
    partition_input_file = wikidata_parts_path + '/all.' + kgtk_extension 
    !{kgtk_command} {kgtk_options} sort2 --verbose={verbose} --gzip-command={gzip_command} \
 --input-file {wikidata_input_path} \
 --output-file {partition_input_file} \
 --columns     id node1 label node2 \
 --extra       "{sort_extras}"

Sorting the input file '/Users/markmann/Downloads/subset/output/all.tsv.gz'.
Timing: elapsed=0:12:14.986915 CPU=0:00:00.806604 (  0.1%): sort2 --verbose=False --gzip-command=gzip --input-file /Users/markmann/Downloads/subset/output/all.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/all.tsv.gz --columns id node1 label node2 --extra --parallel 24 --buffer-size 30% --temporary-directory /Users/markmann/Downloads/subset/output/parts/temp

real	12m15.144s
user	11m52.586s
sys	1m30.524s


### Partition the Claims, Qualifiers, and Entity Data
Split out the entity data (alias, description, label, and sitelinks) and additional metadata (datatype, type).  Separate the qualifiers from the claims.


In [6]:
!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --first-match-only \
 --input-file {partition_input_file} \
 -p '; datatype ;'        -o {wikidata_parts_path}/metadata.property.datatypes.{kgtk_extension} \
 -p '; alias ;'           -o {wikidata_parts_path}/aliases.{kgtk_extension} \
 -p '; description ;'     -o {wikidata_parts_path}/descriptions.{kgtk_extension} \
 -p '; label ;'           -o {wikidata_parts_path}/labels.{kgtk_extension} \
 -p '; addl_wikipedia_sitelink,wikipedia_sitelink ;' \
                          -o {wikidata_parts_path}/sitelinks.{kgtk_extension} \
 -p '; sitelink-badge,sitelink-language,sitelink-site,sitelink-title ;' \
                          -o {wikidata_parts_path}/sitelinks.qualifiers.{kgtk_extension} \
 -p '; type ;'            -o {wikidata_parts_path}/metadata.types.{kgtk_extension} \
 --reject-file {temp_folder_path}/claims-and-qualifiers.sorted-by-id.{kgtk_extension}

Timing: elapsed=0:14:28.182973 CPU=0:43:18.381168 (299.3%): filter --verbose=False --use-mgzip=True --first-match-only --input-file /Users/markmann/Downloads/subset/output/parts/all.tsv.gz -p ; datatype ; -o /Users/markmann/Downloads/subset/output/parts/metadata.property.datatypes.tsv.gz -p ; alias ; -o /Users/markmann/Downloads/subset/output/parts/aliases.tsv.gz -p ; description ; -o /Users/markmann/Downloads/subset/output/parts/descriptions.tsv.gz -p ; label ; -o /Users/markmann/Downloads/subset/output/parts/labels.tsv.gz -p ; addl_wikipedia_sitelink,wikipedia_sitelink ; -o /Users/markmann/Downloads/subset/output/parts/sitelinks.tsv.gz -p ; sitelink-badge,sitelink-language,sitelink-site,sitelink-title ; -o /Users/markmann/Downloads/subset/output/parts/sitelinks.qualifiers.tsv.gz -p ; type ; -o /Users/markmann/Downloads/subset/output/parts/metadata.types.tsv.gz --reject-file /Users/markmann/Downloads/subset/output/parts/temp/claims-and-qualifiers.sorted-by-id.tsv.gz

real	14m29.006s
u

### Sort the claims and qualifiers on Node1
Sort the combined claims and qualifiers file by the node1 column.
This may take a while.

In [7]:
!{kgtk_command} {kgtk_options} sort2 --verbose={verbose} --gzip-command={gzip_command} \
 --input-file {temp_folder_path}/claims-and-qualifiers.sorted-by-id.{kgtk_extension} \
 --output-file {temp_folder_path}/claims-and-qualifiers.sorted-by-node1.{kgtk_extension}\
 --columns     node1 \
 --extra       "{sort_extras}"

Timing: elapsed=0:02:02.890051 CPU=0:00:00.782150 (  0.6%): sort2 --verbose=False --gzip-command=gzip --input-file /Users/markmann/Downloads/subset/output/parts/temp/claims-and-qualifiers.sorted-by-id.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/temp/claims-and-qualifiers.sorted-by-node1.tsv.gz --columns node1 --extra --parallel 24 --buffer-size 30% --temporary-directory /Users/markmann/Downloads/subset/output/parts/temp

real	2m3.074s
user	2m1.539s
sys	0m13.139s


### Split the claims and qualifiers
If row A's node1 value matches some other row's id value, the then row A is a qualifier.

In [8]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file {temp_folder_path}/claims-and-qualifiers.sorted-by-node1.{kgtk_extension} \
 --filter-file {temp_folder_path}/claims-and-qualifiers.sorted-by-id.{kgtk_extension} \
 --output-file {temp_folder_path}/qualifiers.sorted-by-node1.{kgtk_extension}\
 --reject-file {temp_folder_path}/claims.sorted-by-node1.{kgtk_extension}\
 --input-keys node1 \
 --filter-keys id

Timing: elapsed=0:04:54.564006 CPU=0:14:03.884440 (286.5%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/temp/claims-and-qualifiers.sorted-by-node1.tsv.gz --filter-file /Users/markmann/Downloads/subset/output/parts/temp/claims-and-qualifiers.sorted-by-id.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/temp/qualifiers.sorted-by-node1.tsv.gz --reject-file /Users/markmann/Downloads/subset/output/parts/temp/claims.sorted-by-node1.tsv.gz --input-keys node1 --filter-keys id

real	4m54.973s
user	7m2.028s
sys	0m7.516s


### Sort the claims by ID
Sort the split claims by id, node1, label, node2.
This may take a while.

In [9]:
!{kgtk_command} {kgtk_options} sort2 --verbose={verbose} --gzip-command={gzip_command} \
 --input-file {temp_folder_path}/claims.sorted-by-node1.{kgtk_extension} \
 --output-file {temp_folder_path}/claims.no-datatype.{kgtk_extension}\
 --columns     id node1 label node2 \
 --extra       "{sort_extras}"

Timing: elapsed=0:02:28.354037 CPU=0:00:00.878512 (  0.6%): sort2 --verbose=False --gzip-command=gzip --input-file /Users/markmann/Downloads/subset/output/parts/temp/claims.sorted-by-node1.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/temp/claims.no-datatype.tsv.gz --columns id node1 label node2 --extra --parallel 24 --buffer-size 30% --temporary-directory /Users/markmann/Downloads/subset/output/parts/temp

real	2m28.535s
user	2m46.376s
sys	0m15.131s


### Merge the Wikidata Property Datatypes into the claims
Merge the Wikidata Property Datatypes into the claims row as node2;wikidatatype. This column will be used to partition the claims by Wikidata Property Datatype ina later step.  If the claims file already has a node2;wikidatatype column, lift only when that column has an empty value.


In [None]:
!gzcat {temp_folder_path}/claims.no-datatype.{kgtk_extension} | head

In [None]:
# !{temp_folder_path}/claims.no-datatype.{kgtk_extension}
# !gzcat {wikidata_parts_path}/metadata.property.datatypes.{kgtk_extension} | head #ISSUE: metadata is empty...
!{wikidata_parts_path}/metadata.property.datatypes.{kgtk_extension}

In [10]:
#ISSUE HERE
!{kgtk_command} {kgtk_options} lift --verbose={verbose} --use-mgzip={use_mgzip} \
 --input-file {temp_folder_path}/claims.no-datatype.{kgtk_extension} \
 --columns-to-lift label \
 --overwrite False \
 --label-file {wikidata_parts_path}/metadata.property.datatypes.{kgtk_extension}\
 --label-value datatype \
 --output-file {wikidata_parts_path}/claims.{kgtk_extension}\
 --columns-to-write 'node2;wikidatatype'

Timing: elapsed=0:03:03.725288 CPU=0:09:43.910892 (317.8%): lift --verbose=False --use-mgzip=True --input-file /Users/markmann/Downloads/subset/output/parts/temp/claims.no-datatype.tsv.gz --columns-to-lift label --overwrite False --label-file /Users/markmann/Downloads/subset/output/parts/metadata.property.datatypes.tsv.gz --label-value datatype --output-file /Users/markmann/Downloads/subset/output/parts/claims.tsv.gz --columns-to-write node2;wikidatatype

real	3m4.163s
user	4m52.053s
sys	0m3.938s


### Sort the qualifiers by ID
Sort the split qualifiers by id, node1, label, node2.
This may take a while.

In [11]:
!{kgtk_command} {kgtk_options} sort2 --verbose={verbose} --gzip-command={gzip_command} \
 --input-file {temp_folder_path}/qualifiers.sorted-by-node1.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.{kgtk_extension}\
 --columns     id node1 label node2 \
 --extra       "{sort_extras}"

Timing: elapsed=0:00:14.978641 CPU=0:00:00.743402 (  5.0%): sort2 --verbose=False --gzip-command=gzip --input-file /Users/markmann/Downloads/subset/output/parts/temp/qualifiers.sorted-by-node1.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --columns id node1 label node2 --extra --parallel 24 --buffer-size 30% --temporary-directory /Users/markmann/Downloads/subset/output/parts/temp

real	0m15.121s
user	0m14.232s
sys	0m1.139s


### Extract the English aliases, descriptions, labels, and sitelinks.
Aliases, descriptions, and labels are extracted by selecting rows where the `node2` value ends in the language suffix for English (`@en`) in a KGTK language-qualified string. This is an abbreviated pattern; a more general pattern would include the single quotes used to delimit a KGTK language-qualified string. If `kgtk import-wikidata` has executed properly, the abbreviated pattern should be sufficient.

Sitelink rows do not have a language-specific marker in the `node2` value. We use the `lang` column to provide the language code for English ('en').  The `lang` column is an additional column created by `kgtk import-wikidata`.

In [12]:
!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --regex \
 --input-file {wikidata_parts_path}/aliases.{kgtk_extension} \
 -p ';; ^.*@en$' -o {wikidata_parts_path}/aliases.en.{kgtk_extension}

Timing: elapsed=0:00:13.194327 CPU=0:00:25.215834 (191.1%): filter --verbose=False --use-mgzip=True --regex --input-file /Users/markmann/Downloads/subset/output/parts/aliases.tsv.gz -p ;; ^.*@en$ -o /Users/markmann/Downloads/subset/output/parts/aliases.en.tsv.gz

real	0m13.528s
user	0m12.709s
sys	0m0.334s


In [13]:
!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --regex \
 --input-file {wikidata_parts_path}/descriptions.{kgtk_extension} \
 -p ';; ^.*@en$' -o {wikidata_parts_path}/descriptions.en.{kgtk_extension}

Timing: elapsed=0:00:19.577638 CPU=0:00:49.065846 (250.6%): filter --verbose=False --use-mgzip=True --regex --input-file /Users/markmann/Downloads/subset/output/parts/descriptions.tsv.gz -p ;; ^.*@en$ -o /Users/markmann/Downloads/subset/output/parts/descriptions.en.tsv.gz

real	0m19.832s
user	0m24.617s
sys	0m0.659s


In [14]:
!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --regex \
 --input-file {wikidata_parts_path}/labels.{kgtk_extension} \
 -p ';; ^.*@en$' -o {wikidata_parts_path}/labels.en.{kgtk_extension}

Timing: elapsed=0:00:31.853621 CPU=0:01:31.086664 (286.0%): filter --verbose=False --use-mgzip=True --regex --input-file /Users/markmann/Downloads/subset/output/parts/labels.tsv.gz -p ;; ^.*@en$ -o /Users/markmann/Downloads/subset/output/parts/labels.en.tsv.gz

real	0m32.063s
user	0m45.620s
sys	0m0.836s


In [15]:
!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} \
 --input-file {wikidata_parts_path}/sitelinks.qualifiers.{kgtk_extension} \
 -p '; sitelink-language ; en' -o {temp_folder_path}/sitelinks.language.en.{kgtk_extension}

Timing: elapsed=0:00:00.628267 CPU=0:00:00.687380 (109.4%): filter --verbose=False --use-mgzip=True --input-file /Users/markmann/Downloads/subset/output/parts/sitelinks.qualifiers.tsv.gz -p ; sitelink-language ; en -o /Users/markmann/Downloads/subset/output/parts/temp/sitelinks.language.en.tsv.gz

real	0m0.962s
user	0m0.426s
sys	0m0.163s


In [16]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file {wikidata_parts_path}/sitelinks.{kgtk_extension} \
 --filter-on {temp_folder_path}/sitelinks.language.en.{kgtk_extension} \
 --output-file {wikidata_parts_path}/sitelinks.en.{kgtk_extension} \
 --input-keys  id \
 --filter-keys node1

Timing: elapsed=0:00:00.404702 CPU=0:00:00.582456 (143.9%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/sitelinks.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/temp/sitelinks.language.en.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/sitelinks.en.tsv.gz --input-keys id --filter-keys node1

real	0m0.810s
user	0m0.367s
sys	0m0.125s


In [17]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file {wikidata_parts_path}/sitelinks.qualifiers.{kgtk_extension} \
 --filter-on {temp_folder_path}/sitelinks.language.en.{kgtk_extension} \
 --output-file {wikidata_parts_path}/sitelinks.qualifiers.en.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys node1

Timing: elapsed=0:00:00.435751 CPU=0:00:00.603504 (138.5%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/sitelinks.qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/temp/sitelinks.language.en.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/sitelinks.qualifiers.en.tsv.gz --input-keys node1 --filter-keys node1

real	0m0.898s
user	0m0.376s
sys	0m0.133s


### Partition the claims by Wikidata Property Datatype
Wikidata has two names for each Wikidata property datatype: the name that appears in the JSON dump file, and the name that appears in the TTL dump file. `kgtk import-wikidata` currently imports rows from Wikikdata JSON dump files, and these are the names that appear below.

The `part.other` file catches any records that have an unknown Wikidata property datatype. Additional Wikidata property datatypes may occur when processing from certain Wikidata extensions.

In [18]:
!{kgtk_command} {kgtk_options} filter --verbose={verbose} --use-mgzip={use_mgzip} --first-match-only \
 --input-file {wikidata_parts_path}/claims.{kgtk_extension} \
 --obj 'node2;wikidatatype' \
 -p ';; commonsMedia'      -o {wikidata_parts_path}/claims.commonsMedia.{kgtk_extension} \
 -p ';; external-id'       -o {wikidata_parts_path}/claims.external-id.{kgtk_extension} \
 -p ';; geo-shape'         -o {wikidata_parts_path}/claims.geo-shape.{kgtk_extension} \
 -p ';; globe-coordinate'  -o {wikidata_parts_path}/claims.globe-coordinate.{kgtk_extension} \
 -p ';; math'              -o {wikidata_parts_path}/claims.math.{kgtk_extension} \
 -p ';; monolingualtext'   -o {wikidata_parts_path}/claims.monolingualtext.{kgtk_extension} \
 -p ';; musical-notation'  -o {wikidata_parts_path}/claims.musical-notation.{kgtk_extension} \
 -p ';; quantity'          -o {wikidata_parts_path}/claims.quantity.{kgtk_extension} \
 -p ';; string'            -o {wikidata_parts_path}/claims.string.{kgtk_extension} \
 -p ';; tabular-data'      -o {wikidata_parts_path}/claims.tabular-data.{kgtk_extension} \
 -p ';; time'              -o {wikidata_parts_path}/claims.time.{kgtk_extension} \
 -p ';; url'               -o {wikidata_parts_path}/claims.url.{kgtk_extension} \
 -p ';; wikibase-form'     -o {wikidata_parts_path}/claims.wikibase-form.{kgtk_extension} \
 -p ';; wikibase-item'     -o {wikidata_parts_path}/claims.wikibase-item.{kgtk_extension} \
 -p ';; wikibase-lexeme'   -o {wikidata_parts_path}/claims.wikibase-lexeme.{kgtk_extension} \
 -p ';; wikibase-property' -o {wikidata_parts_path}/claims.wikibase-property.{kgtk_extension} \
 -p ';; wikibase-sense'    -o {wikidata_parts_path}/claims.wikibase-sense.{kgtk_extension} \
 --reject-file {wikidata_parts_path}/claims.other.{kgtk_extension}

Timing: elapsed=0:03:25.670821 CPU=0:11:17.735666 (329.5%): filter --verbose=False --use-mgzip=True --first-match-only --input-file /Users/markmann/Downloads/subset/output/parts/claims.tsv.gz --obj node2;wikidatatype -p ;; commonsMedia -o /Users/markmann/Downloads/subset/output/parts/claims.commonsMedia.tsv.gz -p ;; external-id -o /Users/markmann/Downloads/subset/output/parts/claims.external-id.tsv.gz -p ;; geo-shape -o /Users/markmann/Downloads/subset/output/parts/claims.geo-shape.tsv.gz -p ;; globe-coordinate -o /Users/markmann/Downloads/subset/output/parts/claims.globe-coordinate.tsv.gz -p ;; math -o /Users/markmann/Downloads/subset/output/parts/claims.math.tsv.gz -p ;; monolingualtext -o /Users/markmann/Downloads/subset/output/parts/claims.monolingualtext.tsv.gz -p ;; musical-notation -o /Users/markmann/Downloads/subset/output/parts/claims.musical-notation.tsv.gz -p ;; quantity -o /Users/markmann/Downloads/subset/output/parts/claims.quantity.tsv.gz -p ;; string -o /Users/markmann/D

### Partition the qualifiers
Extract the qualifier records for each of the Wikidata property datatype partition files.

In [19]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.commonsMedia.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.commonsMedia.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:14.310210 CPU=0:00:26.996326 (188.7%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.commonsMedia.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.commonsMedia.tsv.gz --input-keys node1 --filter-keys id

real	0m14.659s
user	0m13.585s
sys	0m0.354s


In [20]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.external-id.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.external-id.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:29.280873 CPU=0:00:59.000042 (201.5%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.external-id.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.external-id.tsv.gz --input-keys node1 --filter-keys id

real	0m29.677s
user	0m29.583s
sys	0m0.717s


In [21]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.geo-shape.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.geo-shape.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:11.006226 CPU=0:00:20.982724 (190.6%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.geo-shape.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.geo-shape.tsv.gz --input-keys node1 --filter-keys id

real	0m11.345s
user	0m10.570s
sys	0m0.222s


In [22]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.globe-coordinate.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.globe-coordinate.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:15.969536 CPU=0:00:30.665646 (192.0%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.globe-coordinate.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.globe-coordinate.tsv.gz --input-keys node1 --filter-keys id

real	0m16.253s
user	0m15.410s
sys	0m0.345s


In [23]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.math.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.math.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:10.927218 CPU=0:00:21.107980 (193.2%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.math.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.math.tsv.gz --input-keys node1 --filter-keys id

real	0m11.188s
user	0m10.629s
sys	0m0.205s


In [24]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.monolingualtext.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.monolingualtext.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:15.272820 CPU=0:00:29.418372 (192.6%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.monolingualtext.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.monolingualtext.tsv.gz --input-keys node1 --filter-keys id

real	0m15.604s
user	0m14.796s
sys	0m0.334s


In [25]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.musical-notation.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.musical-notation.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:11.592409 CPU=0:00:22.455004 (193.7%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.musical-notation.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.musical-notation.tsv.gz --input-keys node1 --filter-keys id

real	0m11.835s
user	0m11.308s
sys	0m0.220s


In [26]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.quantity.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.quantity.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:28.679419 CPU=0:01:04.357568 (224.4%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.quantity.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.quantity.tsv.gz --input-keys node1 --filter-keys id

real	0m28.977s
user	0m32.253s
sys	0m0.557s


In [27]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.string.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.string.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:17.197049 CPU=0:00:32.957478 (191.6%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.string.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.string.tsv.gz --input-keys node1 --filter-keys id

real	0m17.470s
user	0m16.561s
sys	0m0.332s


In [28]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.tabular-data.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.tabular-data.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:10.783551 CPU=0:00:20.916836 (194.0%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.tabular-data.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tabular-data.tsv.gz --input-keys node1 --filter-keys id

real	0m11.061s
user	0m10.533s
sys	0m0.200s


In [29]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.time.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.time.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:17.622693 CPU=0:00:34.273780 (194.5%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.time.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.time.tsv.gz --input-keys node1 --filter-keys id

real	0m17.968s
user	0m17.212s
sys	0m0.370s


In [30]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.url.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.url.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:14.893292 CPU=0:00:28.712904 (192.8%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.url.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.url.tsv.gz --input-keys node1 --filter-keys id

real	0m15.134s
user	0m14.433s
sys	0m0.287s


In [31]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.wikibase-form.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.wikibase-form.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:10.633202 CPU=0:00:20.672618 (194.4%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.wikibase-form.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.wikibase-form.tsv.gz --input-keys node1 --filter-keys id

real	0m10.893s
user	0m10.412s
sys	0m0.194s


In [32]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.wikibase-item.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.wikibase-item.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:57.382729 CPU=0:01:55.416764 (201.1%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.wikibase-item.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.wikibase-item.tsv.gz --input-keys node1 --filter-keys id

real	0m57.675s
user	0m57.796s
sys	0m1.234s


In [33]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.wikibase-lexeme.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.wikibase-lexeme.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:11.348215 CPU=0:00:21.624082 (190.6%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.wikibase-lexeme.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.wikibase-lexeme.tsv.gz --input-keys node1 --filter-keys id

real	0m11.768s
user	0m10.906s
sys	0m0.250s


In [34]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.wikibase-property.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.wikibase-property.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:10.842485 CPU=0:00:20.931252 (193.0%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.wikibase-property.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.wikibase-property.tsv.gz --input-keys node1 --filter-keys id

real	0m11.178s
user	0m10.542s
sys	0m0.204s


In [35]:
!{kgtk_command} {kgtk_options} ifexists --verbose={verbose} --use-mgzip={use_mgzip} --presorted \
 --input-file  {wikidata_parts_path}/qualifiers.{kgtk_extension} \
 --filter-on   {wikidata_parts_path}/claims.wikibase-sense.{kgtk_extension} \
 --output-file {wikidata_parts_path}/qualifiers.wikibase-sense.{kgtk_extension} \
 --input-keys  node1 \
 --filter-keys id

Timing: elapsed=0:00:10.661392 CPU=0:00:20.554058 (192.8%): ifexists --verbose=False --use-mgzip=True --presorted --input-file /Users/markmann/Downloads/subset/output/parts/qualifiers.tsv.gz --filter-on /Users/markmann/Downloads/subset/output/parts/claims.wikibase-sense.tsv.gz --output-file /Users/markmann/Downloads/subset/output/parts/qualifiers.wikibase-sense.tsv.gz --input-keys node1 --filter-keys id

real	0m10.983s
user	0m10.353s
sys	0m0.191s
