# Wikidata JSON Dumps

> Welcome!
> 
> For this notebook we will not be using [PAWS](https://hub-paws.wmcloud.org/) as the data sizes are too large

This notebook provides an overview of working with the [Wikidata Dumps](https://www.wikidata.org/wiki/Wikidata:Database_download) using [requests](https://requests.readthedocs.io/en/latest/) and other Python libraries. For the functions that are used for this notebook, please see the [mismatch generation utils file on GitHub](https://github.com/Wikidata/Purdue-Data-Mine-2024/blob/main/MismatchGeneration/utils.py). The dumps themselves can be found at [dumps.wikimedia.org/other/wikibase/wikidatawiki](https://dumps.wikimedia.org/other/wikibase/wikidatawiki/).

**Note**: if you ever find yourself working with the dumps for other Wikimedia projects, we'd suggest using the XML dumps instead of the JSON dumps as the added structure of XML allows you to more easily handle the entries. It's not a good idea to use XML for Wikidata as the format of the JSON data embedded in the XML dumps is subject to change without notice, and may be inconsistent between revisions ([source](https://www.wikidata.org/wiki/Wikidata:Database_download#XML_dumps)). For other projects XML is the preferred method.

In [1]:
# pip install pymysql

In [2]:
# pip install tensorflow

In [3]:
# pip install jupyter-black

In [4]:
%load_ext jupyter_black

## Download Dump

In [5]:
import sys

PATH_TO_UTILS = "../MismatchGeneration/"  # change based on your directory structure
sys.path.append(PATH_TO_UTILS)

from utils import download_wikidata_json_dump, parse_wikidata_dump_to_ndjson

In [6]:
# You cannot run this in PAWS as even a 2 GB file is too large for the system.
download_wikidata_json_dump(
    target_dir="../MismatchGeneration/Data", dump_id=False  # get the most recent dump
)

Target Wikidata dump file is 'latest-all.json.bz2'.

The desired dump already exists locally at ../MismatchGeneration/Data/latest-all.json.bz2 (86.29 GBs). Skipping download.


## Total Wikidata Items

We want the total Wikidata items so that we can have a general estimate of the number of QIDs for [tqdm](https://github.com/tqdm/tqdm) progress bars when parsing the data. We're using the estimate we found in Notebook 3 because if we wanted the exact number of items in the dump, we'd need to load it all into memory.

In [7]:
# From the Wiki Replicas section in Notebook 3.
total_wikidata_items = 111351401  # could be used for `input_limit` in a full parse

## Parsing Dump

Returns the first two members of (`P361`) the European Union (`Q458`) given the order of the dump (Belgium and Portugal).

In [8]:
parse_wikidata_dump_to_ndjson(
    pids=["P361"],  # what PIDs should returned entities have
    pid_values=["Q458"],  # what values should the properties have
    pid_value_props=None,  # ex: for passing units or languages
    prop_intersection=True,  # returned entities must have all props
    output_file_path="../MismatchGeneration/Data/test-parse.ndjson",
    input_file_path="../MismatchGeneration/Data/latest-all.json.bz2",
    output_limit=2,
    input_limit=None,
    verbose=True,
)

The output file ../MismatchGeneration/Data/test-parse.ndjson already exists.
This file will be rewritten.


Outputs returned:   0%|          | 0/2 [00:00<?, ?entries/s]

Parsed 9 entries in the JSON dump into an NDJSON file with 2 entities.


Returns entities that have an area (`P2046`) of `30528` with a unit of square kilometers (`Q712226`) within the first two entities of the dump (Belgium).

In [9]:
parse_wikidata_dump_to_ndjson(
    pids=["P2046"],  # what PIDs should returned entities have
    pid_values=["+30528"],  # what values should the properties have
    pid_value_props=[
        "http://www.wikidata.org/entity/Q712226"
    ],  # ex: for passing units or languages
    prop_intersection=True,  # returned entities must have all props
    output_file_path="../MismatchGeneration/Data/test-parse.ndjson",
    input_file_path="../MismatchGeneration/Data/latest-all.json.bz2",
    output_limit=None,
    input_limit=2,
    verbose=True,
)

The output file ../MismatchGeneration/Data/test-parse.ndjson already exists.
This file will be rewritten.


Entries processed:   0%|          | 0/2 [00:00<?, ?entries/s]

Parsed 2 entries in the JSON dump into an NDJSON file with 1 entities.
