# Data Extraction from Wikidata

## 当ノートブックの作業内容
- idに対応する見出しを付与していく
- エンティティに対するプロパティのうち、必要なものだけを絞り込む
   - 絞り込んだ結果, データサイズが 17GB -> 1.3GB にまで削減


### 詳細

元データ
- 多言語のデータ
- 1JSONエントリが、エンティティ(idがQ*のもの)またはプロパティ(idがP*のもの)に対応
- これらのうち、日本語タイトルを含むデータのみ抽出 
   - 67,719,431件(68GB) -> 2,274,894件(17GB)
   - フィールド
        - id
        - `title_ja` (originally `label->ja`)
        - title_en
        - descriptions_ja (originally `descriptions->ja`)
        - descriptions_en
        - aliases_ja (originally `aliases->ja->value`)
        - properties (originally `claims`)

- 上記の前処理はシングルスレッドだとかなり時間がかかるので、データ分割してマルチコア並列処理で対処(GNU Parallel使用)


参考情報.
- 元データ: https://dumps.wikimedia.org/wikidatawiki/entities/
- 元データ仕様: https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON
- プロパティ型仕様: https://www.wikidata.org/wiki/Special:ListDatatypes
- プロパティ一覧: https://www.wikidata.org/wiki/Wikidata:Database_reports/List_of_properties/all

In [1]:
import jsonlines
import json

In [2]:
from pprint import PrettyPrinter

pp = PrettyPrinter()

# データのあるディレクトリを指定

In [3]:
from pathlib import Path

workdir = Path('./data')

## Qid -> エンティティ見出し

In [4]:
%%time
import pickle


with jsonlines.open(workdir / '20191125-split-ja-title.jsonl') as reader:
    qid2midasi = {json.loads(l)['id']: json.loads(l)['title_ja']['value']
               for l in reader.iter()
               if json.loads(l)['id'].startswith('Q')}
# with open('qid2midasi.pkl', 'wb') as fp:
#     pickle.dump(qid2midasi, fp)
#with open('qid2midasi.pkl', 'rb') as fp:
#    qid2midasi = pickle.load(fp)

len(qid2midasi)

CPU times: user 15min 17s, sys: 12.8 s, total: 15min 30s
Wall time: 24min 38s


2272754

In [5]:
%%time

with jsonlines.open(workdir / '20191125-split-ja-title.jsonl') as reader:
    p_lines = [l for l in reader.iter() if json.loads(l)['id'].startswith('P')]
p_dicts = [json.loads(l) for l in p_lines]
print(len(p_dicts))
with jsonlines.open(workdir / '20191125-split-ja-title_properties.jsonl', 'w') as writer:
    writer.write_all(p_dicts)

p_dicts_title = {d['id']: {'ja': d['title_ja']['value'], 'en': d['title_en']['value']}  for d in p_dicts}

2140
CPU times: user 9min 5s, sys: 13.1 s, total: 9min 18s
Wall time: 9min 19s


## Pid -> プロパティ見出し
### 抽出対象のプロパティ型を `WikibaseItem`, `WikibaseProperty` に限定

その他のプロパティ型
- String
- Monolingualtext
- GeoShape
- GeoCoordinate
- CommonsMedia
- TabularData
- ExternalId
- Quantity
- Time
- Url

cf.
- プロパティ型定義: https://www.wikidata.org/wiki/Special:ListDatatypes
- プロパティ一覧: https://www.wikidata.org/wiki/Wikidata:Database_reports/List_of_properties/all
- プロパティ検索: https://tools.wmflabs.org/prop-explorer/

In [7]:
with open(workdir / 'props_ja.json') as fp:
    props_ja = json.loads(fp.read())
with open(workdir / 'props_en.json') as fp:
    props_en = json.loads(fp.read())
print(len(props_ja), len(props_en))
prop_dict = {k: {'ja': v, 'en': props_en[k]} for k, v in props_ja.items()}

def get_property_name(key, prop_dict):
    d = prop_dict[key]
    return {'ja': d['ja']['label'], 'en': d['en']['label'], 'type': d['en']['type']}

target_types = {'WikibaseItem', 'WikibaseProperty'}
prop_dict_filtered = {pid: prop_dict[pid] for pid, d in prop_dict.items() if d['ja']['type'] in target_types}
len(prop_dict), len(prop_dict_filtered)

6941 6941


(6941, 1327)

In [8]:
# 'P31': {'ja': '分類', 'en': 'instance of', 'type': 'WikibaseItem'}
prop_dict_filtered['P31']

{'ja': {'label': '分類',
  'type': 'WikibaseItem',
  'description': 'この項目をインスタンスとする種類・概念',
  'aliases': 'クラス, is-a, ∈, is a, インスタンスの元, 以下の実体, 実体の元, 種類'},
 'en': {'label': 'instance of',
  'type': 'WikibaseItem',
  'description': 'that class of which this subject is a particular example and member (subject typically an individual member with a proper name label); different from P279; using this property as a qualifier is deprecated—use P2868 or P3831 instead',
  'aliases': 'member of, type, is a, distinct element of, distinct individual member of, distinct member of, has class, has type, is a particular, is a specific, is a unique, is an, is an example of, is an individual, non-type member of, not a type but is instance of, occurrence of, rdf:type, uninstantiable instance of, unique individual of, unitary element of class, unsubclassable example of, unsubclassifiable member of, unsubtypable particular'}}

In [9]:
def reduce_props(props):
    pids = [pid for pid in props if pid in prop_dict_filtered]
    results = []
    for pid in pids:
        p_midasi = get_property_name(pid, prop_dict_filtered)
        prop_d = {'id': pid, 'property_midasi': p_midasi, 'entities': []}
        for v in props[pid]:
            snak = v['mainsnak']
            if snak['snaktype'] == 'value' and snak['datavalue']['type'] == 'wikibase-entityid':
                qid = snak['datavalue']['value']['id']
                q_midasi = qid2midasi.get(qid, None)
                if q_midasi is not None:
                    prop_d['entities'].append({'id': qid, 'entity_midasi': q_midasi})
        if prop_d['entities']:
            results.append( prop_d )
    return results


In [11]:
%%time

cnt = 0
lines = []
with jsonlines.open(workdir / '20191125-split-ja-title.jsonl') as reader:
    for l in reader.iter():
        d = json.loads(l)
        if d['id'].startswith('Q'):
            cnt += 1
            d['properties'] = reduce_props(d['properties'])
            d['title_ja'] = d['title_ja']['value']
            if d['title_en'] is not None:
                d['title_en'] = d['title_en']['value']
            if d['descriptions_en'] is not None:
                d['descriptions_en'] = d['descriptions_en']['value']
            if d['descriptions_ja'] is not None:
                d['descriptions_ja'] = d['descriptions_ja']['value']
            lines.append(d)

            if cnt % 100000 == 0:
                with jsonlines.open(workdir / f'20191125-reduced-ja-{cnt:08}.jsonl', 'w') as writer:
                    writer.write_all(lines)
                    lines = []


CPU times: user 9min 32s, sys: 24.8 s, total: 9min 57s
Wall time: 10min


# 使用例

In [12]:
with jsonlines.open(workdir / '20191125-reduced-ja.jsonl') as reader:
    kb = list(reader.iter())

In [14]:
import random

pp.pprint(random.choice(kb))

{'aliases_ja': [],
 'descriptions_en': 'date in Gregorian calendar',
 'descriptions_ja': None,
 'id': 'Q69221511',
 'properties': [{'entities': [{'entity_midasi': '土曜日', 'id': 'Q131'}],
                 'id': 'P2894',
                 'property_midasi': {'en': 'day of week',
                                     'ja': '曜日',
                                     'type': 'WikibaseItem'}},
                {'entities': [{'entity_midasi': '1843年1月', 'id': 'Q16644749'}],
                 'id': 'P361',
                 'property_midasi': {'en': 'part of',
                                     'ja': '以下の一部分',
                                     'type': 'WikibaseItem'}},
                {'entities': [{'entity_midasi': '1月14日', 'id': 'Q2257'}],
                 'id': 'P31',
                 'property_midasi': {'en': 'instance of',
                                     'ja': '分類',
                                     'type': 'WikibaseItem'}}],
 'title_en': '14 January 1843',
 'title_ja': '1843年1月14日

In [15]:
triples = random.choice(kb)

triples = [(triples['title_ja'], pe['property_midasi']['ja'], e['entity_midasi'])
           for pe in triples['properties'] for e in pe['entities']]
triples

[('新潟県立新潟南高等学校', '国', '日本'),
 ('新潟県立新潟南高等学校', '位置する行政区画', '新潟市'),
 ('新潟県立新潟南高等学校', '分類', 'ハイスクール'),
 ('新潟県立新潟南高等学校', '分類', '教育機関')]