# Supervised Training with Wikidata Simple Questions

- Dataset Repository: [askplatypus/wikidata-simplequestions](https://github.com/askplatypus/wikidata-simplequestions)
- Stats (only answerable questions):
    - train: 14894
    - valid: 4295
    - test: 2210

## Step 0. Prerequisites

Install and import dependencies:

```bash
pip install srsly srtk pandas
```

In [None]:
import os

import pandas as pd
import srsly

Download data from GitHub to `data/wikidata-simplequestions/raw` with the following commands:

```bash
mkdir -p data/wikidata-simplequestions/raw
wget https://raw.githubusercontent.com/askplatypus/wikidata-simplequestions/master/annotated_wd_data_test_answerable.txt -P data/wikidata-simplequestions/raw
wget https://raw.githubusercontent.com/askplatypus/wikidata-simplequestions/master/annotated_wd_data_train_answerable.txt -P data/wikidata-simplequestions/raw
wget https://raw.githubusercontent.com/askplatypus/wikidata-simplequestions/master/annotated_wd_data_valid_answerable.txt -P data/wikidata-simplequestions/raw
```


## Step 1. Format the Raw Data for Preprocessing

### 1.1 Inspect raw data

In [None]:
raw_paths = {
    'train': 'data/wikidata-simplequestions/raw/annotated_wd_data_train_answerable.txt',
    'valid': 'data/wikidata-simplequestions/raw/annotated_wd_data_valid_answerable.txt',
    'test': 'data/wikidata-simplequestions/raw/annotated_wd_data_test_answerable.txt',
}
splits = ['train', 'valid', 'test']
intermediate_dir = 'data/wikidata-simplequestions/intermediate' # intermediate data, here it's the scored paths
dataset_dir = 'data/wikidata-simplequestions/dataset'  # preprocessed data
output_model_dir = 'artifacts/models/wd_simple'
retrieved_subgraph_path = 'artifacts/subgraphs/wd_simple.jsonl'

`Rxxx` property identifiers encode the inverse property of the Wikidata property `Pxxx`. For example `R19` encodes the properties "born here", i.e. the inverse of `P19` ("birth place"). 

In [None]:
raw_train_path = raw_paths['train']
! head -n 5 $raw_train_path

Q12439	R19	Q6106580	who is a musician born in detroit
Q6817891	P364	Q1568	what is the language in which mera shikar was filmed in
Q1297	R276	Q2888523	Whats the name of a battle that happened in chicago
Q193592	R413	Q5822614	what player plays the position midfielder?
Q6849115	P413	Q336286	what is the position that  mike twellman plays


### 1.2 Remove reverse relations

In [None]:
preserved_data = {}
for split in splits:
    raw_path = raw_paths[split]
    data = pd.read_csv(raw_path, sep='\t', header=None)
    print('Full train size:', len(data))
    before_len = len(data)
    # Remove samples where the 1-st column starting with 'R'
    data = data[~data[1].str.startswith('R')]
    after_len = len(data)
    print(f'{split} size after removing reverse relation: {len(data)}')
    print(f'{after_len/before_len*100:.2f}% percentage {split} data are kept')
    preserved_data[split] = data


Full train size: 19481
train size after removing reverse relation: 14894
76.45% percentage train data are kept
Full train size: 2821
valid size after removing reverse relation: 2210
78.34% percentage valid data are kept
Full train size: 5622
test size after removing reverse relation: 4296
76.41% percentage test data are kept


### 1.3 Convert to scored path format

The paths format is a JSONL file, where each line is a dictionary as:
```json
{
    "id": "train-100",
    "question": "What is the birth place of Barack Obama?",
    "question_entities": ["Q76"],
    "answer_entities": ["Q23513"],
    "paths": [["P19"]]  # there may be multiple paths, and each path may have variable lengths
    "scores": [1.0]     # the score of each path. for ground truth paths, we assign max score 1.0 to each path.
}

In [None]:
# Create folder to store intermediate files
!mkdir -p $intermediate_dir
!mkdir -p $dataset_dir

In [None]:
for split in splits:
    data = preserved_data[split]
    samples = []
    for idx, line in enumerate(data.values):
        question_entity, relation, answer_entity, question = line
        sample = {
            "id": f"{split}-{idx:05d}",
            "question": question,
            "question_entities": [question_entity],
            "answer_entities": [answer_entity],
            "paths": [[relation]],
            "path_scores": [1.0]
        }
        samples.append(sample)
    save_path = os.path.join(intermediate_dir, f'scores_{split}.jsonl')
    srsly.write_jsonl(save_path, samples)
    print(f'Saved {split} scored paths file to {save_path}')

Saved train scored paths file to data/wikidata-simplequestions/intermediate/scores_train.jsonl
Saved valid scored paths file to data/wikidata-simplequestions/intermediate/scores_valid.jsonl
Saved test scored paths file to data/wikidata-simplequestions/intermediate/scores_test.jsonl


## Step 2. Preprocessing

Use the `srtk preprocess` command to creating training samples. Do not pass `--search-path` beacuse the paths are already provided in the dataset. This step mainly involves negative sampling and dataset generation.

In [None]:
for split in splits:
    print(f'Processing {split} data...')
    scored_path = os.path.join(intermediate_dir, f'scores_{split}.jsonl')
    dataset_path = os.path.join(dataset_dir, f'{split}.jsonl')
    !srtk preprocess -i $scored_path \
        -o $dataset_path \
        -e http://localhost:1234/api/endpoint/sparql \
        -kg wikidata

Processing train data...
Negative sampling: 100%|██████████████████| 14894/14894 [02:49<00:00, 87.90it/s]
Number of training records: 29643
Converting relation ids to labels: 100%|█| 29643/29643 [01:28<00:00, 336.45it/s]
Training samples are saved to data/wikidata-simplequestions/dataset/train.jsonl
Processing valid data...
Negative sampling: 100%|████████████████████| 2210/2210 [00:26<00:00, 84.60it/s]
Number of training records: 4397
Converting relation ids to labels: 100%|███| 4397/4397 [00:13<00:00, 326.37it/s]
Training samples are saved to data/wikidata-simplequestions/dataset/valid.jsonl
Processing test data...
Negative sampling: 100%|████████████████████| 4296/4296 [00:48<00:00, 89.45it/s]
Number of training records: 8539
Converting relation ids to labels: 100%|███| 8539/8539 [00:25<00:00, 331.68it/s]
Training samples are saved to data/wikidata-simplequestions/dataset/test.jsonl


In [None]:
train_dataset_path = os.path.join(dataset_dir, 'train.jsonl')
!head -n 5 $train_dataset_path

{"query":"what is the language in which mera shikar was filmed in [SEP] ","positive":"original language of film or TV show","negatives":["composer","instance of","cast member","director","instance of","composer","country of origin","country of origin","cast member","cast member","color","cast member","instance of","country of origin","cast member"]}
{"query":"what is the language in which mera shikar was filmed in [SEP] original language of film or TV show","positive":"END OF HOP","negatives":["has grammatical case","has grammatical case","described by source","different from","indigenous to","on focus list of Wikimedia project","has grammatical gender","instance of","different from","writing system","subclass of","instance of","country","subclass of","writing system"]}
{"query":"what is the position that  mike twellman plays [SEP] ","positive":"position played on team / speciality","negatives":["country of citizenship","sex or gender","occupation","educated at","family name","sport","

## Step 3. Train a scorer model

Training a retrieval model is as easy as running a single command! Just pass in the pretrained language model name and the training dataset path. We'll handle the rest.

We recommend you to register an accound at [WanDB](https://docs.wandb.ai/quickstart), and log in in the command line. That's it, then you'll then be able to track your training progress in an online dashboard.

In [None]:
train_dataset_path = os.path.join(dataset_dir, 'train.jsonl')
validation_dataset_path = os.path.join(dataset_dir, 'valid.jsonl')

In [None]:
os.environ['TOKENIZERS_PARALLELISM'] = 'true'

In [None]:
!CUDA_VISIBLE_DEVICES=1 srtk train -t $train_dataset_path \
    -v $validation_dataset_path \
    --model-name-or-path roberta-base \
    --output-dir $output_model_dir \
    --accelerator gpu \
    --learning-rate 1e-5 \
    --batch-size 96 \
    --max-epochs 5

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Found cached dataset json (/home/wiss/liao/.cache/huggingface/datasets/json/default-e38cde1f0c05f759/0.0.0/0f7e3662623656454fcd2b650f34e886a7db4b9104504885bd462096cc7a9f51)
Loading cached processed dataset at /home/wiss/liao/.cache/huggingface/datasets/json/default-e38cde1f0c05f759/0.0.0/0f7e3662

By clicking the wandb link shown in the output, you can easily monitor the training progress in a dashboard. 

![wandb training progress](https://i.imgur.com/4zKmzBT.png)

## Step 4. Evaluate the model

You may evaluate the trained retriever (or to say, the scorer) by passing `--evaluate` argument to the `retrieve` subcommand. The model is evaluated by the number of samples that ground truth entities are retieved by the number of total test samples.

The file for evaluation should at least contain the following fields:
```json
{
    "question": "What is the birth place of Barack Obama?",
    "question_entities": ["Q76"],
    "answer_entities": ["Q23513"]
}
```

In [None]:
test_scored_path = os.path.join(intermediate_dir, 'scores_test.jsonl')

In [None]:
!srtk retrieve -i $test_scored_path \
    -o $retrieved_subgraph_path \
    -e http://localhost:1234/api/endpoint/sparql \
    -kg wikidata \
    --scorer-model-path $output_model_dir \
    --beam-width 20 \
    --max-depth 1 \
    --evaluate

Retrieving subgraphs: 100%|█████████████████| 4296/4296 [02:47<00:00, 25.61it/s]
Retrieved subgraphs saved to to artifacts/subgraphs/wd_simple.jsonl
Answer recall: 0.9748603351955307 (4188 / 4296)


## Step 5. (Optional) Share your model

Please refer to [share a model](https://huggingface.co/docs/transformers/model_sharing) from HuggingFace.