# End-to-end Subgraph Retrieval

This tutorial showcases how to use `srtk` to retrieve subgraphs using natural language questions from  [Wikidata](https://www.wikidata.org/).

It contains the following steps:

0. [Install the dependencies](#step-0-preparations)
1. [Link the entities in the question to the entities in Wikidata.](#step-1-entity-linking)
2. [Use a pretrained retriever to retrieve the subgraphs.](#step-2-retrieve-subgraphs)
3. [Visualize the retrieved subgraphs.](#step-3-visualize-the-retrieved-subgraphs)

We will use [Mintaka](https://huggingface.co/datasets/AmazonScience/mintaka) dataset as an example. 

## Step 0. Preparations

Before running this notebook, you should have the entity linking server, Wikidata SPARQL server and wikimapper database prepared. prepared. Please refer to [Setup Wikidata](https://srtk.readthedocs.io/en/latest/setups/setup_wikidata.html) for setup instructions. We assume that:

- The REL entity linking server is running at `http://localhost:1235`.
- The Wikidata SPARQL server endpoint is at `http://localhost:1234/api/endpoint/sparql`.
- The wikimapper database file is located at `resources/wikimapper/index_enwiki.db`

In [None]:
# Install srtk
!pip install srtk

In [None]:
# Import all dependencies
import srsly
from datasets import load_dataset

Define intermediate and output file paths:

In [None]:
question_path = 'data/mintaka-100/question.jsonl'
linked_path = 'data/mintaka-100/linked.jsonl'
retrieved_subgraph_path = 'data/mintaka-100/subgraph.jsonl'
visualization_dir = 'data/mintaka-100/html'

## Step 1. Entity Linking

Actually, mintaka dataset is already linked to Wikidata. We still perform this step to generalize the usage of `srtk` to those datasets that are not linked to Wikidata.

Different steps between pipelines are connected with files (mostly jsonl files) in `srtk`. Therefore, we first need to convert the dataset to a `jsonl` file, where each line is a json object representing a question.

In [None]:

# Load the first 100 samples of mintaka dataset from huggingface datasets 
mintaka = load_dataset("AmazonScience/mintaka", split="train[:100]")
print(mintaka)
# Extract the question and id from the dataset
samples = [{'id': sample['id'], 'question': sample['question']} for sample in mintaka]
srsly.write_jsonl(question_path, samples)

No config specified, defaulting to: mintaka/en
Found cached dataset mintaka (/home/wiss/liao/.cache/huggingface/datasets/AmazonScience___mintaka/en/1.0.0/bb35d95f07aed78fa590601245009c5f585efe909dbd4a8f2a4025ccf65bb11d)


Dataset({
    features: ['id', 'lang', 'question', 'answerText', 'category', 'complexityType', 'questionEntity', 'answerEntity'],
    num_rows: 100
})


Perform entity linking on the questions using the CLI interface. Run `srtk link --help` for more details.

In [None]:
!srtk link --input $question_path \
    --output $linked_path \
    --knowledge-graph wikidata \
    --ground-on question \
    --el-endpoint http://127.0.0.1:1235 \
    --wikimapper-db resources/wikimapper/index_enwiki.db

Entity linking data/mintaka-100/question.jsonl: 100%|█| 100/100 [00:02<00:00, 37
0 / 146 grounded entities not converted to Wikidata qids
Entity linking result saved to data/mintaka-100/linked.jsonl


Check the linking results

In [None]:
!head -n 5 $linked_path

{"question":"What is the seventh tallest mountain in North America?","question_entities":["Q49"],"spans":[[40,53]],"entity_names":["North_America"],"id":"a9011ddf"}
{"question":"Which actor was the star of Titanic and was born in Los Angeles, California?","question_entities":["Q44578","Q65","Q99"],"spans":[[28,35],[52,63],[65,75]],"entity_names":["Titanic_(1997_film)","Los_Angeles","California"],"id":"2723bb1b"}
{"question":"Which actor starred in Vanilla Sky and was married to Katie Holmes?","question_entities":["Q110278","Q174346"],"spans":[[23,34],[54,66]],"entity_names":["Vanilla_Sky","Katie_Holmes"],"id":"88349c89"}
{"question":"What year was the first book of the A Song of Ice and Fire series published?","question_entities":["Q45875"],"spans":[[36,58]],"entity_names":["A_Song_of_Ice_and_Fire"],"id":"bff78c91"}
{"question":"Who is the youngest current US governor?","question_entities":["Q30"],"spans":[[28,30]],"entity_names":["United_States"],"id":"982450cf"}


## Step 2. Retrieve Subgraphs

The retrieved path consists of a list of relations, based on the idea that a question typically implies a reasoning chain. For instance, "Where is Hakata Ward?" implies "Hakata --**locate in**--> ?".

The retrieval process relies on the similarities between a question and its expanding path, which is formed by the relations along that path. In the example mentioned above, the expanding path `locate in` would have an embedding close to the question embedding. For multi-hop relations, each relation is embedded close to the embedding of the question combined with previous relations. For instance, if a question `q` has the reasoning path `r1 -> r2 -> r3`, then the embedding of `r1` is embedded close to `q`, the embedding of `r2` is close to `q + r1`, and the embedding of `r3` is close to `q + r1 + r2`.

A scorer is used to evaluate the similarity between the question and the expanding path. In this tutorial, a BERT-like model was trained as a scorer, and it is available on Huggingface Hub under the name `drt/scorer-mintaka`. To train your own scorer, please refer to [Train a Scorer](https://github.com/happen2me/subgraph-retrieval-toolkit/blob/main/tutorials/3.weak_train_wikidata.ipynb). If your scorer model is saved locally, you can pass the directory containing the model to the `--scorer-model-path` argument.

Note that the [qualifiers](https://www.wikidata.org/wiki/Help:Qualifiers) in Wikidata are ignored during retrieval by default. You may use option `--include-qualifiers` to include qualifiers in the retrieval process. The `drt/scorer-mintaka` model was however trained without qualifiers.

You can use srtk retrieve to retrieve subgraphs based on the pre-trained model. Internally, it performs two tasks:

1. Executes beam search for possible paths (relation chains) using the trained scorer.
2. Retrieves entities starting from the linked entities in the question, following the relation paths through the Wikidata SPARQL endpoint.
The subgraphs are represented as triplets of (subject, relation, object). These triplets are added to the output JSONL file in the triplets field.

For more information, run srtk retrieve --help.

In [None]:
!srtk retrieve --input $linked_path \
    --output $retrieved_subgraph_path \
    --sparql-endpoint http://localhost:1234/api/endpoint/sparql \
    --knowledge-graph wikidata \
    --scorer-model-path drt/scorer-mintaka \
    --beam-width 10 \
    --max-depth 2

Retrieving subgraphs: 100%|███████████████████| 100/100 [00:50<00:00,  1.97it/s]
Retrieved subgraphs saved to to data/mintaka-100/subgraph.jsonl


In [None]:
!sed '3!d' $retrieved_subgraph_path | jq

[1;39m{
  [0m[34;1m"question"[0m[1;39m: [0m[0;32m"Which actor starred in Vanilla Sky and was married to Katie Holmes?"[0m[1;39m,
  [0m[34;1m"question_entities"[0m[1;39m: [0m[1;39m[
    [0;32m"Q110278"[0m[1;39m,
    [0;32m"Q174346"[0m[1;39m
  [1;39m][0m[1;39m,
  [0m[34;1m"spans"[0m[1;39m: [0m[1;39m[
    [1;39m[
      [0;39m23[0m[1;39m,
      [0;39m34[0m[1;39m
    [1;39m][0m[1;39m,
    [1;39m[
      [0;39m54[0m[1;39m,
      [0;39m66[0m[1;39m
    [1;39m][0m[1;39m
  [1;39m][0m[1;39m,
  [0m[34;1m"entity_names"[0m[1;39m: [0m[1;39m[
    [0;32m"Vanilla_Sky"[0m[1;39m,
    [0;32m"Katie_Holmes"[0m[1;39m
  [1;39m][0m[1;39m,
  [0m[34;1m"id"[0m[1;39m: [0m[0;32m"88349c89"[0m[1;39m,
  [0m[34;1m"triplets"[0m[1;39m: [0m[1;39m[
    [1;39m[
      [0;32m"Q110278"[0m[1;39m,
      [0;32m"P1981"[0m[1;39m,
      [0;32m"Q20644797"[0m[1;39m
    [1;39m][0m[1;39m,
    [1;39m[
      [0;32m"Q49088"[0m[1;39m,
      [0;

## Step 3. Visualize the retrieved subgraphs

You may use the `srtk visualize` command to easily visualize the retrieved subgraphs. Each subgraph is stored as a webpage file. Run `srtk visualize --help` for more details.

In [None]:
!srtk visualize --input $retrieved_subgraph_path \
    --output-dir $visualization_dir \
    --sparql-endpoint http://localhost:1234/api/endpoint/sparql \
    --knowledge-graph wikidata

Visualizing graphs:   0%|                               | 0/100 [00:00<?, ?it/s]No label for identifier Q49280127.
Visualizing graphs:   4%|▉                      | 4/100 [00:00<00:15,  6.06it/s]No label for identifier Q29045456.
No label for identifier Q20571325.
No label for identifier Q29045433.
Visualizing graphs:  22%|████▊                 | 22/100 [00:03<00:10,  7.66it/s]No label for identifier Q25554668.
Visualizing graphs:  23%|█████                 | 23/100 [00:03<00:11,  6.79it/s]No label for identifier Q43200400.
Visualizing graphs:  25%|█████▌                | 25/100 [00:03<00:11,  6.29it/s]No label for identifier Q22828226.
Visualizing graphs:  37%|████████▏             | 37/100 [00:04<00:06,  9.97it/s]No label for identifier Q11522520.
Visualizing graphs:  57%|████████████▌         | 57/100 [00:06<00:03, 13.73it/s]No label for identifier Q1847223.
No label for identifier Q112289487.
Visualizing graphs:  61%|█████████████▍        | 61/100 [00:06<00:03, 11.89it/s]No label f

A visualization example of question *Which actor starred in Vanilla Sky and was married to Katie Holmes?* is shown below. The question entities are shown in dark blue.

![image.png](https://i.imgur.com/nsYUhGT.png)

Moreover, the visualization script can highlight the answer entities if the answers are known (e.g. KGQA). To do so, 'answer_entities' field should present in the subgraph JSONL file. The format of the 'answer_entities' field is a list of Wikidata entity IDs.

In Mintaka dataset, some of the answers are in the form of entities and are already known, we can trivially add them to the subgraph JSONL file.

In [None]:
subgraphs = srsly.read_jsonl(retrieved_subgraph_path)
processed_subgraphs = []
for sample, subgraph in zip(iter(mintaka), subgraphs):
    answers = [answer['name'] for answer in sample['answerEntity']]
    subgraph['answer_entities'] = answers
    processed_subgraphs.append(subgraph)
srsly.write_jsonl(retrieved_subgraph_path, processed_subgraphs)

# Show the answer entities of the third sample
!sed '3!d' $retrieved_subgraph_path | jq | grep -A 2 answer_entities

  "answer_entities": [
    "Q37079"
  ]


Run the visualization script again, we can see that the answer entities are highlighted in green the subgraph.

In [None]:
!srtk visualize --input $retrieved_subgraph_path \
    --output-dir $visualization_dir \
    --sparql-endpoint http://localhost:1234/api/endpoint/sparql \
    --knowledge-graph wikidata

Visualizing graphs:   0%|                               | 0/100 [00:00<?, ?it/s]No label for identifier Q49280127.
Visualizing graphs:   3%|▋                      | 3/100 [00:00<00:16,  5.84it/s]No label for identifier Q20571325.
No label for identifier Q29045456.
No label for identifier Q29045433.
Visualizing graphs:   5%|█▏                     | 5/100 [00:00<00:13,  7.18it/s]No label for identifier Q29045456.
Visualizing graphs:  22%|████▊                 | 22/100 [00:03<00:10,  7.72it/s]No label for identifier Q25554668.
Visualizing graphs:  23%|█████                 | 23/100 [00:03<00:10,  7.36it/s]No label for identifier Q43200400.
Visualizing graphs:  24%|█████▎                | 24/100 [00:03<00:11,  6.89it/s]No label for identifier Q22828226.
Visualizing graphs:  38%|████████▎             | 38/100 [00:04<00:05, 11.25it/s]No label for identifier Q11522520.
Visualizing graphs:  57%|████████████▌         | 57/100 [00:05<00:03, 13.83it/s]No label for identifier Q112289487.
No label 

If the answer entities are retrieved, they will be displayed in green. An example is shown below.

![with-answer-entity](https://i.imgur.com/BcC8dde.png)