#**Fact-Checking with Wikidata**

## **Technique 1**: Wikidata MCP

### Connect Le Chat to Wikidata MCP

* Open Le Chat: *https://chat.mistral.ai*
* Navigate to: *Intelligence* -> *Connectors* -> *+ Add Connector* -> *Custom MCP Connector*
* Add Connector Name: *Wikidata*
* Add Connector Server: *https://wd-mcp.wmcloud.org/mcp*
* Press *Create*


## **Technique 2**: Search & Filter & Classify

### Imports & Configurations

In [None]:
!pip install transformers



In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import requests

HEADERS = {
    'User-Agent': 'Fact-Checker/1.0 (embeddings@wikimedia.de)'
}
LANGUAGE = 'en'
INCLUDE_EXTERNAL_IDS = False

# Define the claim that we need to fact-check
claim = 'Albert Einstein was a theoretical physicist who developed the theory of relativity.'

## Search

### Vector Search for Items

Given a claim, find items from Wikidata that are relevant.

More info on the vector database: https://www.wikidata.org/wiki/Wikidata:Vector_Database

In [None]:
# Search for relevant items with the Wikidata Vector Database
items_vectordb = requests.get(
    'https://wd-vectordb.wmcloud.org/item/query',
    params={'query': claim, 'lang': LANGUAGE},
    headers=HEADERS,
)
items_vectordb = items_vectordb.json()

In [None]:
items_vectordb_ids = [item['QID'] for item in items_vectordb]
items_vectordb_ids[:5]

['Q937', 'Q43514', 'Q11455', 'Q11452', 'Q1309274']

### Fetch Wikidata Statements
Get a list of statements that have a subject as one of the items from the vector search and predicate as one of the properties.

In [None]:
# Using the a microservice to fetch statements + labels of values
# Github: https://github.com/philippesaade-wmde/WikidataTextifier

def get_statements_wd_textify(qid):
    params = {
        'id': qid,
        'lang': LANGUAGE,
        'external_ids': INCLUDE_EXTERNAL_IDS,
        'format': 'json'
    }

    results = requests.get(
        "https://wd-textify.toolforge.org",
        params=params,
        headers=HEADERS
    )
    results.raise_for_status()

    return results.json()

items_data = get_statements_wd_textify(','.join(items_vectordb_ids))

In [None]:
# Collect statements from the found items & properties

result_statements = []
for item_qid, item in items_data.items():
    for statement in item['claims']:
        # Get statement data
        statement = {
            'statement': statement,
            'item_label': item['label'],
            'item_qid': item_qid,
        }
        result_statements.append(statement)

## Filter




### Statement to Text

Tranform each Wikidata statement into a textual representation to be processed by an LLM

In [None]:
# Transform each Wikidata statement into a string

def statement_value_to_string(value):
    if isinstance(value, str):
        return value
    if 'string' in value:
        return value['string']
    elif 'label' in value:
        return value['label']
    elif 'time' in value:
        return value['time']
    return str(value)

def statement_to_string(statement):
    item_label = statement['item_label']
    statement = statement['statement']
    text = f"{item_label} : {statement['property_label']} : "
    for svalue in statement['values']:
        text += f"{statement_value_to_string(svalue['value'])}"

        if 'qualifiers' in svalue:
            for qualifier in svalue['qualifiers']:
                values = ', '.join([
                    statement_value_to_string(v['value']) for v in qualifier['values']
                ])
                text += f" ({qualifier['property_label']} : {values})"

        text += ", "
    return text.strip().rstrip(',')

for i in range(len(result_statements)):
    result_statements[i]['text'] = statement_to_string(result_statements[i])

### Prepare Reranker LLM

We score the statements using a cross-encoder reranker LLM: https://huggingface.co/jinaai/jina-reranker-v1-turbo-en

The model takes as input both the claim and a statement and outputs a score of relevance between the two.

In [None]:
# Load the model: Model size 38M parameters
reranker_model = AutoModelForSequenceClassification.from_pretrained(
    'jinaai/jina-reranker-v1-turbo-en', num_labels=1, trust_remote_code=True
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

configuration_bert.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-reranker-v1-turbo-en:
- configuration_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_bert.py: 0.00B [00:00, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/jinaai/jina-reranker-v1-turbo-en:
- modeling_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/75.6M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/280 [00:00<?, ?B/s]

### Ranking & Filtering Statements

Evaluate statements from Wikidata in consideration to the original claim.

In [None]:
documents = [result_statements[i]['text'] for i in range(len(result_statements))]
sentence_pairs = [[claim, doc] for doc in documents]

# Predict Reranker Score
statement_scores = reranker_model.compute_score(sentence_pairs)

# Save scores
for i in range(len(result_statements)):
    result_statements[i]['score'] = statement_scores[i]

In [None]:
result_statements = sorted(result_statements, key=lambda x: x['score'], reverse=True)

In [None]:
# Sort statements based on the calculated scores
result_statements = sorted(result_statements, key=lambda x: x['score'], reverse=True)

# Drop statements with a score lower than the threshold
THRESHOLD = 0.25
result_statements = [r for r in result_statements if r['score'] > THRESHOLD]
len(result_statements)

15

## Classify

### Prepare Fact-Checking LLM

The LLM is an open source NLI model: https://huggingface.co/MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli

It's a classifier that takes as input the claim and a Wikidata statement, and outputs 3 probabilities:
- **Entailment**: Whether we can hypothesis the claim given the statement from the Wikidata. Therefore, The statement supports the claim.
- **Contradiction**: Whether the claim contradicts the statement from Wikidata.
- **Neutral**: Both sentences being unrelated to each other.

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# Download the model, Model size 0.4B parameters
model_name = "MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name).to(device)
label_names = ["entailment", "neutral", "contradiction"]


def predict_entailment(premises, claim, batch_size=8):
    results = []
    with torch.no_grad():
        for i in range(0, len(premises), batch_size):
            batch_premises = premises[i:i + batch_size]

            enc = tokenizer(
                batch_premises,
                [claim] * len(batch_premises),
                truncation=True,
                padding=True,
                return_tensors="pt",
            )

            enc = {k: v.to(device) for k, v in enc.items()}

            out = model(**enc)
            probs = torch.softmax(out.logits, dim=-1).cpu()

            for row in probs:
                row_dict = {
                    name: round(float(p) * 100, 1)
                    for name, p in zip(label_names, row)
                }
                results.append(row_dict)

    return results

tokenizer_config.json:   0%|          | 0.00/395 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/18.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/870M [00:00<?, ?B/s]

In [None]:
claim

'Albert Einstein was a theoretical physicist who developed the theory of relativity.'

### Calculate Entailment Probability
Classify the entailment/contradiction of the claim along with each found Wikidata statement

In [None]:
premises = [result_statements[i]['text'] for i in range(len(result_statements))]

# Predict Entailment
predictions = predict_entailment(
    premises,
    claim,
    batch_size=8
)

# Save Scores
for i in range(len(result_statements)):
    result_statements[i]['entailment'] = predictions[i]

In [None]:
claim

'Albert Einstein was a theoretical physicist who developed the theory of relativity.'

In [None]:
# Sort by neutrality (relevance of claim and Wikidata fact)
result_statements = sorted(result_statements, key=lambda x: x['entailment']['neutral'])

print(result_statements[0]['text'])
print(result_statements[0]['entailment'])

Albert Einstein : award received : Barnard Medal for Meritorious Service to Science (point in time : 1920), Nobel Prize in Physics (point in time : 1921) (prize money : {'amount': '+121573', 'unit': 'Swedish krona', 'unit_QID': 'Q122922'}) (award rationale : for his services to Theoretical Physics, and especially for his discovery of the law of the photoelectric effect), Gold Medal of the Royal Astronomical Society (point in time : 1926), Prix Jules Janssen (point in time : 1931), Matteucci Medal (point in time : 1921), Max Planck Medal (point in time : 1929), Franklin Medal (point in time : 1935), Copley Medal (point in time : 1925) (award rationale : For his theory of relativity and his contributions to the quantum theory.), Pour le Mérite for Sciences and Arts order, Josiah Willard Gibbs Lectureship (point in time : 1934), Honorary doctorate from the University of Geneva, honorary doctor of the Hebrew University of Jerusalem (point in time : 1949), honorary doctorate from Princeton 