# Dataset Acquisition from Wikidata

## Overview
This notebook acquires entity reconciliation data from Wikidata for early modern Polish personal names.
It retrieves pairs of main labels and alternative labels (aliases) for individuals who:
- Have documented connections to Poland or its historical territories
- Were active during 1264-1770 (early modern period)
- Have multilingual name variants in Wikidata

## Context
The CHExRISH project at Jagiellonian University seeks to link historical records across multiple institutional databases.
This dataset provides ground truth for evaluating entity reconciliation strategies on historical personal names.

## Output
A dataframe with ~6,000-7,000 label-alias pairs for benchmarking string similarity algorithms.
Each row represents one person-name variant combination.

## Step 1: Query Wikidata SPARQL Endpoint

This cell queries the Wikidata SPARQL endpoint to retrieve Polish personal names and their alternative labels.

### Query Logic
- **Entity type**: Humans (Q5)
- **Polish connection**: Either in Polish Biographical Dictionary OR genealogical database "The Great Genealogy"
- **Time period**: Born 1264-1770 (early modern era)
- **Language**: Polish labels and aliases only
- **Result**: Person ID, primary name label, alternative name label

In [1]:
import requests
import pandas as pd

# Wikidata SPARQL endpoint for querying structured data
WIKIDATA_SPARQL_ENDPOINT = "https://query.wikidata.org/sparql"

# SPARQL query to retrieve Polish historical figures with alternative names
# Two conditions for Polish connection (OR):
    # 1. wdt:P8172 -> entry in Polish Biographical Dictionary
    # 2. wdt:P1343 wd:Q1485141 -> mentioned in 'The Great Genealogy' reference
wikidata_query = """
SELECT ?person ?polish_label ?name_variant
WHERE {
  {
    ?person wdt:P31 wd:Q5 ;           # Human
            wdt:P8172 ?pbd_key .      # In Polish Biographical Dictionary
  }
  UNION
  {
    ?person wdt:P31 wd:Q5 ;           # Human
            wdt:P1343 wd:Q1485141 .   # Mentioned in 'The Great Genealogy'
  }
  
  # Retrieve birth date, primary label, and alternative label
  ?person wdt:P569 ?birth_date ;
          rdfs:label ?polish_label ;
          skos:altLabel ?name_variant .
  
  # Filter for Polish language labels and aliases
  FILTER(LANG(?polish_label) = "pl")
  FILTER(LANG(?name_variant) = "pl")
  
  # Filter for early modern period (1264-1770)
  FILTER (?birth_date >= "1264-01-01"^^xsd:dateTime && 
          ?birth_date <= "1770-12-31"^^xsd:dateTime)
}
"""

# HTTP headers for JSON response format
request_headers = {
    "Accept": "application/sparql-results+json"
}

# Execute the SPARQL query against Wikidata
response = requests.get(
    WIKIDATA_SPARQL_ENDPOINT,
    params={"query": wikidata_query},
    headers=request_headers
)

# Parse JSON response
response_data = response.json()

# Extract the bindings (actual result rows) from the response
result_bindings = response_data["results"]["bindings"]

# Convert to pandas DataFrame for easier manipulation
df = pd.DataFrame(result_bindings)

# Print initial count (may include duplicates)
print(f"Initial result count: {len(df)}")

6960


## Step 2: Extract and Clean Data

This cell processes the raw Wikidata results:
1. Extracts Wikidata IDs (Q-identifiers) from URIs
2. Extracts the actual text values from nested JSON structures
3. Removes duplicate label-alias pairs

In [2]:
# ============================================================================
# Extract entity IDs from Wikidata URIs
# ============================================================================
# Wikidata returns URIs like "http://www.wikidata.org/entity/Q96212877"
# We extract just the Q-identifier (e.g., "Q96212877")

df["person"] = df["person"].apply(
    lambda uri_dict: uri_dict['value'].split("/")[-1]  # Get last part after final slash
)

# ============================================================================
# Extract text values from nested JSON structures
# ============================================================================
# Wikidata returns values as {"value": "actual_text", "type": "literal"}
# We extract just the "value" field

df["name_variant"] = df["name_variant"].apply(
    lambda value_dict: value_dict['value']
)

df["polish_label"] = df["polish_label"].apply(
    lambda value_dict: value_dict['value']
)

# ============================================================================
# Remove duplicate label-alias pairs
# ============================================================================
# Wikidata may return the same person-name combination multiple times
# Keep only unique pairs

df = df.drop_duplicates()

print(f"Final deduplicated count: {len(df)}")
print(f"\nFirst 10 rows:")
print(df.head(10))

# Display example name variations showing why entity reconciliation is needed
print(f"\n=== Examples of Name Variations ===")
print(f"Person Q203808 variations:")
print(df[df['person'] == 'Q203808'][['polish_label', 'name_variant']].drop_duplicates())

0,1,2
person,name_variant,polish_label


## Step 3: Summary Statistics

Analyze the acquired dataset:

In [3]:
# ============================================================================
# Dataset Overview
# ============================================================================

print(f"=== DATASET STATISTICS ===")
print(f"Total label-alias pairs: {len(df)}")
print(f"Unique Wikidata entities: {df['person'].nunique()}")
print(f"\nDataset structure:")
print(df.info())

print(f"\n=== Example Records ===")
print(f"\nSample of 5 random pairs:")
print(df.sample(5))

# ============================================================================
# Name Complexity Analysis
# ============================================================================
# Analyze number of name components (words) in labels vs variants

df["label_num_components"] = df["polish_label"].str.split().str.len()
df["variant_num_components"] = df["name_variant"].str.split().str.len()

print(f"\n=== Name Complexity ===")
print(f"Primary label - Average components: {df['label_num_components'].mean():.2f}")
print(f"Primary label - Components range: {df['label_num_components'].min()}-{df['label_num_components'].max()}")
print(f"\nAlternative variant - Average components: {df['variant_num_components'].mean():.2f}")
print(f"Alternative variant - Components range: {df['variant_num_components'].min()}-{df['variant_num_components'].max()}")