## Update metadata

Update `wdi.meta.yml` from WDI metadata file. This notebook is intended to be run manually and all changes to the YAML file need to be verified.

In [5]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [6]:
from owid.catalog import Dataset
from etl.paths import DATA_DIR
import os

version = os.getcwd().rsplit('/', 1)[1]
ds_meadow = Dataset(DATA_DIR / 'meadow/worldbank_wdi' / version / 'wdi')
tb = ds_meadow['wdi']
indicator_codes = [tb[col].m.title for col in tb.columns]

tb_metadata = ds_meadow.read("wdi_metadata", safe_types=False)

In [7]:
from wdi import load_variable_metadata

df_vars = load_variable_metadata(tb_metadata, indicator_codes)
df_vars.head()

[2m2025-06-10 11:41:57[0m [[32m[1minfo     [0m] [1mwdi.missing_metadata          [0m [36mn_indicators[0m=[35m1[0m


Unnamed: 0_level_0,topic,indicator_name,short_definition,long_definition,unit_of_measure,periodicity,base_period,other_notes,aggregation_method,limitations_and_exceptions,...,general_comments,source,statistical_concept_and_methodology,development_relevance,related_source_links,other_web_links,related_indicators,license_type,unit,indicator_code_original
indicator_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ag_con_fert_pt_zs,Environment: Agricultural production,Fertilizer consumption (% of fertilizer produc...,,Fertilizer consumption measures the quantity o...,,Annual,,The world and regional aggregate series do not...,Weighted average,The FAO has revised the time series for fertil...,...,,"Food and Agriculture Organization, electronic ...",Fertilizer consumption measures the quantity o...,"Factors such as the green revolution, has led ...",,,,CC BY-4.0,,AG.CON.FERT.PT.ZS
ag_con_fert_zs,Environment: Agricultural production,Fertilizer consumption (kilograms per hectare ...,,Fertilizer consumption measures the quantity o...,,Annual,,The world and regional aggregate series do not...,Weighted average,The FAO has revised the time series for fertil...,...,,"Food and Agriculture Organization, electronic ...",Fertilizer consumption measures the quantity o...,"Factors such as the green revolution, has led ...",,,,CC BY-4.0,,AG.CON.FERT.ZS
ag_lnd_agri_k2,Environment: Land use,Agricultural land (sq. km),,Agricultural land refers to the share of land ...,,Annual,,Areas of former states are included in the suc...,Sum,The data are collected by the Food and Agricul...,...,,"Food and Agriculture Organization, electronic ...",Agricultural land constitutes only a part of a...,Agricultural land covers more than one-third o...,,,,CC BY-4.0,,AG.LND.AGRI.K2
ag_lnd_agri_zs,Environment: Land use,Agricultural land (% of land area),,Agricultural land refers to the share of land ...,,Annual,,Areas of former states are included in the suc...,Weighted average,The data are collected by the Food and Agricul...,...,,"Food and Agriculture Organization, electronic ...",Agriculture is still a major sector in many ec...,Agricultural land covers more than one-third o...,,,,CC BY-4.0,,AG.LND.AGRI.ZS
ag_lnd_arbl_ha,Environment: Land use,Arable land (hectares),,Arable land (in hectares) includes land define...,,Annual,,,,The Food and Agriculture Organization (FAO) tr...,...,,"Food and Agriculture Organization, electronic ...",Temporary fallow land refers to land left fall...,Agricultural land covers more than one-third o...,,,,CC BY-4.0,,AG.LND.ARBL.HA


## Replace years in YAML metadata

In [None]:
from etl.files import ruamel_dump, ruamel_load

yaml_path = "wdi.meta.yml"

with open(yaml_path, "r") as f:
    yml = ruamel_load(f)

In [None]:
KEEP = {'armed_forces_share_population'}

# Delete variables that are not in the dataset
missing_variables = set(yml['tables']['wdi']['variables'].keys()) - set(tb.columns)
missing_variables = {v for v in missing_variables if not v.startswith('omm_') and v not in KEEP}

print(f"Deleting {len(missing_variables)} variables")
for var in missing_variables:
    del yml['tables']['wdi']['variables'][var]

In [None]:
import re
from typing import Union


def replace_years(s: str, year: Union[int, str]) -> str:
    """replaces all years in string with {year}.

    Example:

        >>> replace_years("GDP (constant 2010 US$)", 2015)
        "GDP (constant 2015 US$)"
    """
    year_regex = re.compile(r"\b([1-2]\d{3})\b")
    s_new = year_regex.sub(str(year), s)
    return s_new


variables = yml["tables"]["wdi"]["variables"]

for indicator_code in df_vars.index:
    if indicator_code in variables:
        var = variables[indicator_code]
    else:
        var = {}
        variables[indicator_code] = var

    # update titles from metadata file
    try:
        var["title"] = df_vars.loc[indicator_code].indicator_name
    except KeyError:
        continue

    # if title contains year, try to update units too
    year_regex = re.compile(r"\b([1-2]\d{3})\b")
    regex_res = year_regex.search(df_vars.loc[indicator_code].indicator_name)
    if regex_res:
        assert len(regex_res.groups()) == 1
        year = regex_res.groups()[0]

        if "unit" in var:
            var["unit"] = replace_years(var["unit"], year)

        if "short_unit" in var:
            var["short_unit"] = replace_years(var["short_unit"], year)

        for k in ["name", "unit", "short_unit"]:
            if var.get("display", {}).get("unit"):
                var["display"]["unit"] = replace_years(var["display"]["unit"], year)

            if var.get("display", {}).get("short_unit"):
                var["display"]["short_unit"] = replace_years(var["display"]["short_unit"], year)

        if "presentation" in var:
            for k in ["title_public", "title_variant"]:
                if k in var["presentation"]:
                    var["presentation"][k] = replace_years(var["presentation"][k], year)

In [None]:
with open(yaml_path, "w") as f:
    f.write(ruamel_dump(yml))

## Replace years in chart configs

In [None]:
from etl.db import get_engine, read_sql

# get GDP variable
q = """
select id from variables
where name = 'GDP per capita, PPP (constant 2021 international $)'
    and catalogPath = 'grapher/worldbank_wdi/2025-01-24/wdi/wdi#ny_gdp_pcap_pp_kd'
"""
engine = get_engine()
var_id = read_sql(q, engine).id.iloc[0]
print(var_id)

# get all charts using that variable
q = f"""
select chartId from chart_dimensions where variableId = {var_id};
"""
chart_ids = list(read_sql(q, engine)['chartId'])
len(chart_ids)

In [None]:
from apps.chart_sync.admin_api import AdminAPI
from etl.config import OWID_ENV, ENV_GRAPHER_USER_ID

admin_api = AdminAPI(OWID_ENV, grapher_user_id=ENV_GRAPHER_USER_ID)

old_year = "2017"
new_year = "2021"

for chart_id in chart_ids:
    chart_config = admin_api.get_chart_config(chart_id)

    fields = ['subtitle', 'note']

    update = False
    for field in fields:
        if field in chart_config:
            if old_year in (chart_config.get(field, '') or ''):
                chart_config[field] = chart_config[field].replace(old_year, new_year)
                update = True

    if update:
        print(f"Updating chart {chart_id}")
        admin_api.update_chart(chart_id, chart_config)

## Update Sources

In [8]:
import json

with open("wdi.sources.json", "r") as f:
    sources = json.load(f)

sources = [s for s in sources if not s["name"].startswith("TODO")]

missing_sources = list(set(df_vars["source"]) - {s["rawName"] for s in sources})
len(missing_sources)

208

In [13]:
GOOD_EXAMPLES = [
    {
    "rawName": "ASPIRE: The Atlas of Social Protection - Indicators of Resilience and Equity, The World Bank. Data are based on national representative household surveys. (datatopics.worldbank.org/aspire/)",
    "dataPublisherSource": "The Atlas of Social Protection Indicators of Resilience and Equity - World Bank",
    "name": "ASPIRE: The Atlas of Social Protection, World Bank"
  },
  {
    "rawName": "Brauer, M. et al. 2017, for the Global Burden of Disease Study 2017.",
    "dataPublisherSource": "Brauer et al. (2017)",
    "name": "Brauer et al. (2017), via World Bank"
  },
  {
    "rawName": "Data collected by the Lancet Commission on Global Surgery (www.lancetglobalsurgery.org); Data collected by WHO Collaborating Centre for Surgery and Public Health at Lund University from various sources including Ministries of Health or equivalent national regulatory bodies, national official entities such as medical councils, Eurostat, OECD, WHO Euro Health For All Database, WHO EURO Technical resources for health Database; BMJ Glob Health.",
    "dataPublisherSource": "Lancet Commission on Global Surgery, World Health Organization Collaborating Centre for Surgery and Public Health at Lund University",
    "name": "Lancet Commission on Global Surgery, WHO, and BMJ Global Health, via World Bank"
  },
  {
    "rawName": "Debt service is the sum of principle repayments and interest actually paid in currency, goods, or services. This series differs from the standard debt to exports series. It covers only long-term public and publicly guaranteed debt and repayments (repurchases and charges) to the IMF. Exports of goods and services include primary income, but do not include workers' remittances.",
    "dataPublisherSource": "International Debt Statistics - World Bank",
    "name": "World Bank"
  },
  {
    "rawName": "Demographic and Health Surveys (DHS)",
    "dataPublisherSource": "Demographic and Health Surveys",
    "name": "Demographic and Health Surveys (DHS), via World Bank",
  },
  {
    "rawName": "Demographic and Health Surveys (DHS), Multiple Indicator Cluster Surveys (MICS), and other surveys",
    "dataPublisherSource": "Demographic and Health Surveys, Multiple Indicator Cluster Surveys, other surveys",
    "name": "Demographic and Health Surveys (DHS), Multiple Indicator Cluster Surveys (MICS), and other surveys, via World Bank"
  },
  {
    "rawName": "Demographic and Health Surveys (DHS).",
    "dataPublisherSource": "Demographic and Health Surveys (DHS)",
    "name": "Demographic and Health Surveys (DHS), via World Bank"
  },
  {
    "rawName": "Demographic and Health Surveys, and UNAIDS.",
    "dataPublisherSource": "Demographic and Health Surveys, UNAIDS",
    "name": "Demographic and Health Surveys (DHS), and UNAIDS, via World Bank"
  },
  {
    "rawName": "Derived using World Bank national accounts data and OECD National Accounts data files, and employment data from International Labour Organization, ILOSTAT database.",
    "dataPublisherSource": "ILOSTAT database - International Labour Organization, National accounts data - World Bank / OECD",
    "name": "World Bank and OECD national accounts, and ILOSTAT"
  },
  {
    "rawName": "World Bank staff estimates based on age distributions of United Nations Population Division's World Population Prospects: 2024 Revision.",
    "dataPublisherSource": "World Bank based on World Population Prospects - UN Population Division (2024)",
    "name": "World Bank based on data from the UN Population Division"
  }
]

In [None]:
import os
from openai import OpenAI
import json

SYSTEM_PROMPT = """
You are tasked with creating short citation names for data sources based on their raw names and data publisher sources.

Rules for creating the "name" field:
1. Create a concise citation of the data producer based on the rawName and dataPublisherSource
2. The name should be clear and professional, suitable for academic citations
3. When the rawName mentions "World Bank" along with a specific database/platform/program, use the format: "{Database/Platform Name} - World Bank"
4. For academic papers/publications, include author name and publication: "Author (Year), via World Bank"
5. For organizations without World Bank in rawName, append ", via World Bank"
6. Preserve full organization names rather than using abbreviations
7. Keep it concise but informative and professional

Input format: You will receive rawName and dataPublisherSource fields.
Output format: Return a JSON object with a "sources" field containing an array of objects with rawName and name fields.

Examples of good names:
""" + json.dumps(GOOD_EXAMPLES, indent=2)


# Limit batch size to control costs and API limits
MAX_BATCH_SIZE = 30

client = OpenAI()
all_new_sources = []
total_cost = 0

# Process all missing sources in batches
for i in range(0, len(missing_sources), MAX_BATCH_SIZE):
    batch_missing_sources = missing_sources[i:i+MAX_BATCH_SIZE]
    print(f"Processing batch {i//MAX_BATCH_SIZE + 1}: {len(batch_missing_sources)} sources (total: {len(missing_sources)})")

    # Create input data for this batch
    missing_sources_data = []
    for raw_name in batch_missing_sources:
        # Find corresponding dataPublisherSource from df_vars
        matching_rows = df_vars[df_vars["source"] == raw_name]
        if not matching_rows.empty:
            data_publisher_source = matching_rows.iloc[0].get("dataPublisherSource", "")
            missing_sources_data.append({
                "rawName": raw_name,
                "dataPublisherSource": data_publisher_source
            })

    if not missing_sources_data:
        continue

    input_text = json.dumps(missing_sources_data, ensure_ascii=False, indent=2)

    messages = [
        {
            "role": "system",
            "content": SYSTEM_PROMPT,
        },
        {
            "role": "user",
            "content": input_text,
        },
    ]

    # Use GPT-4o for processing
    response = client.chat.completions.create(
        model="gpt-4o",
        temperature=0,
        messages=messages,
        response_format={"type": "json_object"},
    )

    r = json.loads(response.choices[0].message.content)
    all_new_sources.extend(r['sources'])

print(f"\nTotal cost: ${total_cost:.4f}")
print(f"Processed {len(all_new_sources)} sources across {(len(missing_sources) + MAX_BATCH_SIZE - 1) // MAX_BATCH_SIZE} batches")

# Combine all results
r = {'sources': all_new_sources}
print(f"\nFirst 5 results:")
for source in r['sources'][:5]:
    print(f"  {source['name']} <- {source['rawName'][:100]}...")

Processing batch 1: 30 sources (total: 158)
Batch cost: $0.0335
Processing batch 2: 30 sources (total: 158)
Batch cost: $0.0335
Processing batch 2: 30 sources (total: 158)
Batch cost: $0.0341
Processing batch 3: 30 sources (total: 158)
Batch cost: $0.0341
Processing batch 3: 30 sources (total: 158)
Batch cost: $0.0312
Processing batch 4: 30 sources (total: 158)
Batch cost: $0.0312
Processing batch 4: 30 sources (total: 158)
Batch cost: $0.0347
Processing batch 5: 30 sources (total: 158)
Batch cost: $0.0347
Processing batch 5: 30 sources (total: 158)
Batch cost: $0.0329
Processing batch 6: 8 sources (total: 158)
Batch cost: $0.0329
Processing batch 6: 8 sources (total: 158)
Batch cost: $0.0122

Total cost: $0.1787
Processed 158 sources across 6 batches

First 5 results:
  Enterprise Surveys - World Bank <- World Bank, Enterprise Surveys (http://www.enterprisesurveys.org/)....
  Public Expenditure and Financial Accountability (via World Bank) <- Public Expenditure and Financial Accountab

## Update wdi.sources.json file

In [40]:
import json

with open("wdi.sources.json", "r") as f:
    sources = json.load(f)

for new_source in r['sources']:
    for s in sources:
        if s['rawName'] == new_source['rawName']:
            print(f"Updating source:\n  {new_source['name']} <- {s['rawName']}")
            s['name'] = new_source['name']
            break
    else:
        raise ValueError(f"Source {new_source['rawName']} not found in existing sources")

# Remove sources that still have TODO in their name
sources = [s for s in sources if not s['name'].startswith('TODO')]

# Save updated sources back to file
with open("wdi.sources.json", "w") as f:
    json.dump(sources, f, ensure_ascii=False, indent=2)

Updating source:
  Enterprise Surveys - World Bank <- World Bank, Enterprise Surveys (http://www.enterprisesurveys.org/).
Updating source:
  Public Expenditure and Financial Accountability (via World Bank) <- Public Expenditure and Financial Accountability (PEFA). Ministry of Finance (MoF).
Updating source:
  WHO et al. (2023) (via World Bank) <- WHO, UNICEF, UNFPA, World Bank Group, and UNDESA/Population Division. Trends in Maternal Mortality 2000 to 2020. Geneva, World Health Organization, 2023
Updating source:
  Demographic and Health Surveys (via World Bank) <- Demographic and Health Surveys.
Updating source:
  World Bank Staff Estimates <- World Bank staff estimates based data from International Monetary Fund's Direction of Trade database.
Updating source:
  Internal Displacement Monitoring Centre (via World Bank) <- The Internal Displacement Monitoring Centre (http://www.internal-displacement.org/)
Updating source:
  International Telecommunication Union (via World Bank) <- Inter