## Update metadata

Update `wdi.meta.yml` from WDI metadata file. This notebook is intended to be run manually and all changes to the YAML file need to be verified.

In [1]:
%load_ext autoreload
%autoreload 2

In [14]:
from owid.catalog import Dataset
from etl.paths import DATA_DIR
import os

version = os.getcwd().rsplit('/', 1)[1]
ds_meadow = Dataset(DATA_DIR / 'meadow/worldbank_wdi' / version / 'wdi')
tb = ds_meadow['wdi']
indicator_codes = [tb[col].m.title for col in tb.columns]

tb_metadata = ds_meadow.read("wdi_metadata", safe_types=False)

In [15]:
from wdi import load_variable_metadata

df_vars = load_variable_metadata(tb_metadata, indicator_codes)
df_vars.head()

[2m2025-09-08 17:23:01[0m [[32m[1minfo     [0m] [1mwdi.missing_metadata          [0m [36mn_indicators[0m=[35m0[0m


Unnamed: 0_level_0,topic,indicator_name,long_definition,unit_of_measure,periodicity,base_period,other_notes,aggregation_method,limitations_and_exceptions,notes_from_original_source,general_comments,source,statistical_concept_and_methodology,development_relevance,related_source_links,other_web_links,related_indicators,license_type,indicator_code_original
indicator_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
ag_con_fert_pt_zs,Environment: Agricultural production,Fertilizer consumption (% of fertilizer produc...,Fertilizer consumption measures the quantity o...,% (ratio),Annual,,The world and regional aggregate series do not...,Weighted average,The FAO has revised the time series for fertil...,,,"FAO electronic files and web site, Food and Ag...",Methodology: Fertilizer consumption measures t...,"Factors such as the green revolution, has led ...",,,,CC BY-4.0,AG.CON.FERT.PT.ZS
ag_con_fert_zs,Environment: Agricultural production,Fertilizer consumption (kilograms per hectare ...,Fertilizer consumption measures the quantity o...,kg per hectare of arable land,Annual,,The world and regional aggregate series do not...,Weighted average,The FAO has revised the time series for fertil...,,,"FAO electronic files and web site, Food and Ag...",Methodology: Fertilizer consumption measures t...,"Factors such as the green revolution, has led ...",,,,CC BY-4.0,AG.CON.FERT.ZS
ag_lnd_agri_k2,Environment: Land use,Agricultural land (sq. km),Agricultural land refers to the share of land ...,square kilometers (sq. km),Annual,,Areas of former states are included in the suc...,Sum,The data are collected by the Food and Agricul...,,,"FAO electronic files and web site, Food and Ag...",Methodology: Agricultural land constitutes onl...,Agricultural land covers more than one-third o...,,,,CC BY-4.0,AG.LND.AGRI.K2
ag_lnd_agri_zs,Environment: Land use,Agricultural land (% of land area),Agricultural land refers to the share of land ...,% (share) of land area,Annual,,Areas of former states are included in the suc...,Weighted average,The data are collected by the Food and Agricul...,,,"FAO electronic files and web site, Food and Ag...",Methodology: Agriculture is still a major sect...,Agricultural land covers more than one-third o...,,,,CC BY-4.0,AG.LND.AGRI.ZS
ag_lnd_arbl_ha,Environment: Land use,Arable land (hectares),Arable land (in hectares) includes land define...,hectares,Annual,,,,The Food and Agriculture Organization (FAO) tr...,,,"FAO electronic files and web site, Food and Ag...",Methodology: Temporary fallow land refers to l...,Agricultural land covers more than one-third o...,,,,CC BY-4.0,AG.LND.ARBL.HA


## Replace years in YAML metadata

In [16]:
from etl.files import ruamel_dump, ruamel_load

yaml_path = "wdi.meta.yml"

with open(yaml_path, "r") as f:
    yml = ruamel_load(f)

In [None]:
KEEP = {'armed_forces_share_population'}

# Delete variables that are not in the dataset
missing_variables = set(yml['tables']['wdi']['variables'].keys()) - set(tb.columns)
missing_variables = {v for v in missing_variables if not v.startswith('omm_') and v not in KEEP}

print(f"Deleting {len(missing_variables)} variables")
for var in missing_variables:
    del yml['tables']['wdi']['variables'][var]

In [None]:
import re
from typing import Union


def replace_years(s: str, year: Union[int, str]) -> str:
    """replaces all years in string with {year}.

    Example:

        >>> replace_years("GDP (constant 2010 US$)", 2015)
        "GDP (constant 2015 US$)"
    """
    year_regex = re.compile(r"\b([1-2]\d{3})\b")
    s_new = year_regex.sub(str(year), s)
    return s_new


variables = yml["tables"]["wdi"]["variables"]

for indicator_code in df_vars.index:
    if indicator_code in variables:
        var = variables[indicator_code]
    else:
        var = {}
        variables[indicator_code] = var

    # update titles from metadata file
    try:
        var["title"] = df_vars.loc[indicator_code].indicator_name
    except KeyError:
        continue

    # if title contains year, try to update units too
    year_regex = re.compile(r"\b([1-2]\d{3})\b")
    regex_res = year_regex.search(df_vars.loc[indicator_code].indicator_name)
    if regex_res:
        assert len(regex_res.groups()) == 1
        year = regex_res.groups()[0]

        if "unit" in var:
            var["unit"] = replace_years(var["unit"], year)

        if "short_unit" in var:
            var["short_unit"] = replace_years(var["short_unit"], year)

        for k in ["name", "unit", "short_unit"]:
            if var.get("display", {}).get("unit"):
                var["display"]["unit"] = replace_years(var["display"]["unit"], year)

            if var.get("display", {}).get("short_unit"):
                var["display"]["short_unit"] = replace_years(var["display"]["short_unit"], year)

        if "presentation" in var:
            for k in ["title_public", "title_variant"]:
                if k in var["presentation"]:
                    var["presentation"][k] = replace_years(var["presentation"][k], year)

In [None]:
with open(yaml_path, "w") as f:
    f.write(ruamel_dump(yml))

## Replace years in chart configs

In [None]:
from etl.db import get_engine, read_sql

# get GDP variable
q = """
select id from variables
where name = 'GDP per capita, PPP (constant 2021 international $)'
    and catalogPath = 'grapher/worldbank_wdi/2025-01-24/wdi/wdi#ny_gdp_pcap_pp_kd'
"""
engine = get_engine()
var_id = read_sql(q, engine).id.iloc[0]
print(var_id)

# get all charts using that variable
q = f"""
select chartId from chart_dimensions where variableId = {var_id};
"""
chart_ids = list(read_sql(q, engine)['chartId'])
len(chart_ids)

In [None]:
from apps.chart_sync.admin_api import AdminAPI
from etl.config import OWID_ENV, ENV_GRAPHER_USER_ID

admin_api = AdminAPI(OWID_ENV, grapher_user_id=ENV_GRAPHER_USER_ID)

old_year = "2017"
new_year = "2021"

for chart_id in chart_ids:
    chart_config = admin_api.get_chart_config(chart_id)

    fields = ['subtitle', 'note']

    update = False
    for field in fields:
        if field in chart_config:
            if old_year in (chart_config.get(field, '') or ''):
                chart_config[field] = chart_config[field].replace(old_year, new_year)
                update = True

    if update:
        print(f"Updating chart {chart_id}")
        admin_api.update_chart(chart_id, chart_config)

## Update Sources

In [6]:
import json

with open("wdi.sources.json", "r") as f:
    sources = json.load(f)

sources = [s for s in sources if not s["name"].startswith("TODO")]

missing_sources = list(set(df_vars["source"]) - {s["rawName"] for s in sources})
len(missing_sources)

228

In [7]:
GOOD_EXAMPLES = [
    {
    "rawName": "ASPIRE: The Atlas of Social Protection - Indicators of Resilience and Equity, The World Bank. Data are based on national representative household surveys. (datatopics.worldbank.org/aspire/)",
    "dataPublisherSource": "The Atlas of Social Protection Indicators of Resilience and Equity - World Bank",
    "name": "ASPIRE: The Atlas of Social Protection, World Bank"
  },
  {
    "rawName": "Brauer, M. et al. 2017, for the Global Burden of Disease Study 2017.",
    "dataPublisherSource": "Brauer et al. (2017)",
    "name": "Brauer et al. (2017), via World Bank"
  },
  {
    "rawName": "Data collected by the Lancet Commission on Global Surgery (www.lancetglobalsurgery.org); Data collected by WHO Collaborating Centre for Surgery and Public Health at Lund University from various sources including Ministries of Health or equivalent national regulatory bodies, national official entities such as medical councils, Eurostat, OECD, WHO Euro Health For All Database, WHO EURO Technical resources for health Database; BMJ Glob Health.",
    "dataPublisherSource": "Lancet Commission on Global Surgery, World Health Organization Collaborating Centre for Surgery and Public Health at Lund University",
    "name": "Lancet Commission on Global Surgery, WHO, and BMJ Global Health, via World Bank"
  },
  {
    "rawName": "Debt service is the sum of principle repayments and interest actually paid in currency, goods, or services. This series differs from the standard debt to exports series. It covers only long-term public and publicly guaranteed debt and repayments (repurchases and charges) to the IMF. Exports of goods and services include primary income, but do not include workers' remittances.",
    "dataPublisherSource": "International Debt Statistics - World Bank",
    "name": "International Debt Statistics - World Bank"
  },
  {
    "rawName": "Demographic and Health Surveys (DHS)",
    "dataPublisherSource": "Demographic and Health Surveys",
    "name": "Demographic and Health Surveys (DHS), via World Bank",
  },
  {
    "rawName": "Demographic and Health Surveys (DHS), Multiple Indicator Cluster Surveys (MICS), and other surveys",
    "dataPublisherSource": "Demographic and Health Surveys, Multiple Indicator Cluster Surveys, other surveys",
    "name": "Demographic and Health Surveys (DHS), Multiple Indicator Cluster Surveys (MICS), and other surveys, via World Bank"
  },
  {
    "rawName": "Demographic and Health Surveys (DHS).",
    "dataPublisherSource": "Demographic and Health Surveys (DHS)",
    "name": "Demographic and Health Surveys (DHS), via World Bank"
  },
  {
    "rawName": "Demographic and Health Surveys, and UNAIDS.",
    "dataPublisherSource": "Demographic and Health Surveys, UNAIDS",
    "name": "Demographic and Health Surveys (DHS), and UNAIDS, via World Bank"
  },
  {
    "rawName": "Derived using World Bank national accounts data and OECD National Accounts data files, and employment data from International Labour Organization, ILOSTAT database.",
    "dataPublisherSource": "ILOSTAT database - International Labour Organization, National accounts data - World Bank / OECD",
    "name": "World Bank and OECD national accounts, and ILOSTAT"
  },
  {
    "rawName": "World Bank staff estimates based on age distributions of United Nations Population Division's World Population Prospects: 2024 Revision.",
    "dataPublisherSource": "World Bank based on World Population Prospects - UN Population Division (2024)",
    "name": "World Bank based on data from the UN Population Division"
  },
  {
    "rawName": "Food and Agriculture Organization of the United Nations (FAO)",
    "dataPublisherSource": "Food and Agriculture Organization of the United Nations",
    "name": "Food and Agriculture Organization of the United Nations, via World Bank"
  },
  {
    "rawName": "Inter-Parliamentary Union (IPU) (www.ipu.org).  For the year of 1998, the data is as of August 10, 1998.",
    "dataPublisherSource": "Inter-Parliamentary Union",
    "name": "Inter-Parliamentary Union (IPU), via World Bank"
  },
  {
    "rawName": "United Nations Population Division, Trends in Total Migrant Stock: 2008 Revision.",
    "dataPublisherSource": "Trends in Total Migrant Stock - UN Population Division (2008)",
    "name": "United Nations Population Division, via World Bank",
  },
  {
    "rawName": "Development Assistance Committee of the Organisation for Economic Co-operation and Development, Geographical Distribution of Financial Flows to Developing Countries, Development Co-operation Report, and International Development Statistics database. Data are available online at: www.oecd.org/dac/stats/idsonline. World Bank gross capital formation estimates are used for the denominator.",
    "dataPublisherSource": "Geographical Distribution of Financial Flows to Developing Countries - OECD Development Assistance Committee, Development Co-operation Report - OECD Development Assistance Committee, International Development Statistics Database - OECD, Gross capital formation estimates - World Bank",
    "name": "Development Assistance Committee - OECD, via World Bank",
  },
  {
    "rawName": "Development Assistance Committee of the Organisation for Economic Co-operation and Development, Geographical Distribution of Financial Flows to Developing Countries, Development Co-operation Report, and International Development Statistics database. Data are available online at: www.oecd.org/dac/stats/idsonline. World Bank imports of good and services estimates are used for the denominator.",
    "dataPublisherSource": "Geographical Distribution of Financial Flows to Developing Countries - OECD Development Assistance Committee, Development Co-operation Report - OECD Development Assistance Committee, International Development Statistics Database - OECD, Imports estimates - World Bank",
    "name": "Development Assistance Committee - OECD, via World Bank",
  },
  {
    "rawName": "Food and Agriculture Organization and World Bank population estimates.",
    "dataPublisherSource": "Food and Agriculture Organization of the United Nations, Population estimates - World Bank",
    "name": "Food and Agriculture Organization of the United Nations and World Bank",
  },
  {
    "rawName": "Food and Agriculture Organization of the United Nations (FAO)",
    "dataPublisherSource": "Food and Agriculture Organization of the United Nations",
    "name": "Food and Agriculture Organization of the United Nations and World Bank",
  },
  {
    "rawName": "Food and Agriculture Organization, AQUASTAT data, and World Bank and OECD GDP estimates.",
    "dataPublisherSource": "AQUASTAT Database - Food and Agriculture Organization of the United Nations, GDP estimates - World Bank / OECD",
    "name": "Food and Agriculture Organization of the United Nations, OECD, and World Bank"
  },
  {
    "rawName": "Food and Agriculture Organization, AQUASTAT data.",
    "dataPublisherSource": "AQUASTAT Database - Food and Agriculture Organization of the United Nations",
    "name": "Food and Agriculture Organization of the United Nations, OECD, and World Bank"
  },
  {
    "rawName": "Food and Agriculture Organization, electronic files and web site.",
    "dataPublisherSource": "Food and Agriculture Organization of the United Nations",
    "name": "Food and Agriculture Organization of the United Nations, via World Bank"
  },
  {
    "rawName": "Household surveys, including Demographic and Health Surveys and Multiple Indicator Cluster Surveys. Largely compiled by United Nations Population Division.",
    "dataPublisherSource": "Demographic and Health Surveys, Multiple Indicator Cluster Surveys, Household surveys, UN Population Division",
    "name": "Demographic and Health Surveys (DHS), Multiple Indicator Cluster Surveys (MICS), and United Nations Population Division, via World Bank",
  },
  {
    "rawName": "International Comparison Program, World Bank | World Development Indicators database, World Bank | Eurostat-OECD PPP Programme.",
    "dataPublisherSource": "International Comparison Program - World Bank, World Development Indicators - World Bank, Eurostat-OECD PPP Programme",
    "name": "Eurostat, OECD, and World Bank",
  },
  {
    "rawName": "The Program in Global Surgery and Social Change (PGSSC) at Harvard Medical School (https://www.pgssc.org/)",
    "dataPublisherSource": "Harvard Medical School Program in Global Surgery and Social Change",
    "name": "Harvard Medical School Program in Global Surgery and Social Change (PGSSC), via World Bank"
  },
  {
    "rawName": "The country data compiled, adjusted and used in the estimation model by the Maternal Mortality Estimation Inter-Agency Group (MMEIG). The country data were compiled from the following sources:  civil registration and vital statistics; specialized studies on maternal mortality; population based surveys and censuses; other available data sources including data from surveillance sites.",
    "dataPublisherSource": "UN Maternal Mortality Estimation Inter-Agency Group",
    "name": "UN Maternal Mortality Estimation Inter-Agency Group, via World Bank"
  },
  {
    "rawName": "UNICEF, State of the World's Children, Childinfo, and Demographic and Health Surveys.",
    "dataPublisherSource": "State of the World's Children - UNICEF, Demographic and Health Surveys",
    "name": "Demographic and Health Surveys (DHS) and UNICEF, via World Bank"
  },
  {
    "rawName": "UNICEF-WHO Low birthweight estimates [data.unicef.org]",
    "dataPublisherSource": "Low Birthweight Estimates - UNICEF / World Health Organization",
    "name": "UNICEF and World Health Organization, via World Bank"
  },
  {
    "rawName": "Understanding Children's Work project based on data from ILO, UNICEF and the World Bank.",
    "dataPublisherSource": "Understanding Children's Work Project - International Labour Organization / UNICEF / World Bank",
    "name": "International Labour Organization, UNICEF, and World Bank"
  },
  {
    "rawName": "United Nations Children's Fund, Division of Data, Analysis, Planning and Monitoring (2019). UNICEF Global Databases on Iodized salt, New York, June 2019",
    "dataPublisherSource": "Global Database on Household Consumption of Iodized Salt - UNICEF",
    "name": "UNICEF, via World Bank"
  },
  {
    "rawName": "United Nations High Commissioner for Refugees (UNHCR) and UNRWA through UNHCR's Refugee Data Finder at https://www.unhcr.org/refugee-statistics/.",
    "dataPublisherSource": "Refugee Data Finder - UN High Commissioner for Refugees",
    "name": "UNHCR and UNRWA, via World Bank"
  },
  {
    "rawName": "United Nations High Commissioner for Refugees (UNHCR) and UNRWA through UNHCR's Refugee Data Finder at https://www.unhcr.org/refugee-statistics/.",
    "dataPublisherSource": "Refugee Data Finder - UN High Commissioner for Refugees",
    "name": "UNHCR and UNRWA, via World Bank"
  },
  {
    "rawName": "World Bank staff estimates based on IMF balance of payments data, and World Bank and OECD GDP estimates.",
    "dataPublisherSource": "World Bank based on Balance of Payments Statistics - International Monetary Fund, GDP estimates - World Bank / OECD",
    "name": "World Bank based on IMF and OECD"
  },
  {
    "rawName": "International Labour Organization. “ILO Modelled Estimates and Projections database (ILOEST)” ILOSTAT. Accessed June 18, 2024. https://ilostat.ilo.org/data/.",
    "dataPublisherSource": "ILO Modelled Estimates and Projections Database (ILOEST) - International Labour Organization",
    "name": "International Labour Organization, ILOSTAT, via World Bank"
  },
]

In [9]:
import os
from openai import OpenAI
import json

SYSTEM_PROMPT = """
You are tasked with creating short citation names for data sources based on their raw names and data publisher sources.

Input format: You will receive rawName and dataPublisherSource fields.
Output format: Return a JSON object with a "sources" field containing an array of objects with rawName and name fields.


Rules for creating the "name" field in addition to what you infer from examples
1. World Bank MUST appear in every citation name

Check out these good examples. Make sure these examples are followed closely.
""" + json.dumps(GOOD_EXAMPLES, indent=2)


# Limit batch size to control costs and API limits
MAX_BATCH_SIZE = 30

client = OpenAI()
all_new_sources = []

# Process all missing sources in batches
for i in range(0, len(missing_sources), MAX_BATCH_SIZE):
    batch_missing_sources = missing_sources[i:i+MAX_BATCH_SIZE]
    print(f"Processing batch {i//MAX_BATCH_SIZE + 1}: {len(batch_missing_sources)} sources (total: {len(missing_sources)})")

    # Create input data for this batch
    missing_sources_data = []
    for raw_name in batch_missing_sources:
        # Find corresponding dataPublisherSource from df_vars
        matching_rows = df_vars[df_vars["source"] == raw_name]
        if not matching_rows.empty:
            data_publisher_source = matching_rows.iloc[0].get("dataPublisherSource", "")
            missing_sources_data.append({
                "rawName": raw_name,
                "dataPublisherSource": data_publisher_source
            })

    if not missing_sources_data:
        continue

    input_text = json.dumps(missing_sources_data, ensure_ascii=False, indent=2)

    messages = [
        {
            "role": "system",
            "content": SYSTEM_PROMPT,
        },
        {
            "role": "user",
            "content": input_text,
        },
    ]

    # Use GPT-4o for processing
    response = client.chat.completions.create(
        model="gpt-5-nano",
        messages=messages,
        response_format={"type": "json_object"},
    )

    r = json.loads(response.choices[0].message.content)
    all_new_sources.extend(r['sources'])

print(f"Processed {len(all_new_sources)} sources across {(len(missing_sources) + MAX_BATCH_SIZE - 1) // MAX_BATCH_SIZE} batches")

# Combine all results
r = {'sources': all_new_sources}
print(f"\nFirst 5 results:")
for source in r['sources'][:5]:
    print(f"  {source['name']} <- {source['rawName'][:100]}...")

Processing batch 1: 30 sources (total: 228)
Processing batch 2: 30 sources (total: 228)
Processing batch 3: 30 sources (total: 228)
Processing batch 4: 30 sources (total: 228)
Processing batch 5: 30 sources (total: 228)
Processing batch 6: 30 sources (total: 228)
Processing batch 7: 30 sources (total: 228)
Processing batch 8: 18 sources (total: 228)
Processed 228 sources across 8 batches

First 5 results:
  Remittance Prices Worldwide, World Bank <- Remittance Prices Worldwide, World Bank (WB), uri: http://remittanceprices.worldbank.org...
  Yearbook of Tourism Statistics - UN Tourism, IMF, and World Bank <- Yearbook of Tourism Statistics, Compendium of Tourism Statistics and data files, UN Tourism;
IMF imp...
  Labour Market-related SDG Indicators (ILOSTAT) - ILO, via World Bank <- Labour Market-related SDG Indicators database (ILOSDG), International Labour Organization (ILO), uri...
  UN Inter-agency Group for Child Mortality Estimation (UNICEF, WHO, UN Population Division), via Worl

## Update wdi.sources.json file

In [32]:
import json

with open("wdi.sources.json", "r") as f:
    sources = json.load(f)

for new_source in r['sources']:
    for s in sources:
        if s['rawName'] == new_source['rawName']:
            print(f"Updating source:\n  {new_source['name']} <- {s['rawName']}")
            s['name'] = new_source['name']
            break
    else:
        # New source, add it
        # raise ValueError(f"Source {new_source['rawName']} not found in existing sources")
        sources.append({
            "rawName": new_source['rawName'],
            "name": new_source['name'],
            "dataPublisherSource": new_source.get('dataPublisherSource', '')
        })

# Remove sources that still have TODO in their name
sources = [s for s in sources if not s['name'].startswith('TODO')]

# Save updated sources back to file
with open("wdi.sources.json", "w") as f:
    json.dump(sources, f, ensure_ascii=False, indent=2)