## Update metadata

Update `wdi.meta.yml` from WDI metadata file. This notebook is intended to be run manually and all changes to the YAML file need to be verified.

In [1]:
from wdi import load_variable_metadata

df_vars = load_variable_metadata()

In [2]:
df_vars.head()

Unnamed: 0_level_0,topic,indicator_name,short_definition,long_definition,unit_of_measure,periodicity,base_period,other_notes,aggregation_method,limitations_and_exceptions,notes_from_original_source,general_comments,source,statistical_concept_and_methodology,development_relevance,related_source_links,other_web_links,related_indicators,license_type
indicator_code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
ag_agr_trac_no,Environment: Agricultural production,"Agricultural machinery, tractors",,Agricultural machinery refers to the number of...,,Annual,,,Sum,The data are collected by the Food and Agricul...,,,"Food and Agriculture Organization, electronic ...",A tractor provides the power and traction to m...,Agricultural land covers more than one-third o...,,,,CC BY-4.0
ag_con_fert_pt_zs,Environment: Agricultural production,Fertilizer consumption (% of fertilizer produc...,,Fertilizer consumption measures the quantity o...,,Annual,,The world and regional aggregate series do not...,Weighted average,The FAO has revised the time series for fertil...,,,"Food and Agriculture Organization, electronic ...",Fertilizer consumption measures the quantity o...,"Factors such as the green revolution, has led ...",,,,CC BY-4.0
ag_con_fert_zs,Environment: Agricultural production,Fertilizer consumption (kilograms per hectare ...,,Fertilizer consumption measures the quantity o...,,Annual,,The world and regional aggregate series do not...,Weighted average,The FAO has revised the time series for fertil...,,,"Food and Agriculture Organization, electronic ...",Fertilizer consumption measures the quantity o...,"Factors such as the green revolution, has led ...",,,,CC BY-4.0
ag_lnd_agri_k2,Environment: Land use,Agricultural land (sq. km),,Agricultural land refers to the share of land ...,,Annual,,Areas of former states are included in the suc...,Sum,The data are collected by the Food and Agricul...,,,"Food and Agriculture Organization, electronic ...",Agricultural land constitutes only a part of a...,Agricultural land covers more than one-third o...,,,,CC BY-4.0
ag_lnd_agri_zs,Environment: Land use,Agricultural land (% of land area),,Agricultural land refers to the share of land ...,,Annual,,Areas of former states are included in the suc...,Weighted average,The data are collected by the Food and Agricul...,,,"Food and Agriculture Organization, electronic ...",Agriculture is still a major sector in many ec...,Agricultural land covers more than one-third o...,,,,CC BY-4.0


In [3]:
import ruamel.yaml

yaml_path = "wdi.meta.yml"

with open(yaml_path, "r") as f:
    yml = ruamel.yaml.load(f, Loader=ruamel.yaml.RoundTripLoader)

In [4]:
import re
from typing import Union


def replace_years(s: str, year: Union[int, str]) -> str:
    """replaces all years in string with {year}.

    Example:

        >>> replace_years("GDP (constant 2010 US$)", 2015)
        "GDP (constant 2015 US$)"
    """
    year_regex = re.compile(r"\b([1-2]\d{3})\b")
    s_new = year_regex.sub(str(year), s)
    return s_new


variables = yml["tables"]["wdi"]["variables"]

for indicator_code in df_vars.index:
    if indicator_code in variables:
        var = variables[indicator_code]
    else:
        var = {}
        variables[indicator_code] = var

    # update titles from metadata file
    try:
        var["title"] = df_vars.loc[indicator_code].indicator_name
    except KeyError:
        continue

    # if title contains year, try to update units too
    year_regex = re.compile(r"\b([1-2]\d{3})\b")
    regex_res = year_regex.search(df_vars.loc[indicator_code].indicator_name)
    if regex_res:
        assert len(regex_res.groups()) == 1
        year = regex_res.groups()[0]

        if "unit" in var:
            var["unit"] = replace_years(var["unit"], year)

        if "short_unit" in var:
            var["short_unit"] = replace_years(var["short_unit"], year)

        for k in ["name", "unit", "short_unit"]:
            if var.get("display", {}).get("unit"):
                var["display"]["unit"] = replace_years(var["display"]["unit"], year)

            if var.get("display", {}).get("short_unit"):
                var["display"]["short_unit"] = replace_years(var["display"]["short_unit"], year)

In [5]:
with open(yaml_path, "w") as f:
    ruamel.yaml.dump(yml, f, Dumper=ruamel.yaml.RoundTripDumper, width=120)

## Update Sources

In [6]:
import json

with open("wdi.sources.json", "r") as f:
    sources = json.load(f)

sources = [s for s in sources if not s["name"].startswith("TODO")]

missing_sources = list(set(df_vars["source"]) - {s["rawName"] for s in sources})
missing_sources

[1m[[0m[1m][0m

In [25]:
import os
from openai import OpenAI
import random

SYSTEM_PROMPT = f"""
You are given list of examples in JSON format you should use for learning. Each example has
rawName and fields name and dataPublisherSource are derived from rawName.
I'll give you a list of rawNames and you should give me a JSON list of those
rawNames with name and dataPublisherSource fields filled in.

Examples:
{json.dumps(random.sample(sources, 20))}
"""

all_sources = "\n".join(missing_sources)

messages = [
    {
        "role": "system",
        "content": SYSTEM_PROMPT,
    },
    {
        "role": "user",
        "content": all_sources,
    },
]

client = OpenAI()

# 10 missing sources / 5 examples -> 2min
response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    messages=messages,
    response_format={"type": "json_object"},
)
print(f"Cost GPT4o: ${response.usage.total_tokens / 1e6 * 7.5:.2f}")
r = json.loads(response.choices[0].message.content)
print(json.dumps(r, ensure_ascii=False, indent=2))

Cost GPT4: $0.04
{
  "rawNames": [
    {
      "rawName": "UNESCO Institute for Statistics (UIS). UIS.Stat Bulk Data Download Service. Accessed November 27, 2023. https://apiportal.uis.unesco.org/bdds.",
      "name": "UNESCO Institute for Statistics",
      "dataPublisherSource": "UIS.Stat Bulk Data Download Service - UNESCO Institute for Statistics"
    },
    {
      "rawName": "World Bank and UIS",
      "name": "World Bank and UNESCO Institute for Statistics",
      "dataPublisherSource": "World Bank and UNESCO Institute for Statistics"
    },
    {
      "rawName": "UNICEF, WHO, World Bank: Joint child Malnutrition Estimates (JME). Aggregation is based on UNICEF, WHO, and the World Bank harmonized dataset (adjusted, comparable data) and methodology.",
      "name": "UNICEF, WHO, World Bank",
      "dataPublisherSource": "Joint child Malnutrition Estimates - UNICEF, WHO, World Bank"
    },
    {
      "rawName": "International Labour Organization. “Wages and Working Time Statistic