## Generate SDG mapping from goals to variables and datasets

We list all `goals -> variables` mappings from https://sdg-tracker.org/. The original `goal -> variable` mapping was in [this spreadsheet](https://docs.google.com/spreadsheets/d/1n0UrpKKS2JVcXSmth_QVYLThlWev6pzEiHP7HaIq9BY/edit#gid=1284188229), but it's not up to date anymore. Instead we get them by scraping SDG tracker files from github. After that we enhance it with data from grapher DB and generate a new CSV file to be used by ETL.

In [2]:
import requests

js = requests.get("https://api.github.com/repos/owid/sdg-tracker.org/git/trees/master?recursive=1").json()

In [3]:
from bs4 import BeautifulSoup
from collections import defaultdict

charts = defaultdict(list)

for p in js["tree"]:
    if not p["path"].startswith("pages"):
        continue

    resp = requests.get(f"https://raw.githubusercontent.com/owid/sdg-tracker.org/master/{p['path']}")
    if resp.status_code != 200:
        continue

    soup = BeautifulSoup(resp.text, "html.parser")

    for div_indicator in soup.find_all("div", {"class": "indicator"}):
        if "id" not in div_indicator.attrs:
            raise Exception(f'Page {p["path"]} is missing id=[indicator] in <div class="indicator">')

        for iframe_chart in div_indicator.find_all("iframe"):
            charts[div_indicator.attrs["id"]].append(iframe_chart.attrs["src"])

In [4]:
import pandas as pd

df = pd.DataFrame(
    [(indicator, chart) for indicator, charts in charts.items() for chart in charts],
    columns=["indicator", "chart"],
)
df.head()

Unnamed: 0,indicator,chart
0,15.1.1,https://ourworldindata.org/grapher/forest-area...
1,15.1.2,https://ourworldindata.org/grapher/terrestrial...
2,15.1.2,https://ourworldindata.org/grapher/protected-t...
3,15.1.2,https://ourworldindata.org/grapher/proportion-...
4,15.2.1,https://ourworldindata.org/grapher/forest-area...


In [5]:
df["chart_slug"] = df.chart.str.split("/").str.get(-1)

In [6]:
from sqlalchemy import create_engine
from urllib.parse import quote

from dotenv import dotenv_values

env = dotenv_values("../../../../../../.env.prod")

engine = f"mysql://{env['DB_USER']}:{quote(env['DB_PASS'])}@{env['DB_HOST']}:{env['DB_PORT']}/{env['DB_NAME']}"

# get variable id -> dataset id relationship
q = """
select
    c.id as chart_id,
    v.id as variable_id,
    v.name as variable_name,
    d.id as dataset_id,
    d.name as dataset_name,
    c.config->>"$.slug" as chart_slug
from variables as v
join datasets as d on d.id = v.datasetId
join chart_dimensions as cd on cd.variableId = v.id
join charts as c on c.id = cd.chartId
where c.config->>"$.slug" in %(slugs)s
    and d.isPrivate is false
"""
gf = pd.read_sql(q, engine, params={"slugs": df.chart_slug.tolist()})

In [7]:
df = df.merge(gf, on="chart_slug")

In [8]:
from owid.catalog.utils import underscore

df["dataset_name"] = "dataset_" + df.dataset_id.astype(str) + "_" + df.dataset_name.map(underscore)

In [9]:
# bulk_backport command to run locally
"ENV=.env.prod bulk_backport " + "-d " + " -d ".join(list(set(df.dataset_id.map(str))))

'ENV=.env.prod bulk_backport -d 1861 -d 5790 -d 5774 -d 3093 -d 5782 -d 5362 -d 829 -d 5201 -d 5959 -d 943 -d 1070 -d 5708 -d 5895 -d 5821 -d 115 -d 5941 -d 5593 -d 5676 -d 5520 -d 5637 -d 5599 -d 1857 -d 5546 -d 5855 -d 5839 -d 5712 -d 1047 -d 5575 -d 5332'

In [10]:
# dependencies for DAG file
print("\n".join([f"- backport://backport/owid/latest/{n}" for n in sorted(list(set(df["dataset_name"])))]))

- backport://backport/owid/latest/dataset_1047_world_bank_gender_statistics__gender
- backport://backport/owid/latest/dataset_1070_statistical_capacity_indicator__sci__world_bank_data_on_statistical_capacity
- backport://backport/owid/latest/dataset_115_countries_continents
- backport://backport/owid/latest/dataset_1857_employment
- backport://backport/owid/latest/dataset_1861_earnings_and_labour_cost
- backport://backport/owid/latest/dataset_3093_economic_losses_from_disasters_as_a_share_of_gdp__pielke__2018
- backport://backport/owid/latest/dataset_5201_forest_land__deforestation_and_change__fao__2020
- backport://backport/owid/latest/dataset_5332_water_and_sanitation__who_wash__2021
- backport://backport/owid/latest/dataset_5362_world_bank_edstats_2020
- backport://backport/owid/latest/dataset_5520_united_nations_sustainable_development_goals__united_nations__2022_02
- backport://backport/owid/latest/dataset_5546_democracy__lexical_index
- backport://backport/owid/latest/dataset_557

In [11]:
df.to_csv("sdg_sources.csv", index=False)