# Validate continents against grapher

Make sure continents in our `countries_regions.csv` file are consistent with continent definitions in grapher. Continents are currently in two places in grapher:
- `variable 123` - the source of truth for continents
- `country_name_tool_countrydata` - rather old table that is subject to deprecation, **DON'T USE IT FOR NEW CODE**

This notebook checks differences between these 3 sources. If you find a major difference, you should update it manually and then rerun the script to make sure there all of them align.

In [None]:
from init import *

## Load all data

In [None]:
# continents from `country_name_tool_countrydata` table used as an input for `countries-regions` pipeline
q = """
select
    cd.owid_name,
    cd.iso_alpha3,
    ct.continent_name
from country_name_tool_countrydata as cd
join country_name_tool_continent as ct on cd.continent = ct.id
"""
countrydata = pd.read_sql(q, engine)

In [None]:
# continents from grapher
q = """
select
  dv.value as continent,
  e.code as country_code,
  e.name as country_name
from data_values as dv
join entities as e on dv.entityId = e.id
where dv.variableId = 123
"""
variable = pd.read_sql(q, engine)

In [None]:
# continents from countries_regions.csv
import pandas as pd
from owid import catalog

from etl.paths import REFERENCE_DATASET

reference_dataset = catalog.Dataset(REFERENCE_DATASET)
countries_regions = reference_dataset["countries_regions"]

## Difference between `country_name_tool_countrydata` and grapher

In [None]:
countrydata["owid_continent"] = countrydata.iso_alpha3.map(
    variable.set_index("country_code").continent
)
diffs = countrydata[countrydata.continent_name != countrydata.owid_continent].dropna(
    subset=["owid_continent"]
)

# there should be no differences!
if len(diffs) != 0:
    print(diffs)
    raise Exception()

## Difference between `countries_regions` and grapher

The differences below should be empty if everything is consistent.

In [None]:
import json

for continent, df in variable.groupby("continent"):
    cr_countries = json.loads(
        countries_regions[countries_regions.name == continent].iloc[0].members
    )

    print(continent)
    print("countries_regions.csv - grapher:", set(cr_countries) - set(df.country_code))
    print("grapher - countries_regions.csv:", set(df.country_code) - set(cr_countries))
    print()