## Generate Explorers from Dimensions column information on the variables table|

A part of the multdimensional indicator investigation of cycle 2024.2 was to check if the garden
datasets that we currently have could relatively easily be turned into "small explorers".

When we have dataframes in garden that have breakdowns that go beyond country and year, e.g. age
brackets, sex/gender, then we slice these up when we export them into graphers datamodel (i.e. when
we create indicatorid.data.json files and write entries in the variables table in our MySQL
database) - E.g. a dataframe that has 10 age groups and an indicator "cancer deaths" could be
turned into 10 indicators like "cancer deaths - ages 0-9". When we do this slicing, we also store a
bit of information as a json blob in the dimensions column of the variables table for these slices
that can help us restore the original structure.

This notebook investigates what it would look like if we used this information to create indictor
based explorers for each of these indicators.

The result is about 900 explorers, many of which only have one dropdown with 3 values (e.g.
"rural", "urban", "rural and urban"). The conclusion of this notebook is that while this works in theory, garden dataframes and columns would still have to be authored in a very specific way to work well as explorers - which means that probably just making indicator based explorers easier to author is more useful than requiring a very specific data layout in our dataframes and auto-generating explorers from that.

Here is an example of an explorer that was constructed in this way:

```
explorerTitle    15-19 years old, current drinkers (%) - Sex: trans
selection        World

graphers
    sex Dropdown    yVariableIds
    both sexes      821603
    female          821604
    male            821605
```

In [1]:
import pandas as pd
import os
import json

pd.set_option('display.max_colwidth', None)

The code below is commented out because it documents the SQL query that was used to get the information from the MySQL database which requires DB access to be configured. The result of this
query was saved into the "variables_with_dimensions.parquet" file which is loaded below.

In [5]:
# %%mysql -o all

# select id, name, dimensions, catalogPath, datasetId from variables where dimensions is not null

  df = pd.read_sql(cell, self.client)


In [None]:
# all.to_parquet("variables_with_dimensions.parquet")

In [2]:
all = pd.read_parquet("variables_with_dimensions.parquet")

Define a bunch of types and helper functions

In [3]:
from dataclasses import dataclass

@dataclass
class Indicator:
    id: int
    catalogPath: str
    catalogPathTruncated: str
    dimensions: dict
    name: str
    datasetId: int

In [4]:
from typing import List


def extract_dimensions(indicators : List[Indicator]):
    """ Extracts all dimensions from a list of indicators. Returns a dictionary with the dimension
        name as key and a set of all values as value."""
    dimensions = {}
    for indicator in indicators:
        for filter in indicator.dimensions["filters"]:
            name = filter['name']
            value = filter['value']
            if name not in dimensions:
                dimensions[name] = set()
            dimensions[name].add(value)
    return dimensions

In [5]:
def get_shared_name_fragment(indicators : List[Indicator]):
    """ Given a list of indicators that are part of the same logical mutlidimensional one,
        extract the common prefix of their names which we'll use as the name of the explorer."""
    names = [indicator.name for indicator in indicators]
    return os.path.commonprefix(names)

In [6]:
@dataclass
class Explorer:
    shared_name_fragment: str
    num_indicators: int
    indicator_ids: List[int]
    num_dimensions: int
    non_overlapping_dimensions: bool
    lines: List[List[str]]

Below is the function that is the workhorse of this notebook - it takes a list of indicators that belong together and creates an indicator based explorer config from it (in the form of a list of list of strings (lines of cells), not yet in TSV form)

In [7]:
def create_explorer(indicators : List[Indicator]):
    lines : List[List[str]] = []
    shared_name_fragment = get_shared_name_fragment(indicators)
    lines.append(["explorerTitle", shared_name_fragment])
    lines.append(["selection", "World"])
    lines.append([])
    lines.append(["graphers"])
    dimensions = extract_dimensions(indicators)
    # Dimensions are only eligible for an explorer control if they have at least 2 values
    dimensions_to_show = { key: val for key, val in dimensions.items() if len(dimensions[key]) > 1}
    header = [""]
    for dimension_key, dimension_set in dimensions_to_show.items():
        label = f"{dimension_key} Checkbox" if len(dimension_set) == 2 else f"{dimension_key} Dropdown"
        header.append(label)
    header.append("yVariableIds")
    header.append("hasMapTab")
    lines.append(header)

    non_overlapping_dimensions = False
    for i in indicators:
        line = [""]
        filters = { item["name"]: item["value"] for item in i.dimensions["filters"]}
        for dimension_key, dimension_set in dimensions_to_show.items():
            dimension_value = str(filters.get(dimension_key, ""))
            line.append(dimension_value)
            if dimension_value == "":
                non_overlapping_dimensions = True
        line.append(str(i.id))
        line.append("true")
        lines.append(line)

    lines.append([])
    lines.append(["columns"])
    lines.append(["", "variableId"])
    for i in indicators:
        lines.append(["", str(i.id)])

    return Explorer(shared_name_fragment, len(indicators), [i.id for i in indicators], len(dimensions_to_show), non_overlapping_dimensions, lines)


In [8]:
def lines_to_tsv(lines):
    return "\n".join(["\t".join(line) for line in lines])

In [9]:
indicators = []

for i, row in all.iterrows():
    dimensions = json.loads(row.dimensions)
    indicators.append(Indicator(row["id"], row["catalogPath"], row["catalogPath"].split("__")[0] , dimensions, row["name"], row["datasetId"]))

Group the indicators by dataset id + originalShortName. This is the grouping that we will turn into one small explorer.

In [10]:
grouped = dict()
for i in indicators:
    key = f"{i.datasetId}-{i.dimensions['originalShortName']}"
    if key not in grouped:
        grouped[key] = []
    grouped[key].append(i)


Display an example indicator group

In [11]:
grouped["5743-growth_rate"]

[Indicator(id=520001, catalogPath='grapher/un/2022-07-11/un_wpp/growth_rate#growth_rate__sex_all__age_all__variant_constant_fertility', catalogPathTruncated='grapher/un/2022-07-11/un_wpp/growth_rate#growth_rate', dimensions={'filters': [{'name': 'sex', 'value': 'all'}, {'name': 'age', 'value': 'all'}, {'name': 'variant', 'value': 'constant fertility'}], 'originalName': 'Growth rate', 'originalShortName': 'growth_rate'}, name='Growth rate - Sex: all - Age: all - Variant: constant fertility', datasetId=5743),
 Indicator(id=520002, catalogPath='grapher/un/2022-07-11/un_wpp/growth_rate#growth_rate__sex_all__age_all__variant_low', catalogPathTruncated='grapher/un/2022-07-11/un_wpp/growth_rate#growth_rate', dimensions={'filters': [{'name': 'sex', 'value': 'all'}, {'name': 'age', 'value': 'all'}, {'name': 'variant', 'value': 'low'}], 'originalName': 'Growth rate', 'originalShortName': 'growth_rate'}, name='Growth rate - Sex: all - Age: all - Variant: low', datasetId=5743),
 Indicator(id=52000

How many indicators will be disregarded because they are a group size of one?

In [12]:
len([ g for g in grouped if len(grouped[g]) == 1])

142

Finally create explorer configs for all indicator groups with more than one item and save the resulting explorer config into the explorers subdirectory.

Also save a summary.csv file with a bit of information like how many indicators each explorer has, how many dimensions and if the dimensions are non-overlapping (i.e. that there is more than one dimension but we don't have all permutations available - this can happen if we have breakdown by sex and breakdown by age indivdually but not the combinations of age x sex) 

In [13]:
from pathlib import Path


def save_explorer(indicators, filename):
    explorer = create_explorer(indicators)
    tsv_content = lines_to_tsv(explorer.lines)
    with open(filename, "w") as f:
        f.write(tsv_content)
    return explorer

summaries = []
Path("explorers").mkdir(exist_ok=True)
for key, indicators in grouped.items():
    if len(indicators) > 1:
        short_name = indicators[0].dimensions["originalShortName"]
        filename = f"explorers/{short_name}.explorer.tsv"
        explorer = save_explorer(indicators, filename)
        summaries.append({"filename": f"{key}.explorer.tsv", "shared_name_fragment": explorer.shared_name_fragment, "num_indicators": explorer.num_indicators, "num_dimensions": explorer.num_dimensions, "non_overlapping_dimensions": explorer.non_overlapping_dimensions})
# write the summaries into a csv file called explorers/summaries.csv
summary_df = pd.DataFrame(summaries)
summary_df.to_csv("explorers/summary.csv", index=False)
