Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.

Improve metadata handling #90

Closed
wants to merge 20 commits into from
Closed

Conversation

pabloarosado
Copy link
Contributor

@pabloarosado pabloarosado commented May 18, 2023

[Work in progress] This PR attempts to achieve two useful features for working with tables in a data pipeline:

  1. Properly handling metadata.
  2. Having a processing log for each variable.

Metadata handling

We need to preserve the metadata of variables and tables when processing data (whenever possible).

For example, having a table

tb = Table({"a": [1, 2, 3], "b": [4, 5, 6]}
tb.metadata = ...

Where "a" has some sources, and "b" has some other sources, if we create a new variable

tb["c"] = tb["a"] + tb["b"]

we want "c" to have the union of the sources of "a" and "b".

In this branch (WIP) the inheritance of metadata is already achieved for the following operations:

  • Operations with (dunder) methods "+", "-", "*", "**", "/", "//", "%".
  • merge (although there is some further logic that needs to be applied).
  • concat.
  • pivot.
  • melt.
  • ...

NOTE: Even if the metadata combination is currently working (and all unit tests pass), the way variable names are handled doesn't feel optimal (via the use of UNNAMED_VARIABLE). Also, more unit tests should be included, and there are several TODOs to be tackled.

Processing log

Now each variable's metadata includes a field called processing_log. This field should ideally contain something like:

# tb["a"].metadata.processing_log
[{"variable": "a", "parents": ["garden/namespace/version/a"], "operation": "load"}]

# tb["b"].metadata.processing_log
[{"variable": "b", "parents": ["garden/namespace/version/b"], "operation": "load"}]

# tb["c"].metadata.processing_log
[{"variable": "a", "parents": ["garden/namespace/version/a"], "operation": "load"}, 
{"variable": "b", "parents": ["garden/namespace/version/b"], "operation": "load"},
{"variable": "c", "parents": ["a", "b"], "operation": "+"}]

The goal would be to have a log entry for each operation done to a variable (e.g. renaming columns, dropping nans, etc.).
TODO: When an operation can't handle the processing glog automatically, it should be possible to manually insert the log entry.

I started playing around with this, but encountered a few issues:

  • One is that some unit tests were failing (which is probably easy to fix). For now I've commented out some parts to avoid the failing tests.
  • But the main one is that, as it is currently implemented, we can't properly track the variable name. We may need to rethink how to implement this feature.

@Marigold
Copy link
Collaborator

Marigold commented May 19, 2023

Great work, we're almost there! Code looks good, I didn't notice any architecture anti-patterns. I have just one request and that is to make it optional. If we shipped this, it could have unintended effects on our existing datasets (especially performance and size of dataframes, some can have thousands of variables which could be really tricky).

We could do it with env variable and context manager

from contextlib import contextmanager

@contextmanager
def enable_tracing():
    # Store old value
    old_value = os.environ.get('TRACING', None)

    # Set TRACING to '1'
    os.environ['TRACING'] = '1'

    try:
        # This is where the body of the `with` statement will execute
        yield
    finally:
        # Restore old value
        if old_value is None:
            del os.environ['TRACING']
        else:
            os.environ['TRACING'] = old_value
...
# Read table from garden dataset.
tb_garden = ds_garden["lgbti_policy_index"]

# Enable tracing and processing log
with enable_tracing():
    ... process tb_garden ...
class Variable:
...
def __add__(self, other: Union[Scalar, Series, "Variable"]) -> Series:
        if os.environ.get("TRACING") == "1":
            variable = Variable(self.values + other, name=UNNAMED_VARIABLE)  # type: ignore
            variable.metadata = combine_variables_metadata(variables=[self, other], operation="+", name=self.name)
            return variable
        else:
            return variable = super().__add__(other)

(feel free to use a different on/off system than context manager). With that we can merge this, slowly start using it for new datasets and see how it behaves. It's not necessary to have implemented all methods pivot, melt, ... we can just add them on the fly as needed.

"""
from .tables import concat, melt, merge, pivot, read_csv, read_excel

__all__ = ["concat", "melt", "merge", "pivot", "read_csv", "read_excel"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cherry on the top would be monkey patching pandas if we are in enable_tracing context manager

from owid.catalog import tables as t

def enable_tracing():
    original_concat = pd.concat
    pd.concat = t.concat
    ....
    yield
    ....
    pd.concat = original_concat

it's not necessary though, might do more bad than good.

Copy link
Contributor

@larsyencken larsyencken left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mammoth effort, with tons of tricky boilerplate, thanks for the hard work! 🙇

My main architectural question is whether we'd prefer to do all this via inheritance or composition.

Today we do it by inheritance, which means Table inherits from DataFrame, and then we try to plug all the holes in the DataFrame surface area with metadata-friendly methods. The main problem being that Pandas is a massive project and the surface is really large.

The alternative is to do it by composition, which would look more like:

class Table:
   data: pd.DataFrame
   metadata: TableMeta
   ...

Then we could support an opinionated subset of Pandas operations, and everything that we support would preserve metadata. You might still need something not in our supported surface, but then you would be clearly stepping out of it to do that.

I think it's about the same amount of code, but perhaps a neater structure for it. I'd love to take discussion together on it.

Another meta point is that any change like that really needs to be made alongside the ETL so that we can see how it interacts with all the code that depends on it right now. I'd love to merge in this repo as a subfolder of the etl repo and work from there. I think you'd find that CI was much more useful to you then.

@@ -105,6 +105,7 @@ class VariableMeta:
short_unit: Optional[str] = None
display: Optional[Dict[str, Any]] = None
additional_info: Optional[Dict[str, Any]] = None
processing_log: List[Any] = field(default_factory=list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be List[str] or List[Dict[str, Any]] ?


# TODO: Handle metadata and processing info for each of the following functions.
def pivot(*args, **kwargs) -> Table:
return Table(pd.pivot(*args, **kwargs))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, perhaps the table level metadata should be copied (excl: primary key).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants