Improve metadata handling #90

pabloarosado · 2023-05-18T15:12:44Z

[Work in progress] This PR attempts to achieve two useful features for working with tables in a data pipeline:

Properly handling metadata.
Having a processing log for each variable.

Metadata handling

We need to preserve the metadata of variables and tables when processing data (whenever possible).

For example, having a table

tb = Table({"a": [1, 2, 3], "b": [4, 5, 6]}
tb.metadata = ...

Where "a" has some sources, and "b" has some other sources, if we create a new variable

tb["c"] = tb["a"] + tb["b"]

we want "c" to have the union of the sources of "a" and "b".

In this branch (WIP) the inheritance of metadata is already achieved for the following operations:

NOTE: Even if the metadata combination is currently working (and all unit tests pass), the way variable names are handled doesn't feel optimal (via the use of UNNAMED_VARIABLE). Also, more unit tests should be included, and there are several TODOs to be tackled.

Processing log

Now each variable's metadata includes a field called processing_log. This field should ideally contain something like:

# tb["a"].metadata.processing_log
[{"variable": "a", "parents": ["garden/namespace/version/a"], "operation": "load"}]

# tb["b"].metadata.processing_log
[{"variable": "b", "parents": ["garden/namespace/version/b"], "operation": "load"}]

# tb["c"].metadata.processing_log
[{"variable": "a", "parents": ["garden/namespace/version/a"], "operation": "load"}, 
{"variable": "b", "parents": ["garden/namespace/version/b"], "operation": "load"},
{"variable": "c", "parents": ["a", "b"], "operation": "+"}]

The goal would be to have a log entry for each operation done to a variable (e.g. renaming columns, dropping nans, etc.).
TODO: When an operation can't handle the processing glog automatically, it should be possible to manually insert the log entry.

I started playing around with this, but encountered a few issues:

One is that some unit tests were failing (which is probably easy to fix). For now I've commented out some parts to avoid the failing tests.
But the main one is that, as it is currently implemented, we can't properly track the variable name. We may need to rethink how to implement this feature.

Marigold · 2023-05-19T11:34:08Z

Great work, we're almost there! Code looks good, I didn't notice any architecture anti-patterns. I have just one request and that is to make it optional. If we shipped this, it could have unintended effects on our existing datasets (especially performance and size of dataframes, some can have thousands of variables which could be really tricky).

We could do it with env variable and context manager

from contextlib import contextmanager

@contextmanager
def enable_tracing():
    # Store old value
    old_value = os.environ.get('TRACING', None)

    # Set TRACING to '1'
    os.environ['TRACING'] = '1'

    try:
        # This is where the body of the `with` statement will execute
        yield
    finally:
        # Restore old value
        if old_value is None:
            del os.environ['TRACING']
        else:
            os.environ['TRACING'] = old_value

...
# Read table from garden dataset.
tb_garden = ds_garden["lgbti_policy_index"]

# Enable tracing and processing log
with enable_tracing():
    ... process tb_garden ...

class Variable:
...
def __add__(self, other: Union[Scalar, Series, "Variable"]) -> Series:
        if os.environ.get("TRACING") == "1":
            variable = Variable(self.values + other, name=UNNAMED_VARIABLE)  # type: ignore
            variable.metadata = combine_variables_metadata(variables=[self, other], operation="+", name=self.name)
            return variable
        else:
            return variable = super().__add__(other)

(feel free to use a different on/off system than context manager). With that we can merge this, slowly start using it for new datasets and see how it behaves. It's not necessary to have implemented all methods pivot, melt, ... we can just add them on the fly as needed.

Marigold · 2023-05-19T11:42:07Z

owid/catalog/processing.py

+"""
+from .tables import concat, melt, merge, pivot, read_csv, read_excel
+
+__all__ = ["concat", "melt", "merge", "pivot", "read_csv", "read_excel"]


Cherry on the top would be monkey patching pandas if we are in enable_tracing context manager

from owid.catalog import tables as t def enable_tracing(): original_concat = pd.concat pd.concat = t.concat .... yield .... pd.concat = original_concat

it's not necessary though, might do more bad than good.

…nd other enhancements

…bles

…r improvements

…ariable

larsyencken

This is a mammoth effort, with tons of tricky boilerplate, thanks for the hard work! 🙇

My main architectural question is whether we'd prefer to do all this via inheritance or composition.

Today we do it by inheritance, which means Table inherits from DataFrame, and then we try to plug all the holes in the DataFrame surface area with metadata-friendly methods. The main problem being that Pandas is a massive project and the surface is really large.

The alternative is to do it by composition, which would look more like:

class Table:
   data: pd.DataFrame
   metadata: TableMeta
   ...

Then we could support an opinionated subset of Pandas operations, and everything that we support would preserve metadata. You might still need something not in our supported surface, but then you would be clearly stepping out of it to do that.

I think it's about the same amount of code, but perhaps a neater structure for it. I'd love to take discussion together on it.

Another meta point is that any change like that really needs to be made alongside the ETL so that we can see how it interacts with all the code that depends on it right now. I'd love to merge in this repo as a subfolder of the etl repo and work from there. I think you'd find that CI was much more useful to you then.

larsyencken · 2023-06-13T10:59:09Z

owid/catalog/meta.py

@@ -105,6 +105,7 @@ class VariableMeta:
    short_unit: Optional[str] = None
    display: Optional[Dict[str, Any]] = None
    additional_info: Optional[Dict[str, Any]] = None
+    processing_log: List[Any] = field(default_factory=list)


Should this be List[str] or List[Dict[str, Any]] ?

larsyencken · 2023-06-13T12:53:52Z

owid/catalog/tables.py

+
+# TODO: Handle metadata and processing info for each of the following functions.
+def pivot(*args, **kwargs) -> Table:
+    return Table(pd.pivot(*args, **kwargs))


In this case, perhaps the table level metadata should be copied (excl: primary key).

…-metadata-handling

pabloarosado added 7 commits May 17, 2023 18:58

🚧 Let operations between variables handle metadata properly

ad3d6d3

:white-check-mark: Update tests

db6e0be

✨ Minor improvements in style and documentation

33d8a99

✅ Update tests

ee9e981

🎉 Add metadata handling for all common dunder methods

d1be5e3

✅ Add tests

26b31ff

🎉 Add merge function that handles metadata (WIP)

a99c0df

pabloarosado requested review from Marigold, larsyencken and lucasrodes May 18, 2023 15:22

Marigold reviewed May 19, 2023

View reviewed changes

pabloarosado added 11 commits May 22, 2023 16:28

feat: Let processing log properly store variable name

3c732ab

test: Update tests

d34a294

feat: Add other existing pandas method to properly handle metadata, a…

43ea392

…nd other enhancements

feat: Add processing log when loading, creating, or saving a table

8865d24

test: Update tests

38159ae

feat: Improve metadata handling when reading new files and merging ta…

1621297

…bles

feat: Add processing log entry after renaming columns, and other mino…

967f3c7

…r improvements

test: Update tests

51b0112

feat: Get sources and licenses from dataset if not defined for each v…

789fd0a

…ariable

feat: Implement logic to handle metadat for melt function

65f767e

feat: Implement logic to handle metadata for concat function

61bc8b1

pabloarosado mentioned this pull request May 25, 2023

feature: Properly track metadata and processing of each variable #91

Open

larsyencken reviewed Jun 13, 2023

View reviewed changes

pabloarosado added 2 commits June 14, 2023 09:59

Merge branch 'master' of github.com:owid/owid-catalog-py into improve…

89e6cf2

…-metadata-handling

style: Improve format

2821139

pabloarosado mentioned this pull request Jun 22, 2023

✨ metadata handling improvements owid/etl#1251

Merged

larsyencken closed this Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve metadata handling #90

Improve metadata handling #90

pabloarosado commented May 18, 2023 •

edited

Loading

Marigold commented May 19, 2023 •

edited

Loading

Marigold May 19, 2023

larsyencken left a comment

larsyencken Jun 13, 2023

larsyencken Jun 13, 2023

Improve metadata handling #90

Improve metadata handling #90

Conversation

pabloarosado commented May 18, 2023 • edited Loading

Metadata handling

Processing log

Marigold commented May 19, 2023 • edited Loading

Marigold May 19, 2023

Choose a reason for hiding this comment

larsyencken left a comment

Choose a reason for hiding this comment

larsyencken Jun 13, 2023

Choose a reason for hiding this comment

larsyencken Jun 13, 2023

Choose a reason for hiding this comment

pabloarosado commented May 18, 2023 •

edited

Loading

Marigold commented May 19, 2023 •

edited

Loading