Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.

feature: Properly track metadata and processing of each variable #91

Open
pabloarosado opened this issue May 25, 2023 · 0 comments
Open
Assignees

Comments

@pabloarosado
Copy link
Contributor

Very often when we do simple operations on a variable, the metadata disappears. We need to:

  1. Ensure the metadata is inherited properly (when possible), e.g. if tb["c"] = tb["a"] + tb["b"], the new variable c should have the union of sources and licenses of a and b.
  2. Keep a log of all processing done to a variable, e.g. "variable loaded from table ...", "variable c created as the sum of variables a and b", etc.

I started implementing this logic in this branch (and created a PR). But there's some more work to be done, to ensure the changes are robust, and to include additional logic and features.

I also created an etl branch to test these changes on a simple dataset. We may decide to delete this etl branch in the future if things change significantly.

Once done implementing these features, we would need to ensure that all active ETL steps work without any modification (and check that they don't take much longer to run). To migrate to a workflow where we properly handle metadata and keep a processing log, we could start by adding a default processing log to each variable in ETL, which has 3 entries: "variable loaded from table ...", "data processing", "variable saved to table ...". Then, whenever each step is updated, the code could be refactored to properly build the processing log.

@pabloarosado pabloarosado self-assigned this May 25, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant