-
-
Notifications
You must be signed in to change notification settings - Fork 6
Conversation
Great work, we're almost there! Code looks good, I didn't notice any architecture anti-patterns. I have just one request and that is to make it optional. If we shipped this, it could have unintended effects on our existing datasets (especially performance and size of dataframes, some can have thousands of variables which could be really tricky). We could do it with env variable and context manager from contextlib import contextmanager
@contextmanager
def enable_tracing():
# Store old value
old_value = os.environ.get('TRACING', None)
# Set TRACING to '1'
os.environ['TRACING'] = '1'
try:
# This is where the body of the `with` statement will execute
yield
finally:
# Restore old value
if old_value is None:
del os.environ['TRACING']
else:
os.environ['TRACING'] = old_value ...
# Read table from garden dataset.
tb_garden = ds_garden["lgbti_policy_index"]
# Enable tracing and processing log
with enable_tracing():
... process tb_garden ... class Variable:
...
def __add__(self, other: Union[Scalar, Series, "Variable"]) -> Series:
if os.environ.get("TRACING") == "1":
variable = Variable(self.values + other, name=UNNAMED_VARIABLE) # type: ignore
variable.metadata = combine_variables_metadata(variables=[self, other], operation="+", name=self.name)
return variable
else:
return variable = super().__add__(other) (feel free to use a different on/off system than context manager). With that we can merge this, slowly start using it for new datasets and see how it behaves. It's not necessary to have implemented all methods |
""" | ||
from .tables import concat, melt, merge, pivot, read_csv, read_excel | ||
|
||
__all__ = ["concat", "melt", "merge", "pivot", "read_csv", "read_excel"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cherry on the top would be monkey patching pandas if we are in enable_tracing
context manager
from owid.catalog import tables as t
def enable_tracing():
original_concat = pd.concat
pd.concat = t.concat
....
yield
....
pd.concat = original_concat
it's not necessary though, might do more bad than good.
…nd other enhancements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a mammoth effort, with tons of tricky boilerplate, thanks for the hard work! 🙇
My main architectural question is whether we'd prefer to do all this via inheritance or composition.
Today we do it by inheritance, which means Table
inherits from DataFrame
, and then we try to plug all the holes in the DataFrame
surface area with metadata-friendly methods. The main problem being that Pandas is a massive project and the surface is really large.
The alternative is to do it by composition, which would look more like:
class Table:
data: pd.DataFrame
metadata: TableMeta
...
Then we could support an opinionated subset of Pandas operations, and everything that we support would preserve metadata. You might still need something not in our supported surface, but then you would be clearly stepping out of it to do that.
I think it's about the same amount of code, but perhaps a neater structure for it. I'd love to take discussion together on it.
Another meta point is that any change like that really needs to be made alongside the ETL so that we can see how it interacts with all the code that depends on it right now. I'd love to merge in this repo as a subfolder of the etl
repo and work from there. I think you'd find that CI was much more useful to you then.
@@ -105,6 +105,7 @@ class VariableMeta: | |||
short_unit: Optional[str] = None | |||
display: Optional[Dict[str, Any]] = None | |||
additional_info: Optional[Dict[str, Any]] = None | |||
processing_log: List[Any] = field(default_factory=list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be List[str]
or List[Dict[str, Any]]
?
|
||
# TODO: Handle metadata and processing info for each of the following functions. | ||
def pivot(*args, **kwargs) -> Table: | ||
return Table(pd.pivot(*args, **kwargs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, perhaps the table level metadata should be copied (excl: primary key).
[Work in progress] This PR attempts to achieve two useful features for working with tables in a data pipeline:
Metadata handling
We need to preserve the metadata of variables and tables when processing data (whenever possible).
For example, having a table
Where "a" has some sources, and "b" has some other sources, if we create a new variable
we want "c" to have the union of the sources of "a" and "b".
In this branch (WIP) the inheritance of metadata is already achieved for the following operations:
merge
(although there is some further logic that needs to be applied).concat
.pivot
.melt
.NOTE: Even if the metadata combination is currently working (and all unit tests pass), the way variable names are handled doesn't feel optimal (via the use of
UNNAMED_VARIABLE
). Also, more unit tests should be included, and there are several TODOs to be tackled.Processing log
Now each variable's metadata includes a field called
processing_log
. This field should ideally contain something like:The goal would be to have a log entry for each operation done to a variable (e.g. renaming columns, dropping nans, etc.).
TODO: When an operation can't handle the processing glog automatically, it should be possible to manually insert the log entry.
I started playing around with this, but encountered a few issues: