Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ metadata handling improvements #1251

Merged
merged 65 commits into from
Jun 27, 2023
Merged

Conversation

pabloarosado
Copy link
Contributor

@pabloarosado pabloarosado commented Jun 22, 2023

This PR was documented and reviewed in this old owid-catalog-py PR.
There have been some new changes with respect to that PR. Mainly:

  • I had to do a minor fix in backport/datasync/data_metadata.py so that grapher steps can be executed.
  • Adapted a data pipeline us_corn_yields to the new metadata tools.
  • Improved the table functions (mainly merge, melt, pivot, concat) and fixed some issues.
  • Added many unit tests.
  • Added some useful functions (e.g. to update and amend entries in the processing log).
  • Did many minor tweaks to old etl steps so that they return the same data and metadata as currently in the catalog. There are still some irrelevant differences. The only significant issues (which I'll mention on Slack) are:
    • data://garden/owid/latest/covid
      Data is equal to remote if we restrict dates to ~<2022-05-01. Otherwise, remote data seems to not be up to date.
    • data://garden/war/2023-01-18/clodfelter_2017 (and other steps in war)
      Data is identical, however members of 'conflict_participants' appear in different order.
    • data://open_numbers/open_numbers/latest/bp__energy (and all open_numbers steps)
      All open numbers steps are not found in the catalog, so I couldn't do the comparison.
      UPDATE: The issue was that the version of these steps was None, for some reason. I have now checked that there is no change in any of the open numbers steps.
    • data://garden/energy/2023-06-01/global_primary_energy (and data://grapher/energy/2023-06-01/global_primary_energy)
      The only difference is that the field "additional_info" disappeared. It used to be transferred from the BP backported dataset to the combined dataset Smil + BP. But that field is probably unnecessary.
    • data://grapher/fasttrack/2023-06-16/guinea_worm
      I was unable to load it from the catalog. Is there anything wrong with this dataset?
    • data://meadow/oecd/2016-06-01/regional_wellbeing
      Again, I wasn't able to load it from the catalog (but the error was different).
    • data://meadow/un/2022-07-11/un_wpp (and data://garden/un/2022-07-11/un_wpp)
      When I run these steps, the resulting data is different to the one in the catalog. But this also happens when I run the step in the master branch. So there is something odd with this step.

NOTES:

  • By default, environment variable PROCESSING_LOG is assumed False, and therefore nothing is added to the processing log. So this PR will only affect the way metadata is handled. In a future PR we'll need to work a bit further on handling the processing log, and adapting the unit tests to this new attribute of variables.
  • Is there any way to check that the results of --grapher operations are not affected? I run just one and the result was identical, but maybe there's a way to check on more steps.

@pabloarosado pabloarosado marked this pull request as ready for review June 26, 2023 09:11
@Marigold
Copy link
Collaborator

@pabloarosado can you increment ETL_EPOCH so that all datasets get refreshed for reviewers (tip for other reviewers - I cloned etl to new folder and building it there to avoid back and forth rewriting of data/ when switching between branches).

Copy link
Member

@lucasrodes lucasrodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Thanks for your work, Pablo! <3

@lucasrodes lucasrodes changed the title Improve metadata handling ✨ metadata handling improvements Jun 26, 2023
# Update table metadata.
tb.metadata.title = combine_tables_titles(tables=[left, right])
tb.metadata.description = combine_tables_descriptions(tables=[left, right])
tb.metadata.dataset = DatasetMeta(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not a blocker for merging, but definitely an issue we need to address at some point

TableMeta.dataset is used to reference dataset metadata. The problem is that the reference doesn't point to the same object and dataset.metadata and tb.metadata.dataset is different. I'm not sure why it was implemented like this, but I've already run into this inconsistency somewhere else, thinking that if I do tb.metadata.dataset.description = "foo", it would change ds.metadata.description.

So what you're doing here is replacing the reference with the new DatasetMeta object and tb.metadata.dataset isn't a reference anymore (and both are out-of-sync and might be re-synced elsewhere).

I don't know how it affects your PR yet, will have to dig deeper. Anyway, it'd be worth raising this inconsistency as a separate issue and think about what do we want out of it (and possible forbid setting tb.metadata.dataset). @larsyencken mind sharing your thoughts?

Here's an example demonstrating that metadata are different objects

from owid import catalog
from etl.paths import DATA_DIR

ds = catalog.Dataset(DATA_DIR / "garden/regions/2023-01-01/regions")
id(ds['regions'].metadata.dataset), id(ds.metadata)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently there's no other way to store sources at the table level. I don't think it would be better to refer to the dataset of any of the input tables. Solving this properly would imply changing the attributes of datasets and tables (which is something to be handled in another PR). So, do you think there's anything else to do for this PR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I'm worried is that tb.metadata.dataset is used when upserting data to MySQL (which we haven't tested much). By assigning to tb.metadata.dataset you're replacing the reference with an object you're using to propagate sources and licenses. I'm not saying it'll necessarily break things, but it's a technical debt that will have to be dealt with sooner or later. (I'm cautious here because I spent 2 hours last week debugging a problem caused by tb.metadata.dataset and I'm worried about introducing technical debt before redoing our metadata structure).

Since the sources of a dataset are the union of indicators sources, would it be possible to do this at the very end when calling create_dataset instead of doing it with every operation? (Ideally, dataset.sources would be calculated dynamically, but we don't have to do it as part of this)

(After I finish origins, I'll start looking into tb.metadata.dataset and try to make it read-only property)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason it's implemented this way was that we chose tables as the core unit for the ETL, since we were trying to make an ideal data science environment and that's how people typically do data analysis.

That means the format needs to let you load a table and get everything you need. For that reason, the dataset-level metadata is duplicated to the table any time you add a table to the dataset:

table.metadata.dataset = self.metadata

Since it is only cached at the table level, we should definitely not write to the table's copy directly. The correct/best way to do this is to store this metadata at the variable level instead.

to_table = to_table.copy()

# Copy the table metadata.
to_table.metadata = copy.deepcopy(from_table.metadata)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI - I found that copy.deepcopy is pretty slow. Maybe you can reuse this part of code (and make a function out of it)?

We can do it after comparing runtime of nightly run (should help us pinpoint problematic functions).

Copy link
Contributor Author

@pabloarosado pabloarosado Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, but I'd prefer to do it in a separate PR. I've created an issue.

@Marigold
Copy link
Collaborator

A couple of additional notes:

  • I've tested the performance and it looked good!
  • I tried to run those problematic datasets and I think they're fine (still looking into un_wpp, but others are good)
  • I'm running into pylance (=pyright) linting errors in VSCode and type checks in general, but we don't have to fix it as part of this PR

@lucasrodes
Copy link
Member

lucasrodes commented Jun 26, 2023

An additional note. I see something like the code below repeat itself often:

old_title = table[column].metadata.title
old_description = table[column].metadata.description
# some operation son table[column] ...
table[column].metadata.title = old_title
table[column].metadata.description = old_description

I understand that propagating the title and description fields in metadata automatically may not apply to all instances (maybe the processing done to the field implies a change in title/description). However, could we have a flag set to True if we want to propagate certain fields?

tb[column].preserve_fields = True  # automatically propagates metadata.title and metadata.description (and possibly others)
# some operation son table[column] ...

Or something in this spirit.

@pabloarosado
Copy link
Contributor Author

An additional note. I see something like the code below repeat itself often:

old_title = table[column].metadata.title
old_description = table[column].metadata.description
# some operation son table[column] ...
table[column].metadata.title = old_title
table[column].metadata.description = old_description

I understand that propagating the title and description fields in metadata automatically may not apply to all instances (maybe the processing done to the field implies a change in title/description). However, could we have a flag set to True if we want to propagate certain fields?

tb[column].preserve_fields = True  # automatically propagates metadata.title and metadata.description (and possibly others)
# some operation son table[column] ...

Or something in this spirit.

Yes Lucas, I think we should rethink some of the logic currently implemented in this PR. But maybe it's better to focus on new steps (done with the new tools in mind) rather than old steps. So, we could start working with the new tools, and after gathering some experiences then decide what would be the best logic to implement and make changes.

@pabloarosado
Copy link
Contributor Author

Hi @Marigold @larsyencken what do you think, is it ready to be merged? If you have concerns, please let me know, thanks.

Copy link
Collaborator

@Marigold Marigold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about the use of tb.metadata.dataset, but otherwise looks good!

return table


def get_unique_sources_from_table(table: Table) -> List[Source]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataclasses support equality operator, so instead of converting to dict and then to objects again (which could affect performance), you could do something like this

unique_sources = []
for source in sources:
    if source not in unique_sources:
        unique_sources.append(source)

(not urgent, can be done in a different PR)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New issue: #1270

@pabloarosado
Copy link
Contributor Author

Thanks @Marigold I've just removed all lines where tb.metadata.dataset was overwritten to combine sources and licenses.

@pabloarosado
Copy link
Contributor Author

I've just checked all datadiffs and the only problematic case (dataset life_tables) I've checked it locally and found no differences. So I think we are ready to merge.

@pabloarosado pabloarosado merged commit 127704c into master Jun 27, 2023
@pabloarosado pabloarosado deleted the improve-metadata-handling branch June 27, 2023 17:05
@larsyencken larsyencken mentioned this pull request Jun 28, 2023
41 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants