✨ metadata handling improvements #1251

pabloarosado · 2023-06-22T12:03:05Z

This PR was documented and reviewed in this old owid-catalog-py PR.
There have been some new changes with respect to that PR. Mainly:

I had to do a minor fix in backport/datasync/data_metadata.py so that grapher steps can be executed.
Adapted a data pipeline us_corn_yields to the new metadata tools.
Improved the table functions (mainly merge, melt, pivot, concat) and fixed some issues.
Added many unit tests.
Added some useful functions (e.g. to update and amend entries in the processing log).
Did many minor tweaks to old etl steps so that they return the same data and metadata as currently in the catalog. There are still some irrelevant differences. The only significant issues (which I'll mention on Slack) are:
- data://garden/owid/latest/covid
  Data is equal to remote if we restrict dates to ~<2022-05-01. Otherwise, remote data seems to not be up to date.
- data://garden/war/2023-01-18/clodfelter_2017 (and other steps in war)
  Data is identical, however members of 'conflict_participants' appear in different order.
- data://open_numbers/open_numbers/latest/bp__energy (and all open_numbers steps)
  All open numbers steps are not found in the catalog, so I couldn't do the comparison.
  UPDATE: The issue was that the version of these steps was None, for some reason. I have now checked that there is no change in any of the open numbers steps.
- data://garden/energy/2023-06-01/global_primary_energy (and data://grapher/energy/2023-06-01/global_primary_energy)
  The only difference is that the field "additional_info" disappeared. It used to be transferred from the BP backported dataset to the combined dataset Smil + BP. But that field is probably unnecessary.
- data://grapher/fasttrack/2023-06-16/guinea_worm
  I was unable to load it from the catalog. Is there anything wrong with this dataset?
- data://meadow/oecd/2016-06-01/regional_wellbeing
  Again, I wasn't able to load it from the catalog (but the error was different).
- data://meadow/un/2022-07-11/un_wpp (and data://garden/un/2022-07-11/un_wpp)
  When I run these steps, the resulting data is different to the one in the catalog. But this also happens when I run the step in the master branch. So there is something odd with this step.

NOTES:

By default, environment variable PROCESSING_LOG is assumed False, and therefore nothing is added to the processing log. So this PR will only affect the way metadata is handled. In a future PR we'll need to work a bit further on handling the processing log, and adapting the unit tests to this new attribute of variables.
Is there any way to check that the results of --grapher operations are not affected? I run just one and the result was identical, but maybe there's a way to check on more steps.

…o handle metadata and processing log

…ndling

…a-handling branch to lib/catalog

…ndling

…ul functions

…e level

…ndling

…allow for not-inplace updates

…ocessing log of variables and tables

…d metadata operations

Marigold · 2023-06-26T11:55:08Z

@pabloarosado can you increment ETL_EPOCH so that all datasets get refreshed for reviewers (tip for other reviewers - I cloned etl to new folder and building it there to avoid back and forth rewriting of data/ when switching between branches).

…ndling

lucasrodes

LGTM!

Thanks for your work, Pablo! <3

etl/steps/data/garden/agriculture/2023-06-12/attainable_yields.py

etl/steps/data/garden/emissions/2023-05-03/owid_co2.py

etl/steps/data/grapher/irena/2023-01-04/renewable_electricity_capacity_by_technology.py

lib/catalog/owid/catalog/tables.py

Marigold · 2023-06-26T14:44:18Z

lib/catalog/owid/catalog/tables.py

+    # Update table metadata.
+    tb.metadata.title = combine_tables_titles(tables=[left, right])
+    tb.metadata.description = combine_tables_descriptions(tables=[left, right])
+    tb.metadata.dataset = DatasetMeta(


This is probably not a blocker for merging, but definitely an issue we need to address at some point

TableMeta.dataset is used to reference dataset metadata. The problem is that the reference doesn't point to the same object and dataset.metadata and tb.metadata.dataset is different. I'm not sure why it was implemented like this, but I've already run into this inconsistency somewhere else, thinking that if I do tb.metadata.dataset.description = "foo", it would change ds.metadata.description.

So what you're doing here is replacing the reference with the new DatasetMeta object and tb.metadata.dataset isn't a reference anymore (and both are out-of-sync and might be re-synced elsewhere).

I don't know how it affects your PR yet, will have to dig deeper. Anyway, it'd be worth raising this inconsistency as a separate issue and think about what do we want out of it (and possible forbid setting tb.metadata.dataset). @larsyencken mind sharing your thoughts?

Here's an example demonstrating that metadata are different objects

from owid import catalog from etl.paths import DATA_DIR ds = catalog.Dataset(DATA_DIR / "garden/regions/2023-01-01/regions") id(ds['regions'].metadata.dataset), id(ds.metadata)

Currently there's no other way to store sources at the table level. I don't think it would be better to refer to the dataset of any of the input tables. Solving this properly would imply changing the attributes of datasets and tables (which is something to be handled in another PR). So, do you think there's anything else to do for this PR?

The reason I'm worried is that tb.metadata.dataset is used when upserting data to MySQL (which we haven't tested much). By assigning to tb.metadata.dataset you're replacing the reference with an object you're using to propagate sources and licenses. I'm not saying it'll necessarily break things, but it's a technical debt that will have to be dealt with sooner or later. (I'm cautious here because I spent 2 hours last week debugging a problem caused by tb.metadata.dataset and I'm worried about introducing technical debt before redoing our metadata structure).

Since the sources of a dataset are the union of indicators sources, would it be possible to do this at the very end when calling create_dataset instead of doing it with every operation? (Ideally, dataset.sources would be calculated dynamically, but we don't have to do it as part of this)

(After I finish origins, I'll start looking into tb.metadata.dataset and try to make it read-only property)

The reason it's implemented this way was that we chose tables as the core unit for the ETL, since we were trying to make an ideal data science environment and that's how people typically do data analysis.

That means the format needs to let you load a table and get everything you need. For that reason, the dataset-level metadata is duplicated to the table any time you add a table to the dataset:

etl/lib/catalog/owid/catalog/datasets.py

Line 121 in 0dffc36

table.metadata.dataset = self.metadata

Since it is only cached at the table level, we should definitely not write to the table's copy directly. The correct/best way to do this is to store this metadata at the variable level instead.

Marigold · 2023-06-26T14:48:58Z

lib/catalog/owid/catalog/tables.py

+        to_table = to_table.copy()
+
+    # Copy the table metadata.
+    to_table.metadata = copy.deepcopy(from_table.metadata)


FYI - I found that copy.deepcopy is pretty slow. Maybe you can reuse this part of code (and make a function out of it)?

We can do it after comparing runtime of nightly run (should help us pinpoint problematic functions).

You're right, but I'd prefer to do it in a separate PR. I've created an issue.

Marigold · 2023-06-26T15:38:02Z

A couple of additional notes:

I've tested the performance and it looked good!
I tried to run those problematic datasets and I think they're fine (still looking into un_wpp, but others are good)
I'm running into pylance (=pyright) linting errors in VSCode and type checks in general, but we don't have to fix it as part of this PR

lucasrodes · 2023-06-26T16:01:30Z

An additional note. I see something like the code below repeat itself often:

old_title = table[column].metadata.title
old_description = table[column].metadata.description
# some operation son table[column] ...
table[column].metadata.title = old_title
table[column].metadata.description = old_description

I understand that propagating the title and description fields in metadata automatically may not apply to all instances (maybe the processing done to the field implies a change in title/description). However, could we have a flag set to True if we want to propagate certain fields?

tb[column].preserve_fields = True  # automatically propagates metadata.title and metadata.description (and possibly others)
# some operation son table[column] ...

Or something in this spirit.

pabloarosado · 2023-06-27T09:55:46Z

An additional note. I see something like the code below repeat itself often:
old_title = table[column].metadata.title
old_description = table[column].metadata.description
# some operation son table[column] ...
table[column].metadata.title = old_title
table[column].metadata.description = old_description
I understand that propagating the title and description fields in metadata automatically may not apply to all instances (maybe the processing done to the field implies a change in title/description). However, could we have a flag set to True if we want to propagate certain fields?
tb[column].preserve_fields = True  # automatically propagates metadata.title and metadata.description (and possibly others)
# some operation son table[column] ...
Or something in this spirit.

Yes Lucas, I think we should rethink some of the logic currently implemented in this PR. But maybe it's better to focus on new steps (done with the new tools in mind) rather than old steps. So, we could start working with the new tools, and after gathering some experiences then decide what would be the best logic to implement and make changes.

pabloarosado · 2023-06-27T10:07:11Z

Hi @Marigold @larsyencken what do you think, is it ready to be merged? If you have concerns, please let me know, thanks.

Marigold

I'm a bit concerned about the use of tb.metadata.dataset, but otherwise looks good!

Marigold · 2023-06-27T10:52:48Z

lib/catalog/owid/catalog/tables.py

+        return table
+
+
+def get_unique_sources_from_table(table: Table) -> List[Source]:


Dataclasses support equality operator, so instead of converting to dict and then to objects again (which could affect performance), you could do something like this

unique_sources = [] for source in sources: if source not in unique_sources: unique_sources.append(source)

(not urgent, can be done in a different PR)

New issue: #1270

…ndling

pabloarosado · 2023-06-27T12:28:25Z

Thanks @Marigold I've just removed all lines where tb.metadata.dataset was overwritten to combine sources and licenses.

pabloarosado · 2023-06-27T17:03:56Z

I've just checked all datadiffs and the only problematic case (dataset life_tables) I've checked it locally and found no differences. So I think we are ready to merge.

pabloarosado added 30 commits May 24, 2023 09:00

feat: Add method to export snapshot metadata to table metadata

4937bc2

feat: Adapt us_corn_yields step to the new features in owid-catalog t…

cb495b5

…o handle metadata and processing log

Merge branch 'master' of github.com:owid/etl into improve-metadata-ha…

b3779b1

…ndling

feat(catalog): Bring all changes from owid-catalog-py improve-metadat…

6363766

…a-handling branch to lib/catalog

feat(catalog): Add pivot function that properly handles table metadata

f21ecbd

Merge branch 'master' of github.com:owid/etl into improve-metadata-ha…

1e0c80d

…ndling

feat(catalog): Add dropna handling metadata, and improve merge

1e89ae3

enhance(catalog): Simplify handling of processing log

a8e47c0

enhance(catalog): Simplify handling of processing log and create usef…

31e612f

…ul functions

feat(catalog): Add function to add processing log entries at the tabl…

5203be5

…e level

feat(catalog): Add method sort_values with proper metadata handling

d6aa0aa

enhance(catalog): Improve handling of processing log

89aa866

Merge branch 'master' of github.com:owid/etl into improve-metadata-ha…

562ad32

…ndling

enhance(catalog): Simplify code

8da8d43

refactor(catalog): Change the way the processing log is updated, and …

d353383

…allow for not-inplace updates

feat(catalog): Create functions to be able to amend entries in the pr…

cafce77

…ocessing log of variables and tables

feat(catalog): Improve method equals_table

f41aa85

test(catalog): Improve tests

63d7a00

style(catalog): Improve type hints and remove to-do

25e5e50

test(catalog): Add tests

13b8d2e

feat(catalog): Improve logic of merge and concat

9d684a7

test(catalog): Add tests to variables and tables

80d763a

feat(catalog): Improve logic of melt metadata handling

cfae39c

test(catalog): Add tests for melt function

8794cd3

feat(catalog): Improve handling of table metadata on various operations

d030ca5

test(catalog): Improve tests

59e86f1

fix(catalog): Improve logic of pivot function

61239b9

test(catalog): Add tests to pivot function

01d2498

style(catalog): Improve format

a370152

enhance(catalog): Improve table comparison

b5e16b5

pabloarosado added 3 commits June 26, 2023 10:15

fix(unep): Add missing title and description

abe43ac

fix(backport): Ensure datasync merges are done on dataframes, to avoi…

fa2f95b

…d metadata operations

style: Improve format

375cdc2

pabloarosado requested review from Marigold, larsyencken and lucasrodes June 26, 2023 09:10

pabloarosado marked this pull request as ready for review June 26, 2023 09:11

pabloarosado added 2 commits June 26, 2023 13:57

Merge branch 'master' of github.com:owid/etl into improve-metadata-ha…

33558a5

…ndling

Increase ETL_EPOCH

db622e6

lucasrodes approved these changes Jun 26, 2023

View reviewed changes

enhance(catalog): Allow for additional kwargs on underscore_table

1929f82

lucasrodes changed the title ~~Improve metadata handling~~ ✨ metadata handling improvements Jun 26, 2023

Marigold requested changes Jun 26, 2023

View reviewed changes

pabloarosado mentioned this pull request Jun 27, 2023

Merge functions copy_metadata and copy_metadata_from in catalog.tables #1269

Closed

fix: Resolve conflicts when merging with master

1fa358a

Marigold approved these changes Jun 27, 2023

View reviewed changes

Marigold reviewed Jun 27, 2023

View reviewed changes

pabloarosado added 2 commits June 27, 2023 14:26

Avoid overwriting table.metadata.dataset, as suggested by Mojmir

e6670a3

Merge branch 'master' of github.com:owid/etl into improve-metadata-ha…

ba8143b

…ndling

Marigold mentioned this pull request Jun 27, 2023

Stop misusing TableMeta.dataset reference and make it read-only #1274

Closed

pabloarosado merged commit 127704c into master Jun 27, 2023

pabloarosado deleted the improve-metadata-handling branch June 27, 2023 17:05

larsyencken mentioned this pull request Jun 28, 2023

Roadmap: data engineering 2023 #1285

Closed

41 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

✨ metadata handling improvements #1251

✨ metadata handling improvements #1251

pabloarosado commented Jun 22, 2023 •

edited

Loading

Marigold commented Jun 26, 2023

lucasrodes left a comment

Marigold Jun 26, 2023

pabloarosado Jun 27, 2023

Marigold Jun 27, 2023

larsyencken Jun 27, 2023

Marigold Jun 26, 2023

pabloarosado Jun 27, 2023 •

edited

Loading

Marigold commented Jun 26, 2023

lucasrodes commented Jun 26, 2023 •

edited

Loading

pabloarosado commented Jun 27, 2023

pabloarosado commented Jun 27, 2023

Marigold left a comment

Marigold Jun 27, 2023

pabloarosado Jun 27, 2023

pabloarosado commented Jun 27, 2023

pabloarosado commented Jun 27, 2023

		return table


		def get_unique_sources_from_table(table: Table) -> List[Source]:

✨ metadata handling improvements #1251

✨ metadata handling improvements #1251

Conversation

pabloarosado commented Jun 22, 2023 • edited Loading

Marigold commented Jun 26, 2023

lucasrodes left a comment

Choose a reason for hiding this comment

Marigold Jun 26, 2023

Choose a reason for hiding this comment

pabloarosado Jun 27, 2023

Choose a reason for hiding this comment

Marigold Jun 27, 2023

Choose a reason for hiding this comment

larsyencken Jun 27, 2023

Choose a reason for hiding this comment

Marigold Jun 26, 2023

Choose a reason for hiding this comment

pabloarosado Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Marigold commented Jun 26, 2023

lucasrodes commented Jun 26, 2023 • edited Loading

pabloarosado commented Jun 27, 2023

pabloarosado commented Jun 27, 2023

Marigold left a comment

Choose a reason for hiding this comment

Marigold Jun 27, 2023

Choose a reason for hiding this comment

pabloarosado Jun 27, 2023

Choose a reason for hiding this comment

pabloarosado commented Jun 27, 2023

pabloarosado commented Jun 27, 2023

pabloarosado commented Jun 22, 2023 •

edited

Loading

pabloarosado Jun 27, 2023 •

edited

Loading

lucasrodes commented Jun 26, 2023 •

edited

Loading