Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add an on-disk format for datasets using Frictionless and feather #38

Closed
wants to merge 4 commits into from

Conversation

larsyencken
Copy link
Contributor

@larsyencken larsyencken commented Sep 15, 2021

A challenge of the data catalog is to identify the metadata we need to get data into Grapher, and to make sure we have a format that can capture it all. This PR makes a first attempt at an on-disk format for datasets.

Frictionless + feather

Note: this is a prototype format meant as a starting point. We expect to be able to change this on-disk format arbitrarily in future.

  • Each dataset gets its own folder; the name is not important
  • The folder contains many tables in Feather format (dir/mytable1.feather, dir/mytable2.feather, ...)
  • Metadata for the dataset as a whole, and for all individual tables, is stored in datapackage.json
  • datapackage.json conforms to the Frictionless data standard, letting us use their tools for validation

Rich data frames and series

  • Design goals:
    • We can have multiple serialisation formats for datasets
    • At any time, our tables and series behave and feel like pandas DataFrames and Series
    • We can start without any metadata at all, and incrementally add it
  • For datasets, we introduce the Dataset protocol and AboutThisDataset class for metadata
  • For tables, we introduce the RichDataFrame class and AboutThisTable class for metadata
  • For variables, we introduce this RichDataSeries class and AboutThisSeries class for metadata
  • For each of these levels, you get to the metadata object by obj.metadata

Todo

  • Get RichDataFrame and RichSeries working and interoperating well
  • Get Datasets serialising and deserialising from Frictionless + Feather
    • Translate RichDataFrame metadata into Frictionless resource format
    • Serialise dataset to Frictionless + Feather
    • Deserialise dataset from Frictionless + Feather
  • Can create Frictionless + Feather datasets progressively

After this, the plan is to get review, merge this, then try to import the WHO GHO dataset into this format.

@larsyencken
Copy link
Contributor Author

Closing this PR -- the work ended up forking and becoming the etl repository.

@larsyencken larsyencken closed this Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant