Binary datasets #3513

benjaminwinger · 2024-05-17T21:42:54Z

Instead of storing binary datasets in the git repository, it might be a better idea to generate them as CI artifacts and share them between pipelines when they are used for compatibility tests.

That would remove the need to regenerate them and clutter the repository with large binary artifacts (and also mean that the compatibility tests could use larger datasets more easily).

On the other hand it will add some CI pipeline dependencies that overall may slow down the pipelines (the job generating the database would need to be run before other jobs which use the database are started, and that job will need to build kuzu), but it looks like most of the build steps are only taking about a minute, so it shouldn't be too much of a slowdown.

Edit: It's also really annoying to have to deal with the binary database tests locally when making changes which break storage, as short of updating the version number every time you make a change, there's nothing preventing it from running on the old database and potentially allocating arbitrarily large amounts of memory when data isn't where it should be. It might be better to have those tests be disabled by default and just run in CI

acquamarin · 2024-05-23T09:44:12Z

We currently have binary datasets under dataset folder for extension testing purpose.
I have two questions if we plan to generate them as CI artifacts:

If we want to run the test on our local machine, how can we generate those CI artifacts?
Those binary datasets are used by a local fileserver on each pipeline. How can we let the local server access the CI artifacts?

benjaminwinger · 2024-05-23T14:22:39Z

If we want to run the test on our local machine, how can we generate those CI artifacts?

Well, they wouldn't be CI artifacts locally, but there are a couple of options I think:

We could disable the tests locally by default, and you can manually enable them and generate the datasets using a script.
We could default to generating the datasets locally, and for CI disable the generation and use the artifact instead.

That said, for the tinysnb dataset used by the extension, is it necessary that it be shared between different platforms? The binary-demo dataset was added explicitly for testing stability across multiple platforms, but the tinysnb one seems like it could just always be generated as part of the http-server.py script.

Those binary datasets are used by a local fileserver on each pipeline. How can we let the local server access the CI artifacts?

You download them first. See Passing data between jobs in a workflow in the github docs.

benjaminwinger mentioned this issue May 24, 2024

Generate binary datasets in CI instead of storing them in the repo #3540

Merged

andyfengHKU closed this as completed in #3540 Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Binary datasets #3513

Binary datasets #3513

benjaminwinger commented May 17, 2024 •

edited

acquamarin commented May 23, 2024

benjaminwinger commented May 23, 2024 •

edited

Binary datasets #3513

Binary datasets #3513

Comments

benjaminwinger commented May 17, 2024 • edited

acquamarin commented May 23, 2024

benjaminwinger commented May 23, 2024 • edited

benjaminwinger commented May 17, 2024 •

edited

benjaminwinger commented May 23, 2024 •

edited