Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary datasets #3513

Closed
benjaminwinger opened this issue May 17, 2024 · 2 comments · Fixed by #3540
Closed

Binary datasets #3513

benjaminwinger opened this issue May 17, 2024 · 2 comments · Fixed by #3540

Comments

@benjaminwinger
Copy link
Collaborator

benjaminwinger commented May 17, 2024

Instead of storing binary datasets in the git repository, it might be a better idea to generate them as CI artifacts and share them between pipelines when they are used for compatibility tests.

That would remove the need to regenerate them and clutter the repository with large binary artifacts (and also mean that the compatibility tests could use larger datasets more easily).

On the other hand it will add some CI pipeline dependencies that overall may slow down the pipelines (the job generating the database would need to be run before other jobs which use the database are started, and that job will need to build kuzu), but it looks like most of the build steps are only taking about a minute, so it shouldn't be too much of a slowdown.

Edit: It's also really annoying to have to deal with the binary database tests locally when making changes which break storage, as short of updating the version number every time you make a change, there's nothing preventing it from running on the old database and potentially allocating arbitrarily large amounts of memory when data isn't where it should be. It might be better to have those tests be disabled by default and just run in CI

@acquamarin
Copy link
Collaborator

We currently have binary datasets under dataset folder for extension testing purpose.
I have two questions if we plan to generate them as CI artifacts:

  1. If we want to run the test on our local machine, how can we generate those CI artifacts?
  2. Those binary datasets are used by a local fileserver on each pipeline. How can we let the local server access the CI artifacts?

@benjaminwinger
Copy link
Collaborator Author

benjaminwinger commented May 23, 2024

If we want to run the test on our local machine, how can we generate those CI artifacts?

Well, they wouldn't be CI artifacts locally, but there are a couple of options I think:

  1. We could disable the tests locally by default, and you can manually enable them and generate the datasets using a script.
  2. We could default to generating the datasets locally, and for CI disable the generation and use the artifact instead.

That said, for the tinysnb dataset used by the extension, is it necessary that it be shared between different platforms? The binary-demo dataset was added explicitly for testing stability across multiple platforms, but the tinysnb one seems like it could just always be generated as part of the http-server.py script.

Those binary datasets are used by a local fileserver on each pipeline. How can we let the local server access the CI artifacts?

You download them first. See Passing data between jobs in a workflow in the github docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants