[DB Import] Add presubmit check #77

beckerhe · 2023-07-13T08:47:26Z

This PR is adding a few related things at the same time (Apologies for that.):

It introduces a new CLI command config which adds to utility functions for handling config files: cli.py config dump dumps the entire config after YAML processing and cli.py config list_pipelines returns a newline-separated list with all the defined pipelines.
It is adding a Docker image definition for running presubmit checks in a Docker container. The purpose of the Docker container is mainly to build the bigquery-emulator. (It's a golang-tool and needs to be built from source)
It is adding a GitHub Actions workflow that first builds the Docker image and then runs the unit and integration tests.
It is adding a dummy pipeline config which is needed to see whether the integration tests work.

pzread

Sorry for the delay. Mostly look good. I have some comments on handling docker images.

.github/workflows/db_import.yml

devtools/db_import/Dockerfile.presubmit

.github/workflows/db_import.yml

pzread

LGTM as there is no immediate issue and I don't want to block this for too long.

Regarding the docker images, I'm happy to explore new method to manage them (and it really improves the development workflow). But I think P0 is to provide users the digests to pull docker images from gcr. So if we don't solve it in this PR, makes sure we have an assigned issue to track the follow-up.

devtools/docker/dockerfiles/db_import.Dockerfile

beckerhe · 2023-07-19T09:35:08Z

But I think P0 is to provide users the digests to pull docker images from gcr. So if we don't solve it in this PR, makes sure we have an assigned issue to track the follow-up.

If it has such a high priority I think we should try to solve it immediately or at least have a plan before merging this.

If we wanted to use the "have the CI build the containers" approach that I'm proposing here I see the following options. (Of course it's also possible to just use the proven approach with prod_digests.txt):

Have also the user build their own docker images on demand. This would mean we have a script which calls docker buildx build instead of docker pull. Docker would then "build" the image but since we will have all the caching layers in the registry this will end up being the same image as if it was pulled. The only difference is if a layer is not in the cache - a docker pull would fail, but a docker buildx build would just build the layer. To me this is the preferred approach because the user will never have an outdated or non-matching docker image. docker buildx build will always fetch or build the proper version.
If we wanted to have tags the user can docker pull from, then one option would be to use commit hashes as image tags. In that case we would have a script that determines the commit hash from the git repo and looks up the corresponding docker image. This of course doesn't work if HEAD is not a commit on main, but we could fall back to the latest common ancestor commit between HEAD and origin/main.
Or we determine the latest commit that affected the docker image and use that. This has the disadvantage that now the script determining the tag to pull needs to know what could have affected the image - which is really not its job. But I see the biggest downside that the user needs to know when they should rebuild the Docker image because local non-committed changes won't affect the logic.

So what I'm basically asking is if solution 1 would be an option or if we need prebuilt images. 😅
If we do then option 2 or 3 might be valid, but I would rather lean towards using prod_digests.txt for now and
maybe come back to the BuildKit based approach later.

WDYT?

pzread · 2023-07-19T16:02:20Z

But I think P0 is to provide users the digests to pull docker images from gcr. So if we don't solve it in this PR, makes sure we have an assigned issue to track the follow-up.

If it has such a high priority I think we should try to solve it immediately or at least have a plan before merging this.

If we wanted to use the "have the CI build the containers" approach that I'm proposing here I see the following options. (Of course it's also possible to just use the proven approach with prod_digests.txt):

Have also the user build their own docker images on demand. This would mean we have a script which calls docker buildx build instead of docker pull. Docker would then "build" the image but since we will have all the caching layers in the registry this will end up being the same image as if it was pulled. The only difference is if a layer is not in the cache - a docker pull would fail, but a docker buildx build would just build the layer. To me this is the preferred approach because the user will never have an outdated or non-matching docker image. docker buildx build will always fetch or build the proper version.

If we wanted to have tags the user can docker pull from, then one option would be to use commit hashes as image tags. In that case we would have a script that determines the commit hash from the git repo and looks up the corresponding docker image. This of course doesn't work if HEAD is not a commit on main, but we could fall back to the latest common ancestor commit between HEAD and origin/main.

Or we determine the latest commit that affected the docker image and use that. This has the disadvantage that now the script determining the tag to pull needs to know what could have affected the image - which is really not its job. But I see the biggest downside that the user needs to know when they should rebuild the Docker image because local non-committed changes won't affect the logic.

So what I'm basically asking is if solution 1 would be an option or if we need prebuilt images. sweat_smile If we do then option 2 or 3 might be valid, but I would rather lean towards using prod_digests.txt for now and maybe come back to the BuildKit based approach later.

WDYT?

I'm actually convinced solution 1 is a very good idea, but there are a few things need to be done:

Create and use our own gcr as cache, as github action cache is limited
A wrapper script to help users run locally.
Figure out how to build dependency between dockerfiles (e.g. https://github.com/openxla/openxla-benchmark/blob/main/devtools/docker/dockerfiles/cuda11.8-cudnn8.9.Dockerfile#L9) if there is no digest reference from the gcr. It's anther major reason we introduced manage_images.py. Maybe https://www.docker.com/blog/dockerfiles-now-support-multiple-build-contexts/#:~:text=Create%20Build%20Pipelines%20by%20Linking%20bake%20Targets can be a solution.

So maybe let's use prod_digests.txt and we need more time to prepare for solution 1. (Also if you have time, could you also create an issue to switch to solution 1?)

beckerhe · 2023-07-20T12:53:40Z

So maybe let's use prod_digests.txt and we need more time to prepare for solution 1. (Also if you have time, could you also create an issue to switch to solution 1?)

Yes, sounds reasonable. I changed the PR to use manage_images.py. (Cool tool by the way)

I also created an issue and backreferenced this PR for future consideration.

@pzread PTAL - there were some non-trivial changes which I believe warrant another review.

beckerhe · 2023-07-20T13:08:37Z

Figure out how to build dependency between dockerfiles (e.g. https://github.com/openxla/openxla-benchmark/blob/main/devtools/docker/dockerfiles/cuda11.8-cudnn8.9.Dockerfile#L9) if there is no digest reference from the gcr. It's anther major reason we introduced manage_images.py. Maybe https://www.docker.com/blog/dockerfiles-now-support-multiple-build-contexts/#:~:text=Create%20Build%20Pipelines%20by%20Linking%20bake%20Targets can be a solution.

So Docker Multi-Stage builds kind of support that already. The only downside is that the base image and the cuda image need to be defined in the same Dockerfile which kind of defeats the purpose. There have been discussions about supporting something like and IMPORT statement but so far that doesn't exist. BuildKit also allows defining your own "Dockerfile" formats and quite a few already exist, see https://github.com/moby/buildkit#exploring-llb. Some of them support something like IMPORT.

Last time I looked at that I came to the conclusion that the most simple and robust solution would be to use some arbitrary templating language like Jinja2 to make 1 Dockerfile out of 2 or more Dockerfiles. But by now there might be better options available.

moby/moby#735 has a huge discussion on the topic.

pzread · 2023-07-21T05:42:12Z

LGTM, Thanks!

This PR is adding a few related things at the same time: 1. It introduces a new CLI command config which adds to utility functions for handling config files: cli.py config dump dumps the entire config after YAML processing and cli.py config list_pipelines returns a newline-separated list with all the defined pipelines. 2. It is adding a Docker image definition for running presubmit checks in a Docker container. The purpose of the Docker container is mainly to build the bigquery-emulator. (It's a golang-tool and needs to be built from source) 3. It is adding a GitHub Actions workflow that first builds the Docker image and then runs the unit and integration tests. 4. It is adding a dummy pipeline config which is needed to see whether the integration tests work.

beckerhe force-pushed the add_presubmit branch 21 times, most recently from 5834093 to c5d0b0d Compare July 13, 2023 13:22

beckerhe marked this pull request as ready for review July 13, 2023 13:36

beckerhe force-pushed the add_presubmit branch 5 times, most recently from abf78c1 to 23f5030 Compare July 13, 2023 14:00

beckerhe requested a review from pzread July 13, 2023 14:08

[DB Import] Add presubmit check

ffdb708

beckerhe force-pushed the add_presubmit branch from 23f5030 to ffdb708 Compare July 13, 2023 16:24

pzread reviewed Jul 17, 2023

View reviewed changes

beckerhe force-pushed the add_presubmit branch from ae3a05e to 1becdd8 Compare July 18, 2023 16:15

Address review comments

80e066b

beckerhe force-pushed the add_presubmit branch from 1becdd8 to 80e066b Compare July 18, 2023 16:17

pzread approved these changes Jul 18, 2023

View reviewed changes

devtools/docker/dockerfiles/db_import.Dockerfile Outdated Show resolved Hide resolved

Address review comments

e7801a2

beckerhe mentioned this pull request Jul 20, 2023

Evaluate if we want to move to BuildKit for automatic Docker image handling #89

Open

beckerhe requested a review from pzread July 20, 2023 12:54

pzread approved these changes Jul 20, 2023

View reviewed changes

beckerhe merged commit 634d609 into iree-org:main Jul 21, 2023
7 checks passed

beckerhe deleted the add_presubmit branch July 21, 2023 06:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DB Import] Add presubmit check #77

[DB Import] Add presubmit check #77

beckerhe commented Jul 13, 2023 •

edited

pzread left a comment

pzread left a comment •

edited

beckerhe commented Jul 19, 2023

pzread commented Jul 19, 2023 •

edited

beckerhe commented Jul 20, 2023 •

edited

beckerhe commented Jul 20, 2023

pzread commented Jul 21, 2023

[DB Import] Add presubmit check #77

[DB Import] Add presubmit check #77

Conversation

beckerhe commented Jul 13, 2023 • edited

pzread left a comment

Choose a reason for hiding this comment

pzread left a comment • edited

Choose a reason for hiding this comment

beckerhe commented Jul 19, 2023

pzread commented Jul 19, 2023 • edited

beckerhe commented Jul 20, 2023 • edited

beckerhe commented Jul 20, 2023

pzread commented Jul 21, 2023

beckerhe commented Jul 13, 2023 •

edited

pzread left a comment •

edited

pzread commented Jul 19, 2023 •

edited

beckerhe commented Jul 20, 2023 •

edited