Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DB Import] Add presubmit check #77

Merged
merged 3 commits into from
Jul 21, 2023
Merged

Conversation

beckerhe
Copy link
Collaborator

@beckerhe beckerhe commented Jul 13, 2023

This PR is adding a few related things at the same time (Apologies for that.):

  1. It introduces a new CLI command config which adds to utility functions for handling config files: cli.py config dump dumps the entire config after YAML processing and cli.py config list_pipelines returns a newline-separated list with all the defined pipelines.
  2. It is adding a Docker image definition for running presubmit checks in a Docker container. The purpose of the Docker container is mainly to build the bigquery-emulator. (It's a golang-tool and needs to be built from source)
  3. It is adding a GitHub Actions workflow that first builds the Docker image and then runs the unit and integration tests.
  4. It is adding a dummy pipeline config which is needed to see whether the integration tests work.

@beckerhe beckerhe force-pushed the add_presubmit branch 21 times, most recently from 5834093 to c5d0b0d Compare July 13, 2023 13:22
@beckerhe beckerhe marked this pull request as ready for review July 13, 2023 13:36
@beckerhe beckerhe force-pushed the add_presubmit branch 5 times, most recently from abf78c1 to 23f5030 Compare July 13, 2023 14:00
@beckerhe beckerhe requested a review from pzread July 13, 2023 14:08
Copy link

@pzread pzread left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. Mostly look good. I have some comments on handling docker images.

.github/workflows/db_import.yml Show resolved Hide resolved
.github/workflows/db_import.yml Outdated Show resolved Hide resolved
.github/workflows/db_import.yml Outdated Show resolved Hide resolved
devtools/db_import/Dockerfile.presubmit Outdated Show resolved Hide resolved
.github/workflows/db_import.yml Outdated Show resolved Hide resolved
.github/workflows/db_import.yml Outdated Show resolved Hide resolved
Copy link

@pzread pzread left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as there is no immediate issue and I don't want to block this for too long.

Regarding the docker images, I'm happy to explore new method to manage them (and it really improves the development workflow). But I think P0 is to provide users the digests to pull docker images from gcr. So if we don't solve it in this PR, makes sure we have an assigned issue to track the follow-up.

devtools/docker/dockerfiles/db_import.Dockerfile Outdated Show resolved Hide resolved
@beckerhe
Copy link
Collaborator Author

But I think P0 is to provide users the digests to pull docker images from gcr. So if we don't solve it in this PR, makes sure we have an assigned issue to track the follow-up.

If it has such a high priority I think we should try to solve it immediately or at least have a plan before merging this.

If we wanted to use the "have the CI build the containers" approach that I'm proposing here I see the following options. (Of course it's also possible to just use the proven approach with prod_digests.txt):

  1. Have also the user build their own docker images on demand. This would mean we have a script which calls docker buildx build instead of docker pull. Docker would then "build" the image but since we will have all the caching layers in the registry this will end up being the same image as if it was pulled. The only difference is if a layer is not in the cache - a docker pull would fail, but a docker buildx build would just build the layer. To me this is the preferred approach because the user will never have an outdated or non-matching docker image. docker buildx build will always fetch or build the proper version.
  2. If we wanted to have tags the user can docker pull from, then one option would be to use commit hashes as image tags. In that case we would have a script that determines the commit hash from the git repo and looks up the corresponding docker image. This of course doesn't work if HEAD is not a commit on main, but we could fall back to the latest common ancestor commit between HEAD and origin/main.
  3. Or we determine the latest commit that affected the docker image and use that. This has the disadvantage that now the script determining the tag to pull needs to know what could have affected the image - which is really not its job. But I see the biggest downside that the user needs to know when they should rebuild the Docker image because local non-committed changes won't affect the logic.

So what I'm basically asking is if solution 1 would be an option or if we need prebuilt images. 😅
If we do then option 2 or 3 might be valid, but I would rather lean towards using prod_digests.txt for now and
maybe come back to the BuildKit based approach later.

WDYT?

@pzread
Copy link

pzread commented Jul 19, 2023

But I think P0 is to provide users the digests to pull docker images from gcr. So if we don't solve it in this PR, makes sure we have an assigned issue to track the follow-up.

If it has such a high priority I think we should try to solve it immediately or at least have a plan before merging this.

If we wanted to use the "have the CI build the containers" approach that I'm proposing here I see the following options. (Of course it's also possible to just use the proven approach with prod_digests.txt):

  1. Have also the user build their own docker images on demand. This would mean we have a script which calls docker buildx build instead of docker pull. Docker would then "build" the image but since we will have all the caching layers in the registry this will end up being the same image as if it was pulled. The only difference is if a layer is not in the cache - a docker pull would fail, but a docker buildx build would just build the layer. To me this is the preferred approach because the user will never have an outdated or non-matching docker image. docker buildx build will always fetch or build the proper version.
  2. If we wanted to have tags the user can docker pull from, then one option would be to use commit hashes as image tags. In that case we would have a script that determines the commit hash from the git repo and looks up the corresponding docker image. This of course doesn't work if HEAD is not a commit on main, but we could fall back to the latest common ancestor commit between HEAD and origin/main.
  3. Or we determine the latest commit that affected the docker image and use that. This has the disadvantage that now the script determining the tag to pull needs to know what could have affected the image - which is really not its job. But I see the biggest downside that the user needs to know when they should rebuild the Docker image because local non-committed changes won't affect the logic.

So what I'm basically asking is if solution 1 would be an option or if we need prebuilt images. sweat_smile If we do then option 2 or 3 might be valid, but I would rather lean towards using prod_digests.txt for now and maybe come back to the BuildKit based approach later.

WDYT?

But I think P0 is to provide users the digests to pull docker images from gcr. So if we don't solve it in this PR, makes sure we have an assigned issue to track the follow-up.

If it has such a high priority I think we should try to solve it immediately or at least have a plan before merging this.

If we wanted to use the "have the CI build the containers" approach that I'm proposing here I see the following options. (Of course it's also possible to just use the proven approach with prod_digests.txt):

  1. Have also the user build their own docker images on demand. This would mean we have a script which calls docker buildx build instead of docker pull. Docker would then "build" the image but since we will have all the caching layers in the registry this will end up being the same image as if it was pulled. The only difference is if a layer is not in the cache - a docker pull would fail, but a docker buildx build would just build the layer. To me this is the preferred approach because the user will never have an outdated or non-matching docker image. docker buildx build will always fetch or build the proper version.
  2. If we wanted to have tags the user can docker pull from, then one option would be to use commit hashes as image tags. In that case we would have a script that determines the commit hash from the git repo and looks up the corresponding docker image. This of course doesn't work if HEAD is not a commit on main, but we could fall back to the latest common ancestor commit between HEAD and origin/main.
  3. Or we determine the latest commit that affected the docker image and use that. This has the disadvantage that now the script determining the tag to pull needs to know what could have affected the image - which is really not its job. But I see the biggest downside that the user needs to know when they should rebuild the Docker image because local non-committed changes won't affect the logic.

So what I'm basically asking is if solution 1 would be an option or if we need prebuilt images. sweat_smile If we do then option 2 or 3 might be valid, but I would rather lean towards using prod_digests.txt for now and maybe come back to the BuildKit based approach later.

WDYT?

I'm actually convinced solution 1 is a very good idea, but there are a few things need to be done:

  1. Create and use our own gcr as cache, as github action cache is limited
  2. A wrapper script to help users run locally.
  3. Figure out how to build dependency between dockerfiles (e.g. https://github.com/openxla/openxla-benchmark/blob/main/devtools/docker/dockerfiles/cuda11.8-cudnn8.9.Dockerfile#L9) if there is no digest reference from the gcr. It's anther major reason we introduced manage_images.py. Maybe https://www.docker.com/blog/dockerfiles-now-support-multiple-build-contexts/#:~:text=Create%20Build%20Pipelines%20by%20Linking%20bake%20Targets can be a solution.

So maybe let's use prod_digests.txt and we need more time to prepare for solution 1. (Also if you have time, could you also create an issue to switch to solution 1?)

@beckerhe
Copy link
Collaborator Author

beckerhe commented Jul 20, 2023

So maybe let's use prod_digests.txt and we need more time to prepare for solution 1. (Also if you have time, could you also create an issue to switch to solution 1?)

Yes, sounds reasonable. I changed the PR to use manage_images.py. (Cool tool by the way)

I also created an issue and backreferenced this PR for future consideration.

@pzread PTAL - there were some non-trivial changes which I believe warrant another review.

@beckerhe beckerhe requested a review from pzread July 20, 2023 12:54
@beckerhe
Copy link
Collaborator Author

  1. Figure out how to build dependency between dockerfiles (e.g. https://github.com/openxla/openxla-benchmark/blob/main/devtools/docker/dockerfiles/cuda11.8-cudnn8.9.Dockerfile#L9) if there is no digest reference from the gcr. It's anther major reason we introduced manage_images.py. Maybe https://www.docker.com/blog/dockerfiles-now-support-multiple-build-contexts/#:~:text=Create%20Build%20Pipelines%20by%20Linking%20bake%20Targets can be a solution.

So Docker Multi-Stage builds kind of support that already. The only downside is that the base image and the cuda image need to be defined in the same Dockerfile which kind of defeats the purpose. There have been discussions about supporting something like and IMPORT statement but so far that doesn't exist. BuildKit also allows defining your own "Dockerfile" formats and quite a few already exist, see https://github.com/moby/buildkit#exploring-llb. Some of them support something like IMPORT.

Last time I looked at that I came to the conclusion that the most simple and robust solution would be to use some arbitrary templating language like Jinja2 to make 1 Dockerfile out of 2 or more Dockerfiles. But by now there might be better options available.

moby/moby#735 has a huge discussion on the topic.

@pzread
Copy link

pzread commented Jul 21, 2023

LGTM, Thanks!

@beckerhe beckerhe merged commit 634d609 into iree-org:main Jul 21, 2023
7 checks passed
@beckerhe beckerhe deleted the add_presubmit branch July 21, 2023 06:57
mariecwhite pushed a commit to mariecwhite/openxla-benchmark that referenced this pull request Jul 23, 2023
This PR is adding a few related things at the same time:

1. It introduces a new CLI command config which adds to utility functions for handling config files: cli.py config dump dumps the entire config after YAML processing and cli.py config list_pipelines returns a newline-separated list with all the defined pipelines.
2. It is adding a Docker image definition for running presubmit checks in a Docker container. The purpose of the Docker container is mainly to build the bigquery-emulator. (It's a golang-tool and needs to be built from source)
3. It is adding a GitHub Actions workflow that first builds the Docker image and then runs the unit and integration tests.
4. It is adding a dummy pipeline config which is needed to see whether the integration tests work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants