Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests in experimental/regression_suite/ are using a cache incorrectly #18336

Closed
ScottTodd opened this issue Aug 23, 2024 · 9 comments · Fixed by #18344
Closed

Tests in experimental/regression_suite/ are using a cache incorrectly #18336

ScottTodd opened this issue Aug 23, 2024 · 9 comments · Fixed by #18344
Assignees
Labels
bug 🐞 Something isn't working infrastructure/benchmark Relating to benchmarking infrastructure infrastructure Relating to build systems, CI, or testing

Comments

@ScottTodd
Copy link
Member

What happened?

Based on logs at https://github.com/iree-org/iree/actions/runs/10527576602/job/29171566486?pr=18152#step:9:40, it seems that we now have multiple jobs reading and writing into a single persistent cache without watching for collisions.

Steps to reproduce your issue

  1. Start a workflow run on one PR
  2. Start a workflow run on another PR
  3. Watch them conflict as both modify the same .vmfb file paths

What component(s) does this issue relate to?

Other

Version information

Tip of tree (e.g. a0945cc), since we added multiple runners.

Additional context

Test compiles a .vmfb file into the location referenced in the IREE_TEST_FILES env var:

def test_compile_unet_pipeline_cpu(sdxl_unet_pipeline_mlir):
VmfbManager.sdxl_unet_cpu_pipeline_vmfb = iree_compile(
sdxl_unet_pipeline_mlir,
"cpu",
CPU_COMPILE_FLAGS,
)
@functools.cache
def get_artifact_root_dir() -> Path:
root_path = os.getenv("IREE_TEST_FILES", default=str(Path.cwd())) + "/artifacts"
return Path(os.path.expanduser(root_path)).resolve()

Benchmark (which could be running minutes later) assumes that the test already ran and put files in that location

artifacts_dir = os.getenv("IREE_TEST_FILES", default=Path.cwd()) + "/artifacts"
artifacts_dir = Path(os.path.expanduser(artifacts_dir)).resolve()
prompt_encoder_dir = f"{artifacts_dir}/sdxl_clip"
scheduled_unet_dir = f"{artifacts_dir}/sdxl_unet"
vae_decode_dir = f"{artifacts_dir}/sdxl_vae"
"iree-benchmark-module",
f"--device=hip",
"--device_allocator=caching",
f"--module={prompt_encoder_dir}/model.rocm_{rocm_chip}.vmfb",
f"--parameters=model={prompt_encoder_dir}/real_weights.irpa",

These tests/benchmarks need to be refactored to be hermetic. They can read from a shared cache (checking that the files match expected hashes) and could write into a shared cache using carefully constructed keys/namespaces/hashes/subdirectories. What is usually safer is for tests to write into a build/test directory, not a cache.

Having the benchmarks reuse outputs from tests is also sketchy. I suspect those tests/scripts aren't easily runnable by hand if they require some specific sequencing and environment variables, and that is what we should optimize for first.

@ScottTodd ScottTodd added bug 🐞 Something isn't working infrastructure Relating to build systems, CI, or testing infrastructure/benchmark Relating to benchmarking infrastructure labels Aug 23, 2024
@ScottTodd
Copy link
Member Author

Reading through the code more, I have a fix in mind... but it's going to be tricky to test.

We can keep using the ArtifactGroup class, but change the directories used to point at two locations: the persistent cache for FetchedArtifact and a local workspace for ProducedArtifact. The locations of those can come in through environment variables.

Bugs are still possible if we change the remote file contents though. Using a cache implementation like huggingface's (git LFS) would be better - that downloads versions from git hashes and then creates symlinks into the refs.

@ScottTodd
Copy link
Member Author

Alternate approach (closer to huggingface): always use a workspace-relative location, but create symlinks from the cache into that directory before the tests are run.

@benvanik
Copy link
Collaborator

that would be neat - especially if the cache paths are uniqued per hash such that you can be updating the cache with live symlinks to older versions concurrently (similar to what bazel does when linking from the sandbox to its storage)

@ScottTodd
Copy link
Member Author

See what huggingface does here: https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache

<CACHE_DIR>
├─ datasets--glue
│  ├─ refs
│  ├─ blobs
│  ├─ snapshots
...

They have tools for choosing where you want files to appear: https://huggingface.co/docs/huggingface_hub/en/guides/download#download-files-to-a-local-folder . The recommended options use symlinks with all the details about refs and blobs hidden from the user.

@ScottTodd
Copy link
Member Author

ScottTodd commented Aug 23, 2024

Sketches of what a fix could look like here: https://github.com/iree-org/iree/compare/main...ScottTodd:infra-regression-suite-fix?expand=1. This is messy though - needs some deeper thinking. I also haven't tried running these tests locally - setup steps are complicated x_x.

We've been bouncing these tests back and forth between this repo and https://github.com/nod-ai/SHARK-TestSuite/ (https://github.com/iree-org/iree-test-suites could also be an option, if the Azure paths were replaced with something easier to modify for contributors). The repository-specific things are compile flags / spec files and expected dispatch counts / benchmark metrics.

@saienduri
Copy link
Collaborator

We can also just rework this a little bit: https://github.com/iree-org/iree/blob/main/experimental/regression_suite/ireers_tools/fixtures.py. Instead of making it output a produced artifact, it can just save the vmfb in the local dir. We basically don't use ProducedArtifact at all. I can get a fix out for that. Should be pretty easy and clean.

@ScottTodd
Copy link
Member Author

Okay yeah, I think that sounds good.

@saienduri
Copy link
Collaborator

We probably do need a better longterm solution for the mlirs, weights changing in the persistent cache (huggingface seems good), but we can be strategic about when we update the mlirs and weights in Azure for now (also doesn't happen often)

@ScottTodd
Copy link
Member Author

Here's another failure: https://github.com/iree-org/iree/actions/runs/10530365837/job/29180735869#step:6:180

process = <Popen: returncode: 1 args: ['iree-run-module', '--device=hip', '--module=/h...>
stdout = b''
stderr = b"iree/runtime/src/iree/vm/bytecode/archive.c:106: INVALID_ARGUMENT; FlatBuffer data is not present or less than 16 by...iree_tests_cache/artifacts/sdxl_unet/model.rocm_gfx942.vmfb'; loading modules and dependencies; creating run context\n"

Maybe another run started compiling to the same path, overwriting/deleting the file that this run tried to use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working infrastructure/benchmark Relating to benchmarking infrastructure infrastructure Relating to build systems, CI, or testing
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants