Tests in `experimental/regression_suite/` are using a cache incorrectly #18336

ScottTodd · 2024-08-23T16:46:49Z

What happened?

Based on logs at https://github.com/iree-org/iree/actions/runs/10527576602/job/29171566486?pr=18152#step:9:40, it seems that we now have multiple jobs reading and writing into a single persistent cache without watching for collisions.

Steps to reproduce your issue

Start a workflow run on one PR
Start a workflow run on another PR
Watch them conflict as both modify the same .vmfb file paths

What component(s) does this issue relate to?

Other

Version information

Tip of tree (e.g. a0945cc), since we added multiple runners.

Additional context

Test compiles a .vmfb file into the location referenced in the IREE_TEST_FILES env var:

iree/experimental/regression_suite/shark-test-suite-models/sdxl/test_unet.py

Lines 123 to 128 in a0945cc

    
           def test_compile_unet_pipeline_cpu(sdxl_unet_pipeline_mlir): 
        
               VmfbManager.sdxl_unet_cpu_pipeline_vmfb = iree_compile( 
        
                   sdxl_unet_pipeline_mlir, 
        
                   "cpu", 
        
                   CPU_COMPILE_FLAGS, 
        
               )

iree/experimental/regression_suite/ireers_tools/artifacts.py

Lines 40 to 43 in a0945cc

    
           @functools.cache 
        
           def get_artifact_root_dir() -> Path: 
        
               root_path = os.getenv("IREE_TEST_FILES", default=str(Path.cwd())) + "/artifacts" 
        
               return Path(os.path.expanduser(root_path)).resolve()

Benchmark (which could be running minutes later) assumes that the test already ran and put files in that location

iree/experimental/benchmarks/sdxl/benchmark_sdxl_rocm.py

Lines 18 to 22 in a0945cc

    
           artifacts_dir = os.getenv("IREE_TEST_FILES", default=Path.cwd()) + "/artifacts" 
        
           artifacts_dir = Path(os.path.expanduser(artifacts_dir)).resolve() 
        
           prompt_encoder_dir = f"{artifacts_dir}/sdxl_clip" 
        
           scheduled_unet_dir = f"{artifacts_dir}/sdxl_unet" 
        
           vae_decode_dir = f"{artifacts_dir}/sdxl_vae"

iree/experimental/benchmarks/sdxl/benchmark_sdxl_rocm.py

Lines 69 to 73 in a0945cc

    
           "iree-benchmark-module", 
        
           f"--device=hip", 
        
           "--device_allocator=caching", 
        
           f"--module={prompt_encoder_dir}/model.rocm_{rocm_chip}.vmfb", 
        
           f"--parameters=model={prompt_encoder_dir}/real_weights.irpa",

These tests/benchmarks need to be refactored to be hermetic. They can read from a shared cache (checking that the files match expected hashes) and could write into a shared cache using carefully constructed keys/namespaces/hashes/subdirectories. What is usually safer is for tests to write into a build/test directory, not a cache.

Having the benchmarks reuse outputs from tests is also sketchy. I suspect those tests/scripts aren't easily runnable by hand if they require some specific sequencing and environment variables, and that is what we should optimize for first.

The text was updated successfully, but these errors were encountered:

ScottTodd · 2024-08-23T17:45:45Z

Reading through the code more, I have a fix in mind... but it's going to be tricky to test.

We can keep using the ArtifactGroup class, but change the directories used to point at two locations: the persistent cache for FetchedArtifact and a local workspace for ProducedArtifact. The locations of those can come in through environment variables.

Bugs are still possible if we change the remote file contents though. Using a cache implementation like huggingface's (git LFS) would be better - that downloads versions from git hashes and then creates symlinks into the refs.

ScottTodd · 2024-08-23T18:26:19Z

Alternate approach (closer to huggingface): always use a workspace-relative location, but create symlinks from the cache into that directory before the tests are run.

benvanik · 2024-08-23T18:27:48Z

that would be neat - especially if the cache paths are uniqued per hash such that you can be updating the cache with live symlinks to older versions concurrently (similar to what bazel does when linking from the sandbox to its storage)

ScottTodd · 2024-08-23T18:32:09Z

See what huggingface does here: https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache

<CACHE_DIR>
├─ datasets--glue
│  ├─ refs
│  ├─ blobs
│  ├─ snapshots
...

They have tools for choosing where you want files to appear: https://huggingface.co/docs/huggingface_hub/en/guides/download#download-files-to-a-local-folder . The recommended options use symlinks with all the details about refs and blobs hidden from the user.

ScottTodd · 2024-08-23T18:52:02Z

Sketches of what a fix could look like here: https://github.com/iree-org/iree/compare/main...ScottTodd:infra-regression-suite-fix?expand=1. This is messy though - needs some deeper thinking. I also haven't tried running these tests locally - setup steps are complicated x_x.

We've been bouncing these tests back and forth between this repo and https://github.com/nod-ai/SHARK-TestSuite/ (https://github.com/iree-org/iree-test-suites could also be an option, if the Azure paths were replaced with something easier to modify for contributors). The repository-specific things are compile flags / spec files and expected dispatch counts / benchmark metrics.

saienduri · 2024-08-23T18:57:34Z

We can also just rework this a little bit: https://github.com/iree-org/iree/blob/main/experimental/regression_suite/ireers_tools/fixtures.py. Instead of making it output a produced artifact, it can just save the vmfb in the local dir. We basically don't use ProducedArtifact at all. I can get a fix out for that. Should be pretty easy and clean.

ScottTodd · 2024-08-23T18:59:45Z

Okay yeah, I think that sounds good.

saienduri · 2024-08-23T18:59:46Z

We probably do need a better longterm solution for the mlirs, weights changing in the persistent cache (huggingface seems good), but we can be strategic about when we update the mlirs and weights in Azure for now (also doesn't happen often)

ScottTodd · 2024-08-23T19:10:34Z

Here's another failure: https://github.com/iree-org/iree/actions/runs/10530365837/job/29180735869#step:6:180

process = <Popen: returncode: 1 args: ['iree-run-module', '--device=hip', '--module=/h...>
stdout = b''
stderr = b"iree/runtime/src/iree/vm/bytecode/archive.c:106: INVALID_ARGUMENT; FlatBuffer data is not present or less than 16 by...iree_tests_cache/artifacts/sdxl_unet/model.rocm_gfx942.vmfb'; loading modules and dependencies; creating run context\n"

Maybe another run started compiling to the same path, overwriting/deleting the file that this run tried to use.

ScottTodd added bug 🐞 Something isn't working infrastructure Relating to build systems, CI, or testing infrastructure/benchmark Relating to benchmarking infrastructure labels Aug 23, 2024

ScottTodd assigned ScottTodd and saienduri Aug 23, 2024

This was referenced Aug 23, 2024

[VectorDistribution] Configure contraction layouts at linalg level #18152

Merged

Disable experimental/regression_suite tests while cache is buggy. #18338

Closed

saienduri mentioned this issue Aug 23, 2024

Rework special model testing to avoid shared cache interference. #18344

Merged

saienduri closed this as completed in #18344 Aug 24, 2024

saienduri closed this as completed in cc44a85 Aug 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tests in `experimental/regression_suite/` are using a cache incorrectly #18336

Tests in `experimental/regression_suite/` are using a cache incorrectly #18336

ScottTodd commented Aug 23, 2024

ScottTodd commented Aug 23, 2024

ScottTodd commented Aug 23, 2024

benvanik commented Aug 23, 2024

ScottTodd commented Aug 23, 2024

ScottTodd commented Aug 23, 2024 •

edited

Loading

saienduri commented Aug 23, 2024

ScottTodd commented Aug 23, 2024

saienduri commented Aug 23, 2024

ScottTodd commented Aug 23, 2024

Tests in experimental/regression_suite/ are using a cache incorrectly #18336

Tests in experimental/regression_suite/ are using a cache incorrectly #18336

Comments

ScottTodd commented Aug 23, 2024

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

ScottTodd commented Aug 23, 2024

ScottTodd commented Aug 23, 2024

benvanik commented Aug 23, 2024

ScottTodd commented Aug 23, 2024

ScottTodd commented Aug 23, 2024 • edited Loading

saienduri commented Aug 23, 2024

ScottTodd commented Aug 23, 2024

saienduri commented Aug 23, 2024

ScottTodd commented Aug 23, 2024

Tests in `experimental/regression_suite/` are using a cache incorrectly #18336

Tests in `experimental/regression_suite/` are using a cache incorrectly #18336

ScottTodd commented Aug 23, 2024 •

edited

Loading