Skip to content

Use HSC1k dataset for DataCache ASV benchmark#674

Merged
mtauraso merged 19 commits intomainfrom
copilot/update-data-cache-benchmarks
Feb 4, 2026
Merged

Use HSC1k dataset for DataCache ASV benchmark#674
mtauraso merged 19 commits intomainfrom
copilot/update-data-cache-benchmarks

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 4, 2026

Change Description

DataCache benchmarks on main needed a working dataset download path; this switches the benchmark to the HSC1k sample so CI can measure realistic cache preload timing and memory.

Solution Description

  • Benchmark dataset: configure DataCache benchmark to use the HSC1k sample data directory and HSCDataSet.
  • Download gating: only fetch/extract the archive when the extracted folder is missing.
  • Archive constants: centralize the HSC1k URL, hash, and expected extraction directory name.
hsc_data_dir = data_dir / HSC1K_EXTRACTED_DIRNAME
if not hsc_data_dir.exists():
    pooch.retrieve(
        url=HSC1K_ARCHIVE_URL,
        known_hash=HSC1K_ARCHIVE_HASH,
        fname="hsc_demo_data.zip",
        path=data_dir,
        processor=pooch.Unzip(),
    )

Code Quality

  • I have read the Contribution Guide and agree to the Code of Conduct
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation
Original prompt

This section details on the original issue you should resolve

<issue_title>Ensure DataCache benchmarks are running on main</issue_title>
<issue_description>Benchmarks were added here: #593

They didn't work on the PR, but there is some indication that asv maybe just needs to run a bit on main and we will get a graph for these, since the failures are in git commands.

This bug is to track ensuring the benchmark actually works.</issue_description>

Comments on the Issue (you are @copilot in this section)

@mtauraso @copilot

You are going to change the data cache asv benchmarks (in benchmarks/data_cache_benchmarks.py) to use the hsc1k dataset instead of the CIFAR dataset. The goal is to check in a working version so we can see how long it takes in CI on a typical github action runner.

Please review tests/hyrax/test_e2e.py for the section where we use pooch to download the hsc1k dataset before beginning your implementation.

AVOID RUNNING THE LINTER OR TESTS THEY DO NOT WORK IN YOUR ENVIRONMENT. Just run asv in --quick mode targeting only the data cache benchmarks if you want to see if it's working. THIS IS A PROOF OF CONCEPT AND WE ARE NOT MAKING IT CLEAN RIGHT NOW.</comment_new>


✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 17 commits February 4, 2026 19:05
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>
Copilot AI changed the title [WIP] Ensure DataCache benchmarks are running on main Use HSC1k dataset for DataCache ASV benchmark Feb 4, 2026
Copilot AI requested a review from mtauraso February 4, 2026 19:26
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 4, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 62.95%. Comparing base (f90813e) to head (54a5407).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #674   +/-   ##
=======================================
  Coverage   62.95%   62.95%           
=======================================
  Files          58       58           
  Lines        5782     5782           
=======================================
  Hits         3640     3640           
  Misses       2142     2142           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mtauraso mtauraso marked this pull request as ready for review February 4, 2026 22:44
Copilot AI review requested due to automatic review settings February 4, 2026 22:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the DataCache ASV benchmarks to use the HSC1k sample dataset instead of CIFAR, enabling realistic cache performance measurements in CI. The change addresses an issue where the original benchmarks weren't working properly on main.

Changes:

  • Switch benchmark dataset from HyraxCifarDataset to HSCDataSet
  • Add pooch download logic with HSC1k archive constants (URL, hash, directory name)
  • Update configuration to use HSC1k data directory and appropriate fields (image only, no label field)
  • Rename benchmark methods from *_cifar to *_hsc1k

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 4, 2026

Before [f90813e] After [e2e23ed] Ratio Benchmark (Parameter)
3.53±0.2s 3.87±0.3s 1.1 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(256, 'qdrant')
1.94±0.04s 2.12±0.04s 1.09 benchmarks.time_save_to_database_help
1.93±0.07s 2.06±0.05s 1.07 benchmarks.time_visualize_help
1.91±0.02s 2.01±0.04s 1.05 benchmarks.time_rebuild_manifest_help
1.92±0.07s 2.03±0.03s 1.05 benchmarks.time_umap_help
424±5ms 446±7ms 1.05 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(64, 'chromadb')
2.00±0.04s 2.08±0.03s 1.04 benchmarks.time_train_help
9.92±0.04s 10.2±0.06s 1.03 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'chromadb')
7.70±0.03s 7.96±0.05s 1.03 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'qdrant')
9.26±0.3ms 9.58±0.1ms 1.03 vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(128, 'chromadb')

Click here to view all benchmarks.

@mtauraso mtauraso force-pushed the copilot/update-data-cache-benchmarks branch from 1901c10 to 54a5407 Compare February 4, 2026 23:34
@mtauraso mtauraso enabled auto-merge (squash) February 4, 2026 23:36
Copy link
Copy Markdown
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, only one small comment.

)

self.h.config["general"]["results_dir"] = str(data_dir)
self.h.config["general"]["data_dir"] = str(hsc_data_dir)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point - if ['general']['data_dir'] isn't otherwise needed, we could step back from using it.

@mtauraso mtauraso merged commit 808dbfa into main Feb 4, 2026
21 checks passed
@mtauraso mtauraso deleted the copilot/update-data-cache-benchmarks branch February 4, 2026 23:57
@drewoldag
Copy link
Copy Markdown
Collaborator

General question, is the benefit of using HSC1k just to provide a realistic dataset for benchmarking? Are there additional reasons?

@mtauraso
Copy link
Copy Markdown
Collaborator

mtauraso commented Feb 5, 2026

AFAICT, this succeeds simply because the dataset is smaller. I was looking into several other reasons why the old version wasn't resulting in numbers...none of which ultimately lead anywhere. This idea to switch CIFAR -> HSC1k was just to make the problem smaller in case the underlying issue was somehow resources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure DataCache benchmarks are running on main

4 participants