Use HSC1k dataset for DataCache ASV benchmark by Copilot · Pull Request #674 · lincc-frameworks/hyrax

Copilot · 2026-02-04T18:48:45Z

Change Description

DataCache benchmarks on main needed a working dataset download path; this switches the benchmark to the HSC1k sample so CI can measure realistic cache preload timing and memory.

Solution Description

Benchmark dataset: configure DataCache benchmark to use the HSC1k sample data directory and HSCDataSet.
Download gating: only fetch/extract the archive when the extracted folder is missing.
Archive constants: centralize the HSC1k URL, hash, and expected extraction directory name.

hsc_data_dir = data_dir / HSC1K_EXTRACTED_DIRNAME
if not hsc_data_dir.exists():
    pooch.retrieve(
        url=HSC1K_ARCHIVE_URL,
        known_hash=HSC1K_ARCHIVE_HASH,
        fname="hsc_demo_data.zip",
        path=data_dir,
        processor=pooch.Unzip(),
    )

Code Quality

I have read the Contribution Guide and agree to the Code of Conduct
My code follows the code style of this project
My code builds (or compiles) cleanly without any errors or warnings
My code contains relevant comments and necessary documentation

Original prompt

This section details on the original issue you should resolve

<issue_title>Ensure DataCache benchmarks are running on main</issue_title>
<issue_description>Benchmarks were added here: #593

They didn't work on the PR, but there is some indication that asv maybe just needs to run a bit on main and we will get a graph for these, since the failures are in git commands.

This bug is to track ensuring the benchmark actually works.</issue_description>

Comments on the Issue (you are @copilot in this section)
@mtauraso @copilot
You are going to change the data cache asv benchmarks (in benchmarks/data_cache_benchmarks.py) to use the hsc1k dataset instead of the CIFAR dataset. The goal is to check in a working version so we can see how long it takes in CI on a typical github action runner.

Please review tests/hyrax/test_e2e.py for the section where we use pooch to download the hsc1k dataset before beginning your implementation.

AVOID RUNNING THE LINTER OR TESTS THEY DO NOT WORK IN YOUR ENVIRONMENT. Just run asv in --quick mode targeting only the data cache benchmarks if you want to see if it's working. THIS IS A PROOF OF CONCEPT AND WE ARE NOT MAKING IT CLEAN RIGHT NOW.</comment_new>

Fixes Ensure DataCache benchmarks are running on main #640

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

codecov · 2026-02-04T21:16:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 62.95%. Comparing base (f90813e) to head (54a5407).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #674   +/-   ##
=======================================
  Coverage   62.95%   62.95%           
=======================================
  Files          58       58           
  Lines        5782     5782           
=======================================
  Hits         3640     3640           
  Misses       2142     2142

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

This PR updates the DataCache ASV benchmarks to use the HSC1k sample dataset instead of CIFAR, enabling realistic cache performance measurements in CI. The change addresses an issue where the original benchmarks weren't working properly on main.

Changes:

Switch benchmark dataset from HyraxCifarDataset to HSCDataSet
Add pooch download logic with HSC1k archive constants (URL, hash, directory name)
Update configuration to use HSC1k data directory and appropriate fields (image only, no label field)
Rename benchmark methods from *_cifar to *_hsc1k

github-actions · 2026-02-04T23:27:13Z

Before [`f90813e`]	After [`e2e23ed`]	Ratio	Benchmark (Parameter)
3.53±0.2s	3.87±0.3s	1.1	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(256, 'qdrant')
1.94±0.04s	2.12±0.04s	1.09	benchmarks.time_save_to_database_help
1.93±0.07s	2.06±0.05s	1.07	benchmarks.time_visualize_help
1.91±0.02s	2.01±0.04s	1.05	benchmarks.time_rebuild_manifest_help
1.92±0.07s	2.03±0.03s	1.05	benchmarks.time_umap_help
424±5ms	446±7ms	1.05	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(64, 'chromadb')
2.00±0.04s	2.08±0.03s	1.04	benchmarks.time_train_help
9.92±0.04s	10.2±0.06s	1.03	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'chromadb')
7.70±0.03s	7.96±0.05s	1.03	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'qdrant')
9.26±0.3ms	9.58±0.1ms	1.03	vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(128, 'chromadb')

Click here to view all benchmarks.

drewoldag

Looks good, only one small comment.

drewoldag · 2026-02-04T23:55:45Z

+            )
+
+        self.h.config["general"]["results_dir"] = str(data_dir)
+        self.h.config["general"]["data_dir"] = str(hsc_data_dir)


Minor point - if ['general']['data_dir'] isn't otherwise needed, we could step back from using it.

drewoldag · 2026-02-04T23:57:51Z

General question, is the benefit of using HSC1k just to provide a realistic dataset for benchmarking? Are there additional reasons?

mtauraso · 2026-02-05T00:02:59Z

AFAICT, this succeeds simply because the dataset is smaller. I was looking into several other reasons why the old version wasn't resulting in numbers...none of which ultimately lead anywhere. This idea to switch CIFAR -> HSC1k was just to make the problem smaller in case the underlying issue was somehow resources.

Initial plan

b2f5f0e

Copilot AI assigned Copilot and mtauraso Feb 4, 2026

Copilot started work on behalf of mtauraso February 4, 2026 18:49 View session

Copilot AI and others added 17 commits February 4, 2026 19:05

Update data cache benchmark to use HSC1k dataset

d200a8c

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Tidy HSC1k benchmark download config

ac68039

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Use resolved HSC1k extraction path

110dd8a

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Use Zenodo URL for HSC1k benchmark download

2290bb0

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Move pooch import to module scope

4abc48a

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Validate HSC1k extraction directory

b2ed85e

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Clarify HSC1k extraction error message

ff73dee

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Centralize HSC1k extraction settings

a5821d1

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Detect HSC1k data directory by manifest

c45e460

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Improve HSC1k manifest directory detection

690e6d7

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Clarify HSC1k manifest directory errors

423c4b5

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Refine HSC1k manifest fallback messaging

9107b3a

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Simplify HSC1k benchmark download check

8d2d461

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Tighten HSC1k benchmark error message

6e17178

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Simplify HSC1k extraction check

32c2095

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Centralize HSC1k benchmark archive settings

50f807b

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Clean up HSC1k benchmark download flow

09acc6e

Co-authored-by: mtauraso <31012+mtauraso@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Ensure DataCache benchmarks are running on main~~ Use HSC1k dataset for DataCache ASV benchmark Feb 4, 2026

Copilot AI requested a review from mtauraso February 4, 2026 19:26

Copilot finished work on behalf of mtauraso February 4, 2026 19:26

Fixup benchmarks in CI

54a5407

mtauraso marked this pull request as ready for review February 4, 2026 22:44

Copilot AI review requested due to automatic review settings February 4, 2026 22:44

Copilot started reviewing on behalf of mtauraso February 4, 2026 22:44 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

mtauraso force-pushed the copilot/update-data-cache-benchmarks branch from 1901c10 to 54a5407 Compare February 4, 2026 23:34

mtauraso requested review from drewoldag and gitosaurus February 4, 2026 23:35

mtauraso enabled auto-merge (squash) February 4, 2026 23:36

drewoldag approved these changes Feb 4, 2026

View reviewed changes

mtauraso merged commit 808dbfa into main Feb 4, 2026
21 checks passed

mtauraso deleted the copilot/update-data-cache-benchmarks branch February 4, 2026 23:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use HSC1k dataset for DataCache ASV benchmark#674

Use HSC1k dataset for DataCache ASV benchmark#674
mtauraso merged 19 commits intomainfrom
copilot/update-data-cache-benchmarks

Copilot AI commented Feb 4, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

drewoldag left a comment

Uh oh!

drewoldag Feb 4, 2026

Uh oh!

Uh oh!

drewoldag commented Feb 4, 2026

Uh oh!

mtauraso commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Copilot AI commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Description

Solution Description

Code Quality

Comments on the Issue (you are @copilot in this section)

Uh oh!

codecov Bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

github-actions Bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drewoldag left a comment

Choose a reason for hiding this comment

Uh oh!

drewoldag Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

drewoldag commented Feb 4, 2026

Uh oh!

mtauraso commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Copilot AI commented Feb 4, 2026 •

edited

Loading

codecov Bot commented Feb 4, 2026 •

edited

Loading

github-actions Bot commented Feb 4, 2026 •

edited

Loading