Skip to content

Moving precache threads to config#673

Merged
mtauraso merged 9 commits intomainfrom
mtauraso/precache-config
Feb 24, 2026
Merged

Moving precache threads to config#673
mtauraso merged 9 commits intomainfrom
mtauraso/precache-config

Conversation

@mtauraso
Copy link
Copy Markdown
Collaborator

@mtauraso mtauraso commented Feb 4, 2026

No description provided.

@mtauraso mtauraso requested a review from drewoldag February 4, 2026 18:25
@mtauraso mtauraso self-assigned this Feb 4, 2026
Copilot AI review requested due to automatic review settings February 4, 2026 18:25
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 4, 2026

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.09%. Comparing base (7ae9a58) to head (77d3bd1).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/hyrax/data_sets/data_cache.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #673      +/-   ##
==========================================
- Coverage   64.10%   64.09%   -0.02%     
==========================================
  Files          61       61              
  Lines        5903     5901       -2     
==========================================
- Hits         3784     3782       -2     
  Misses       2119     2119              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR moves the DataCache preload thread count from a hard-coded value to a runtime configuration option.

Changes:

  • Add data_set.preload_threads (default 50) to hyrax_default_config.toml.
  • Update DataCache to use config["data_set"]["preload_threads"] for ThreadPoolExecutor(max_workers=...) and for the lazy executor’s in-flight futures limit.
  • Remove the previously hard-coded _determine_numprocs_preload() helper.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
src/hyrax/hyrax_default_config.toml Introduces the preload_threads configuration knob for cache preloading.
src/hyrax/data_sets/data_cache.py Switches preload concurrency to be driven by data_set.preload_threads instead of a hard-coded constant.

Comment thread src/hyrax/data_sets/data_cache.py Outdated
# Save config we need
self._use_cache = config["data_set"]["use_cache"]
self._preload_cache = config["data_set"]["preload_cache"]
self._preload_threads = config["data_set"]["preload_threads"]
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

preload_threads is read unconditionally from config["data_set"], which will raise a KeyError for configs created from older saved runs (e.g., InferenceDataSet loads original_dataset_config.toml via ConfigManager.read_runtime_config(...) without merging defaults) or any caller that passes a partial config dict. Since this value is only needed when preload_cache is enabled, either (a) default it with config["data_set"].get("preload_threads", 50) and cast/validate it as a positive int, or (b) only read/validate it inside the if self._preload_cache and self._use_cache branch.

Suggested change
self._preload_threads = config["data_set"]["preload_threads"]
preload_threads = config["data_set"].get("preload_threads", 50)
try:
preload_threads = int(preload_threads)
except (TypeError, ValueError) as exc:
raise ValueError(
"config['data_set']['preload_threads'] must be an integer, "
f"got {preload_threads!r}"
) from exc
if preload_threads <= 0:
raise ValueError(
"config['data_set']['preload_threads'] must be a positive integer, "
f"got {preload_threads!r}"
)
self._preload_threads = preload_threads

Copilot uses AI. Check for mistakes.
Comment thread src/hyrax/data_sets/data_cache.py Outdated
Comment on lines +250 to +254
## We use self._preload threads here as both the number of workers and
## the number of in-flight futures that this lazy map executor keeps running
##
## The goal here is actually maximum filesystem performance on a high-latency filesystem
## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new block comment is internally inconsistent/incomplete: it refers to self._preload threads (missing _threads) and ends the sentence "where 50 threads is" without finishing the thought. This makes the guidance hard to follow; please correct the attribute name in the comment and complete or remove the unfinished sentence.

Suggested change
## We use self._preload threads here as both the number of workers and
## the number of in-flight futures that this lazy map executor keeps running
##
## The goal here is actually maximum filesystem performance on a high-latency filesystem
## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is
## We use self._preload_threads here as both the number of workers and
## the number of in-flight futures that this lazy map executor keeps running.
##
## The goal here is actually maximum filesystem performance on a high-latency filesystem.
## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is an
## empirically chosen near-optimal default for preloading performance on that system.

Copilot uses AI. Check for mistakes.
Comment thread src/hyrax/data_sets/data_cache.py Outdated
Comment thread src/hyrax/data_sets/data_cache.py Outdated
Comment thread src/hyrax/data_sets/data_cache.py Outdated
mtauraso and others added 5 commits February 4, 2026 10:31
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 4, 2026

Before [7ae9a58] After [8b57faf] Ratio Benchmark (Parameter)
37.6±0.3ms 38.5±0.1ms 1.02 benchmarks.time_nb_obj_construct
4.03G 4.09G 1.02 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'qdrant')
1.19G 1.21G 1.02 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(2048, 'chromadb')
11.5±0.01s 11.8±0.01s 1.02 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'qdrant')
6.25±0.02s 6.39±0s 1.02 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(256, 'qdrant')
443±2ms 454±1ms 1.02 vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(64, 'qdrant')
112±0.5μs 113±1μs 1.01 data_request_benchmarks.DatasetRequestBenchmarks.time_request_all_data
24.0±0.02s 24.2±0.05s 1.01 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'chromadb')
5.62±0s 5.67±0s 1.01 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'chromadb')
5.19±0.03s 5.23±0.02s 1.01 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(64, 'qdrant')

Click here to view all benchmarks.

Copy link
Copy Markdown
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I see here seems fine. I like that you can easily adjust the threads now. I am a little concerned that a general user reading the comment in the default config might not be able to get out of it what they need.

Comment thread src/hyrax/hyrax_default_config.toml Outdated
Comment on lines +289 to +290
# and your storage layer has high latency or is IOPS limited. Avoid writing your dataset class this way,
# but if you do then a large number of preload threads will maximize io throughput
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that general users are going to understand what you mean with these last two lines. How will a user know if their storage layer has high latency or is IOPS limited. What are they to avoid when writing their dataset class? What is a "large" number of threads?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, You're right. I'll rewrite this and try to give some specific guidance of "if pre-caching is slow try..."

@mtauraso mtauraso enabled auto-merge (squash) February 24, 2026 00:41
@mtauraso mtauraso merged commit 45bd87f into main Feb 24, 2026
6 of 7 checks passed
@mtauraso mtauraso deleted the mtauraso/precache-config branch February 24, 2026 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants