Moving precache threads to config by mtauraso · Pull Request #673 · lincc-frameworks/hyrax

mtauraso · 2026-02-04T18:25:08Z

No description provided.

codecov · 2026-02-04T18:30:18Z

Codecov Report

❌ Patch coverage is 0% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.09%. Comparing base (7ae9a58) to head (77d3bd1).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/hyrax/data_sets/data_cache.py	0.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #673      +/-   ##
==========================================
- Coverage   64.10%   64.09%   -0.02%     
==========================================
  Files          61       61              
  Lines        5903     5901       -2     
==========================================
- Hits         3784     3782       -2     
  Misses       2119     2119

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

This PR moves the DataCache preload thread count from a hard-coded value to a runtime configuration option.

Changes:

Add data_set.preload_threads (default 50) to hyrax_default_config.toml.
Update DataCache to use config["data_set"]["preload_threads"] for ThreadPoolExecutor(max_workers=...) and for the lazy executor’s in-flight futures limit.
Remove the previously hard-coded _determine_numprocs_preload() helper.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
`src/hyrax/hyrax_default_config.toml`	Introduces the `preload_threads` configuration knob for cache preloading.
`src/hyrax/data_sets/data_cache.py`	Switches preload concurrency to be driven by `data_set.preload_threads` instead of a hard-coded constant.

Copilot · 2026-02-04T18:30:17Z

        # Save config we need
        self._use_cache = config["data_set"]["use_cache"]
        self._preload_cache = config["data_set"]["preload_cache"]
+        self._preload_threads = config["data_set"]["preload_threads"]


preload_threads is read unconditionally from config["data_set"], which will raise a KeyError for configs created from older saved runs (e.g., InferenceDataSet loads original_dataset_config.toml via ConfigManager.read_runtime_config(...) without merging defaults) or any caller that passes a partial config dict. Since this value is only needed when preload_cache is enabled, either (a) default it with config["data_set"].get("preload_threads", 50) and cast/validate it as a positive int, or (b) only read/validate it inside the if self._preload_cache and self._use_cache branch.

Suggested change

self._preload_threads = config["data_set"]["preload_threads"]

preload_threads = config["data_set"].get("preload_threads", 50)

try:

preload_threads = int(preload_threads)

except (TypeError, ValueError) as exc:

raise ValueError(

"config['data_set']['preload_threads'] must be an integer, "

f"got {preload_threads!r}"

) from exc

if preload_threads <= 0:

raise ValueError(

"config['data_set']['preload_threads'] must be a positive integer, "

f"got {preload_threads!r}"

)

self._preload_threads = preload_threads

Copilot · 2026-02-04T18:30:17Z

+        ## We use self._preload threads here as both the number of workers and
+        ## the number of in-flight futures that this lazy map executor keeps running
+        ##
+        ## The goal here is actually maximum filesystem performance on a high-latency filesystem
+        ## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is


The new block comment is internally inconsistent/incomplete: it refers to self._preload threads (missing _threads) and ends the sentence "where 50 threads is" without finishing the thought. This makes the guidance hard to follow; please correct the attribute name in the comment and complete or remove the unfinished sentence.

Suggested change

## We use self._preload threads here as both the number of workers and

## the number of in-flight futures that this lazy map executor keeps running

##

## The goal here is actually maximum filesystem performance on a high-latency filesystem

## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is

## We use self._preload_threads here as both the number of workers and

## the number of in-flight futures that this lazy map executor keeps running.

##

## The goal here is actually maximum filesystem performance on a high-latency filesystem.

## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is an

## empirically chosen near-optimal default for preloading performance on that system.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

github-actions · 2026-02-04T19:30:04Z

Before [`7ae9a58`]	After [`8b57faf`]	Ratio	Benchmark (Parameter)
37.6±0.3ms	38.5±0.1ms	1.02	benchmarks.time_nb_obj_construct
4.03G	4.09G	1.02	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'qdrant')
1.19G	1.21G	1.02	vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(2048, 'chromadb')
11.5±0.01s	11.8±0.01s	1.02	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'qdrant')
6.25±0.02s	6.39±0s	1.02	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(256, 'qdrant')
443±2ms	454±1ms	1.02	vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(64, 'qdrant')
112±0.5μs	113±1μs	1.01	data_request_benchmarks.DatasetRequestBenchmarks.time_request_all_data
24.0±0.02s	24.2±0.05s	1.01	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'chromadb')
5.62±0s	5.67±0s	1.01	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(2048, 'chromadb')
5.19±0.03s	5.23±0.02s	1.01	vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(64, 'qdrant')

Click here to view all benchmarks.

drewoldag

What I see here seems fine. I like that you can easily adjust the threads now. I am a little concerned that a general user reading the comment in the default config might not be able to get out of it what they need.

drewoldag · 2026-02-04T20:03:23Z

+# and your storage layer has high latency or is IOPS limited. Avoid writing your dataset class this way,
+# but if you do then a large number of preload threads will maximize io throughput


I'm not sure that general users are going to understand what you mean with these last two lines. How will a user know if their storage layer has high latency or is IOPS limited. What are they to avoid when writing their dataset class? What is a "large" number of threads?

Yeah, You're right. I'll rewrite this and try to give some specific guidance of "if pre-caching is slow try..."

Moving precache threads to config

8d4f94a

mtauraso requested a review from drewoldag February 4, 2026 18:25

mtauraso self-assigned this Feb 4, 2026

Copilot AI review requested due to automatic review settings February 4, 2026 18:25

Copilot started reviewing on behalf of mtauraso February 4, 2026 18:25 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

mtauraso and others added 5 commits February 4, 2026 10:31

Update src/hyrax/data_sets/data_cache.py

f2fcb1c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fixup from code review comments

83c7a40

Fixup from code review comments

e9a707f

Apply suggestion from @Copilot

f53b415

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestion from @Copilot

a58b902

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

drewoldag approved these changes Feb 4, 2026

View reviewed changes

Merge branch 'main' into mtauraso/precache-config

6d43163

mtauraso enabled auto-merge (squash) February 24, 2026 00:41

mtauraso and others added 2 commits February 23, 2026 16:42

Fixup docstring on new config

39427c5

Merge branch 'main' into mtauraso/precache-config

77d3bd1

mtauraso merged commit 45bd87f into main Feb 24, 2026
6 of 7 checks passed

mtauraso deleted the mtauraso/precache-config branch February 24, 2026 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving precache threads to config#673

Moving precache threads to config#673
mtauraso merged 9 commits intomainfrom
mtauraso/precache-config

mtauraso commented Feb 4, 2026

Uh oh!

codecov Bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

drewoldag left a comment

Uh oh!

drewoldag Feb 4, 2026

Uh oh!

mtauraso Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        self._preload_threads = config["data_set"]["preload_threads"]
+        preload_threads = config["data_set"].get("preload_threads", 50)
+        try:
+            preload_threads = int(preload_threads)
+        except (TypeError, ValueError) as exc:
+            raise ValueError(
+                "config['data_set']['preload_threads'] must be an integer, "
+                f"got {preload_threads!r}"
+            ) from exc
+        if preload_threads <= 0:
+            raise ValueError(
+                "config['data_set']['preload_threads'] must be a positive integer, "
+                f"got {preload_threads!r}"
+            )
+        self._preload_threads = preload_threads

-        ## We use self._preload threads here as both the number of workers and
-        ## the number of in-flight futures that this lazy map executor keeps running
-        ##
-        ## The goal here is actually maximum filesystem performance on a high-latency filesystem
-        ## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is
+        ## We use self._preload_threads here as both the number of workers and
+        ## the number of in-flight futures that this lazy map executor keeps running.
+        ##
+        ## The goal here is actually maximum filesystem performance on a high-latency filesystem.
+        ## Currently the defaults are tuned for UW's HYAK Klone filesystem, where 50 threads is an
+        ## empirically chosen near-optimal default for preloading performance on that system.

		# and your storage layer has high latency or is IOPS limited. Avoid writing your dataset class this way,
		# but if you do then a large number of preload threads will maximize io throughput

Conversation

mtauraso commented Feb 4, 2026

Uh oh!

codecov Bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drewoldag left a comment

Choose a reason for hiding this comment

Uh oh!

drewoldag Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

mtauraso Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Feb 4, 2026 •

edited

Loading

github-actions Bot commented Feb 4, 2026 •

edited

Loading