Skip to content

[AUTO_MERGE] Remove if/else from dist_data_loader for collate_func#727

Merged
drewoldag merged 10 commits intomainfrom
copilot/remove-if-else-collate-func
Mar 3, 2026
Merged

[AUTO_MERGE] Remove if/else from dist_data_loader for collate_func#727
drewoldag merged 10 commits intomainfrom
copilot/remove-if-else-collate-func

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 27, 2026

dist_data_loader had a conditional block to assign the collate function: DataProvider instances used dataset.collate, everything else loaded from config. Since every map-style dataset is wrapped in DataProvider, this branch was dead code.

Change Description

  • pytorch_ignite.py: Replace the if/else block with data_loader_kwargs["collate_fn"] = dataset.collate
  • pytorch_ignite.py: Remove the now-unused load_collate_function helper function and the get_or_load_class import that was only used by it

Before:

if isinstance(dataset, DataProvider):
    collation_func = dataset.collate
else:
    collation_func = load_collate_function(data_loader_kwargs)
data_loader_kwargs["collate_fn"] = collation_func

After:

data_loader_kwargs["collate_fn"] = dataset.collate

Solution Description

All map-style datasets reaching dist_data_loader are wrapped in DataProvider, which already has a collate method. The if/else was dead code and has been removed. With the if/else gone, load_collate_function became unreachable and was also removed along with its now-unused get_or_load_class import.

Code Quality

  • I have read the Contribution Guide and agree to the Code of Conduct
  • My code follows the code style of this project
  • My code builds (or compiles) cleanly without any errors or warnings
  • My code contains relevant comments and necessary documentation
Original prompt

This section details on the original issue you should resolve

<issue_title>Remove if/else from pytorch_ignite:dist_data_loader for collate_func</issue_title>
<issue_description>At the beginning of the dist_data_loader function, there's an if(dataset, DataProvider)... block. At this point, every dataset that is passed to dist_data_loader should be a DataProvider instance. So this if/else block is superfluous.

The code should be:

dist_loader_kwargs["collage_fn"] = dataset.collate
```</issue_description>

<agent_instructions>Please make the smallest change possible and update any existing unit tests as needed.</agent_instructions>

## Comments on the Issue (you are @copilot in this section)

<comments>
</comments>


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: drewoldag <47493171+drewoldag@users.noreply.github.com>
Copilot AI changed the title [WIP] Remove if/else from dist_data_loader for collate_func Remove if/else from dist_data_loader for collate_func Feb 27, 2026
Copy link
Copy Markdown
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot please address the feedback in the comments.

Comment thread src/hyrax/downloadCutout/downloadCutout.py
Comment thread src/hyrax/data_sets/random/hyrax_random_dataset.py
Comment thread tests/hyrax/conftest.py
Comment thread src/hyrax/data_sets/random/hyrax_random_dataset.py
…ftest.py

Co-authored-by: drewoldag <47493171+drewoldag@users.noreply.github.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Feb 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 64.15%. Comparing base (a2b525e) to head (071b7d8).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #727   +/-   ##
=======================================
  Coverage   64.14%   64.15%           
=======================================
  Files          61       61           
  Lines        5996     5989    -7     
=======================================
- Hits         3846     3842    -4     
+ Misses       2150     2147    -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@drewoldag
Copy link
Copy Markdown
Collaborator

@copilot check over this PR again, and review the code once more, now that we've incorporated the latest changes from main to make sure this is the only place that needs to be updated.

Co-authored-by: drewoldag <47493171+drewoldag@users.noreply.github.com>
Copy link
Copy Markdown
Contributor Author

Copilot AI commented Feb 27, 2026

@copilot check over this PR again, and review the code once more, now that we've incorporated the latest changes from main to make sure this is the only place that needs to be update...

Reviewed — the only place needing an update was pytorch_ignite.py:dist_data_loader, which is already done. I also cleaned up the now-unused load_collate_function helper and the get_or_load_class import that was only used by it (f48070d). All 313 non-slow tests pass.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 27, 2026

Before [a2b525e] After [3e4d0a9] Ratio Benchmark (Parameter)
failed failed n/a data_cache_benchmarks.DataCacheBenchmarks.time_preload_cache_hsc1k
failed failed n/a data_cache_benchmarks.DataCacheBenchmarks.track_cache_hsc1k_hyrax_size_undercount
failed failed n/a data_request_benchmarks.DatasetRequestBenchmarks.time_request_all_data
9.50±0.03ms 9.82±0.06ms 1.03 vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(128, 'chromadb')
431±4ms 445±2ms 1.03 vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(64, 'qdrant')
1.49G 1.52G 1.02 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'chromadb')
3.74G 3.74G 1.00 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(16384, 'qdrant')
1.44G 1.44G 1.00 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(2048, 'qdrant')
1.07G 1.07G 1.00 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(256, 'chromadb')
1.1G 1.1G 1.00 vector_db_benchmarks.VectorDBInsertBenchmarks.peakmem_load_vector_db(256, 'qdrant')

Click here to view all benchmarks.

Copy link
Copy Markdown
Collaborator

@drewoldag drewoldag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty reasonable to me. It removes some code that was added in haste to support collate functions for the applecider incubator. And the approach that was used there has been changed to the more uniform approach of defining the collation function in the dataset class and registering it using the DataProvider instance.

@drewoldag drewoldag marked this pull request as ready for review February 27, 2026 20:29
@drewoldag drewoldag requested review from a team, Copilot and wilsonbb and removed request for a team February 27, 2026 20:29
@drewoldag drewoldag requested a review from mtauraso February 27, 2026 20:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR simplifies dist_data_loader’s collate function selection logic in src/hyrax/pytorch_ignite.py by removing an if/else branch and deleting the helper that dynamically loaded a collate function from config.

Changes:

  • Always assigns data_loader_kwargs["collate_fn"] = dataset.collate in dist_data_loader.
  • Removes the load_collate_function helper.
  • Removes the now-unused get_or_load_class import.

else:
collation_func = load_collate_function(data_loader_kwargs)
data_loader_kwargs["collate_fn"] = collation_func
data_loader_kwargs["collate_fn"] = dataset.collate
Copy link

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dist_data_loader now unconditionally sets collate_fn to dataset.collate, but non-DataProvider datasets (notably iterable-style datasets) don't appear to define a collate method. This breaks the documented config option config['data_loader']['collate_fn'] (default false in hyrax_default_config.toml, and iterable_dataset_collate is referenced in data_set_registry.py) and will raise AttributeError when an iterable dataset reaches this function (the function already has an if dataset.is_iterable() branch). Consider restoring the previous fallback behavior: use dataset.collate when available (e.g., DataProvider), otherwise load/use the config-specified collate function (or None when false) so iterable datasets can still work.

Suggested change
data_loader_kwargs["collate_fn"] = dataset.collate
# Prefer a dataset-provided collate function when available (e.g., DataProvider),
# otherwise fall back to the config-specified collate_fn. Treat a TOML `false`
# value as "not set" and pass None through to the DataLoader.
collate_from_config = data_loader_kwargs.get("collate_fn", None)
if hasattr(dataset, "collate"):
data_loader_kwargs["collate_fn"] = dataset.collate
else:
if collate_from_config is False:
data_loader_kwargs["collate_fn"] = None

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a reasonable comment for "iterable datasets"

I am not familiar enough with Hyrax to verify myself?

But otherwise the implementation looks sane (and going through the commit history and comments was interesting to see how the copilot workflow is operating here)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just merged a PR that should completely remove all traces of iterable-style datasets. So I believe that the portion of this comment that is relevant are perhaps updating some documentation around config[data_loader][collate_fn].

I'll also double check to see if it's possible to simply remove that from the config entirely.

Copy link
Copy Markdown
Collaborator

@mtauraso mtauraso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make 100% sure that collate_fn instructions make it into our docs and demo notebooks, because this reduces the discoverability of the feature.

@mtauraso
Copy link
Copy Markdown
Collaborator

mtauraso commented Mar 3, 2026

We need to make 100% sure that collate_fn instructions make it into our docs and demo notebooks, because this reduces the discoverability of the feature.

Created #746

@drewoldag drewoldag enabled auto-merge (squash) March 3, 2026 20:32
@drewoldag drewoldag changed the title Remove if/else from dist_data_loader for collate_func [AUTO_MERGE] Remove if/else from dist_data_loader for collate_func Mar 3, 2026
@drewoldag drewoldag merged commit 84fe8b0 into main Mar 3, 2026
6 of 7 checks passed
@drewoldag drewoldag deleted the copilot/remove-if-else-collate-func branch March 3, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove if/else from pytorch_ignite:dist_data_loader for collate_func

5 participants