Skip to content

DM-53001: Improve ResourcePath implementation for davs:// endpoints#138

Merged
airnandez merged 21 commits intomainfrom
tickets/DM-53001
Mar 23, 2026
Merged

DM-53001: Improve ResourcePath implementation for davs:// endpoints#138
airnandez merged 21 commits intomainfrom
tickets/DM-53001

Conversation

@airnandez
Copy link
Contributor

Checklist

  • ran Jenkins
  • added a release note for user-visible changes to doc/changes

Partial reads are sent directly to the backend server after first
redirection. The URL of the resource at the backend server is
recorded and used for future partial reads. When the handle is
closed, notify the backend server so it releases its resources
allocated to serve our partial reads.

Reduce the number of requests when creating a directory hierarchy
if the server allows for it.
When doing partial reads, explicitly close the connection to backend server
to let it know it can shutdown. This is needed since Pyarrow's
pq.read_table() does not call close() on the handle it opens.
The Butler often requests the size of a file it recently imported to
the datastore. The intention of this cache is to reduce the number
of such requests by retrieving the file size from the PUT response
when possible, and cache it. Subsequent requests for this file size
will be served from the cache if the requests arrives within the
validity period of the cache entry.
DM-53468 fixes an issue in pyarrow that prevented us to send requests
directly to the backend server when reading parquet files. Now that
that issue is fixed, we can implement that feature which reduces the
number of GET requests going through the frontend server when reading
chunks of the same file. This feature is also beneficial when reading
contents of zip files.
Refactoring in particular to allow specialization of the
method to retrieve metadata about a file or directory according
to the known capabilities of the server (e.g. XRootD or dCache).

The main motivation is that the dCache frontend server leaves
the network connection in a state that makes it unusable by the
client to send subsequent requests. In that situation, the
client must establish a new TLS connection which may be
expensive depending on the kind of request.

For details see:

   dCache/dcache#8005
This configuration setting allows for specifying one or more
frontend base URLs the client can randomly select to send requests
against.

This can be used as a mechanism for balancing load among several
frontend servers as well as to provide a path for a gradual
migration of existing butler repos which have recorded direct
URLs in their registry database.
@codecov
Copy link

codecov bot commented Mar 17, 2026

Codecov Report

❌ Patch coverage is 88.11475% with 29 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.63%. Comparing base (efc6958) to head (7516530).
⚠️ Report is 22 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
python/lsst/resources/dav.py 82.35% 15 Missing and 9 partials ⚠️
...t/resources/_resourceHandles/_davResourceHandle.py 91.07% 2 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #138      +/-   ##
==========================================
- Coverage   84.92%   83.63%   -1.30%     
==========================================
  Files          36       36              
  Lines        7132     7601     +469     
  Branches      874      916      +42     
==========================================
+ Hits         6057     6357     +300     
- Misses        821      976     +155     
- Partials      254      268      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@dhirving dhirving left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm about halfway through this -- will finish the rest in the morning. So far only nitpicks/questions/minor concerns.


Caching file sizes helps preventing sending requests to the server for
retrieving the size of recently uploaded files. This is in particular
intended to efficiently serve `Butler` requests for the size of a file it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the Butler only does that size thing in a single place (here).

I'm assuming you added this cache because you are seeing a performance hit from extra HTTP requests when data processing workers write their outputs to storage (here).

(There are a couple other places we end up in this code path, e.g. Butler.ingest(), but I don't think they happen during data processing.)

It looks like when this happens, the Butler already has the data in memory or on a local disk, so we already know its size and could just record it directly. This would be "safer" anyway, because we would be recording the initial size before any potential corruption from network transfer.

I need to check a couple details with Tim (on vacation until Monday) but it looks like only a few lines of plumbing in daf_butler to fix this -- would you like me to tweak it there instead of maintaining all of the complexity for this cache?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we observed that the Butler systematically asks for the size of a file it just uploaded. That means an additional request to the server after each PUT requests. Given the amount of requests issued against the datastore when processing, it is beneficial to avoid as many requests as we can.

I agree that it would be cleaner if we address this at the level of the butler. If the butler already knows the size of the file it uploads then it should be better it uses that value. An alternative or even complementary solution would be that ResourcePath.write() returns the number of bytes actually written to the data store (it currently returns None) . The butler could then store that value or compare it against the value it expects.

But yes, I would be happy to remove this file size caching mechanism when possible.

Copy link
Contributor

@dhirving dhirving left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, other than the handful of minor comments I sent yesterday.

@airnandez airnandez merged commit 9f54cac into main Mar 23, 2026
21 of 22 checks passed
@airnandez airnandez deleted the tickets/DM-53001 branch March 23, 2026 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants