DM-53001: Improve ResourcePath implementation for davs:// endpoints#138
DM-53001: Improve ResourcePath implementation for davs:// endpoints#138
Conversation
Partial reads are sent directly to the backend server after first redirection. The URL of the resource at the backend server is recorded and used for future partial reads. When the handle is closed, notify the backend server so it releases its resources allocated to serve our partial reads. Reduce the number of requests when creating a directory hierarchy if the server allows for it.
When doing partial reads, explicitly close the connection to backend server to let it know it can shutdown. This is needed since Pyarrow's pq.read_table() does not call close() on the handle it opens.
…ng to local temporary directory
The Butler often requests the size of a file it recently imported to the datastore. The intention of this cache is to reduce the number of such requests by retrieving the file size from the PUT response when possible, and cache it. Subsequent requests for this file size will be served from the cache if the requests arrives within the validity period of the cache entry.
DM-53468 fixes an issue in pyarrow that prevented us to send requests directly to the backend server when reading parquet files. Now that that issue is fixed, we can implement that feature which reduces the number of GET requests going through the frontend server when reading chunks of the same file. This feature is also beneficial when reading contents of zip files.
Refactoring in particular to allow specialization of the method to retrieve metadata about a file or directory according to the known capabilities of the server (e.g. XRootD or dCache). The main motivation is that the dCache frontend server leaves the network connection in a state that makes it unusable by the client to send subsequent requests. In that situation, the client must establish a new TLS connection which may be expensive depending on the kind of request. For details see: dCache/dcache#8005
This configuration setting allows for specifying one or more frontend base URLs the client can randomly select to send requests against. This can be used as a mechanism for balancing load among several frontend servers as well as to provide a path for a gradual migration of existing butler repos which have recorded direct URLs in their registry database.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #138 +/- ##
==========================================
- Coverage 84.92% 83.63% -1.30%
==========================================
Files 36 36
Lines 7132 7601 +469
Branches 874 916 +42
==========================================
+ Hits 6057 6357 +300
- Misses 821 976 +155
- Partials 254 268 +14 ☔ View full report in Codecov by Sentry. |
dhirving
left a comment
There was a problem hiding this comment.
I'm about halfway through this -- will finish the rest in the morning. So far only nitpicks/questions/minor concerns.
|
|
||
| Caching file sizes helps preventing sending requests to the server for | ||
| retrieving the size of recently uploaded files. This is in particular | ||
| intended to efficiently serve `Butler` requests for the size of a file it |
There was a problem hiding this comment.
As far as I can tell, the Butler only does that size thing in a single place (here).
I'm assuming you added this cache because you are seeing a performance hit from extra HTTP requests when data processing workers write their outputs to storage (here).
(There are a couple other places we end up in this code path, e.g. Butler.ingest(), but I don't think they happen during data processing.)
It looks like when this happens, the Butler already has the data in memory or on a local disk, so we already know its size and could just record it directly. This would be "safer" anyway, because we would be recording the initial size before any potential corruption from network transfer.
I need to check a couple details with Tim (on vacation until Monday) but it looks like only a few lines of plumbing in daf_butler to fix this -- would you like me to tweak it there instead of maintaining all of the complexity for this cache?
There was a problem hiding this comment.
Yes, we observed that the Butler systematically asks for the size of a file it just uploaded. That means an additional request to the server after each PUT requests. Given the amount of requests issued against the datastore when processing, it is beneficial to avoid as many requests as we can.
I agree that it would be cleaner if we address this at the level of the butler. If the butler already knows the size of the file it uploads then it should be better it uses that value. An alternative or even complementary solution would be that ResourcePath.write() returns the number of bytes actually written to the data store (it currently returns None) . The butler could then store that value or compare it against the value it expects.
But yes, I would be happy to remove this file size caching mechanism when possible.
dhirving
left a comment
There was a problem hiding this comment.
Looks good to me, other than the handful of minor comments I sent yesterday.
Checklist
doc/changes