walSegmentFind is expensive #2289

rdunklau · 2024-02-23T09:51:07Z

Hello,

While performing some tests involving pgbackrest, I noticed it performs some expensive operations during wal archiving.
For every WAL file we archive on a cloud storage provider, we call walSegmentFind which performs a list (trying to find all segments possibly matching the one we want to archive, with a regexp on the checksum and compression extension), then actually store the object.

At least for GCS and S3, this list operation is expensive, and amounts to a big share of the costs associated to WAL archiving.

One way to reduce this could be to take advantage of the fact that those object storage allows for metadata to be associated to an object. We could then keep encoding the checksum and compression scheme in the archived WAL filename on posix storage (or other storage that doesn't support metadata) but rely on a unique name for the WAL files stored on an object storage instead of encoding the metadata in the filename.

This would allow to use the API to only fetch the metadata (for example HEAD on S3) to both check for the file presence and fetch it's checksum, which is much cheaper at least for those two.

I understand this is quite a big change, but what do you think about that ?

dwsteele · 2024-02-23T20:13:22Z

What version are you running? This should have been addressed in 2.48 by 9d3a605.

rdunklau · 2024-02-23T22:05:44Z

I was running and reading from 2.50.

I may have misunderstood the code, but don't we end up calling walsegmentfind for every segment we want to push ? This at least matches what I've noticed, which is quite a lot of LIST operations on the remote storage.

My point is that the optimization for finding one wal segment could use something else than a list operation, namely metadata storage and retrieval to ensure both uniquenes of a wal file and a canonical key to archive them with. Right now this is not possible as the checksum and compression type are stored in the filename itself.

My apologies if I misunderstood how it works.

dwsteele · 2024-02-23T22:14:26Z

I may have misunderstood the code, but don't we end up calling walsegmentfind for every segment we want to push ?

Yes, but a cache of prior results is kept so if we are looking for a lot of segments that are already present then the number of LIST commands should be greatly reduced. But if only one segment is found on each call to walSegmentFind() then a LIST is required each time. In this case we think the answer is to implement some sort of backoff, but we haven't done that yet.

My point is that the optimization for finding one wal segment could use something else than a list operation, namely metadata

This is a great idea (and we've thought about it) but it would break the repo format, so needs to be saved for a new repo format. Also, ultimately, it probably won't be practical so save the WAL segment with its exact name. If we do introduce a new repo format it will likely be lockless and so some suffixes will be required.

rdunklau · 2024-02-23T22:22:05Z

Yes by definition for pushing a wal we expect it to not be archived already (or it's an error, and it's good pgbackrest detects it).

This is a great idea (and we've thought about it) but it would break the repo format, so needs to be saved for a new repo format. Also, ultimately, it probably won't be practical so save the WAL segment with its exact name. If we do introduce a new repo format it will likely be lockless and so some suffixes will be required.

Glad it's under consideration ! I'm not sure I understand the point about the required suffixes but as long as a simple HEAD http request can be performed to make sure a wal is present it will greatly reduce the monetary costs associated with archiving.

Do you think an option to remove the pre-push check could be a sensible thing to add in the meantime ?

dwsteele · 2024-02-23T22:30:37Z

Glad it's under consideration ! I'm not sure I understand the point about the required suffixes but as long as a simple HEAD http request can be performed to make sure a wal is present it will greatly reduce the monetary costs associated with archiving.

Like I said, this might not be possible, depending on what we need to do to safely allow lockless archive-push. But this is all for future consideration.

Do you think an option to remove the pre-push check could be a sensible thing to add in the meantime ?

Not sure what you mean by pre-push? You can certainly use archive-check=n but I wouldn't recommend it in general.

rdunklau · 2024-02-23T22:39:14Z

Not sure what you mean by pre-push? You can certainly use archive-check=n but I wouldn't recommend it in general.

This is not what I meant, I was talking about the check done here:

pgbackrest/src/command/archive/push/file.c

Line 179 in 6356a2b

    
           walSegmentFile = walSegmentFindOne(storageRepoIdx(repoData->repoIdx), repoData->archiveId, archiveFile, 0);

This means that we will in the majority of cases perform a list command only to know we can proceed with a push.
For reducing costs, I suggest an option leaving that check out. Then the detection of two different wal segment on the same repo can be performed outside the archive-push command itself, validating the whole repo in one go. This does not seem so dangerous as the checksum is part of the walfile name, it mostly means some cleaning up will have to happen if an error is detected after the fact.

dwsteele · 2024-02-25T20:09:12Z

This is not what I meant, I was talking about the check done here:

Ah, yes, you were talking about walSegmentFind() above so I was thinking about backup.

Then the detection of two different wal segment on the same repo can be performed outside the archive-push command itself, validating the whole repo in one go.

This is something we've thought about, but it's a lot of work, and an incompatible repo format change. To do this each segment would need to be tagged with its source (as part of the file name) so the user could figure out how to reconcile the unwanted segments.

But for now removing this safeguard would allow massive pollution of the repo during a split brain event with no way to figure out what had come from where. As such, I think it is a pretty bad idea to disable it until we are able to do the reconciliation.

rdunklau · 2024-02-26T07:31:27Z

Thanks for your input.

This is something we've thought about, but it's a lot of work, and an incompatible repo format change. To do this each segment would need to be tagged with its source (as part of the file name) so the user could figure out how to reconcile the unwanted segments.

Can't we do that already by using the repo1-storage-tag configuration key ?

dwsteele · 2024-02-26T19:24:27Z

Can't we do that already by using the repo1-storage-tag configuration key ?

Not on Posix. Anyway, the entire point is that you could have multiple writers and they would never overwrite each other. That won't work if the filename is the same. Maybe with versioning, but that presents other issues.

dwsteele self-assigned this Feb 23, 2024

dwsteele added the question label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

walSegmentFind is expensive #2289

walSegmentFind is expensive #2289

rdunklau commented Feb 23, 2024

dwsteele commented Feb 23, 2024

rdunklau commented Feb 23, 2024

dwsteele commented Feb 23, 2024

rdunklau commented Feb 23, 2024

dwsteele commented Feb 23, 2024

rdunklau commented Feb 23, 2024

dwsteele commented Feb 25, 2024

rdunklau commented Feb 26, 2024

dwsteele commented Feb 26, 2024

walSegmentFind is expensive #2289

walSegmentFind is expensive #2289

Comments

rdunklau commented Feb 23, 2024

dwsteele commented Feb 23, 2024

rdunklau commented Feb 23, 2024

dwsteele commented Feb 23, 2024

rdunklau commented Feb 23, 2024

dwsteele commented Feb 23, 2024

rdunklau commented Feb 23, 2024

dwsteele commented Feb 25, 2024

rdunklau commented Feb 26, 2024

dwsteele commented Feb 26, 2024