Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk Array Restructuring #3146

Closed
benjaminwinger opened this issue Mar 26, 2024 · 2 comments
Closed

Disk Array Restructuring #3146

benjaminwinger opened this issue Mar 26, 2024 · 2 comments

Comments

@benjaminwinger
Copy link
Collaborator

To support bypassing the WAL when writing new pages to disk arrays (an important performance optimization for #2938), it will be necessary to track the end of the file within the file itself. Currently, new pages are added by using the filehandle's page counter. As a consequence of this, allowing writes to bypass the WAL will mean that any pages written as part of an aborted transaction will not be re-used. If a page counter is stored in the file, then as long as updated pages are written to the WAL (which will always include the header), we can truncate or overwrite untracked pages since the page counter won't be modified when pages are appended directly to the end of the file.

I think the best way to handle this is like what I'd done in the disk overflow file (#2857), that is, having a parent collection which stores a header tracking the number of pages.

This also provides an opportunity for some optimizations, as the parent collection can handle writing the headers instead of needing to allow each disk array to write its own header independently. Currently, disk array headers are 48 bytes and stored in their own pages. This is somewhat inefficient and leads to some bloating (E.g. 255 unnecessary pages in the hash index file: see #2625).
If we cut down what's being stored in the disk array headers to the bare minimum, we should be able to fit 256 headers into a single page:

For each array, we need to store:

  • numElements (8 bytes)
  • firstPIPPageIdx (currently stored as an 8 byte int, but really it's a page_idx_t, which is 4 bytes)

All other properties are either derived, or in the case of the element size, are known at compile time (it might be a good idea to include the element size anyway as a sanity check; it could be stored in 2 bytes, since elements cannot exceed the page size, or one if we store the alignedElementSizeLog2).

Global data:

  • Number of arrays (uint32_t is sufficient given that the number of arrays is constrained by the size of firstPIPPageIdx)
  • page counter (4 bytes)
  • Page index of the next header page (4 bytes; header pages are always read, so there's no disadvantage to storing them as a linked list).

This is sufficient to store headers for (4096 - 12)/12 = 340 arrays per page, and leaves 3 extra bytes per array header which could be used for additional data while still storing the 256 hash index headers in a single page.

Currently, disk arrays are initialized using a header page index. That would just be changed to a header index, as it no longer refers to a specific page, but remains fundamentally the same.

@benjaminwinger
Copy link
Collaborator Author

benjaminwinger commented Mar 26, 2024

On reflection, it might make sense to add WAL records noting that pages have been added to a file other than the WAL. Like the page update/insert record but without actually storing or writing the page. That would still have some overhead, but if it stores a range of pages it could be done with a single record.
Then the file could be truncated if the transaction is rolled back.

Edit: Given that copy hasn't previously supported rollback or clean recovery, it might be prudent to restrict bypassing the WAL in the disk array to hash index bulk insertions for now (basically what I already have implemented, with a flag to enable bypassing). Then work could be done later to support rolling back and recovering copy/bulk insertions.

@benjaminwinger
Copy link
Collaborator Author

Fixed by #3557 (with the exception of handling truncating the file on rollback, but that's a missing feature more broadly speaking).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant