Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce the longer-keys feature which sets -DMDB_MAXKEYSIZE=0 #263

Merged
merged 2 commits into from
May 30, 2024

Conversation

tpunder
Copy link
Contributor

@tpunder tpunder commented May 24, 2024

This PR (which does not have a corresponding issue) adds a longer-keys feature to lmdb-master-sys and heed which sets -DMDB_MAXKEYSIZE=0 when compiling LMDB. This allows you to use keys that are larger than the default of 511 bytes long.

The feature is 100% opt-in.

I've added a conditional test to lmdb-master-sys and heed to check that keys larger than 511 can be successfully stored without returning an MDB_BAD_VALSIZE error.

Enabling this feature also allows you to use values larger than 511 in databases with the MDB_DUPSORT flag set (which was the problem I was running into).

Here is the documentation snippet from http://www.lmdb.tech/doc/group__internal.html:

#define MDB_MAXKEYSIZE ((MDB_DEVEL) ? 0 : 511)

The max size of a key we can write, or 0 for computed max.

This macro should normally be left alone or set to 0. Note that a database with big keys or dupsort data cannot be reliably modified by a liblmdb which uses a smaller max. The default is 511 for backwards compat, or 0 when MDB_DEVEL.

Other values are allowed, for backwards compat. However: A value bigger than the computed max can break if you do not know what you are doing, and liblmdb <= 0.9.10 can break when modifying a DB with keys/dupsort data bigger than its max.

Data items in an MDB_DUPSORT database are also limited to this size, since they're actually keys of a sub-DB. Keys and MDB_DUPSORT data items must fit on a node in a regular page.

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tpunder 👋
What you've done is very cool! I am shocked that I was not aware of this feature before. Than you for your work.

However, could you rename the feature to something more user-firendly (newbie-friendly), please? Something in the vein of read-txn-no-tls. I am thinking about longer-keys, infinite-key-length or unlimited-key-length. Wat do you think?

I would also like if you could add a description with the advantages, as you describe them in this PR, to the feature of the heed crate (dupsort and stuff), and the limitations and dowsides if there are any (size increase to store length bytes, slowdowns).

BTW @AureliaDolo and @darnuria I don't know if you are aware of this but it fixes your problem with the key-size limits of LMDB when using the DUPSORT feature 🤩

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the changes!

heed/src/mdb/lmdb_error.rs Show resolved Hide resolved
heed/src/env.rs Outdated Show resolved Hide resolved
@tpunder
Copy link
Contributor Author

tpunder commented May 28, 2024

@Kerollmops I think longer-keys makes sense since you don't get unlimited key sizes. The actual max key length is architecture dependent (related to the page size). I've seen values of 8126 bytes on my M1 Macbook Pro and 1982 bytes on my Intel Macbook Pro.

I've updated the PR to include the rename. I've also hooked up the mdb_env_get_maxkeysize function (as Env::max_key_size()) so people have a way to check what the value is for them. I've also updated the docs a bit.

It's not clear to me that using longer-keys actually increases the overhead (e.g. length bytes) of any data stored on disk. It looks like a short is always used for the key length of a node: https://github.com/LMDB/lmdb/blob/88d0531d3ec3a592cdd63ca77647d31c568c24bc/libraries/liblmdb/mdb.c#L1124

I think the database/file format is the same regardless of the MDB_MAXKEYSIZE.

I think the only real downsize would be that the database files become potentially non-portable across architectures with different page sizes if you are actually storing long keys. For example: If I create a database on my M1 Macbook Pro (key length limit of 8126 bytes) with a key of size 2000 bytes and then try to open that file on my Intel Macbook pro (key length limit 1982) then I think I would have problems. However, if all of my keys are under 1982 bytes then I don't think there is any problem.

…ing LMDB

By default (and for backwards compatibility) LMDB only allows keys up to 511 bytes in length. If you want larger keys then you need to set "-DMDB_MAXKEYSIZE=0" at compile time.

This limit also applies to values when using the MDB_DUPSORT flag on a database.
@tpunder
Copy link
Contributor Author

tpunder commented May 28, 2024

@Kerollmops I added a note about moving databases between different architectures to the feature flag. I think I also got your proposed changes from above.

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the investigation. Very appreciated 😇

Do you confirm that when using this feature, people will be able to store 2GiB data when opening a database with the DUPSORT feature? Or will the data length always be restricted to 8192/1982 bytes?

Will merge and do a release just after you fix the remaining issues.

@tpunder
Copy link
Contributor Author

tpunder commented May 29, 2024

@Kerollmops No, when using the DUPSORT feature the data length is restricted to 8192/1982 bytes (or whatever it is for your architecture). DUPSORT values are stored as keys of a sub-db which means they must meet any key length limits.

Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @tpunder 👋 I can merge only once the files are well formatted (check the CI). You can run a cargo +nightly fmt on the files you modified, especially heed/src/mdb/lmdb_ffi.rs 😃

…gth.

This calls the LMDB mdb_env_get_maxkeysize function.
@tpunder
Copy link
Contributor Author

tpunder commented May 30, 2024

@Kerollmops I just ran cargo +nightly fmt to fix the formatting in lmdb_ffi.rs. Hopefully all is good now 😀

@Kerollmops Kerollmops merged commit 155ff8e into meilisearch:main May 30, 2024
8 checks passed
@Kerollmops Kerollmops modified the milestones: v0.20.1, v0.20.2 May 31, 2024
@Kerollmops Kerollmops changed the title Add the "mdb_maxkeysize_0" feature which sets "-DMDB_MAXKEYSIZE=0" when compiling lmdb Introduce the longer-keys feature which sets -DMDB_MAXKEYSIZE=0 when compiling LMDB May 31, 2024
@Kerollmops Kerollmops changed the title Introduce the longer-keys feature which sets -DMDB_MAXKEYSIZE=0 when compiling LMDB Introduce the longer-keys feature which sets -DMDB_MAXKEYSIZE=0 May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants