Skip to content
This repository has been archived by the owner on Jul 16, 2021. It is now read-only.

[Proposal] Add support for optimized binary UUID keys? #645

Closed
sauladam opened this issue Jun 18, 2017 · 32 comments
Closed

[Proposal] Add support for optimized binary UUID keys? #645

sauladam opened this issue Jun 18, 2017 · 32 comments

Comments

@sauladam
Copy link

I wanted to use UUIDs as primary keys in a project and did some research on that. In short, here's what I found out (https://mariadb.com/kb/en/mariadb/guiduuid-performance/ explains it in more detail):

The problem: Using raw UUIDs as primary keys in tables can be challenging, because 1. they are not technically sequential so INSERTing into an index means jumping around a lot, and 2. the 36 characters of a UUID result in a pretty big index footprint, especially in InnoDB where every secondary key also includes the primary key. And the Laravel default of utf8mb4-encoding makes this even worse, making the 36 character string 146bytes big.

The solution: UUID v1 is time-based, so some parts of it are kind of sequential. If we rearrange the UUID and put those parts in the front we can allow for better indexing. And if we strip out the dashes and encode it to binary, we end up with 16bytes instead of 146.

I thought it would be awesome if Laravel / Eloquent offered this functionality out of the box. So I started hacking on it and realized that it's quite a heavy change. I want to query the model with the original UUID, but I want it to to be optimized and binary-encoded automatically before going into the DB and then decoded and rearranged back to the original UUID when coming back from the DB as a model. So far, this works great for $model->find() but it gets trickier when it comes to relationships. But I already got it working for belongsTo() and eager loads.

So, my question especially towards @taylorotwell is: Is this something you would consider for Laravel and should I keep working on it in a way that makes a good PR or is there maybe a reason why it's not supported in Laravel?

@sauladam sauladam changed the title Add support for optimized binary UUID keys? [Proposal] Add support for optimized binary UUID keys? Jun 18, 2017
@michaeldyrynda
Copy link

I did a lot of the legwork for this in a package to handle UUIDs in Eloquent, as well as tackling the efficient storage of them not too long ago.

I generally find it easier to use the UUID in combination with an auto-incrementing integer as the primary key - it is my understanding that the combination is much more efficient for foreign keys anyway.

As far as adding them to the core, it doesn't seem like that level of complexity fits with the philosophy of the framework, being more of an extension but I'd be happy to see them included in some way. One less external dependency to pull in, for sure 👍

@sisve
Copy link

sisve commented Jun 18, 2017

Your second point about the 36 characters required; that is a workaround because Mysql does not support uuid types. If you want uuids, switch to something that supports them, like postgresql or SQL Server, where they are stored as 16 bytes.

The display representation vs the actual type causes all sorts of confusion when people try to allocate space for them in there systems. SQL Server calls the type UniqueIdentifier and PostgreSQL calls the type uuid. Both types occupy 16-bytes of storage. For compatibility reasons with other software or databases, many use some stanardized text representation of them particularly for transport rather than using the native type. The text representation may require anywhere from 32 to 38 bytes (characters) contingent on if you add dashes or {}. Thus the confusion in storage requirements particularly when transporting to databases that don't have an equivalent of this type.

The GUID/UUID is not a text in either PostgreSQL or SQL Server system. They are native typs in both systems and are stored using 16 bytes. PostgreSQL does although have auto casts built in for converting uuids to text where as SQL Server requires an explicit cast to varchar.

Source: http://www.postgresonline.com/journal/archives/179-Universal-Unique-Identifiers-PostgreSQL-SQL-Server-Compare.html

The current implementation of uuids-as-primary-keys requires you to disable Model::$incrementing and in the Model::creating(...) callback generate an uuid for the model. It's just a one character change from Uuid::uuid4() to Uuid::uuid1() if you're using ramsey/uuid.

So, it seems to basically boil down to using a database engine that doesn't properly support your application requirements. You want uuids, and you've picked a database engine that doesn't support them natively. Why should the framework provide you with the required (in my opinion hack-ish) workarounds?

@sisve
Copy link

sisve commented Jun 18, 2017

I would also like to point out that your 128-144 bytes argument is based on using the utf8mb4 collation. Just use another collation that fits the storage requirement, like the ascii collation, and you've reduced the key length to 32-36 bytes (depending on if you store the dashes or not).

@troelskn
Copy link

Surely, storing uuid's in binary form would be preferable, if you want to go that route. See: http://mysqlserverteam.com/storing-uuid-values-in-mysql-tables/

@sauladam
Copy link
Author

Yes @sisve , I'm aware that there are DBMS with native binary UUID support and I understand that it's possible to save the plain text UUID with a different collaction. As I said, that was just a short summary of the problem. All of your points are adressed in the link I provided.

The main point, however, is not that some DBMS lack a dedicated binary UUID data type. It's more that you have to arrange the UUID in a specific way to make it as close to sequential as possible so it can be indexed more efficiently. So you don't just throw a plain UUID at the the DB, you have to adjust it to your needs first.

And yes, there are things like stored procedures and indexed virtual columns. MySql supports them, PostgreSQL, however, doesn't. And yes, you could argue that I should "just use a database that supports my application requirements". But the thing is that actually every DBMS does support it if you massage the data into the right shape. And maybe I use MySql in production but Sqlite for my tests.

So while it's quite hard to achieve optimized binary UUID primary keys on a DB level, there are still two things to keep in mind:

  1. UUIDs can be generated on the client side (i. e. the app) and
  2. all DBMS (at least that I know of) support binary data types.

So there is a way to achieve this regardless of the DBMS. It's quite complex, but it's possible. And as I also mentioned, the real complexity is not how to enable this for one model (I know how to hook into the creating callback...) but to actually make Eloquent aware of it and use it within relationships and eager loads and automatically converting between the original UUID (for the user) and the optimized binary version (for the DB).


Thanks @michaeldyrynda - how could I miss your package while googling for "laravel uuid"? :-D

Yes, using a good old auto-incrementing key under the hood sure makes things a lot easier. And I also read that post on the Percona blog you refered to, but the statement that the combination of auto-incrementing keys and UUIDs is more efficient than UUIDs alone kind of flew over my head (it's hidden in the comments). I'm not quite sure if I really get the point though. I'll cite (what I think is) the relevant part:

In your solution the primary key would be a 16byte binary UUID, and all secondary indexes would be forced to use this meaning much larger index footprints, and I don’t know about you but most of our tables have multiple indexes. You’ve eliminated the IO overhead of inserting UUID’s and re-indexing, but you’ve increased the size of every index in the system as a side effect meaning both more physical and memory footprint.

By using the ‘ai_col’ first column and using it as the primary key, it solves both of those problems. You would then just create a second column with the true UUID’s, add a UNIQUE index on it. Your rows will have a slightly larger size, but your indexes will be MUCH smaller, resulting in less IO, and a smaller footprint overall.

While yes, I get that the index becomes larger in size for UUIDs, but how much is "MUCH smaller" when using auto-incrementing keys? If that's like 50MB instead of 100MB for 1 million records it would be "MUCH smaller" but I don't think that it would really matter. But I'm not overly familiar with the internals of a DBMS and with how indexing works and what has to be taken into account.

Other than that, how exactly does this auto-incrementing-with-uuid thing work? You still would have to reference the auto-incrementing ID as the foreign key on related table, no? If so, you maybe would finally end up in the same situation that you were trying to avoid by switching to UUID primary keys in the first place, like running out of IDs (not very likely in an 'average' project, but surely possible) or making replication / multiple databases easier to maintain.

So yeah... The question remains: Should this be part of Laravel core? I don't want it to be a Jurassic Park thing where they "were so preoccupied with whether or not they could, they didn’t stop to think if they should". However, while it sure adds some complexity, it also seems to address a quite common use case.

@sisve
Copy link

sisve commented Jun 18, 2017

If you choose to store them as binary, and manipulate the bytes stored in them, make sure they copy the functionality of MysSQL 8's UUID_TO_BIN() & BIN_TO_UUID() function so that it would be a seamless transition/upgrade to it in the future.

If swap_flag is 1, the format of the return value differs: The time-low and time-high parts (the first and third groups of hexadecimal digits, respectively) are swapped. This moves the more rapidly varying part to the right and can improve indexing efficiency if the result is stored in an indexed column.

Source: https://dev.mysql.com/doc/refman/8.0/en/miscellaneous-functions.html#function_uuid-to-bin

@sauladam
Copy link
Author

@sisve Yes, I understand the concept and I know that those functions exist in MySql.

I'm just asking if it would make sense for Laravel to include a DBMS-agnostic approach to this out of the box.

@spronkey
Copy link

I think this absolutely makes sense. All the current RDBMSs support storing binary UUIDs in some way, be it a specific UUID column a la postgres, or a binary column a la older MySQL.

@barryvdh
Copy link

@michaeldyrynda Your package just converts it to bytes, not re-arrange the order to be incrementing, or did I overlook something?

ramsey/uuid also has a way to re-arrange the uuid1 bytes to be incrementing with time using the OrderedTimeCoded, see ramsey/uuid#118

@michaeldyrynda
Copy link

Nope, just handles conversion to/from bytes and storing them as such in a (MySQL) database @barryvdh

@steve3d
Copy link

steve3d commented Nov 30, 2017

hope this will be integrated. I personally really need this for a project.

@brendt
Copy link

brendt commented Dec 1, 2017

We just created a package doing exactly what OP is looking for. It's not built into Laravel of course, but maybe it could be a start to solve @sauladam's and other's problem.

https://github.com/spatie/laravel-binary-uuid

The README goes into more detail on how to use the package with models, and also shows some benchmark results. Internally we're using Ramsey's UUID package, with the addition of the bit switching mechanism that @barryvdh added in that package.

@sisve
Copy link

sisve commented Dec 1, 2017

@brendt There's no indexes in your benchmarks except for the binary column. Isn't that a bit unfair? And what about using the ascii collation for the textual checks?
https://github.com/spatie/laravel-binary-uuid/blob/b893db65bd759ea57a4654aba3c556bf14399800/tests/Benchmark/OptimisedUuid.php#L55

@steve3d
Copy link

steve3d commented Dec 1, 2017

this still have another problem, I've already moved my test project to use optimized binary uuid as primary id key, and also I have some relations also use this uuid format.

if I don't cast the relation key to text format at all, then the relation can be found, but once I add the cast for the binary uuid, then all relation depend on this key will not be found. after a little dig , I found that laravel also cast the key value before create the relation binding parameter, that's why all this uuid cast will cause all relation to be null(not found).

And, I'm also confused about why relation key value should be casted before creating the relation parameter binding.

@steve3d
Copy link

steve3d commented Dec 1, 2017

@sisve, you can checkout this https://www.percona.com/blog/2014/12/19/store-uuid-optimized-way/

in this article, it's a real table with some other indexes exists, and also the file size, benchmark result.

@brendt
Copy link

brendt commented Dec 1, 2017

@sisve You're right, and I've changed the benchmarks accordingly. The same result can still be seen, albeit with a less extreme difference.

The blogpost @steve3d mentioned also has more thorough benchmarks, the same result can be observed. The package I linked to was actually inspired by the results in that blogpost.

@sisve
Copy link

sisve commented Dec 1, 2017

I suggest that we split this issue into three parts;

  1. Store uuids in BINARY(16).
  2. Rearrange the bytes in the uuid value
  3. Start calling these rearranged values "sequential uuids" instead of "optimized uuids"

Optimization is a broad term, and has a context. Sequential uuids are suboptimal if I am partitioning on the first part of the uuid, for example. Binary form is suboptimal if I have to keep database compatibility with something else.

Percona's blog entry is focusing on storage space; both in form of data and index space. The last graph suggest that the larger index size is hitting some buffer/cache limits and causing io requests. We're also talking millions of rows where the MERGE_THRESHOLD is relevant but not declared.

The Mysql-blog also acknowledges this:

So, in a sequence of UUIDs continuously generated by a single machine, all UUIDs have different first bytes. Inserting this sequence into an indexed column (in binary or text form) will thus modify a different index page each time, preventing in-memory caching. So it makes sense to re-arrange the UUID, making the rapidly-changing parts go last, before we store into id_bin. Again, note that this idea applies only to UUIDs of version 1.

@brendt's benchmark focuses on execution times.

No benchmark so far is focusing on concurrency. A sequential uuid (based on time) means that all your servers will be writing to the same database page/extent. Would this be a problem?

I guess my point here is to convince you that "optimized uuids" are too vague, they aren't optimized for everything, and rename them to "sequential uuids". It also closer resembles integer's auto_increment flag which has a similar behavior; an incrementally increasing value.

So, there's five scenarios we need to support

  1. Uuid columns stored as text (the current/"old" way).
  2. Uuid columns stored as BINARY(16)
  3. Models using normal/"old" uuids as primary keys.
  4. Models using sequential uuids as primary keys.
  5. A model can have a mix of these; a uuid-as-text column, a uuid-as-binary column and a sequential-uuid (as text?) primary key.

On a related note; wouldn't sequential uuids have similar performance boosts over the "old way" even in text form?

@barryvdh
Copy link

barryvdh commented Dec 1, 2017

Imho, the optimizing/re-arranging is already handled by ramsey/uuid in the form of a different Codec: https://github.com/ramsey/uuid/tree/master/src/Codec (it's called OrderedTimeCodec there). It alsy handles text <-> byte form conversions.

So the only thing required is to wrap it in a package, what https://github.com/spatie/laravel-binary-uuid does.

Is there anything that's not possible to do with that, what wouldn't be possible in the core? Cause that's the only thing we need to worry about here.

@brendt
Copy link

brendt commented Dec 1, 2017

We had to add a "hack" to be able to save real BINARY data. The uuid type saves CHAR by default, and the binary type saves BLOB. We solved it by overriding the typeUuid method in MySqlGrammar and SQLiteGrammar.

From our point of view, there's two things missing in the core:

  • Being able to create real BINARY columns without having to override Grammar classes.
  • Either adding that binary method to the Blueprint class, or provide a way to extend the Blueprint class so you can add custom column types from within an application or package. At first glance you're able to set blueprintResolver on Illuminate\Database\Schema\Builder, but you cannot re-use the same builder using that custom blueprint resolver throughout your application. A new Builder is always re-created in eg. MySqlConnection::getSchemaBuilder.

All the other things required to add sequential UUIDs could be done on project or package level.

And to answer some of @sisve 's questions. It's clear that sequential UUIDs only have a positive performance impact when working with a lot of data. The benchmarks I ran only showed a positive effect from ~300k-400k records or more. Everything below that actually makes it worse.

@paulofreitas
Copy link

paulofreitas commented Dec 3, 2017

From our point of view, there's two things missing in the core:

Being able to create real BINARY columns without having to override Grammar classes.
Either adding that binary method to the Blueprint class, [...]

I've found this "limitation" recently as well, but unfortunately Blueprint has already a binary() column type that maps to different data types for each database.

The BLOB data type was introduced in SQL:2003 and the BINARY and BINARY VARYING (or VARBINARY) data types were introduced later in SQL:2008 standard. They have subtle differences that are usually implementation-defined such as zero-byte padding, trailing bytes removal, etc. Here's a quick example: http://sqlfiddle.com/#!9/2c13f1/1

Let me talk about vendor implementations:

  • MySQL does implement all them: BINARY, VARBINARY and BLOB.
  • PostgreSQL does implement only the non-standard BYTEA data type which is similar to BLOB.
  • SQLite does implement only the BLOB data type.
  • SQL Server does implement only the BINARY and VARBINARY data types.

Usually the BINARY VARYING/VARBINARY data type is similar to BLOB.

Internally, Laravel does map binary() columns as follows:

  • MySQL: BLOB
  • PostgreSQL: BYTEA
  • SQLite: BLOB
  • SQL Server: VARBINARY

It does make a lot sense if you consider the binary() as a synonym to blob(). But as you can see it lacks the true BINARY storage-efficient data type use for both MySQL and SQL Server database drivers.

This could be improved by introducing a new blob() column type that use the existing binary() column type implementation and changing the existing binary() implementation for MySQL and SQL Server to use the BINARY data type instead.

I'm commenting this just for future reference on this subject. 👍

@spronkey
Copy link

spronkey commented Dec 3, 2017

@paulofreitas I don't think I would support a rename of binary() type to blob() as this would probably have compatibility repercussions.

However, the ability to be more specific about whether you want a fixed binary column, or a blob representation, is probably something that should exist, considering at least two of the engines support it, and the others can fallback to blobs.

For UUIDs, I think we really need a specific uuid() column specifier. I think the use case is common enough, and the underlying storage different enough across RDBMS to warrant this.

Internally, this type would map to the following:

  • UUID on Postgres
  • BINARY(16) on MySQL
  • BLOB (or char) in SQLite
  • UNIQUEIDENTIFIER in SQL Server

This is more optimal than anything binary() or blob() could give us, and it could also be used to handle any conversions or optimisations required for that particular underlying engine - such as converting to binary from display or vice versa, or rearranging the UUID bytes for better sequential values.

On a side note, database engines that don't have a dedicated UUID type in 2017 need a kick in the behind!

@sisve
Copy link

sisve commented Dec 3, 2017

I see an obvious problem in defaulting to binary for some engines; your code will no longer be portable. Where is the responsibility to cast the strings that the user has into binary values? Imagine a User::find($request->input('userId')), that would fail on mysql+sqlite if we only changed the fields to binary, while postgres and sqlsrv still worked because they have built-in conversion of strings to uuid.

Using a binary field means that we're treating the values as opaque blobs that the user may use, but not easily view or modify. This works on the database level for primary and foreign keys, but it causes issues as soon as we move out of the database world and into the web. How would these values be converted to a uuid and used in routes? How would model binding take incoming values (large readable strings) and convert them back to binary? And how would the model binding know if they are normal binary, or should have the values swapped around to become sequential?

I disagree with the binary fields for mysql+sqlite because the closest match in functionality is a char(36). Basic queries that work on real uuid types will also work on a char(36). You can select it, update it, filter on it, etc. There's also the issue with breaking existing code by changing the $table->uuid(...) method.

Now, a $table->uuidAsBinary(...) would work on all database engines, and produce binary fields on all database engines. That means that no matter which database you are using; if you're using binary uuids you have to treat them as opaque binary values that needs special case with encoding.

@brendt
Copy link

brendt commented Dec 8, 2017

@sisve

This works on the database level for primary and foreign keys, but it causes issues as soon as we move out of the database world and into the web. How would these values be converted to a uuid and used in routes?

What about something like casts? The easiest for developers would be to be able to just work with textual UUIDs, and not worry about conversions at all.

Now, a $table->uuidAsBinary(...) would work on all database engines, and produce binary fields on all database engines. That means that no matter which database you are using; if you're using binary uuids you have to treat them as opaque binary values that needs special case with encoding.

I'd also opt for a separate method and not changing the binary or uuid blueprint methods.

@sisve
Copy link

sisve commented Dec 9, 2017

Yes, casts sounds like a good idea. We would need two of them; one for the binary format, and one for the sequential format. We would also need to expose the casting logic somewhere so they can be called manually, for example when you use the DB facade that doesn't know about the configured casts rules.

@brendt
Copy link

brendt commented Dec 18, 2017

So what's next? Who eventually decides how the problem must be solved, and who is allowed to make the changes in the core?

@sisve
Copy link

sisve commented Dec 18, 2017

@brendt Are we even sure that there needs to be changes in the core? Perhaps this entire thing can be implemented as a third-party package if we just PR the necessary extension points into the core.

@brendt
Copy link

brendt commented Dec 18, 2017

@sisve As far as I'm concerned, that would be as good a solution.

From our point of view, the most simple solution is to be able to add custom Blueprint functions via eg. a ServiceProvider. Everything else can be done in the package. But from what I gather from the opinions above, some people may not agree to that solution.

What do you consider to be these extension points?

@sisve
Copy link

sisve commented Dec 18, 2017

Binary uuids values in general

  1. The Blueprint class is macroable since Laravel 5.4.29, so we got a solution to introduce uuidAsBinary().
  2. We need to extend Model::castAttribute to handle binary->text conversion.
  3. We need to extend Model::setAttribute to handle text->binary conversion.

I imagine that we need two separate casts/conversions, one for the straight binary<->text conversion, and one for the rearranged uuids that has been discussed. This has to be done at the "casting" layer since we can have models using both the "normal" and the "rearranged" uuids, so we cannot just change what Uuid::uuid1() returns.

Binary uuids as primary keys
We would also need to handle the serialization of model identifiers if we're using binary values as primary keys. This would be Model::getQueueableId and Model::newQueryForRestoration, the later is weirdly enough missing from the QueueableEntity interface.

There's also the model binding; we would need to make sure that Model::getRouteKey() does use the text value (which it seems that it does by reading the code). We would also need to extend Model::resolveRouteBinding to encode the text value (from the route/querystring) into the database format.

I believe all these things can be implemented as opt-in traits.

@steve3d
Copy link

steve3d commented Dec 19, 2017

And there is another problem need to consider, what if someone want to use the optimized uuid as foreign key? Then the core need to distinguish the convert to text value and use the binary value then get for relations

@sisve
Copy link

sisve commented Dec 19, 2017

@steve3d Could you explain that scenario some more? I don't see how that differs from the casts described in the "Binary uuids values in general".

@steve3d
Copy link

steve3d commented Dec 19, 2017

here, for example, I have a ProjectComponent model, which use binary uuid only(as primary key), and I have a ComponentStatus which also use binary uuid only( also as primary key). So a ProjectComponent has many ComponentStatus, which means in ComponentStatus I will have a binary uuid key as component_id.

with this setup, if you just return the text value of uuid in getAttribute function, then of course, there will be no relations between these two models, because the database are tring to match the text value of a uuid (component->id) with a binary value (status->component_id).

and this problem don't exists when the ComponentStatus use a text uuid with field component_id, but why I want to use the text version of uuid instead of binary version to waste my database storage?

and there is a dirty solution for this.

for getting the text value of projectComponent->id, we usually get it from __get or castXXX function of ProjectComponent, then we can add a default parameter to getAttribute function, then the getAttribute function are called from __get or castXXX function, it will return the text value of uuid. In relation key value, it don't need to change, just get the original binary uuid.

Hope you understand the problem here.

@sisve
Copy link

sisve commented Dec 19, 2017

with this setup, if you just return the text value of uuid in getAttribute function, then of course, there will be no relations between these two models, because the database are tring to match the text value of a uuid (component->id) with a binary value (status->component_id).

You lost me here. The text values aren't persisted, they are still binary values in the database. You should be able to add foreign keys as usual.

Could you show some code that would expose the problem? Are you writing custom sql queries? Associating models?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants