Be consistent in DDL for integer primary keys #517

jyutzler · 2020-04-27T13:48:36Z

#515 currently proposes modifying the DDL for the Metadata Extension to be more consistent. I guess we should also sweep all of our DDL (Annex C) for consistency in defining primary keys. It looks like we need make this fix in multiple places. Are we in agreement that the correct way to define an integer autoincrement primary key is like this:

id INTEGER PRIMARY KEY AUTOINCREMENT

And that everything else is superfluous?

fjlopez · 2020-04-28T09:43:22Z

The description of the semantics of INTEGER PRIMARY KEY AUTOINCREMENT in SQLite in books like "10.4 Autoincrement" SQLite Database System Design and Implementation (2016) supports this proposal.

jyutzler · 2020-05-05T15:48:25Z

After discussing this with the SWG today, it does not appear that we need the autoincrement keyword at all (and that there is a penalty in CPU/memory/disk space/IO). https://www.sqlite.org/autoinc.html
IIUC, trying to insert a blank row into a table will auto-generate a key for you. In summary autoincrement should be allowed but not required and therefore the DDL should remain informative, not normative.

jyutzler · 2020-05-05T23:39:28Z

I confirmed that integer primary key values are generated automatically by SQLite. Our table definition SQL is technically normative (except for user-defined tables, of course) so I added a note stating that minor modifications to the SQL are permitted. https://github.com/opengeospatial/geopackage/pull/521/files#diff-e76f7b8609c8bf2c4d6c52402c6a30a5R5

jyutzler · 2020-05-06T14:28:45Z

There is still some outside discussion on this topic. The main point of concern is in RTE relationships. Consider this scenario:

GeoPackage is made by Party A without using AUTOINCREMENT on gpkg_tile_matrix
Party B updates GeoPackage, implementing the Related Tables Extension and creating some relationships between tiles and other data
Party C goes out into the field and performs some tile updates that include removal of the row with the highest ID and addition of some new tiles (maybe they moved laterally)

Since ROWIDs will be reused in this circumstance, you will now have RTE rows pointing to the wrong thing.

What do you think?

bosborn · 2020-05-06T16:20:28Z

I agree that is an issue with not only RTE, but any potential extension someone or some software is not aware of. The user table ids are basically the only guaranteed row identifying information, used throughout the spec and extensions. Related tables do not force foreign key constraints. Even if they did, sqlite defaults do not enforce them unless manually enabled.

Intelligent database design prevents unique generated ids from reuse so that a record id is guaranteed to only represent that record. Any new record should be uniquely identifiable without reusing an id. Other than performance, most arguments against using autoincrement in a database encourage the use of serial ids instead. But by omitting autoincrement, GeoPackage is instead encouraging non serial auto generated ids. Both autoincremented and generated serial ids maintain database integrity, but auto generated max+1 ids do not.

All implementations should be able to function on the base GeoPackage w/o the risking data integrity. There are so many ways to corrupt data as it is, we should do the best to prevent it when possible. We need to think more broadly about GeoPackages shared between different types of users, software, and usages.

GeoPackages that comply with the requirements in the standard and do not implement vendor-specific extensions are interoperable across all enterprise and personal computing environments.

I see this as a potential issue for data integrity, especially for shared GeoPackages. There is also the case of external id tracking for export scenarios. A record is only externally uniquely identifiable and traceable within a versioned GeoPackage by its' table and id.

The estimated 10-12% performance gain from omitting autoincrement may not offset the potential downsides with database integrity and traceability.

Point number 4 should be enough to justify the inclusion: https://www.sqlite.org/autoinc.html

If the AUTOINCREMENT keyword appears after INTEGER PRIMARY KEY, that changes the automatic ROWID assignment algorithm to prevent the reuse of ROWIDs over the lifetime of the database. In other words, the purpose of AUTOINCREMENT is to prevent the reuse of ROWIDs from previously deleted rows.

I would at least specify autoincrements in the spec and strongly encourage them on generated ids. By default, implementations will still function on tables that do omit them.

jyutzler · 2020-05-06T17:39:49Z

One possibility is reversing the guidance - putting AUTOINCREMENT in the table definition SQL but providing a note that it MAY be eliminated for performance reasons. We're just hitting performance bottlenecks in some scenarios and even 10% would move the needle.

bosborn · 2020-05-06T17:56:44Z

Are you are doing batch inserts? Committing in chunks within a transaction instead of each row?

jyutzler · 2020-05-06T18:05:20Z

I can't speak for others but I have been. Batches of 1000 seem to work best.

jerstlouis · 2020-05-13T19:31:28Z

@jyutzler I would support including the AUTOINCREMENT in the reference normative SQL, but not making it a hard requirement, so that for those for whom performance and storage size is paramount, and have a clear understanding of the implications, it can be omitted.

fjlopez · 2020-05-14T08:53:49Z

Although I agree 100% with #517 (comment), I think that we should support this kind of proposals with an experiment with synthetic or real data. For example, we can create at least two geopackages with the same feature but only one with the AUTOINCREMENT enabled. Populate both with different amount of data (100 rows, 1000 rows, 10.000 rows, 100.000 rows), measure insertion time with different batch size, do some random delete/insert performance measures, and then, vacuum both and compare sizes with sqlite3_analyzer.

IMHO, when a GeoPackage decision design is conditioned by a specific behaviour of SQLite and there are concerns related to size/speed/consistency, we should have reference experiments that support the decision. I am worried about the use of magic numbers for taking decisions without experimental support (e.g. the classic 80% of data is geographic).

jyutzler · 2020-05-18T01:56:40Z

While I agree that it would be nice to have performance testing for this case, I am not aware of anyone who is available to do this work.

danielbarela · 2020-05-18T13:11:24Z

While these numbers are not GeoPackage specific, here is a link to an experiment that someone ran for the question of AUTOINCREMENT vs without.

His conclusion was that it was 8-12% slower, but let us look at the numbers. On an HTC One M8 phone from 2014, it took ~3.5 seconds to insert 10000 items one at a time with AUTOINCREMENT on, and ~3.1 seconds without. Inserting as one batch took ~680 milliseconds with AUTOINCREMENT and ~600 milliseconds without.

https://stackoverflow.com/questions/45384919/what-are-the-overheads-of-using-autoincrement-for-sqlite-on-android

jerstlouis · 2020-05-18T13:30:08Z

8-12% slower on a process that takes a long time is quite significant...
We are planning to experiment with very large GeoPackages with billions of records.
Also I am even more concerned about the storage requirements (I would be curious to see some figures for that).

So I would stronly support @jyutzler 's suggestion:

One possibility is reversing the guidance - putting AUTOINCREMENT in the table definition SQL but providing a note that it MAY be eliminated for performance reasons.

danielbarela · 2020-05-18T13:35:43Z

I would also be curious to see figures for size. Are you able to run your experiment, inserting billions of records, with both AUTOINCREMENT and without and share the results?

jerstlouis · 2020-05-18T14:01:24Z

@danielbarela This is not our current focus right now, but I would be happy to when we get around to it.

rcoup · 2020-05-18T14:30:02Z

8-12% slower on a process that takes a long time is quite significant...
We are planning to experiment with very large GeoPackages with billions of records.

An HTC One M8 phone from 2014 isn't what you'll be inserting billions of records into GeoPackages on though? Performance comparisons don't scale across hardware very well — the completely different cpu architectures/cpu caches/memory/storage/OS/etc makes it virtually a random number.

jerstlouis · 2020-05-18T14:57:35Z

This S/O post here shines more light onto this...

https://stackoverflow.com/questions/52700893/sqlite-best-practices-about-using-autoincrement

(It also points to that post with the 8-12% figure -- https://stackoverflow.com/questions/45384919/what-are-the-overheads-of-using-autoincrement-for-sqlite-on-android)

According to this, the only time that the 'autoincrement' behavior looping back to re-use IDs occur is when the ROWIDs have gone over 9,223,372,036,854,775,807.

At that point, the difference is that adding a record will FAIL (getting a SQLITE_FULL).
I don't see how this can possibly be a better behavior, also considering the performance, memory and disk storage penalty.

A VACUUM will change all ROWIDs regardless of AUTOINCREMENT being present or not, and should adjust all reference to the primary keys...

Unless I am missunderstanding this explanation, I do not believe adding AUTOINCREMENT helps with any of the scenarios that were mentioned here as a justification for using it.

So if I understand this right, AUTOINCREMENT adds zero benefit at all, and whichever performance, memory or disk usage, wheteher it's 20, 10, 5 or 1%, is a complete waste.

As https://www.sqlite.org/autoinc.html says very clearly:

The AUTOINCREMENT keyword imposes extra CPU, memory, disk space, and disk I/O overhead and should be avoided if not strictly needed. It is usually not needed.

I believe this is a bit of an understatement, and in practice it really is never needed.
That is unless you really prefer adding records to fail rather than use an ID which was used before, but only after all non-reused entries have been exhausted... But if there are still tables pointing to that deleted record, there are already problems in the database before you add this 9,223,372,036,854,775,808th record.

EDIT: I might have misunderstood that... The posts talks about max(rowid), so if you delete the biggest rowID row it might be re-used on the next add. So if a client deletes the largest rowid row and some referencing table does not get updated, it might go from pointing to a missing row to pointing to a wrong row when a record is added later (without autoincrement).

bosborn · 2020-05-18T15:36:19Z

A VACUUM will change all ROWIDs regardless of AUTOINCREMENT being present or not, and should adjust all reference to the primary keys...

This is only in the case of where a INTEGER PRIMARY KEY is not declared which does not apply here. See #516 for a little more insight on this.

And yes, I would say there is a clear benefit as specified in the documentation you referenced.

If the AUTOINCREMENT keyword appears after INTEGER PRIMARY KEY, that changes the automatic ROWID assignment algorithm to prevent the reuse of ROWIDs over the lifetime of the database. In other words, the purpose of AUTOINCREMENT is to prevent the reuse of ROWIDs from previously deleted rows.

Are we really expecting people to insert over nine quintillion features into one GeoPackage table? The problems with memory, speed, and handling that kind of database should be interesting.

bosborn · 2020-05-18T15:48:20Z

18 quintillion max rows per table in theory, but 140 TB file limit so about 10 trillion max rows.
https://www.sqlite.org/limits.html

The theoretical maximum number of rows in a table is 2^64 (18446744073709551616 or about 1.8e+19). This limit is unreachable since the maximum database size of 140 terabytes will be reached first. A 140 terabytes database can hold no more than approximately 1e+13 rows, and then only if there are no indices and if each row contains very little data.

jyutzler · 2020-05-18T18:03:05Z

Well, we're going to discuss (again!) during tomorrow's SWG meeting. I'm not sure this is the best use of SWG resources, but the squeaky wheel gets the oil.

bosborn · 2020-05-18T18:55:29Z

I edited and ran a quick personal laptop performance test on our Java library.

2 column feature table (id and geometry)
50 million features (same polygon inserted each time, created in advance)
1k feature transactions
With and without autoincrement (only 2 runs each, edited here)

Omitting auto increment is faster as expected, but that rate varied (1.5% - 17.15% in just a few runs).

So not large enough of a test case or representative of most use cases, but gives an idea that the performance hit can be negligible up to about what we thought.

results.txt

Edit:
File Sizes
AUTOINCREMENT: 13,691,183,104 bytes
No AUTOINCREMENT: 13,691,179,008 bytes
Difference: 4096 bytes (the default page size)

4096 bytes for the added sqlite_sequence table

fjlopez · 2020-05-18T22:11:16Z

I created a simple script that creates this table:

CREATE TABLE t1( a INTEGER PRIMARY KEY, b INTEGER);

And populates with this sentence:

INSERT INTO t1(b) values (0);

In M transactions of N INSERTS each using the sqlite3 shell. I performed four runs, two without AUTOINCREMENT and other two with AUTOINCREMENT. The first two runs are similar to the @bosborn experiment in size and transactions. Time is measured with time sqlite3 test.db < script. test.db is deleted before each run. My laptop is a MacBook Pro 15-inch mid 2014, 2,5 GHz Quad-Core Intel Core i7, 16 GB 1600 MHz DDR3. My SQLite version is 3.27.2 2019-02-25 16:06:06

configuration	user	system	cpu	total
50K transactions with 1K INSERTS	97.83s	34.07s	89%	2:27.39
50K transactions with 1K INSERTS and AUTOINCREMENT	134.18s	35.91s	91%	3:06.64
1 transaction with 50M INSERTS	91.36s	1.62s	99%	1:33.84
1 transaction with 50M INSERTS and AUTOINCREMENT	129.40s	1.73s	98%	2:12.47

By removing the layers that add ORMLite, Xerial JDBC and Java JNI, this result supports the finding of @bosborn and shows that there is a performance problem when bulk loading and AUTOINCREMENT are required. However, this experiment also shows that there are SQLite "tricks" that can be useful to address it.

Recommending the use of AUTOINCREMENT is a good idea. It ensures a more consistent GeoPackage. Users that may suffer some loss of performance due to this recommendation have the possibility of improving the speed of their INSERTS by using larger transactions or tweaking PRAGMAS.

jyutzler mentioned this issue May 5, 2020

Removing superfluous primary key clauses like autoincrement #521

Closed

jyutzler changed the title ~~Be consistent in DDL for integer autoincrement primary keys~~ Be consistent in DDL for integer primary keys May 5, 2020

jyutzler added this to the 1.3.0 milestone May 6, 2020

bosborn mentioned this issue May 6, 2020

gpkg_metadata table definition SQL #515

Closed

jyutzler mentioned this issue May 13, 2020

Pull request 449 (associated with issue 446) breaks applications in the field and needs to be revisited #498

Closed

jyutzler mentioned this issue May 20, 2020

Multiple clarifications #528

Merged

jyutzler closed this as completed in #528 Jun 4, 2020

olsen232 mentioned this issue Aug 4, 2020

GPKG's generated by sno don't have AUTOINCREMENT koordinates/kart#188

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be consistent in DDL for integer primary keys #517

Be consistent in DDL for integer primary keys #517

jyutzler commented Apr 27, 2020

fjlopez commented Apr 28, 2020 •

edited

Loading

jyutzler commented May 5, 2020

jyutzler commented May 5, 2020 •

edited

Loading

jyutzler commented May 6, 2020

bosborn commented May 6, 2020

jyutzler commented May 6, 2020 •

edited

Loading

bosborn commented May 6, 2020

jyutzler commented May 6, 2020

jerstlouis commented May 13, 2020 •

edited

Loading

fjlopez commented May 14, 2020 •

edited

Loading

jyutzler commented May 18, 2020

danielbarela commented May 18, 2020

jerstlouis commented May 18, 2020

danielbarela commented May 18, 2020

jerstlouis commented May 18, 2020

rcoup commented May 18, 2020

jerstlouis commented May 18, 2020 •

edited

Loading

bosborn commented May 18, 2020

bosborn commented May 18, 2020

jyutzler commented May 18, 2020

bosborn commented May 18, 2020 •

edited

Loading

fjlopez commented May 18, 2020 •

edited

Loading

Be consistent in DDL for integer primary keys #517

Be consistent in DDL for integer primary keys #517

Comments

jyutzler commented Apr 27, 2020

fjlopez commented Apr 28, 2020 • edited Loading

jyutzler commented May 5, 2020

jyutzler commented May 5, 2020 • edited Loading

jyutzler commented May 6, 2020

bosborn commented May 6, 2020

jyutzler commented May 6, 2020 • edited Loading

bosborn commented May 6, 2020

jyutzler commented May 6, 2020

jerstlouis commented May 13, 2020 • edited Loading

fjlopez commented May 14, 2020 • edited Loading

jyutzler commented May 18, 2020

danielbarela commented May 18, 2020

jerstlouis commented May 18, 2020

danielbarela commented May 18, 2020

jerstlouis commented May 18, 2020

rcoup commented May 18, 2020

jerstlouis commented May 18, 2020 • edited Loading

bosborn commented May 18, 2020

bosborn commented May 18, 2020

jyutzler commented May 18, 2020

bosborn commented May 18, 2020 • edited Loading

fjlopez commented May 18, 2020 • edited Loading

fjlopez commented Apr 28, 2020 •

edited

Loading

jyutzler commented May 5, 2020 •

edited

Loading

jyutzler commented May 6, 2020 •

edited

Loading

jerstlouis commented May 13, 2020 •

edited

Loading

fjlopez commented May 14, 2020 •

edited

Loading

jerstlouis commented May 18, 2020 •

edited

Loading

bosborn commented May 18, 2020 •

edited

Loading

fjlopez commented May 18, 2020 •

edited

Loading