Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be consistent in DDL for integer primary keys #517

Closed
jyutzler opened this issue Apr 27, 2020 · 22 comments · Fixed by #528
Closed

Be consistent in DDL for integer primary keys #517

jyutzler opened this issue Apr 27, 2020 · 22 comments · Fixed by #528
Milestone

Comments

@jyutzler
Copy link
Contributor

#515 currently proposes modifying the DDL for the Metadata Extension to be more consistent. I guess we should also sweep all of our DDL (Annex C) for consistency in defining primary keys. It looks like we need make this fix in multiple places. Are we in agreement that the correct way to define an integer autoincrement primary key is like this:

id INTEGER PRIMARY KEY AUTOINCREMENT

And that everything else is superfluous?

@fjlopez
Copy link

fjlopez commented Apr 28, 2020

The description of the semantics of INTEGER PRIMARY KEY AUTOINCREMENT in SQLite in books like "10.4 Autoincrement" SQLite Database System Design and Implementation (2016) supports this proposal.

@jyutzler
Copy link
Contributor Author

jyutzler commented May 5, 2020

After discussing this with the SWG today, it does not appear that we need the autoincrement keyword at all (and that there is a penalty in CPU/memory/disk space/IO). https://www.sqlite.org/autoinc.html
IIUC, trying to insert a blank row into a table will auto-generate a key for you. In summary autoincrement should be allowed but not required and therefore the DDL should remain informative, not normative.

@jyutzler jyutzler changed the title Be consistent in DDL for integer autoincrement primary keys Be consistent in DDL for integer primary keys May 5, 2020
@jyutzler
Copy link
Contributor Author

jyutzler commented May 5, 2020

I confirmed that integer primary key values are generated automatically by SQLite. Our table definition SQL is technically normative (except for user-defined tables, of course) so I added a note stating that minor modifications to the SQL are permitted. https://github.com/opengeospatial/geopackage/pull/521/files#diff-e76f7b8609c8bf2c4d6c52402c6a30a5R5

@jyutzler jyutzler added this to the 1.3.0 milestone May 6, 2020
@jyutzler
Copy link
Contributor Author

jyutzler commented May 6, 2020

There is still some outside discussion on this topic. The main point of concern is in RTE relationships. Consider this scenario:

  1. GeoPackage is made by Party A without using AUTOINCREMENT on gpkg_tile_matrix
  2. Party B updates GeoPackage, implementing the Related Tables Extension and creating some relationships between tiles and other data
  3. Party C goes out into the field and performs some tile updates that include removal of the row with the highest ID and addition of some new tiles (maybe they moved laterally)

Since ROWIDs will be reused in this circumstance, you will now have RTE rows pointing to the wrong thing.

What do you think?

@bosborn
Copy link
Contributor

bosborn commented May 6, 2020

I agree that is an issue with not only RTE, but any potential extension someone or some software is not aware of. The user table ids are basically the only guaranteed row identifying information, used throughout the spec and extensions. Related tables do not force foreign key constraints. Even if they did, sqlite defaults do not enforce them unless manually enabled.

Intelligent database design prevents unique generated ids from reuse so that a record id is guaranteed to only represent that record. Any new record should be uniquely identifiable without reusing an id. Other than performance, most arguments against using autoincrement in a database encourage the use of serial ids instead. But by omitting autoincrement, GeoPackage is instead encouraging non serial auto generated ids. Both autoincremented and generated serial ids maintain database integrity, but auto generated max+1 ids do not.

All implementations should be able to function on the base GeoPackage w/o the risking data integrity. There are so many ways to corrupt data as it is, we should do the best to prevent it when possible. We need to think more broadly about GeoPackages shared between different types of users, software, and usages.

GeoPackages that comply with the requirements in the standard and do not implement vendor-specific extensions are interoperable across all enterprise and personal computing environments.

I see this as a potential issue for data integrity, especially for shared GeoPackages. There is also the case of external id tracking for export scenarios. A record is only externally uniquely identifiable and traceable within a versioned GeoPackage by its' table and id.

The estimated 10-12% performance gain from omitting autoincrement may not offset the potential downsides with database integrity and traceability.

Point number 4 should be enough to justify the inclusion: https://www.sqlite.org/autoinc.html

  1. If the AUTOINCREMENT keyword appears after INTEGER PRIMARY KEY, that changes the automatic ROWID assignment algorithm to prevent the reuse of ROWIDs over the lifetime of the database. In other words, the purpose of AUTOINCREMENT is to prevent the reuse of ROWIDs from previously deleted rows.

I would at least specify autoincrements in the spec and strongly encourage them on generated ids. By default, implementations will still function on tables that do omit them.

@jyutzler
Copy link
Contributor Author

jyutzler commented May 6, 2020

One possibility is reversing the guidance - putting AUTOINCREMENT in the table definition SQL but providing a note that it MAY be eliminated for performance reasons. We're just hitting performance bottlenecks in some scenarios and even 10% would move the needle.

@bosborn
Copy link
Contributor

bosborn commented May 6, 2020

Are you are doing batch inserts? Committing in chunks within a transaction instead of each row?

@jyutzler
Copy link
Contributor Author

jyutzler commented May 6, 2020

I can't speak for others but I have been. Batches of 1000 seem to work best.

@jerstlouis
Copy link
Member

jerstlouis commented May 13, 2020

@jyutzler I would support including the AUTOINCREMENT in the reference normative SQL, but not making it a hard requirement, so that for those for whom performance and storage size is paramount, and have a clear understanding of the implications, it can be omitted.

@fjlopez
Copy link

fjlopez commented May 14, 2020

Although I agree 100% with #517 (comment), I think that we should support this kind of proposals with an experiment with synthetic or real data. For example, we can create at least two geopackages with the same feature but only one with the AUTOINCREMENT enabled. Populate both with different amount of data (100 rows, 1000 rows, 10.000 rows, 100.000 rows), measure insertion time with different batch size, do some random delete/insert performance measures, and then, vacuum both and compare sizes with sqlite3_analyzer.

IMHO, when a GeoPackage decision design is conditioned by a specific behaviour of SQLite and there are concerns related to size/speed/consistency, we should have reference experiments that support the decision. I am worried about the use of magic numbers for taking decisions without experimental support (e.g. the classic 80% of data is geographic).

@jyutzler
Copy link
Contributor Author

While I agree that it would be nice to have performance testing for this case, I am not aware of anyone who is available to do this work.

@danielbarela
Copy link
Contributor

While these numbers are not GeoPackage specific, here is a link to an experiment that someone ran for the question of AUTOINCREMENT vs without.

His conclusion was that it was 8-12% slower, but let us look at the numbers. On an HTC One M8 phone from 2014, it took ~3.5 seconds to insert 10000 items one at a time with AUTOINCREMENT on, and ~3.1 seconds without. Inserting as one batch took ~680 milliseconds with AUTOINCREMENT and ~600 milliseconds without.

https://stackoverflow.com/questions/45384919/what-are-the-overheads-of-using-autoincrement-for-sqlite-on-android

@jerstlouis
Copy link
Member

8-12% slower on a process that takes a long time is quite significant...
We are planning to experiment with very large GeoPackages with billions of records.
Also I am even more concerned about the storage requirements (I would be curious to see some figures for that).

So I would stronly support @jyutzler 's suggestion:

One possibility is reversing the guidance - putting AUTOINCREMENT in the table definition SQL but providing a note that it MAY be eliminated for performance reasons.

@danielbarela
Copy link
Contributor

I would also be curious to see figures for size. Are you able to run your experiment, inserting billions of records, with both AUTOINCREMENT and without and share the results?

@jerstlouis
Copy link
Member

@danielbarela This is not our current focus right now, but I would be happy to when we get around to it.

@rcoup
Copy link

rcoup commented May 18, 2020

8-12% slower on a process that takes a long time is quite significant...
We are planning to experiment with very large GeoPackages with billions of records.

An HTC One M8 phone from 2014 isn't what you'll be inserting billions of records into GeoPackages on though? Performance comparisons don't scale across hardware very well — the completely different cpu architectures/cpu caches/memory/storage/OS/etc makes it virtually a random number.

@jerstlouis
Copy link
Member

jerstlouis commented May 18, 2020

This S/O post here shines more light onto this...

https://stackoverflow.com/questions/52700893/sqlite-best-practices-about-using-autoincrement

(It also points to that post with the 8-12% figure -- https://stackoverflow.com/questions/45384919/what-are-the-overheads-of-using-autoincrement-for-sqlite-on-android)

According to this, the only time that the 'autoincrement' behavior looping back to re-use IDs occur is when the ROWIDs have gone over 9,223,372,036,854,775,807.

At that point, the difference is that adding a record will FAIL (getting a SQLITE_FULL).
I don't see how this can possibly be a better behavior, also considering the performance, memory and disk storage penalty.

A VACUUM will change all ROWIDs regardless of AUTOINCREMENT being present or not, and should adjust all reference to the primary keys...

Unless I am missunderstanding this explanation, I do not believe adding AUTOINCREMENT helps with any of the scenarios that were mentioned here as a justification for using it.

So if I understand this right, AUTOINCREMENT adds zero benefit at all, and whichever performance, memory or disk usage, wheteher it's 20, 10, 5 or 1%, is a complete waste.

As https://www.sqlite.org/autoinc.html says very clearly:

The AUTOINCREMENT keyword imposes extra CPU, memory, disk space, and disk I/O overhead and should be avoided if not strictly needed. It is usually not needed.

I believe this is a bit of an understatement, and in practice it really is never needed.
That is unless you really prefer adding records to fail rather than use an ID which was used before, but only after all non-reused entries have been exhausted... But if there are still tables pointing to that deleted record, there are already problems in the database before you add this 9,223,372,036,854,775,808th record.

EDIT: I might have misunderstood that... The posts talks about max(rowid), so if you delete the biggest rowID row it might be re-used on the next add. So if a client deletes the largest rowid row and some referencing table does not get updated, it might go from pointing to a missing row to pointing to a wrong row when a record is added later (without autoincrement).

@bosborn
Copy link
Contributor

bosborn commented May 18, 2020

A VACUUM will change all ROWIDs regardless of AUTOINCREMENT being present or not, and should adjust all reference to the primary keys...

This is only in the case of where a INTEGER PRIMARY KEY is not declared which does not apply here. See #516 for a little more insight on this.

And yes, I would say there is a clear benefit as specified in the documentation you referenced.

If the AUTOINCREMENT keyword appears after INTEGER PRIMARY KEY, that changes the automatic ROWID assignment algorithm to prevent the reuse of ROWIDs over the lifetime of the database. In other words, the purpose of AUTOINCREMENT is to prevent the reuse of ROWIDs from previously deleted rows.

Are we really expecting people to insert over nine quintillion features into one GeoPackage table? The problems with memory, speed, and handling that kind of database should be interesting.

@bosborn
Copy link
Contributor

bosborn commented May 18, 2020

18 quintillion max rows per table in theory, but 140 TB file limit so about 10 trillion max rows.
https://www.sqlite.org/limits.html

The theoretical maximum number of rows in a table is 2^64 (18446744073709551616 or about 1.8e+19). This limit is unreachable since the maximum database size of 140 terabytes will be reached first. A 140 terabytes database can hold no more than approximately 1e+13 rows, and then only if there are no indices and if each row contains very little data.

@jyutzler
Copy link
Contributor Author

Well, we're going to discuss (again!) during tomorrow's SWG meeting. I'm not sure this is the best use of SWG resources, but the squeaky wheel gets the oil.

@bosborn
Copy link
Contributor

bosborn commented May 18, 2020

I edited and ran a quick personal laptop performance test on our Java library.

  • 2 column feature table (id and geometry)
  • 50 million features (same polygon inserted each time, created in advance)
  • 1k feature transactions
  • With and without autoincrement (only 2 runs each, edited here)

Omitting auto increment is faster as expected, but that rate varied (1.5% - 17.15% in just a few runs).
image

image

So not large enough of a test case or representative of most use cases, but gives an idea that the performance hit can be negligible up to about what we thought.

results.txt

Edit:
File Sizes
AUTOINCREMENT: 13,691,183,104 bytes
No AUTOINCREMENT: 13,691,179,008 bytes
Difference: 4096 bytes (the default page size)

4096 bytes for the added sqlite_sequence table
Screen Shot 2020-05-18 at 1 41 15 PM

@fjlopez
Copy link

fjlopez commented May 18, 2020

I created a simple script that creates this table:

CREATE TABLE t1( a INTEGER PRIMARY KEY, b INTEGER);

And populates with this sentence:

INSERT INTO t1(b) values (0);

In M transactions of N INSERTS each using the sqlite3 shell. I performed four runs, two without AUTOINCREMENT and other two with AUTOINCREMENT. The first two runs are similar to the @bosborn experiment in size and transactions. Time is measured with time sqlite3 test.db < script. test.db is deleted before each run. My laptop is a MacBook Pro 15-inch mid 2014, 2,5 GHz Quad-Core Intel Core i7, 16 GB 1600 MHz DDR3. My SQLite version is 3.27.2 2019-02-25 16:06:06

configuration user system cpu total
50K transactions with 1K INSERTS 97.83s 34.07s 89% 2:27.39
50K transactions with 1K INSERTS and AUTOINCREMENT 134.18s 35.91s 91% 3:06.64
1 transaction with 50M INSERTS 91.36s 1.62s 99% 1:33.84
1 transaction with 50M INSERTS and AUTOINCREMENT 129.40s 1.73s 98% 2:12.47

By removing the layers that add ORMLite, Xerial JDBC and Java JNI, this result supports the finding of @bosborn and shows that there is a performance problem when bulk loading and AUTOINCREMENT are required. However, this experiment also shows that there are SQLite "tricks" that can be useful to address it.

Recommending the use of AUTOINCREMENT is a good idea. It ensures a more consistent GeoPackage. Users that may suffer some loss of performance due to this recommendation have the possibility of improving the speed of their INSERTS by using larger transactions or tweaking PRAGMAS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants