Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creation of rows in knex_migration_lock table is not safe concurrently #3538

Open
bri3d opened this issue Nov 12, 2019 · 5 comments
Open

creation of rows in knex_migration_lock table is not safe concurrently #3538

bri3d opened this issue Nov 12, 2019 · 5 comments

Comments

@bri3d
Copy link

bri3d commented Nov 12, 2019

Environment

Knex version: 0.19.5
Database + version: XtraDB 5.7
OS: CentOS

Bug

We have encountered a concurrency-related issue around the knex_migration_lock table. The issue we have observed is as follows:

3 nodes concurrently try to execute db.migrate.latest() against a database which has not yet been initialized for use with Knex.

The knex_migration_lock table is created with three entries, rather than only one, because of a data race here: https://github.com/knex/knex/blob/master/lib/migrate/table-creator.js#L30 . The race is between the return of return getTable(trxOrKnex, lockTable, schemaName).select('*'); and the check of data.length in !data.length && trxOrKnex.into(lockTableWithSchema).insert({ is_locked: 0 }) .

Multiple nodes execute the (non-locking) select query simultaneously and all return 0 rows. Next, multiple nodes execute the (non-locking) insert query simultaneously, leading to the creation of multiple sequential locking rows.

After this has happened, no migrations can function until the table is reset, because the check in https://github.com/knex/knex/blob/master/lib/migrate/Migrator.js#L318 verifies that only one row is updated by the query which takes the lock.

Perhaps an upsert primitive could be used to insert the locking record into the table initially, or the first indexed lock row can be utilized as the semaphore for the lock?

  1. Error message

Migration table is already locked

  1. Reduced test code, for example in https://npm.runkit.com/knex or if it needs real
    database connection to MySQL or PostgreSQL, then single file example which initializes
    needed data and demonstrates the problem.

Executing db.migrate.latest in multiple truly concurrent processes in a loop will eventually trigger various migration concurrency/race issues, this being the most common one.

@bri3d
Copy link
Author

bri3d commented Nov 12, 2019

I think this might be #3457 , but against MySQL rather than Postgres.

@elhigu
Copy link
Member

elhigu commented Nov 13, 2019

Please add simple single file reproduction code. Explaining steps is not enough.

It is normal behavior that Migration table is already locked error is thrown if you are running multiple migration calls / processes at the same time.

@elhigu
Copy link
Member

elhigu commented Nov 13, 2019

The knex_migration_lock table is created with three entries, rather than only one.

That might be a bug (but not necessarily since actual locking is done at different stage). I just read migration / migration locking code once more and I still feel that it could be better... It seems to be also using implicit transactions for every database, which might cause problems with mysql, oracle and sqlite...

@bri3d
Copy link
Author

bri3d commented Nov 13, 2019

"It is normal behavior that Migration table is already locked error is thrown if you are running multiple migration calls / processes at the same time."

Yes, but it is not normal behavior that the lock is never correctly released and migration cannot ever succeed in the future.

Let me be more clear about the observed behavior as well as try to provide a test case. Unfortunately testing concurrency is challenging in a "single file" type of scenario, but I will do my best to provide something that can illustrate the issue we saw. In my reading of the code (linked, line by line, in my original PR), the issue is fairly clear to me.

Much like the test case in #3457:

import knex from 'knex';
(async () => {
  const knexConnection = knex( {
    client: 'mysql',
    connection: {
      database: 'testme',
      host: '127.0.0.1',
      user: 'root',
    }
  });

  await knexConnection.migrate.latest();
})();

DEBUG=knex* npx babel-node knex-test.js & DEBUG=knex* npx babel-node knex-test.js & DEBUG=knex* npx babel-node knex-test.js &

Please note that this may need to be run many times to observe the behavior, and of course depends on a high number of uncontrollable variables. Concurrency in a single-file test reduction without access to environmental setup is a very fickle thing. I have observed that the issue can be reproduced more readily if there is network latency to the DB server.

Observed behavior: "Migration table is already locked" is thrown from ALL nodes. No node is able to "win" the lock and perform migration. Restarting the process in a single-concurrency environment also does not function, "Migration table is already locked" continues to be thrown.

Here is the debug output from 3 concurrent processes, where you can observe the race between the SELECT and INSERT causing 3 nodes to insert the rows at the same time:

knex:client acquired connection from pool: __knexUid1 +1ms
  knex:client acquired connection from pool: __knexUid1 +1ms
  knex:client releasing connection to pool: __knexUid1 +3ms
  knex:query select * from `knex_migrations_lock` undefined +5ms
  knex:query select * from `knex_migrations_lock` undefined +5ms
  knex:bindings [] undefined +5ms
  knex:bindings [] undefined +5ms
  knex:client acquired connection from pool: __knexUid1 +1ms
  knex:client releasing connection to pool: __knexUid1 +4ms
  knex:client releasing connection to pool: __knexUid1 +5ms
  knex:client acquired connection from pool: __knexUid1 +1ms
  knex:query select * from `knex_migrations_lock` undefined +6ms
  knex:client acquired connection from pool: __knexUid1 +0ms
  knex:bindings [] undefined +6ms
  knex:query insert into `knex_migrations_lock` (`is_locked`) values (?) undefined +4ms
  knex:bindings [ 0 ] undefined +4ms
  knex:query insert into `knex_migrations_lock` (`is_locked`) values (?) undefined +4ms
  knex:bindings [ 0 ] undefined +4ms
  knex:client releasing connection to pool: __knexUid1 +4ms
  knex:client releasing connection to pool: __knexUid1 +3ms
  knex:client acquired connection from pool: __knexUid1 +1ms
  knex:client acquired connection from pool: __knexUid1 +1ms
  knex:client releasing connection to pool: __knexUid1 +4ms
  knex:query insert into `knex_migrations_lock` (`is_locked`) values (?) undefined +4ms
  knex:client acquired connection from pool: __knexUid1 +0ms
  knex:query select `name` from `knex_migrations` order by `id` asc undefined +3ms
  knex:bindings [ 0 ] undefined +4ms
  knex:bindings [] undefined +3ms

Here is the debug output from a single node trying to migrate after this behavior has occurred, where you can observe the existence of multiple rows in the locking table preventing the locking functionality from ever working again:

❯ DEBUG=knex* npx babel-node knex-test.js
  knex:client acquired connection from pool: __knexUid1 +0ms
  knex:query select * from information_schema.tables where table_name = ? and table_schema = database() undefined +0ms
  knex:bindings [ 'knex_migrations' ] undefined +0ms
  knex:client releasing connection to pool: __knexUid1 +8ms
  knex:client acquired connection from pool: __knexUid1 +1ms
  knex:query select * from information_schema.tables where table_name = ? and table_schema = database() undefined +8ms
  knex:bindings [ 'knex_migrations_lock' ] undefined +7ms
  knex:client releasing connection to pool: __knexUid1 +2ms
  knex:client acquired connection from pool: __knexUid1 +1ms
  knex:query select * from `knex_migrations_lock` undefined +4ms
  knex:bindings [] undefined +4ms
  knex:client releasing connection to pool: __knexUid1 +3ms
  knex:client acquired connection from pool: __knexUid1 +0ms
  knex:query select `name` from `knex_migrations` order by `id` asc undefined +3ms
  knex:bindings [] undefined +3ms
  knex:client releasing connection to pool: __knexUid1 +2ms
  knex:tx trx2: Starting top level transaction +0ms
  knex:client acquired connection from pool: __knexUid1 +1ms
  knex:query BEGIN; trx2 +3ms
  knex:bindings undefined trx2 +3ms
  knex:query update `knex_migrations_lock` set `is_locked` = ? where `is_locked` = ? trx2 +4ms
  knex:bindings [ 1, 0 ] trx2 +4ms
Can't take lock to run migrations: Migration table is already locked
If you are sure migrations are not running you can release the lock manually by deleting all the rows = require(migrations lock table: knex_migrations_lock
  knex:query ROLLBACK trx2 +5ms
  knex:bindings undefined trx2 +5ms
  knex:tx trx2: releasing connection +11ms
  knex:client releasing connection to pool: __knexUid1 +12ms

On inspection of the database, knex_migrations_lock contains multiple rows:

mysql> SELECT * from knex_migrations_lock;
+-------+-----------+
| index | is_locked |
+-------+-----------+
|     1 |         0 |
|     2 |         0 |
|     3 |         0 |
+-------+-----------+
3 rows in set (0.00 sec)

It is again highly evident, from reading the code at https://github.com/knex/knex/blob/master/lib/migrate/table-creator.js#L30 , how these rows were created, and again evident from reading the code at https://github.com/knex/knex/blob/master/lib/migrate/Migrator.js#L318 , which verifies that the number of rows updated is equal to one, how having multiple rows in the knex_migrations_lock table will prevent migrations from ever running.

@HappyZombies
Copy link

Has there been any update on a "fix" for this if even possible? It's been 2 years since a comment or status...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants