LOCK TABLES can lead to crashes or locks when used with Galera #27071

Closed
butonic opened this Issue Feb 1, 2017 · 21 comments

Comments

Projects
None yet
7 participants
@butonic
Member

butonic commented Feb 1, 2017

See https://mariadb.com/kb/en/mariadb/lock-tables-and-unlock-tables/#limitations

Also we are seeing installations that have to do a 'repair table' to fix the cluster. The specific problem is

ERROR 1100 (HY000): Table oc_appconfig' was not locked with LOCK TABLES

The documentation only explains:

While a connection holds an explicit lock on a table, it cannot access a non-locked table. If you try, the [...] error will be produced [...]

AFAICT locking is only used for the oc_jobs table: https://github.com/owncloud/core/blob/master/lib/private/BackgroundJob/JobList.php#L191-L210 However a WRITE lock is intended to prevent other connections from updating the table. Galera does not propagate locks ... no Idea what exactly is causing the error.

Introduced with d0a2fa0 ... seems to be released with 9.1.0

Any other installation running 9.1 on galera without this problem?

cc @DeepDiver1975 @felixboehm @PhilippSchaffrath @dercorn @IljaN

@phisch

This comment has been minimized.

Show comment
Hide comment
@phisch

phisch Feb 1, 2017

Contributor

What is weird, is that an instance with this error works fine after a repair table and might randomly fail again after a few hours or a day.

Even if this is caused by jobs, a request to index.php should hold its own connection which should run without any problems. But we could see that no requests at all will work, all requests get the same error. The table is accessible perfectly fine if you connect to the database itself and run the query manual.

Contributor

phisch commented Feb 1, 2017

What is weird, is that an instance with this error works fine after a repair table and might randomly fail again after a few hours or a day.

Even if this is caused by jobs, a request to index.php should hold its own connection which should run without any problems. But we could see that no requests at all will work, all requests get the same error. The table is accessible perfectly fine if you connect to the database itself and run the query manual.

@DeepDiver1975

This comment has been minimized.

Show comment
Hide comment
@DeepDiver1975

DeepDiver1975 Feb 1, 2017

Member

https://mariadb.com/kb/en/mariadb/mariadb-galera-cluster-known-limitations/

Bloody hell ..... is there a docker with galery inside? I'll immendiatly put this into the ci pipeline

Member

DeepDiver1975 commented Feb 1, 2017

https://mariadb.com/kb/en/mariadb/mariadb-galera-cluster-known-limitations/

Bloody hell ..... is there a docker with galery inside? I'll immendiatly put this into the ci pipeline

@PVince81 PVince81 added this to the 9.1.5 milestone Feb 1, 2017

@DeepDiver1975

This comment has been minimized.

Show comment
Hide comment
@DeepDiver1975

DeepDiver1975 Feb 1, 2017

Member

let's try this one

docker pull erkules/galera
Member

DeepDiver1975 commented Feb 1, 2017

let's try this one

docker pull erkules/galera
@Helios07

This comment has been minimized.

Show comment
Hide comment
@Helios07

Helios07 Feb 2, 2017

Which version of MariaDB/MySQL are we talking about? At least in my Galera cluster working with MariaDB 5.5 a repair table is not possible for an InnoDB table (and there will most probably no myISAM tables):

MariaDB [owncloud]> repair table oc_appconfig; +-----------------------+--------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +-----------------------+--------+----------+---------------------------------------------------------+ | owncloud.oc_appconfig | repair | note | The storage engine for the table doesn't support repair | +-----------------------+--------+----------+---------------------------------------------------------+ 1 row in set (0.01 sec)

Even if this works in newer versions, will the repair table be propagated? This would explain to happen the error to occur again at a later time when you might be connected to another database node.

By the way, I have a test installation with this Galera cluster where I do not get the error.

Helios07 commented Feb 2, 2017

Which version of MariaDB/MySQL are we talking about? At least in my Galera cluster working with MariaDB 5.5 a repair table is not possible for an InnoDB table (and there will most probably no myISAM tables):

MariaDB [owncloud]> repair table oc_appconfig; +-----------------------+--------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +-----------------------+--------+----------+---------------------------------------------------------+ | owncloud.oc_appconfig | repair | note | The storage engine for the table doesn't support repair | +-----------------------+--------+----------+---------------------------------------------------------+ 1 row in set (0.01 sec)

Even if this works in newer versions, will the repair table be propagated? This would explain to happen the error to occur again at a later time when you might be connected to another database node.

By the way, I have a test installation with this Galera cluster where I do not get the error.

@phisch

This comment has been minimized.

Show comment
Hide comment
@phisch

phisch Feb 2, 2017

Contributor

AFAIK it says that the engine doesn't support repair, but repair actually does multiple steps and the message comes from one of those.

Contributor

phisch commented Feb 2, 2017

AFAIK it says that the engine doesn't support repair, but repair actually does multiple steps and the message comes from one of those.

@butonic

This comment has been minimized.

Show comment
Hide comment
@butonic

butonic Feb 2, 2017

Member

@Helios07 Indeed, we got the same output with 'The storage engine for the table doesn't support repair' BUT the table worked afterwards.
Reading the docs it would be great to get the output of:

  1. SHOW OPEN TABLES WHERE in_use <> 0
  2. SHOW FULL PROCESSLIST;
  3. from http://www.dbrnd.com/2016/02/mysql-script-to-identify-the-locks-and-blocking-transactions/
SELECT 
    pl.id
    ,pl.user
    ,pl.state
    ,it.trx_id 
    ,it.trx_mysql_thread_id 
    ,it.trx_query AS query
    ,it.trx_id AS blocking_trx_id
    ,it.trx_mysql_thread_id AS blocking_thread
    ,it.trx_query AS blocking_query
FROM information_schema.processlist AS pl 
INNER JOIN information_schema.innodb_trx AS it
	ON pl.id = it.trx_mysql_thread_id
INNER JOIN information_schema.innodb_lock_waits AS ilw
	ON it.trx_id = ilw.requesting_trx_id 
        AND it.trx_id = ilw.blocking_trx_id

We may also try https://github.com/innotop/innotop to monitor the locks.

AFAICT it should be the oc_jobs table. If so, we will know where to look for an alternative solution.

Member

butonic commented Feb 2, 2017

@Helios07 Indeed, we got the same output with 'The storage engine for the table doesn't support repair' BUT the table worked afterwards.
Reading the docs it would be great to get the output of:

  1. SHOW OPEN TABLES WHERE in_use <> 0
  2. SHOW FULL PROCESSLIST;
  3. from http://www.dbrnd.com/2016/02/mysql-script-to-identify-the-locks-and-blocking-transactions/
SELECT 
    pl.id
    ,pl.user
    ,pl.state
    ,it.trx_id 
    ,it.trx_mysql_thread_id 
    ,it.trx_query AS query
    ,it.trx_id AS blocking_trx_id
    ,it.trx_mysql_thread_id AS blocking_thread
    ,it.trx_query AS blocking_query
FROM information_schema.processlist AS pl 
INNER JOIN information_schema.innodb_trx AS it
	ON pl.id = it.trx_mysql_thread_id
INNER JOIN information_schema.innodb_lock_waits AS ilw
	ON it.trx_id = ilw.requesting_trx_id 
        AND it.trx_id = ilw.blocking_trx_id

We may also try https://github.com/innotop/innotop to monitor the locks.

AFAICT it should be the oc_jobs table. If so, we will know where to look for an alternative solution.

@phisch

This comment has been minimized.

Show comment
Hide comment
@phisch

phisch Feb 2, 2017

Contributor

@butonic i tried listing open tables, but at the time the instance was unuseable there was no table in use or locked. I did not check the processlist though.

Contributor

phisch commented Feb 2, 2017

@butonic i tried listing open tables, but at the time the instance was unuseable there was no table in use or locked. I did not check the processlist though.

@dercorn

This comment has been minimized.

Show comment
Hide comment
@dercorn

dercorn Feb 6, 2017

Contributor

@butonic @phisch @DeepDiver1975 any progress here? Or any ideas on how we might isolate the cause of the problem?

Contributor

dercorn commented Feb 6, 2017

@butonic @phisch @DeepDiver1975 any progress here? Or any ideas on how we might isolate the cause of the problem?

@butonic

This comment has been minimized.

Show comment
Hide comment
@butonic

butonic Feb 10, 2017

Member
{"reqId":"uY8\/fpZZIdA964fn8\/hn","remoteAddr":"10.10.2.245","app":"remote","message":"Exception: {\"Exception\":\"Doctrine\\\\DBAL\\\\Exception\\\\DriverException\",\"Message\":\"An exception occurred while executing 'UPDATE `oc_authtoken` SET `last_activity` = ? WHERE `id` = ?' with params [1486376134, 247]:\\n\\nSQLSTATE[HY000]: General error: 1100 Table 'oc_authtoken' was not locked with LOCK TABLES\",\"Code\":0,\"Trace\":\"#0 \\\/var\\\/www\\\/owncloud\\\/3rdparty\\\/doctrine\\\/dbal\\\/lib\\\/Doctrine\\\/DBAL\\\/DBALException.php(116): Doctrine\\\\DBAL\\\\Driver\\\\AbstractMySQLDriver->convertException('An exception oc...', Object(Doctrine\\\\DBAL\\\\Driver\\\\PDOException))\\n#1 \\\/var\\\/www\\\/owncloud\\\/3rdparty\\\/doctrine\\\/dbal\\\/lib\\\/Doctrine\\\/DBAL\\\/Statement.php(174): Doctrine\\\\DBAL\\\\DBALException::driverExceptionDuringQuery(Object(Doctrine\\\\DBAL\\\\Driver\\\\PDOMySql\\\\Driver), Object(Doctrine\\\\DBAL\\\\Driver\\\\PDOException), 'UPDATE `oc_auth...', Array)\\n#2 \\\/var\\\/www
Member

butonic commented Feb 10, 2017

{"reqId":"uY8\/fpZZIdA964fn8\/hn","remoteAddr":"10.10.2.245","app":"remote","message":"Exception: {\"Exception\":\"Doctrine\\\\DBAL\\\\Exception\\\\DriverException\",\"Message\":\"An exception occurred while executing 'UPDATE `oc_authtoken` SET `last_activity` = ? WHERE `id` = ?' with params [1486376134, 247]:\\n\\nSQLSTATE[HY000]: General error: 1100 Table 'oc_authtoken' was not locked with LOCK TABLES\",\"Code\":0,\"Trace\":\"#0 \\\/var\\\/www\\\/owncloud\\\/3rdparty\\\/doctrine\\\/dbal\\\/lib\\\/Doctrine\\\/DBAL\\\/DBALException.php(116): Doctrine\\\\DBAL\\\\Driver\\\\AbstractMySQLDriver->convertException('An exception oc...', Object(Doctrine\\\\DBAL\\\\Driver\\\\PDOException))\\n#1 \\\/var\\\/www\\\/owncloud\\\/3rdparty\\\/doctrine\\\/dbal\\\/lib\\\/Doctrine\\\/DBAL\\\/Statement.php(174): Doctrine\\\\DBAL\\\\DBALException::driverExceptionDuringQuery(Object(Doctrine\\\\DBAL\\\\Driver\\\\PDOMySql\\\\Driver), Object(Doctrine\\\\DBAL\\\\Driver\\\\PDOException), 'UPDATE `oc_auth...', Array)\\n#2 \\\/var\\\/www
@butonic

This comment has been minimized.

Show comment
Hide comment
@butonic

butonic Feb 10, 2017

Member

after a LOCK TABLES oc_jobs WRITE we can see the lock with SHOW OPEN TABLES WHERE in_use <> 0 trying to select from another lable correctly gives us the was not locked with LOCK TABLES error. Nothing unusual. @phisch is going to write a patch that tries to log the locked tables on a \Doctrine\DBAL\Exception\DriverException that has a message containing was not locked with LOCK TABLES so we can narrow down the cause. Logic dictates it is a lock on oc_jobs because that is the only table we lock in core.

Note that other connections are not affected: when we lock the oc_jobs table via mysql CLI the web ui still works as designed because it does not require the oc_jobs table. But the error is seen in the web ui. In theory a web request must have created a lock and then tried to touch another table. How is that possible?

Hm ... their netscale load balancer sends all queries to one of two galera cluster master nodes in active passive mode. maybe a connection is reused that hasn't been closed correctly?

Also if @Helios07 doesn't see this problem in his test instance we may neet to ping people with more galera cluster know how. Someone like @ayurchen or @temeo.

Member

butonic commented Feb 10, 2017

after a LOCK TABLES oc_jobs WRITE we can see the lock with SHOW OPEN TABLES WHERE in_use <> 0 trying to select from another lable correctly gives us the was not locked with LOCK TABLES error. Nothing unusual. @phisch is going to write a patch that tries to log the locked tables on a \Doctrine\DBAL\Exception\DriverException that has a message containing was not locked with LOCK TABLES so we can narrow down the cause. Logic dictates it is a lock on oc_jobs because that is the only table we lock in core.

Note that other connections are not affected: when we lock the oc_jobs table via mysql CLI the web ui still works as designed because it does not require the oc_jobs table. But the error is seen in the web ui. In theory a web request must have created a lock and then tried to touch another table. How is that possible?

Hm ... their netscale load balancer sends all queries to one of two galera cluster master nodes in active passive mode. maybe a connection is reused that hasn't been closed correctly?

Also if @Helios07 doesn't see this problem in his test instance we may neet to ping people with more galera cluster know how. Someone like @ayurchen or @temeo.

@PVince81

This comment has been minimized.

Show comment
Hide comment
@PVince81

PVince81 Apr 7, 2017

Member

Any update on this ? We need to do another RC2 with a fix for this

Member

PVince81 commented Apr 7, 2017

Any update on this ? We need to do another RC2 with a fix for this

@PVince81

This comment has been minimized.

Show comment
Hide comment
@PVince81

PVince81 Apr 7, 2017

Member

Should we try getting rid of any SQL "LOCK" commands and implement a custom locking ? But it sucks if we can't use native DB commands.

Member

PVince81 commented Apr 7, 2017

Should we try getting rid of any SQL "LOCK" commands and implement a custom locking ? But it sucks if we can't use native DB commands.

@butonic

This comment has been minimized.

Show comment
Hide comment
@butonic

butonic Apr 7, 2017

Member

I think we should be able to get around a table lock. I implemented the necessary kind of begin transaction, update and change, select updated value, commit in https://github.com/owncloud/core/pull/25771/files I am currently trying to understand if we can use that kind of atomicity to get rid of the table lock.

A different approach may be #25100.

Member

butonic commented Apr 7, 2017

I think we should be able to get around a table lock. I implemented the necessary kind of begin transaction, update and change, select updated value, commit in https://github.com/owncloud/core/pull/25771/files I am currently trying to understand if we can use that kind of atomicity to get rid of the table lock.

A different approach may be #25100.

@PVince81

This comment has been minimized.

Show comment
Hide comment
@PVince81

PVince81 Apr 7, 2017

Member

#25100 is not backportable, unless you mean extract the locking logic

Member

PVince81 commented Apr 7, 2017

#25100 is not backportable, unless you mean extract the locking logic

@PVince81

This comment has been minimized.

Show comment
Hide comment
@PVince81

PVince81 Apr 7, 2017

Member

What's the next step ? Reopening/porting https://github.com/owncloud/core/pull/25771/files ? Who can work on this ?

Member

PVince81 commented Apr 7, 2017

What's the next step ? Reopening/porting https://github.com/owncloud/core/pull/25771/files ? Who can work on this ?

@PVince81

This comment has been minimized.

Show comment
Hide comment
@PVince81

PVince81 Apr 7, 2017

Member

a web request must have created a lock

Some web requests like deleting a file to trash or overwriting a file might schedule a trashbin/version expiration by inserting a row into oc_jobs. I don't think there is any explicit LOCK command there, but maybe it happens implicitly.

Member

PVince81 commented Apr 7, 2017

a web request must have created a lock

Some web requests like deleting a file to trash or overwriting a file might schedule a trashbin/version expiration by inserting a row into oc_jobs. I don't think there is any explicit LOCK command there, but maybe it happens implicitly.

@butonic

This comment has been minimized.

Show comment
Hide comment
@butonic

butonic Apr 7, 2017

Member

no core/#25771 does not remove the lock. Working on it

Member

butonic commented Apr 7, 2017

no core/#25771 does not remove the lock. Working on it

@butonic butonic assigned butonic and unassigned phisch Apr 7, 2017

@butonic butonic referenced this issue Apr 7, 2017

Merged

[stable9.1] make JobList::next() lock free #27597

3 of 9 tasks complete
@butonic

This comment has been minimized.

Show comment
Hide comment
@butonic

butonic Apr 7, 2017

Member

I took a detour with subselect magic and dark arts of bending the QueryBuilder to do what I want it to do ... didn't work out as expected. Now have a much simpler solution: #27597

Member

butonic commented Apr 7, 2017

I took a detour with subselect magic and dark arts of bending the QueryBuilder to do what I want it to do ... didn't work out as expected. Now have a much simpler solution: #27597

@PVince81

This comment has been minimized.

Show comment
Hide comment
@PVince81

PVince81 Apr 7, 2017

Member

Fix to be released with 9.1.5

Member

PVince81 commented Apr 7, 2017

Fix to be released with 9.1.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment