Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LOCK TABLES can lead to crashes or locks when used with Galera #27071

Closed
butonic opened this issue Feb 1, 2017 · 22 comments
Closed

LOCK TABLES can lead to crashes or locks when used with Galera #27071

butonic opened this issue Feb 1, 2017 · 22 comments

Comments

@butonic
Copy link
Member

butonic commented Feb 1, 2017

See https://mariadb.com/kb/en/mariadb/lock-tables-and-unlock-tables/#limitations

Also we are seeing installations that have to do a 'repair table' to fix the cluster. The specific problem is

ERROR 1100 (HY000): Table oc_appconfig' was not locked with LOCK TABLES

The documentation only explains:

While a connection holds an explicit lock on a table, it cannot access a non-locked table. If you try, the [...] error will be produced [...]

AFAICT locking is only used for the oc_jobs table: https://github.com/owncloud/core/blob/master/lib/private/BackgroundJob/JobList.php#L191-L210 However a WRITE lock is intended to prevent other connections from updating the table. Galera does not propagate locks ... no Idea what exactly is causing the error.

Introduced with d0a2fa0 ... seems to be released with 9.1.0

Any other installation running 9.1 on galera without this problem?

cc @DeepDiver1975 @felixboehm @PhilippSchaffrath @dercorn @IljaN

@phisch
Copy link
Contributor

phisch commented Feb 1, 2017

What is weird, is that an instance with this error works fine after a repair table and might randomly fail again after a few hours or a day.

Even if this is caused by jobs, a request to index.php should hold its own connection which should run without any problems. But we could see that no requests at all will work, all requests get the same error. The table is accessible perfectly fine if you connect to the database itself and run the query manual.

@DeepDiver1975
Copy link
Member

https://mariadb.com/kb/en/mariadb/mariadb-galera-cluster-known-limitations/

Bloody hell ..... is there a docker with galery inside? I'll immendiatly put this into the ci pipeline

@PVince81 PVince81 added this to the 9.1.5 milestone Feb 1, 2017
@DeepDiver1975
Copy link
Member

let's try this one

docker pull erkules/galera

@Helios07
Copy link

Helios07 commented Feb 2, 2017

Which version of MariaDB/MySQL are we talking about? At least in my Galera cluster working with MariaDB 5.5 a repair table is not possible for an InnoDB table (and there will most probably no myISAM tables):

MariaDB [owncloud]> repair table oc_appconfig; +-----------------------+--------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +-----------------------+--------+----------+---------------------------------------------------------+ | owncloud.oc_appconfig | repair | note | The storage engine for the table doesn't support repair | +-----------------------+--------+----------+---------------------------------------------------------+ 1 row in set (0.01 sec)

Even if this works in newer versions, will the repair table be propagated? This would explain to happen the error to occur again at a later time when you might be connected to another database node.

By the way, I have a test installation with this Galera cluster where I do not get the error.

@phisch
Copy link
Contributor

phisch commented Feb 2, 2017

AFAIK it says that the engine doesn't support repair, but repair actually does multiple steps and the message comes from one of those.

@butonic
Copy link
Member Author

butonic commented Feb 2, 2017

@Helios07 Indeed, we got the same output with 'The storage engine for the table doesn't support repair' BUT the table worked afterwards.
Reading the docs it would be great to get the output of:

  1. SHOW OPEN TABLES WHERE in_use <> 0
  2. SHOW FULL PROCESSLIST;
  3. from http://www.dbrnd.com/2016/02/mysql-script-to-identify-the-locks-and-blocking-transactions/
SELECT 
    pl.id
    ,pl.user
    ,pl.state
    ,it.trx_id 
    ,it.trx_mysql_thread_id 
    ,it.trx_query AS query
    ,it.trx_id AS blocking_trx_id
    ,it.trx_mysql_thread_id AS blocking_thread
    ,it.trx_query AS blocking_query
FROM information_schema.processlist AS pl 
INNER JOIN information_schema.innodb_trx AS it
	ON pl.id = it.trx_mysql_thread_id
INNER JOIN information_schema.innodb_lock_waits AS ilw
	ON it.trx_id = ilw.requesting_trx_id 
        AND it.trx_id = ilw.blocking_trx_id

We may also try https://github.com/innotop/innotop to monitor the locks.

AFAICT it should be the oc_jobs table. If so, we will know where to look for an alternative solution.

@phisch
Copy link
Contributor

phisch commented Feb 2, 2017

@butonic i tried listing open tables, but at the time the instance was unuseable there was no table in use or locked. I did not check the processlist though.

@dercorn
Copy link
Contributor

dercorn commented Feb 6, 2017

@butonic @phisch @DeepDiver1975 any progress here? Or any ideas on how we might isolate the cause of the problem?

@butonic
Copy link
Member Author

butonic commented Feb 10, 2017

{"reqId":"uY8\/fpZZIdA964fn8\/hn","remoteAddr":"10.10.2.245","app":"remote","message":"Exception: {\"Exception\":\"Doctrine\\\\DBAL\\\\Exception\\\\DriverException\",\"Message\":\"An exception occurred while executing 'UPDATE `oc_authtoken` SET `last_activity` = ? WHERE `id` = ?' with params [1486376134, 247]:\\n\\nSQLSTATE[HY000]: General error: 1100 Table 'oc_authtoken' was not locked with LOCK TABLES\",\"Code\":0,\"Trace\":\"#0 \\\/var\\\/www\\\/owncloud\\\/3rdparty\\\/doctrine\\\/dbal\\\/lib\\\/Doctrine\\\/DBAL\\\/DBALException.php(116): Doctrine\\\\DBAL\\\\Driver\\\\AbstractMySQLDriver->convertException('An exception oc...', Object(Doctrine\\\\DBAL\\\\Driver\\\\PDOException))\\n#1 \\\/var\\\/www\\\/owncloud\\\/3rdparty\\\/doctrine\\\/dbal\\\/lib\\\/Doctrine\\\/DBAL\\\/Statement.php(174): Doctrine\\\\DBAL\\\\DBALException::driverExceptionDuringQuery(Object(Doctrine\\\\DBAL\\\\Driver\\\\PDOMySql\\\\Driver), Object(Doctrine\\\\DBAL\\\\Driver\\\\PDOException), 'UPDATE `oc_auth...', Array)\\n#2 \\\/var\\\/www

@butonic
Copy link
Member Author

butonic commented Feb 10, 2017

after a LOCK TABLES oc_jobs WRITE we can see the lock with SHOW OPEN TABLES WHERE in_use <> 0 trying to select from another lable correctly gives us the was not locked with LOCK TABLES error. Nothing unusual. @phisch is going to write a patch that tries to log the locked tables on a \Doctrine\DBAL\Exception\DriverException that has a message containing was not locked with LOCK TABLES so we can narrow down the cause. Logic dictates it is a lock on oc_jobs because that is the only table we lock in core.

Note that other connections are not affected: when we lock the oc_jobs table via mysql CLI the web ui still works as designed because it does not require the oc_jobs table. But the error is seen in the web ui. In theory a web request must have created a lock and then tried to touch another table. How is that possible?

Hm ... their netscale load balancer sends all queries to one of two galera cluster master nodes in active passive mode. maybe a connection is reused that hasn't been closed correctly?

Also if @Helios07 doesn't see this problem in his test instance we may neet to ping people with more galera cluster know how. Someone like @ayurchen or @temeo.

@PVince81
Copy link
Contributor

PVince81 commented Apr 7, 2017

Any update on this ? We need to do another RC2 with a fix for this

@PVince81
Copy link
Contributor

PVince81 commented Apr 7, 2017

Should we try getting rid of any SQL "LOCK" commands and implement a custom locking ? But it sucks if we can't use native DB commands.

@butonic
Copy link
Member Author

butonic commented Apr 7, 2017

I think we should be able to get around a table lock. I implemented the necessary kind of begin transaction, update and change, select updated value, commit in https://github.com/owncloud/core/pull/25771/files I am currently trying to understand if we can use that kind of atomicity to get rid of the table lock.

A different approach may be #25100.

@PVince81
Copy link
Contributor

PVince81 commented Apr 7, 2017

#25100 is not backportable, unless you mean extract the locking logic

@PVince81
Copy link
Contributor

PVince81 commented Apr 7, 2017

What's the next step ? Reopening/porting https://github.com/owncloud/core/pull/25771/files ? Who can work on this ?

@PVince81
Copy link
Contributor

PVince81 commented Apr 7, 2017

a web request must have created a lock

Some web requests like deleting a file to trash or overwriting a file might schedule a trashbin/version expiration by inserting a row into oc_jobs. I don't think there is any explicit LOCK command there, but maybe it happens implicitly.

@butonic
Copy link
Member Author

butonic commented Apr 7, 2017

no core/#25771 does not remove the lock. Working on it

@butonic
Copy link
Member Author

butonic commented Apr 7, 2017

I took a detour with subselect magic and dark arts of bending the QueryBuilder to do what I want it to do ... didn't work out as expected. Now have a much simpler solution: #27597

@PVince81
Copy link
Contributor

PVince81 commented Apr 7, 2017

Fix to be released with 9.1.5

@lock
Copy link

lock bot commented Aug 1, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Aug 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants