handle stalling in mysql driver #990
Comments
So dudes. is this crazy? I'm thinking a very simple way to get TCP connection falldown connection is to stop using the mysql driver's built in query queue completely. Rather apply something that has its own queue, is well tested, and we can easily introduce the ability to cancel jobs if they take too long. node-compute-cluster doesn't have the right name for this application, but I think it could work very nicely and easily, and also provide us with a trivial way to configure how many simultaneous db connections should be managed in a process. I also think it can help us handle database failover. my first pass at thinking about this would be to write an abstraction around the mysql driver which exports the very same interface as the driver itself, but sends queries off to child processes for execution, and cancels them (and the process in which they were run) if they fail. logging all the way. ho ho ho. first reactions? |
sounds reasonable. we want to make sure cancelling transactions cancels on the server side, too (this should also happen in the case of a client-side query timeout). |
|
I see we have a single connection to mysql per process. This is probably fine, as we have many processes running. Another way to go is to create a db pool, so that a single process is more resilient to slow queries. Having multiple connection in a pool would allow us to tune the timeout to allow for longer queries since the probability of all connections hitting slow queries is lower under healthy conditions. We'd still be able to kill and reset connections for queries that aren't responding. Multiple connections should improve throughput, but again has to be balanced with how many processes are hitting the DB. If a single process is handling dozens of concurrent requests which use the DB (out of the hundreds of requests it's servicing), this could be a win. Probably outside the scope of this bug, but definitely a feature for a more general purpose solution. Doesn't mysql have slow query logging? (I know Postgres does). If so, we should turn that on and aggregate timing to drive Issues for poorly written SQL or missing indexes. This isn't a big deal with BrowserID, since it has a small schema that changes in-frequently. |
QA will test this with OPs in Stage during the 12-03-01 load testing. |
Verified as fixed through load testing and Stage breakage today... |
We are seeing sporatic mysql driver time out and stalling. The first step to addressing this is to detect the situation and recover. We should implement application level detection of slow queries that output an error in the log but reconnect and continue running if a query takes longer than N seconds.
The text was updated successfully, but these errors were encountered: