Clustering spends 1/3 of its time waiting for DB to check for cancellation #369

jstray · 2013-03-04T00:26:47Z

Timings on my dev machine, importing drones set:

Current version:
18:02:13.567 INFO WORKER - Clustered documents, time: 31.445 seconds
18:03:13.333 INFO WORKER - Clustered documents, time: 28.492 seconds
18:06:19.625 INFO WORKER - Clustered documents, time: 28.352 seconds
18:08:20.393 INFO WORKER - Clustered documents, time: 30.624 seconds

With cancellation check (j.state == CANCELLED) commented out:
18:13:46.790 INFO WORKER - Clustered documents, time: 25.431 seconds
18:14:44.100 INFO WORKER - Clustered documents, time: 22.024 seconds
18:15:52.829 INFO WORKER - Clustered documents, time: 21.495 seconds
18:18:19.440 INFO WORKER - Clustered documents, time: 21.642 seconds

To confirm, I'm also seeing this in VisualVM where we spend as much time in org.postgresql.core.VisibleBufferedInputStream.readMore() as we do in org.overviewproject.clustering.KMeansDocumentOps.distance()

We should be reading the cancel state on another thread while the clustering is happening. Could be done using actors to wrap DB requests, but not really necessary. See e.g. http://stackoverflow.com/questions/4087696/is-asynchronous-jdbc-call-possible and http://www.gamlor.info/wordpress/2012/05/async-sql-and-akka/

adamhooper · 2013-03-04T02:17:31Z

The links you provided give many uncool suggestions along with good ones.
#1 has many reasoned critiques that you must read before implementing. #2
won't compile and isn't tested.

Yes, it's reasonable to solve this specific issue (polling a boolean field)
with an actor running a query concurrently as the worker writes.

I don't know how far we should generalize, though.

Async only works if we don't care about state--but we usually do. On the
web server, for instance, we want to handle fewer requests at once, or
we'll thrash the DB. Queries aren't running in a vacuum.

Also: timings on dev and prod can vary tons.

Enjoy life,
Adam
On Mar 3, 2013 7:26 PM, "Jonathan Stray" notifications@github.com wrote:

Timings on my dev machine, importing drones set:

Current version:
18:02:13.567 INFO WORKER - Clustered documents, time: 31.445 seconds
18:03:13.333 INFO WORKER - Clustered documents, time: 28.492 seconds
18:06:19.625 INFO WORKER - Clustered documents, time: 28.352 seconds
18:08:20.393 INFO WORKER - Clustered documents, time: 30.624 seconds

With cancellation check (j.state == CANCELLED) commented out:
18:13:46.790 INFO WORKER - Clustered documents, time: 25.431 seconds
18:14:44.100 INFO WORKER - Clustered documents, time: 22.024 seconds
18:15:52.829 INFO WORKER - Clustered documents, time: 21.495 seconds
18:18:19.440 INFO WORKER - Clustered documents, time: 21.642 seconds

We should be reading the cancel state on another thread while the
clustering is happening. Could be done using actors to wrap DB requests,
but not really necessary. See e.g.
http://stackoverflow.com/questions/4087696/is-asynchronous-jdbc-call-possibleand
http://www.gamlor.info/wordpress/2012/05/async-sql-and-akka/

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/369
.

jstray · 2013-03-04T02:52:20Z

Yep, It's Complicated. I don't mean to suggest that we use adbcj, just that it shows the idioms around this.

Re server, there is certainly no point in having more concurrent threads than there are DB connections in the connection pool. Any request past that should simply block. But it's easy to imagine that e.g. the two threads in the server (I believe this is how many we have now?) are both waiting on writes and therefore the server cannot respond to a read which may be on an entirely different table, e.g. client wants to update progress. Or the client might want to view a route which doesn't even hit the DB, such as the main page.

In other words, if implemented well, this would be roughly equivalent to running more server processes. I don't see how the DB would be able to handle an additional server process yet not able to handle a server with more threads.

thejonas · 2013-03-04T11:56:40Z

I suggest we adopt a messaging system and let job cancellations be one of the messages. Then we won't have to worry about polling the db.

jstray · 2013-03-04T16:51:56Z

Indeed Jonas, that is the best answer.

adamhooper · 2015-04-01T20:08:40Z

We have other issues on Pivotal that target this problem. And the polling doesn't eat up 1/3 of the time any more.

ghost assigned thejonas Mar 4, 2013

adamhooper added the wontfix label Apr 1, 2015

adamhooper closed this as completed Apr 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustering spends 1/3 of its time waiting for DB to check for cancellation #369

Clustering spends 1/3 of its time waiting for DB to check for cancellation #369

jstray commented Mar 4, 2013

adamhooper commented Mar 4, 2013

jstray commented Mar 4, 2013

thejonas commented Mar 4, 2013

jstray commented Mar 4, 2013

adamhooper commented Apr 1, 2015

Clustering spends 1/3 of its time waiting for DB to check for cancellation #369

Clustering spends 1/3 of its time waiting for DB to check for cancellation #369

Comments

jstray commented Mar 4, 2013

adamhooper commented Mar 4, 2013

jstray commented Mar 4, 2013

thejonas commented Mar 4, 2013

jstray commented Mar 4, 2013

adamhooper commented Apr 1, 2015