Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustering spends 1/3 of its time waiting for DB to check for cancellation #369

Closed
jstray opened this issue Mar 4, 2013 · 5 comments
Closed
Assignees
Labels

Comments

@jstray
Copy link
Contributor

jstray commented Mar 4, 2013

Timings on my dev machine, importing drones set:

Current version:
18:02:13.567 INFO WORKER - Clustered documents, time: 31.445 seconds
18:03:13.333 INFO WORKER - Clustered documents, time: 28.492 seconds
18:06:19.625 INFO WORKER - Clustered documents, time: 28.352 seconds
18:08:20.393 INFO WORKER - Clustered documents, time: 30.624 seconds

With cancellation check (j.state == CANCELLED) commented out:
18:13:46.790 INFO WORKER - Clustered documents, time: 25.431 seconds
18:14:44.100 INFO WORKER - Clustered documents, time: 22.024 seconds
18:15:52.829 INFO WORKER - Clustered documents, time: 21.495 seconds
18:18:19.440 INFO WORKER - Clustered documents, time: 21.642 seconds

To confirm, I'm also seeing this in VisualVM where we spend as much time in org.postgresql.core.VisibleBufferedInputStream.readMore() as we do in org.overviewproject.clustering.KMeansDocumentOps.distance()

We should be reading the cancel state on another thread while the clustering is happening. Could be done using actors to wrap DB requests, but not really necessary. See e.g. http://stackoverflow.com/questions/4087696/is-asynchronous-jdbc-call-possible and http://www.gamlor.info/wordpress/2012/05/async-sql-and-akka/

@ghost ghost assigned thejonas Mar 4, 2013
@adamhooper
Copy link
Member

The links you provided give many uncool suggestions along with good ones.
#1 has many reasoned critiques that you must read before implementing. #2
won't compile and isn't tested.

Yes, it's reasonable to solve this specific issue (polling a boolean field)
with an actor running a query concurrently as the worker writes.

I don't know how far we should generalize, though.

Async only works if we don't care about state--but we usually do. On the
web server, for instance, we want to handle fewer requests at once, or
we'll thrash the DB. Queries aren't running in a vacuum.

Also: timings on dev and prod can vary tons.

Enjoy life,
Adam
On Mar 3, 2013 7:26 PM, "Jonathan Stray" notifications@github.com wrote:

Timings on my dev machine, importing drones set:

Current version:
18:02:13.567 INFO WORKER - Clustered documents, time: 31.445 seconds
18:03:13.333 INFO WORKER - Clustered documents, time: 28.492 seconds
18:06:19.625 INFO WORKER - Clustered documents, time: 28.352 seconds
18:08:20.393 INFO WORKER - Clustered documents, time: 30.624 seconds

With cancellation check (j.state == CANCELLED) commented out:
18:13:46.790 INFO WORKER - Clustered documents, time: 25.431 seconds
18:14:44.100 INFO WORKER - Clustered documents, time: 22.024 seconds
18:15:52.829 INFO WORKER - Clustered documents, time: 21.495 seconds
18:18:19.440 INFO WORKER - Clustered documents, time: 21.642 seconds

We should be reading the cancel state on another thread while the
clustering is happening. Could be done using actors to wrap DB requests,
but not really necessary. See e.g.
http://stackoverflow.com/questions/4087696/is-asynchronous-jdbc-call-possibleand
http://www.gamlor.info/wordpress/2012/05/async-sql-and-akka/


Reply to this email directly or view it on GitHubhttps://github.com//issues/369
.

@jstray
Copy link
Contributor Author

jstray commented Mar 4, 2013

Yep, It's Complicated. I don't mean to suggest that we use adbcj, just that it shows the idioms around this.

Re server, there is certainly no point in having more concurrent threads than there are DB connections in the connection pool. Any request past that should simply block. But it's easy to imagine that e.g. the two threads in the server (I believe this is how many we have now?) are both waiting on writes and therefore the server cannot respond to a read which may be on an entirely different table, e.g. client wants to update progress. Or the client might want to view a route which doesn't even hit the DB, such as the main page.

In other words, if implemented well, this would be roughly equivalent to running more server processes. I don't see how the DB would be able to handle an additional server process yet not able to handle a server with more threads.

@thejonas
Copy link
Contributor

thejonas commented Mar 4, 2013

I suggest we adopt a messaging system and let job cancellations be one of the messages. Then we won't have to worry about polling the db.

@jstray
Copy link
Contributor Author

jstray commented Mar 4, 2013

Indeed Jonas, that is the best answer.

@adamhooper
Copy link
Member

We have other issues on Pivotal that target this problem. And the polling doesn't eat up 1/3 of the time any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants