Skip to content

Conversation

finalspy
Copy link

[JAVA-1125] Change way remove(query) on gridfs is performed to improve performances.

Problem

While using a query to remove data in the GridFS (both [bucket].files and [bucket].chunks collections) the current driver first issues a select, and then loops over the results to run 2 removes (files and chunks) on each iteration.
On large resultsets, this behavior can results in thousands of requests x2 (files and chunks).

I can understand that performing files and chunks removal, one after the other, is a way to limit data inconsistency. But there still is a risk.
Thus, as long as the linked removal between files and chunks isn't managed by the server itself, the client side is responsible for checking whether both files and chunks are consistent.

Solution

I updated my previous PR by adding a parameter to keep the legacy behavior but allowing to force the "bulk removal".
A remove(query) = remove(query,true) as the default existing behavior.
A remove(query, false) only issues 3 requests :

  • one select on "files",
  • and then a remove using the query on files
  • and a remove with a $in clause on chunks.
    Fields ids are remembered using a list.
    On a single remove this won't be a great improvement, but on large sets of files it'll be worthwhile.

"Legacy" remove(query) = 2 * n requests for remove on gridfs.
"Bulk" remove(query, false) = 3 requests for remove on gridfs.
where n = number of files matched by the query.

[Fixes : https://jira.mongodb.org/browse/JAVA-1125]
[Updates/Improves of PR https://github.com//pull/171 against branch 3.0.x]
[Same as closed PR https://github.com//pull/192 against master instead of branch 3.0.x]

@finalspy finalspy force-pushed the master branch 2 times, most recently from 71b5048 to 0c68630 Compare March 21, 2015 23:06
@rozza
Copy link
Member

rozza commented Nov 19, 2015

Hi, apologies for not coming back to this ticket sooner. As you may be aware much has changed in the codebase with the new and improved 3.x series.

As this ticket interacts with the legacy and effectively deprecated GridFS implementation I'm closing it. There is a new GridFS spec which doesn't have a remove(query) method, due to complexities that can happen if there were to be an error when deleting the data.

If you feel this is a mistake and it should be included for all drivers - then please can I ask you to open a Drivers ticket?

@rozza rozza closed this Nov 19, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants