Race conditions with batching #46

lukasgraf · 2015-11-18T14:13:17Z

(I'm just dumping this here to not have the conversation in #45 get too convoluted - for now I see this as low to medium priority).

Once we implement some sort of batching / pagination, there's some inherent race conditions that can occur:

Imagine a search query. Because fetching a batch page happens in a separate request, the extent and order of the resultset for a given query can change between retrieving batch pages if another client modified the DB in between. This can lead to either duplicate entries or entries that got dropped between batch pages when a consumer simply iterates over all entries in all batch pages.

ElasticSearch addresses this in a rather elegant way with its Scroll API:

The first request just creates a server side, persistent search context that has a certain time to live (TTL).
That request is answered with a response that basically just contains a _scroll_id that uniquely identifies the resultset created by the query at that point in time
To fetch the results, the client issues subsequent requests to fetch a particular batch page from that search context by referencing it via _scroll_id. On each of those requests the TTL for the search context is reset, so it is kept alive for another $TTL minutes.

I could see a similar concept working for us in order to provide stable resultsets for batched sequences, particularly search results.

I'm just brainstorming here, but maybe something along these lines could work:

POST /Plone/search

{"portal_type": "Document"}

This would create a server side, persistent search context. In terms of search results, this could maybe mean persisting a list of brain RIDs [1] for the resultset that matched the query at that point in time.

Returns a response with a scroll_id:

{"scroll_id": "f40dba5"}

The client then can retrieve result batches via GET requests:

GET /Plone/search?scroll_id=f40dba5&page=1&per_page=20

The link to the first batch page can also be provided in a hypermedia fashion as part of the response to the POST that creates the search context.

Search contexts that exceeded their TTL would be destroyed with the next POST. In addition, they could be actively cleared by the client using DELETE or PURGE.

Compared to a simple, stateless GET implementation, I see these pros/cons:

Advantages:

Stable resultsets
Appropriate use of HTTP methods (IMHO)
Allows for complex queries by using JSON in POST body
Still allows for hypermedia batching links because those requests are GET with query string params

Disadvantages:

Stateful - REST / HATEOAS?
Requires at least two requests for even the most trivial search
DB write for search / query operations
The returned metadata from the brains would still be up to date (not frozen in time). This could still lead to some surprising results if an object that matched at the time of query is included in the resultset, but has been changed later, and now according to its metadata wouldn't match the query any more.

[1] Is there a way to get the brain RIDs from a catalog resultset (LazyMap) without destroying its lazyness? If not, that would at least partly defeat the purpose of batching 😢

The text was updated successfully, but these errors were encountered:

vincentfretin · 2016-02-10T18:56:50Z

An interesting read on this subject:
http://www.servicedenuages.fr/pagination-continuation-token
(in French)
Version translated with Google translate:
https://translate.google.fr/translate?hl=en&sl=fr&tl=en&u=http%3A%2F%2Fwww.servicedenuages.fr%2Fpagination-continuation-token

lukasgraf added the 32 needs: review label Nov 18, 2015

lukasgraf mentioned this issue Nov 18, 2015

Search endpoint - request format #45

Closed

tisto added this to the 1.0a1 - Minimal read-only content API + Basic CRUD + Search milestone May 16, 2016

tisto modified the milestones: Future, 1.0a1 - Minimal read-only content API + Basic CRUD + Search May 29, 2016

tisto removed this from the Future milestone Jan 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race conditions with batching #46

Race conditions with batching #46

lukasgraf commented Nov 18, 2015

vincentfretin commented Feb 10, 2016

Race conditions with batching #46

Race conditions with batching #46

Comments

lukasgraf commented Nov 18, 2015

vincentfretin commented Feb 10, 2016