You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(I'm just dumping this here to not have the conversation in #45 get too convoluted - for now I see this as low to medium priority).
Once we implement some sort of batching / pagination, there's some inherent race conditions that can occur:
Imagine a search query. Because fetching a batch page happens in a separate request, the extent and order of the resultset for a given query can change between retrieving batch pages if another client modified the DB in between. This can lead to either duplicate entries or entries that got dropped between batch pages when a consumer simply iterates over all entries in all batch pages.
ElasticSearch addresses this in a rather elegant way with its Scroll API:
The first request just creates a server side, persistent search context that has a certain time to live (TTL).
That request is answered with a response that basically just contains a _scroll_id that uniquely identifies the resultset created by the query at that point in time
To fetch the results, the client issues subsequent requests to fetch a particular batch page from that search context by referencing it via _scroll_id. On each of those requests the TTL for the search context is reset, so it is kept alive for another $TTL minutes.
I could see a similar concept working for us in order to provide stable resultsets for batched sequences, particularly search results.
I'm just brainstorming here, but maybe something along these lines could work:
POST /Plone/search
{"portal_type": "Document"}
This would create a server side, persistent search context. In terms of search results, this could maybe mean persisting a list of brain RIDs [1] for the resultset that matched the query at that point in time.
Returns a response with a scroll_id:
{"scroll_id": "f40dba5"}
The client then can retrieve result batches via GET requests:
GET /Plone/search?scroll_id=f40dba5&page=1&per_page=20
The link to the first batch page can also be provided in a hypermedia fashion as part of the response to the POST that creates the search context.
Search contexts that exceeded their TTL would be destroyed with the next POST. In addition, they could be actively cleared by the client using DELETE or PURGE.
Compared to a simple, stateless GET implementation, I see these pros/cons:
Advantages:
Stable resultsets
Appropriate use of HTTP methods (IMHO)
Allows for complex queries by using JSON in POST body
Still allows for hypermedia batching links because those requests are GET with query string params
Disadvantages:
Stateful - REST / HATEOAS?
Requires at least two requests for even the most trivial search
DB write for search / query operations
The returned metadata from the brains would still be up to date (not frozen in time). This could still lead to some surprising results if an object that matched at the time of query is included in the resultset, but has been changed later, and now according to its metadata wouldn't match the query any more.
[1] Is there a way to get the brain RIDs from a catalog resultset (LazyMap) without destroying its lazyness? If not, that would at least partly defeat the purpose of batching 😢
The text was updated successfully, but these errors were encountered:
(I'm just dumping this here to not have the conversation in #45 get too convoluted - for now I see this as low to medium priority).
Once we implement some sort of batching / pagination, there's some inherent race conditions that can occur:
Imagine a search query. Because fetching a batch page happens in a separate request, the extent and order of the resultset for a given query can change between retrieving batch pages if another client modified the DB in between. This can lead to either duplicate entries or entries that got dropped between batch pages when a consumer simply iterates over all entries in all batch pages.
ElasticSearch addresses this in a rather elegant way with its Scroll API:
_scroll_id
that uniquely identifies the resultset created by the query at that point in time_scroll_id
. On each of those requests the TTL for the search context is reset, so it is kept alive for another$TTL
minutes.I could see a similar concept working for us in order to provide stable resultsets for batched sequences, particularly search results.
I'm just brainstorming here, but maybe something along these lines could work:
This would create a server side, persistent search context. In terms of search results, this could maybe mean persisting a list of brain RIDs [1] for the resultset that matched the query at that point in time.
Returns a response with a
scroll_id
:The client then can retrieve result batches via
GET
requests:The link to the first batch page can also be provided in a hypermedia fashion as part of the response to the
POST
that creates the search context.Search contexts that exceeded their TTL would be destroyed with the next
POST
. In addition, they could be actively cleared by the client usingDELETE
orPURGE
.Compared to a simple, stateless
GET
implementation, I see these pros/cons:Advantages:
POST
bodyGET
with query string paramsDisadvantages:
[1] Is there a way to get the brain RIDs from a catalog resultset (
LazyMap
) without destroying its lazyness? If not, that would at least partly defeat the purpose of batching 😢The text was updated successfully, but these errors were encountered: