Skip to content

Add "batch upload" API to the sync1.5 docs. #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 29, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions source/respcodes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ Server-produced response codes
+------+-----------------------------------------------------------------------------------------------------+
| 16 | Client upgrade required |
+------+-----------------------------------------------------------------------------------------------------+
| 17 | Size limit exceeded |
+------+-----------------------------------------------------------------------------------------------------+

Infrastructure-produced response codes
--------------------------------------
Expand Down
226 changes: 211 additions & 15 deletions source/storage/apis-1.5.rst
Original file line number Diff line number Diff line change
Expand Up @@ -192,6 +192,31 @@ the user's data store as a whole.
the total number of items in each collection.


**GET** **https://<endpoint-url>/info/configuration**

Provides information about the configuration of this storage server
with respect to various protocol and size limits. Returns an object
mapping configuration item names to their values as enforced by this
server. The following configuration items may be present:

- **max_request_bytes**: the maximum size in bytes of the overall
HTTP request body that will be accepted by the server.

- **max_post_records**: the maximum number of records that can be
uploaded to a collection in a single POST request.

- **max_post_bytes**: the maximum combined size in bytes of the
record payloads that can be uploaded to a collection in a single
POST request.

- **max_total_records**: the maximum total number of records that can be
uploaded to a collection as part of a batched upload.

- **max_total_bytes**: the maximum total combined size in bytes of the
record payloads that can be uploaded to a collection as part of
a batched upload.


**DELETE** **https://<endpoint-url>/storage**

Deletes all records for the user. This is URL is provided for backwards-
Expand Down Expand Up @@ -253,7 +278,7 @@ collection.
the **offset** parameter to efficiently skip over the items that have
already been read. See :ref:`syncstorage_paging` for an example.

Two output formats are available for multiple record GET requests.
Two output formats are available for multiple-record GET requests.
They are triggered by the presence of the appropriate format in the
*Accept* request header and are prioritized in the order listed below:

Expand Down Expand Up @@ -315,6 +340,23 @@ collection.
this means that fields not provided in the request body will not be
overwritten on BSOs that already exist.

Two input formats are available for multiple-record POST requests,
selected by the *Content-Type* header of the request:

- **application/json**: the input is a JSON list of objects, one for
for each BSO in the request.

- **application/newlines**: each BSO is sent as a separate JSON object
followed by a newline.

For backwards-compatibility with existing clients, the server will also
treat **text/plain** input as JSON.

Note that the server may impose a limit on the total amount of data
included in the request, and/or may decline to process more than a certain
number of BSOs in a single request. The default limit on the number
of BSOs per request is 100.

Successful responses will contain a JSON object with details of success
or failure for each BSO. It will have the following keys:

Expand All @@ -338,26 +380,65 @@ collection.
Posted BSOs whose ids do not appear in either "success" or "failed"
should be treated as having failed for an unspecified reason.

Two input formats are available for multiple record POST requests,
selected by the *Content-Type* header of the request:
To allow upload of large numbers of items while ensuring that other
clients do not sync down inconsistent data, servers may support combining
several POST requests into a single "batch" so that all modified BSOs appear
to have been submitted at the same time. Batching behaviour is controlled
by the following query parameters:

- **batch**: indicates that uploads should be batched together into a
single conceptual update. To begin a new batch pass the string 'true'.
To add more items to an existing batch pass a previously-obtained batch
identifier. This parameter is ignored by servers that do not support
batching.

- **commit**: indicates that the batch should be committed, and all items
uploaded to that batch made visible to other clients. If present, it
must be the string 'true' and the **batch** query parameter must also
be specified.

When submitting items for inclusion in a multi-request batch upload,
successful responses will have a "202 Accepted" status code, and will
contain a JSON object giving the batch identifier rather than modification
time, alongside individual success or failure status for each item
that was sent.

- **application/json**: the input is a JSON list of objects, one for
for each BSO in the request.
For example::

- **application/newlines**: each BSO is sent as a separate JSON object
followed by a newline.
{
"batch": "OPAQUEBATCHID",
"success": ["GXS58IDC_12", "GXS58IDC_13", "GXS58IDC_15",
"GXS58IDC_16", "GXS58IDC_18", "GXS58IDC_19"],
"failed": {"GXS58IDC_11": "invalid ttl"],
"GXS58IDC_14": "invalid sortindex"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably failures here will only be in record parsing, not in storage, right? Is it possible for a record to transition from success in an intermediate 202 to failed in the final response?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moreover, do success and failure lists appear in the intermediate batch responses, or only at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably failures here will only be in record parsing, not in storage, right?

Right. The idea would be that if we've reported success at this stage, it'll only fail if the entire batch fails.

do success and failure lists appear in the intermediate batch responses

I think they should, so that we can report individual item failures to parse etc. What I don't think we should do is the current behavior were we tell you that the first 100 items succeeded, but all the others were rejected because you sent too many. That should just fail the entire request when in batch mode. (It should fail the entire request in non-batch mode to, but that would risk preventing old clients who send such items from making progress)

}

For backwards-compatibility with existing clients, the server will also
treat **text/plain** input as JSON.
The returned value of "batch" can be passed in the "batch" query parameter
to add more items to the batch. Items that appear in the "success" list
are guaranteed to become available to other clients if and when the batch
is successfully committed.

Note that the server may impose a limit on the total amount of data
included in the request, and/or may decline to process more than a certain
number of BSOs in a single request. The default limit on the number
of BSOs per request is 100.
If the server does not support batching, it will ignore the **batch** parameter
and return a "200 OK" response without a batch identifier.

The response when committing a batch is identical to that generated by
a non-batched request. Note that the semantics of a request with
**batch=true&commit=true** (i.e. starting a batch and immediately
committing it) are therefore identical to those of a non-batched request.

Note that the server may impose a limit on the total amount of payload data
included in a batch, and/or may decline to process more than a certain
number of BSOs as part of a single batch. If the uploaded items exceed
this limit, the server will produce a **400 Bad Request** response with
response code **17**. Where possible, clients should use the
*X-Weave-Total-Records** and *X-Weave-Total-Bytes* headers to signal
the expected total size of the uploads, so that oversize batches can be rejected
before the items are uploaded.

Potential HTTP error responses include:

- **400 Bad Request:** the user has exceeded their storage quota.
- **400 Bad Request, response code 14:** the user has exceeded their storage quota.
- **400 Bad Request, response code 17:** server size or item-count limit exceeded.
- **413 Request Entity Too Large:** the request contains more data than the
server is willing to process in a single batch.

Expand Down Expand Up @@ -432,6 +513,47 @@ Request Headers
response will be returned.


**X-Weave-Records**

This header may be sent with multi-record uploads, to indicate the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume all these headers only make sense with batch=true (ie, for the first post rather than subsequent ones)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non-batch ones, I'd be happy to accept for non-batch uploads in order to trigger the fail-the-entire-request behaviour. Right now, we'll accept requests with say 200 items, save the first 100 of them, and tell you that the others failed.

total number of records included in the request. If the server
would not accept an upload containing that many records, then a
**400 Bad Request** response will be returned with response code **17**.


**X-Weave-Bytes**

This header may be sent with multi-record uploads, to indicate the
combined size of payloads in the upload, in bytes. If the server
would not accept an upload containing that many bytes, then a
**400 Bad Request** response will be returned with response code **17**.


**X-Weave-Total-Records**

This header may be included with a POST request using the **batch** query
parameter, to indicate the total number of records in the batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wording concern: "batch" is ambiguous — this batch, or the total?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my head a "batch" is a series of consecutive POST requests that will committed as a single unit, so "batch is approximately the same as "total". Would it help if I said "in the entire batch"? Alternately, s/Batch/Total/ here and below to avoid confusion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, see note below!

English is a squirming beast — we upload things "in batches of 100", so each collection of records is a batch… but we're also uploading a collection of identical batches, and it's also reasonable to call that collection of batches a batch! That's particularly true from the server's perspective.

And it's also (again, primarily from the server's perspective) fair to call the entire collection of all uploaded records a batch — after all, they'll be processed together in one go once they leave the staging area.

("a quantity or consignment of goods produced at one time: a batch of cookies | the company undertakes thirty-six separate quality control checks on every batch.")

I don't know what word one would use for a sequential collection of batches of items.

Maybe we should always use "batch" for a collection of records uploaded at one time, and "transaction" or something like it for all of the records uploaded to process together? 'Total' or 'transaction' both work for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rtilder I'd like to get your take on the above naming discussion, as a fresh set of eyes ^

If the server would not accept a batch containing that many records,
then a **400 Bad Request** response will be returned with response
code **17**.

If the value of this header is not a valid positive integer value, or if
the request is not operating on a batch, then a **400 Bad Request**
response will be returned with response code **1**.

**X-Weave-Total-Bytes**

This header may be included with a POST request using the **batch** query
parameter, to indicate the total combined size of payloads in the batch,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, combined. Perhaps let's use X-Weave-Batch-Whatever for the individual batches, and X-Weave-Total-Whatever for the combination?

in bytes. If the server would not accept a batch containing that many
bytes, then a **400 Bad Request** response will be returned with response
code **17**.

If the value of this header is not a valid positive integer value, or if
the request is not operating on a batch, then a **400 Bad Request**
response will be returned with response code **1**.


Response Headers
================

Expand Down Expand Up @@ -548,8 +670,16 @@ protocol.
body cannot be parsed.

If the response has a *Content-Type* of **application/json** then the body
will be an integer response code as documented in :ref:`respcodes`.
will be an integer response code as documented in :ref:`respcodes`. The
respcodes with particular meaning in this protocol include:

- **6**: JSON parse failure, likely due to badly-formed POST data.
- **8**: invalid BSO, likely due to badly-formed POST data.
- **13**: invalid collection, likely invalid chars incollection name.
- **14**: user has exceeded their storage quota.
- **16**: client is known to be incompatible with the server.
- **17**: server limit exceeded, likely due to too many items or
too large a payload in a POST request.

**401 Unauthorized**

Expand Down Expand Up @@ -795,6 +925,64 @@ collection, this technique should always be combined with the
next_offset = r.headers.get("X-Weave-Next-Offset")


.. _syncstorage_batch_upload:

Example: uploading a large batch of items
-----------------------------------------

The syncstorage server allows several upload requests to be combined into a
single "batch" so that they all become visible to other clients as a single
atomic unit. This is achieved by using the **batch** and **commit** parameters
on the upload request.

Clients should begin by issuing a **POST /storage/<collection>?batch=true**
request, which will accept items for upload and issue a new batch id in the
response body.

To add more items to the batch, make additional **POST** requests to the
collection using the value of **batch** from the response body as the
**batch** query parameter.

When the final items have been uploaded, pass the **commit** query parameter
to the **POST** request. This will finalize the batch and make the uploaded
items visible to other clients. The last-modified time of the collection,
as well as of all items included as part of the batch, will be incremented to
the timestamp of this final **commit** request.

To guard against other clients concurrently committing changes to the
collection, this technique should always be combined with the
**X-If-Unmodified-Since** header as shown below::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pokes at something fairly fundamental. I think this should be phrased as "To guard against other clients concurrently committing changes to the collection" — X-I-U-S doesn't, can't, and shouldn't detect other in-progress batches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


# Make an initial request to start a batch upload.
# It's possible to send some items here, but not required.
r = server.post("/collection?batch=true", [])
batch_id = r.json_body["batch"]

# Always use X-If-Unmodified-Since to detect conflicts.
last_modified = r.headers["X-Last-Modified"]
headers = {"X-If-Unmodified-Since": last_modified}

for items in split_items_into_smaller_batches():

# Send the items in several smaller batches.
r = server.post("/collection?batch=" + batch_id, items, headers)
if r.status == 412:
raise Exception("COLLECTION WAS MODIFIED WHILE UPLOADING ITEMS")

# The collection will not be modified yet.
assert r.headers['X-Last-Modified'] == last_modified

# Commit the batch once all items are uploaded.
# Again, it's possible to send some final items here, but not required.
r = server.post("/collection?commit=true&batch=" + batch_id, [], headers)
if r.status == 412:
raise Exception("COLLECTION WAS MODIFIED WHILE COMMITTING ITEMS")

# At this point all the uploaded items become visible,
# and the collection appears modified to other clients.
assert r.headers['X-Last-Modified'] > last_modified


Changes from v1.1
=================

Expand Down Expand Up @@ -876,4 +1064,12 @@ The following is a summary of protocol changes from
| | guarantees that will improve overall robustness |
| | of the service. |
+-------------------------------------------+---------------------------------------------------+
| Batch uploads are supported that cross | This is a backwards-compatible API extension that |
| several POST requests. | allows clients to ensure consistency of their |
| | uploaded items. |
+-------------------------------------------+---------------------------------------------------+
| Various server-specific size limits can | This is a backwards-compatible API extension that |
| be read from a new /info/configuration | allows clients to ensure interoperability with |
| endpoint. | configurable server behaviour. |
+-------------------------------------------+---------------------------------------------------+