Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idempotency for API 0.6 #2201

Open
mmd-osm opened this issue Apr 6, 2019 · 57 comments

Comments

Projects
None yet
10 participants
@mmd-osm
Copy link
Contributor

commented Apr 6, 2019

API 0.6 has no concept to prevent duplicate POST operations, which could happen as a client retries a failed network operation. As per RFC 7231, POST operations are not idempotent, which is quite unfortunate in particular for diff uploads: assuming the changeset consists of "create" operations only, those objects might be created multiple times on the database.

I've created this issue to discuss different options (such as adding a downward-compatible HTTP Header "Idempotency-Key"), and find ways on how to implement it in the API.

Some pointers:

Link proposes as part of their RESTful API guidlines:

  • Should: Consider To Design POST and PATCH Idempotent [229]:

A client specific idempotency key provided via Idempotency-Key header in the request. The key is not part of the resource but stored temporarily pointing to the original response to ensure idempotent behavior when retrying a request (see May: Consider to Support Idempotency-Key Header).

@tomhughes

This comment has been minimized.

Copy link
Member

commented Apr 6, 2019

I don't understand. Why would they get created more than once?

Are you saying that the client is repeating a POST operation? If so then that client is buggy as POST is, as you say, not idempotent so clients should not repeat it.

Adding some new header is not going to fix a broken client because a broken client that replays POST is just as likely to not implement that extra header correctly!

@tomhughes

This comment has been minimized.

Copy link
Member

commented Apr 6, 2019

I think what you're really asking for here is the asynchonous API that has often been discussed, where the changeset is upload and then the client polls for a result. That is effectively what your solution does. I believe there are already tickets for that.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2019

I've seen the asynchronous API and upload queue proposal by Matt in his "Road to API 0.7" blog post. This one is slightly different. It's still synchronous.

In case the client retries the POST operation, the server has a way to figure out, that the data has already been processed. It can respond with the previous "diffResult" response, which the client didn't receive the first time.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2019

Why would they get created more than once?

A typical message would, say, create 10 new nodes. All of them have negative ids. Sending that message again because of some network failure will create another 10 nodes with different node ids.

If there's some networking issue, editors will let the user retry (or do the retry on their own). This behavior is pretty common across clients these days.

Example from JOSM (OsmApi.java):

    /**
     * Generic method for sending requests to the OSM API.
     *
     * This method will automatically re-try any requests that are answered with a 5xx
     * error code, or that resulted in a timeout exception from the TCP layer.
     *
     * @param requestMethod The http method used when talking with the server.
 

Of course, clients would have to be adapted to send the additional Header, in which case they would have a chance to avoid duplicate data upload. Today it's basically not feasible to detect and avoid this situation after a network failure.

@tomhughes

This comment has been minimized.

Copy link
Member

commented Apr 6, 2019

Yes what JOSM does is broken and they should stop doing it.

The only difference between this and the upload queue is whether the POST returns immediately or tries to wait - hell you could trivially make that optional so the caller could choose.

Well that and that the queue version avoids having to reupload everything on the future calls though again it wouldn't matter if you did, it could just ignore it and return the cached result.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2019

The only difference between this and the upload queue is whether the POST returns immediately or tries to wait

Exactly. The only difference is that the queue requires a second call (or even some more of them) to figure out that the status on the server was. I thought that's a bit of an overkill for a changeset upload that takes seconds rather than minutes in the future. Also, it would probably be more effort on the client side to implement such a queue based protocol, and not compatible with what we have today.

Well that and that the queue version avoids having to reupload everything on the future calls though again it wouldn't matter if you did, it could just ignore it and return the cached result.

Well, reuploading won't hurt that much to justify the additional complexity of a queue. Otherwise, it's exactly as you say, ignore the message and return the cached response.

@tomhughes

This comment has been minimized.

Copy link
Member

commented Apr 6, 2019

Well you effectively have the queue on the server anyway - something has to store the idempotency key and the corresponding request and result and what is that if it's not a queue? It serves the same function and has essentially the same complexity.

You could even argue it's worse at least while we're using rails for uploads, as rails has the all the technology for queues already while we would have to build our own solution to cache results!

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2019

I don't know exactly how this would be implemented in Rails. One approach I found on the web used some custom Idempotency middleware, and used a db table to manage previous results (or on file / redis / whatever): https://ieftimov.com/post/understand-how-why-add-idempotent-requests-api/

That would still all be synchronous as today, with the only addition of requiring an "Idempotency-Key" http header.

Did you have something similar in mind?

@simonpoole

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2019

Yes what JOSM does is broken and they should stop doing it.

While the automatic retry is bad, I have to say from an editor POV there is no good solution as invariably the link between template ids in the editor and what has actually been created in the DB is lost.

So it is one of

  • re-uploading - potentially creating duplicates in the DB
  • refreshing the data in the editor - creates dupes in the editor which may be or not be noticed by the user, but which will, if uploaded, create dupes in the DB
  • replace the data in the editor - potentially (if the upload wasn't actually completed) zapping all the edits by the user (this is what should be done if there was a way to check if the upload was actually successful)

IMHO a possible minimal impact solution would be to close the changeset automatically after a successful diff upload (which boils down to allowing just one diff upload per changeset), which would allow the editor to query the status of the changeset after a "perhaps failed" upload and check the status by querying if the changeset is open.

Or other variants of the same theme, for example an upload per changeset counter. It definitely isn't necessary to create additional complexity by adding completely new concepts to solve this.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2019

I have to say from an editor POV there is no good solution as invariably the link between template ids in the editor and what has actually been created in the DB is lost.

...unless the diffResult reponse is important enough to store it in the database for later retrieval, and associate it to some kind of unique identifier provided by the client. iD as an example doesn't parse the diffResult at all, while JOSM depends on it.

I have to say that I don't find any of the 3 options you mentioned compelling, and that's the main motivation for opening up this issue. JOSM implements option 1, iD either 1 and/or 3 (not exactly sure at the moment), maybe some other editor does 2 already.

Let's try to break down the different alternatives a bit:

Single upload per changeset

Allowing only a single upload per changeset sounds tempting, but it doesn't solve the diffResult issue, and a few JOSM users would probably be unhappy with this approach. (iD supports a single upload per changeset only).

Let's assume someone wants to upload 12'000 changes, first upload w/ 10'000 changes runs, changeset gets closed, yet the diffResult gets lost due to some networking issue on the way back.
How would you match the actual object ids to the placeholder ids in this case? This information may be crucial for the remaining 2'000 objects, if they have references to those 10'000 objects.

(By the way, 10'000 is just the maximum changeset size I used as an example. If a user has configured their JOSM to upload data in chunks of eg. 500 objects, this very same situation can obviously occur with much smaller changesets).

Track upload via changeset upload counter

Using some kind of changeset counter would allow some heuristic to find out if a changeset has been uploaded. Like in the previous case, it doesn't solve the diffResult issue. Implementation wise you would have to

  • extend the changeset table,
  • extend the query changeset API call to return the additional counter field
  • change the changeset upload to update the counter,
  • issue a separate changeset query call on the client after a failed changeset upload call
  • download the changeset
  • implement some heuristic to match actual object ids back to placeholder ids (unless no more data is to be uploaded, in which case you can simply re-download the area in question, like iD does).
@bhousel

This comment has been minimized.

Copy link
Member

commented Apr 6, 2019

I have to say that I don't find any of the 3 options you mentioned compelling, and that's the main motivation for opening up this issue. JOSM implements option 1, iD either 1 and/or 3 (not exactly sure at the moment), maybe some other editor does 2 already.

iD does option 3 currently "replace the data in the editor - potentially zapping all the edits by the user"

The reason for this is: A thing we've seen users do which was surprising to me (but not really) is that users sometimes start their upload and then just close their browser tab and open a new one. Obviously in situations like this, iD has no way of knowing whether that upload will ever complete, and offering to "restore" their changes will introduce duplicates, so we just wipe their history as soon as the changeset is inflight.

@simonpoole

This comment has been minimized.

Copy link
Contributor

commented Apr 6, 2019

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 6, 2019

If successful there is no need to update the local data after the fact, as that is only an optimisation (which iD for example does), which can be replaced by downloading again.

Well, this is how JOSM works today. All local placeholders get automatically updated by the diffResult, and re-downloading the area isn't really needed. In special cases where people use e.g. sparse editing, it might even be impossible to simply re-download an area using the /map call. JOSM handles this case pretty well today solely based on diffResults.

Chunked uploads are problematic, but that isn't covered by the original proposal either.

According to my understanding of how JOSM works, chunked uploads are simply a series of changeset upload API calls,. The idea is that the editor temporarily assigns a unique identifier to each chunk before uploading. If there's some issue while uploading a chunk, a retry would simply provide the same id in the upload.

What I take from this discussion is the following alternative:

  • provide a "last diff result" API call

One potential source of issues I see here are in-flight transactions. Assuming t2 is still in flight, while the client lost connection, a subsequent "last diff result" might still return t1's diffResult, and trick the client into resending t2. By construction, the server cannot tell in this case, that it's in fact a duplicate to be discarded. This may be a different story, if the "last diff result" call needs to acquire a shared lock on the changeset table first, and would simply have to wait for the in-flight transaction t2 to finish. Depending on how this call is implemented, the client won't know exactly when it's safe to call this "last diff result" API endpoint. That’s a potential disadvantage of using a relative rather than an absolute term when referring to different uploads.

NB: Both alternatives may be similar enough that they could be served by the same db persistence. After all, storing the "last diff result" along with the latest Idempotency-Key (if provided) for a given changeset might be sufficient for our purposes. This needs to be analyzed in more detail.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 7, 2019

A minimum scope approach for JOSM could be as follows: it generates a new Idempotency-Key outside the retry loop, and passes it as HTTP header for POST requests.

I also added some log file excerpts below to demo a chunked upload. Every single upload has its own Idempotency-Key. Towards the end, I shut down the Rails server, and JOSM did one unsuccessful post, followed by another 5 unsuccessful retries, each time sending the same Key.

Assuming we had some corresponding logic on the backend as well, this would already improve today's situation. Of course there's always room for improvement, but I don't think this will add a massive burden to clients willing to implement this approach.

Index: src/org/openstreetmap/josm/io/OsmApi.java
===================================================================

 import java.io.StringReader;
 import java.util.HashMap;
 import java.util.List;
 import java.util.Map;
+import java.util.UUID;
 import java.util.function.Consumer;

... 

     /**
      * Generic method for sending requests to the OSM API.
      *
     protected final String sendRequest(String requestMethod, String urlSuffix, String requestBody, ProgressMonitor monitor,
             boolean doAuthenticate, boolean fastFail) throws OsmTransferException {
         int retries = fastFail ? 0 : getMaxRetries();
+   
+        // Generate new Idempotency-Key
+        String idempotencyKey = UUID.randomUUID().toString();
 
         while (true) { // the retry loop
             try {
                     addAuth(client);
                 }
 
+                if ("POST".equals(requestMethod)) {
+                    client.setHeader("Idempotency-Key", idempotencyKey);
+                    Logging.info("Using Idempotency-Key " + idempotencyKey);
+                }
+


2019-04-07 13:13:36.106 INFORMATION: OK
2019-04-07 13:13:36.208 INFORMATION: Using Idempotency-Key e8f69ee7-b958-4d95-a7bd-b0d964947cf5
changeset/1303/upload
2019-04-07 13:13:36.208 INFORMATION: POST http://localhost:3000/api/0.6/changeset/1303/upload (159 B) ...
2019-04-07 13:13:36.253 INFORMATION: POST http://localhost:3000/api/0.6/changeset/1303/upload -> 200
2019-04-07 13:13:36.253 INFORMATION: OK
2019-04-07 13:13:36.355 INFORMATION: Using Idempotency-Key e8f40f0b-27c1-4934-93e3-3646f29dc245
changeset/1303/upload
2019-04-07 13:13:36.355 INFORMATION: POST http://localhost:3000/api/0.6/changeset/1303/upload (160 B) ...
2019-04-07 13:13:36.405 INFORMATION: POST http://localhost:3000/api/0.6/changeset/1303/upload -> 200
2019-04-07 13:13:36.405 INFORMATION: OK
2019-04-07 13:13:36.507 INFORMATION: Using Idempotency-Key 973c9c41-fd18-4346-8536-00b55ab4e6e2
changeset/1303/upload
2019-04-07 13:13:36.507 INFORMATION: POST http://localhost:3000/api/0.6/changeset/1303/upload (160 B) ...
2019-04-07 13:13:36.553 INFORMATION: POST http://localhost:3000/api/0.6/changeset/1303/upload -> 200
2019-04-07 13:13:36.553 INFORMATION: OK
2019-04-07 13:13:36.655 INFORMATION: Using Idempotency-Key 7f4c3f5b-3180-4dd9-b13b-f7e8d55246db
changeset/1303/upload
2019-04-07 13:13:36.655 INFORMATION: POST http://localhost:3000/api/0.6/changeset/1303/upload (159 B) ...
2019-04-07 13:13:36.701 INFORMATION: POST http://localhost:3000/api/0.6/changeset/1303/upload -> 200
2019-04-07 13:13:36.701 INFORMATION: OK
2019-04-07 13:13:36.803 INFORMATION: Using Idempotency-Key ee95ef0f-e760-42fb-ac29-f25dcfeaff90
changeset/1303/upload
2019-04-07 13:13:36.803 INFORMATION: POST http://localhost:3000/api/0.6/changeset/1303/upload (160 B) ...
2019-04-07 13:13:36.804 INFORMATION: Using Idempotency-Key ee95ef0f-e760-42fb-ac29-f25dcfeaff90
changeset/1303/upload
2019-04-07 13:13:36.805 INFORMATION: POST http://localhost:3000/api/0.6/changeset/1303/upload (160 B) ...
2019-04-07 13:13:36.805 INFORMATION: Using Idempotency-Key ee95ef0f-e760-42fb-ac29-f25dcfeaff90
changeset/1303/upload
2019-04-07 13:13:36.806 INFORMATION: POST http://localhost:3000/api/0.6/changeset/1303/upload (160 B) ...
2019-04-07 13:13:36.806 INFORMATION: Using Idempotency-Key ee95ef0f-e760-42fb-ac29-f25dcfeaff90
changeset/1303/upload
2019-04-07 13:13:36.807 INFORMATION: POST http://localhost:3000/api/0.6/changeset/1303/upload (160 B) ...
2019-04-07 13:13:36.807 INFORMATION: Using Idempotency-Key ee95ef0f-e760-42fb-ac29-f25dcfeaff90
changeset/1303/upload
2019-04-07 13:13:36.807 INFORMATION: POST http://localhost:3000/api/0.6/changeset/1303/upload (160 B) ...
2019-04-07 13:13:36.808 INFORMATION: Using Idempotency-Key ee95ef0f-e760-42fb-ac29-f25dcfeaff90
changeset/1303/upload
@tomhughes

This comment has been minimized.

Copy link
Member

commented Apr 7, 2019

That's all very nice, but it ignores how complicated this would be to implement on the server.

Even if you had somebody willing to try and implement it on the server, and a plausible way to do so, which it is not at all clear even exists in our current architecture, it would just be wrong on principle.

Just as a matter of principle I don't think we should accept anything that attempts to make an operation which is defined to be non-idempotent behave idempotently.

I'm not saying there isn't a problem that should be solved, just that I don't think this is the way to do it.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 7, 2019

That's all very nice, but it ignores how complicated this would be to implement on the server.

Just to finish off what I originally had in mind:

Changeset upload

  1. Exclusive lock on changeset
  2. Check if Idempotency-Key was provided in request, if so
    2.1 Fetch (Changeset Id, Idempotency-Key) from DB table changeset_diffresults
    2.2. If entry exists, check if Request hash on DB matches hash for current request. Possibly return an error, if they mismatch (409 Conflict?)
    2.3. If matching, use cached diffResult from DB --> jump to step 6.
  3. Process osmChange message
  4. In case of success, and Idempotency-Key available
    4.1 Store (Changeset id, Idempotency Key, Hash, current time, diffResult) on db table changeset_diffresults
  5. Commit transaction
  6. Return diffResult response to client

Cleanup:

  • Close changeset: remove all entries for changeset id in changeset_diffresults table
  • Periodic job cleans up stale entries for implicitly closed changesets

Table changeset_diffresults

  • Changeset Id (Key)
  • Idempotency-Key (Key)
  • Hash for osmChange mesasge
  • Timestamp (for cleanup)
  • diffResult message

Example:

                   Table "public.changeset_idempotency_cache"
     Column      |            Type             | Collation | Nullable | Default 
-----------------+-----------------------------+-----------+----------+---------
 id              | bigint                      |           | not null | 
 idempotency_key | character varying           |           |          | 
 hash_value      | character varying           |           |          | 
 timestamp       | timestamp without time zone |           |          | 
 payload         | character varying           |           |          | 
Indexes:
    "changeset_idempotency_cache_pk" PRIMARY KEY, btree (id)
Foreign-key constraints:
    "changeset_idempotency_cache_fk" FOREIGN KEY (id) REFERENCES changesets(id)
  id  |           idempotency_key            |                  hash_value                  |         timestamp          |              payload               
------+--------------------------------------+----------------------------------------------+----------------------------+------------------------------------
 1315 | 4d5a01d6-c936-441c-aefd-18e754ae18c6 | k95cXlyoy+ARqzHKmIfHd0qGfuwcuEKRs/Dtd/4J/Kk= | 2019-04-17 20:17:40.473888 | 1 1 -169012 -169012 5003185586 1 0+
      |                                      |                                              |                            | 
@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 7, 2019

Just as a matter of principle I don't think we should accept anything that attempts to make an operation which is defined to be non-idempotent behave idempotently.

If we were to redesign the interface from ground up, I would fully agree with you. Most likely we would first request some URL to store a later osmChange message, and use PUT instead of POST. Still parts of what I have outlined earlier on would to some extent still be valid, e.g. on top of what we have today, we need to persist the diffResult somehow.

It's an interesting coincidence that the folks who came up with this approach also initially forgot to make their API fault-tolerant (see https://stripe.com/en-US/blog/idempotency). In the context of financial transactions, it's obviously a big no go to charge your customers twice, and they had to come up with some schema to fix this.

@tomhughes

This comment has been minimized.

Copy link
Member

commented Apr 7, 2019

See the problem with your design is that it requires storing potentially enormous diffresult text in the database...

Oh, and that exclusive lock is just going to timeout every time if a previous upload is still running...

Also if we're checking a hash of the request why do we even need the idempotency key? I thought the idea was that the client would use a different key when the request changed so that just comparing the key was enough?

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 7, 2019

Re diffResult size: the worst case scenario would be a changeset having 10k entries. This would amount to roughly 800k as XML message. If it turns out that we can keep just the last upload as @simonpoole proposed, this shouldn't be too bad? After an hour of inactivity, they can be removed by the latest. We shouldn't really accumulate large amounts of data this way.

Re: exclusive lock: I wasn't aware about an (exclusive) lock timeout value in production. Is there one, and how large is it?

The hash value may be a bit of an overkill, I agree. I thought it might be good to have an additional safeguard in case clients send different content using the same idempotency key due to some programming mistake. We might as well return an error message in that case.

@tomhughes

This comment has been minimized.

Copy link
Member

commented Apr 7, 2019

I've no idea what the timeout is - check the Postgres doco? There might not even be one unless Postgres decides there is a deadlock.

I mean there may be a way to try a lock without blocking but again you would have to check the defaults. By default I'm pretty sure it will block until the lock is release or until the lock operation times out or deadlocks.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 8, 2019

Well, according to the docs, lock_timeout has a default of 0, which turns this feature off. As we're not changing anything w/r/t locking in this proposal, an upload retry would have to wait until the original upload finishes, pretty much like it's already today.

@gravitystorm

This comment has been minimized.

Copy link
Collaborator

commented Apr 10, 2019

I'm reasonably keen to see the API become fully idempotent. Network requests / responses / connections are lost for all kinds of reasons, and are more likely to happen the worse your internet connection (mobile, rural, far from OSMF's datacentres etc).

I know we've been planning on making a changeset upload queue, in order to avoid having to keep the connection open for significant amounts of time until the diff is finished being processed. It's still worth doing. But that doesn't fully solve the problem, since the response to the initial upload might not be received, and the client won't know whether the diff was put in to the queue or not.

We will, eventually, need to go through the API to make every request idempotent i.e. can be repeated as many times as you like with no ill effects. Setting aside changesets for the moment (since they are complex and arouse passions over multiple diffs in a changeset and whatnot), even simple things like creating a note or creating a trace can be changed to become fully idempotent. If we take the view that we shouldn't mess around with non-idempotent HTTP data upload methods like POST and PATCH, that leaves PUT.

But it's not easy to make a PUT request for a new note (or trace) idempotent. One way would be to have a specific url for creating a given note, like PUT /api/x/notes/uuid-goes-here. We can make it so that the client can PUT to that url twice or more, if required, with no ill effects. Of course this still has complexity - the first request received should succeed, the second should also succeed, but PUTing something different should get a 400, and if the trace is now different (e.g. the description was changed in another request or via the web) then PUTing the original data to the original URL should still succeed but without stomping on changes that were made subsequently.

This uuid could be generated by the server (and obtained via a different call) but that's not really necessary - the client could just make one up. And at that point this is just the same as having an idempotency key in a header anyway, which I think is a slightly more standard way to do it.

So in general I support this overall topic, and I don't think it really matters whether we use POST or PUT since, as far as I can tell, creating new resources needs an idempotency key regardless of whether the method is idempotent itself.

(P.S. There's a small opportunity to say that we can make life easier for ourselves for creation by saying that everything needs to be unique, i.e. you can't have two identical notes (or traces), and so you can just PUT without an idempotency key and the server can avoid creating duplicates by assuming duplicate requests are just resent requests. But that still doesn't work, since if the server receives a new trace, and then the owner changes the description 2 seconds later on the website, and the server receives the same creation request 2 seconds after that, it doesn't have enough information to know if it's an intentional recreation or just a resent request.)

@pierzen

This comment has been minimized.

Copy link

commented Apr 16, 2019

I'm reasonably keen to see the API become fully idempotent. Network requests / responses / connections are lost for all kinds of reasons, and are more likely to happen the worse your internet connection (mobile, rural, far from OSMF's datacentres etc).

Reddit Internet speeds graph by country do not mention many african countries. They are out of the map! Internet connections in Africa, Haiti, etc. are often through cellphones with less! then low bandwith and long latencies at day time. Worse internet and communication lost is a daily situation and a solution to avoid changeset duplicates would be quite usefull.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 18, 2019

I think that it makes sense to consider all parts of the API where we have some potential issues with idempotency, but I'd like to see a bit of a prioritization to have a more focused effort.

As an example, I'd like to exclude "PUT /api/0.6/[node|way|relation]/create" in the first step, as it doesn't follow idempotency semantics anyway, although it is using the "PUT" verb, it's somewhat rarely used these days (none of the major editors use it), and remaining tools could easily just switch to the diff upload.

I think, besides changeset upload, we should only include POSTing Notes, and maybe Notes comments in a first iteration, and see how far we can get. Only once this is stable and all issues ruled out, extend the concept to the remaining bits of the API.

By the way, I added a working prototype to cgimap's changeset upload now, which we could use for further discussion, or maybe even deploy to the dev instance once we have found a good data model for the persistence layer.

@tomhughes

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

Look I'm sorry, but this data is just totally unsuitable for storing in the database.

Maybe once we have ceph working we can use that or something.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 18, 2019

At least for the changeset upload, I had the following design goals in mind:

  • We store at most one I-Key per changeset
  • One entry would have at most 800k in size for the payload (for a 10k changeset)
  • Explicitly closing the changeset removes the I-Key
  • Remaining I-Keys can be deleted after 1 hour, by the time the changeset automatically closes

I'd like to understand a bit better, what your design goals would be, and where exactly you see the issues. Is it in the number of messages, the size, something else? I'm not saying that we need to use a database for this purpose, just like to understand your concerns.

@tomhughes

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

Storing 800k objects that have a short lifetime is just not what relational databases are designed for. Well I mean acting as a cache in general is not really what they are designed for...

I mean this whole thing is insane anyway but that's just the icing on the cake.

Look I could probably live with the idea of the idempotency key (there are better solutions I think but nobody seems to want to listen to that) but the idea of just straight up caching the actual raw HTTP response is just batshit insane.

The trouble of course is that we can't rebuild the response from the database in response to repeated requests because we don't know which bits of the changeset relate to which upload :-(

The only solutions to that are to limit changesets to a single upload, or to introduce an extra level of object underneath the changeset but that means a massive database update to tie objects to the "upload" id rather the changeset id.

Sadly it seems everybody in this thread is would rather slap any old nonsense together in the aim of rolling it out yesterday rather than coming up with a sane solution that might actually take time to implement.

@gravitystorm

This comment has been minimized.

Copy link
Collaborator

commented Apr 18, 2019

@tomhughes I'd appreciate it if you were less dismissive of everyone else's ideas - I don't find phrases like "batshit insane" appropriate, really.

I spent a reasonable amount of time reasoning out what I thought on this subject, and I'm not trying to "slap any old nonsense together" as you put it. If you think that my reasoning for needing idempotency keys for object creation is flawed, then please do explain why and explain further what you suggest instead. You've said that there are better solutions but that "nobody seems to want to listen" - I'm listening, as always, so lets discuss them.

There's no rush here, as far as I'm concerned. I'd rather we all work together to find a solid solution, even if it takes a while.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 18, 2019

Storing 800k objects that have a short lifetime is just not what relational databases are designed for. Well I mean acting as a cache in general is not really what they are designed for...

Fair point. The actual size will be much smaller on average, but that won't change the transient nature of the data.

I could probably live with the idea of the idempotency key (there are better solutions I think but nobody seems to want to listen to that)

Although it may seem at times that I'm outright rejecting other approaches, the intention is rather to discuss pros and cons of the different solutions. I found the input from @simonpoole very helpful to cut down the number of entries we want to store per changeset, as an example. I try to formulate ideas down to a level where it's at least somewhat clear how an approach could look like end-to-end.

but the idea of just straight up caching the actual raw HTTP response is just batshit insane.

I'm not proposing to do that, and I also think it's a pretty horrid idea. In my prototype, I'm only storing some absolute minimum intermediate data, which is later used to re-generate the XML response. I'm not storing the actual HTTP response as such.

Sadly it seems everybody in this thread is would rather slap any old nonsense together in the aim of rolling it out yesterday rather than coming up with a sane solution that might actually take time to implement.

There's no rush whatsoever. We're still only discussing different options, that's all. Writing prototypes is meant to have a slightly more tangible basis for the discussion. If it's nonsense, scrap it, and try something else.

@tomhughes

This comment has been minimized.

Copy link
Member

commented Apr 18, 2019

Sorry, I just find it really hard to express how weird this whole thing seems to me.

Maybe it's just me but ut really has the feel of a duct-tape-and-baling-wire solution - it's one of things where it's hard to explain but it's one of things where my engineer sense just kind of goes "urgh" on me. What is sometimes called a "code smell" or what I often like to think of as code aesthetics where something just feels wrong and you know deep down that there has to be a better way.

I think I've been clear from the start that I think the asynchronous API is the correct solution, though to be fair that is also going to need to store queued requests and responses somehow and I don't really have a clear picture of how that would work.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 18, 2019

Switching to some kind of asynchronous queue based system isn't exactly trivial with today's API calls. I see at least the following three major issues here: (a) sync/asych message interdepencies, (b) data dependencies and (c) error handling.

Queue processing would not only require us to spend some time on the changeset upload, but also on creating, and in particular closing changesets. Once a changeset has been uploaded to be processed asynchronously, we cannot simply send a synchronous "close changeset", as it would directly interfere with the queue handling, and in the worst case close the changeset even before it is due for processing in the queue.

On the data dependency side, I think one thing which has become clear in this issue is that chunked upload implies that message t2 might depend on the results of message t1, so queuing both of them at the same time isn't going to work. A client would have to enqueue message t1, poll for its status until is has been processed and only then enqueue message t2, based on updated information from t1. This sort of thwarts the whole concept of asynch processing.

Next would be error handling. Assuming we were to enqueue multiple changeset uploads, and there's some error condition, it's not clear how the process would continue here. In the synchronous world, we simply tell the user "hey, there's some conflict, because joe mapper deleted the node you were about to add in your way". In the asynch world, that message lingers somewhere in the queue, and it's not clear, how conflict resolution would go about.

I don't believe we would ever be in the position to have more than a single message in the queue, unless we could come up with a complete overhaul of the protocol to address those issues I mentioned.

@simonpoole

This comment has been minimized.

Copy link
Contributor

commented Apr 18, 2019

We have

  • one app misbehaving. solution: tell them to stop and document why auto retrying is a bad idea
  • a general issue with horizontally chunked uploads, these should be discouraged and replaced by referentially complete chunking ( note I do this wrong too just as likely every other editor dev) and best be completly avoided
  • no method with which an app can determine if an upload succeeded or not, this is needed in any case as simply suggesting retrying for an hour (and then what?) is not going to cut it. with vertical chunking retaining the place holder mapping isn't really needed.
@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 18, 2019

@simonpoole : you're right about the chunked uploads. Referentially complete chunking is something that is only implemented in tools which are otherwise frowned upon: upload.py (that's at least my understanding of how this tool works).

See this example: https://master.apis.dev.openstreetmap.org/history#map=15/49.0478/2.0270 - a super large change file (with way more than 10k changes) was split into multiple changesets, and each changeset contains all nodes, ways and relations for one bbox. Today, JOSM typically uploads all nodes first, even if they span multiple changesets (the infamous 10k changesets with no information whatsoever), followed by ways and relations towards the end.

I was seriously considering to suggest this approach to JOSM devs.

in any case as simply suggesting retrying for an hour (and then what?) is not going to cut it.

The expectation here was, that an upload finishes in a matter of seconds, and in case there's some networking issue, the user would have a chance for a safe retry.

One hour should leave ample time for a retry. If that still doesn't work out, re-uploading could possibly create some duplicates. After one hour, the changeset will have been closed anyway, and the message payload would look differently (which means, that a simple re-upload will not work).
In any case, this would never cause a loss of data anyway, so this might be perfectly acceptable.


one app misbehaving. solution: tell them to stop and document why auto retrying is a bad idea

This is definitely not limited to JOSM, and can as well occur in any browser under certain conditions. I'd suggest to take a look at the following blog post: https://blogs.oracle.com/ravello/beware-http-requests-automatic-retries

Also see: https://tools.ietf.org/id/draft-nottingham-httpbis-retry-00.html#rfc.section.3.1

The status quo, therefore, is that no Web application can read HTTP’s retry requirements as a guarantee that any given request won’t be retried, even for methods that are not idempotent. As a result, applications that care about avoiding duplicate requests need to build a way to detect not only user retries but also automatic retries into the application “above” HTTP itself.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 18, 2019

Storing 800k objects that have a short lifetime is just not what relational databases are designed for. Well I mean acting as a cache in general is not really what they are designed for...

Just to add one more data point:

800k seemed a bit too pessimistic, actual numbers in the prototype are in the max 300k range, and that's still based on a human-readable format. The string compresses down to 20k-50k (factor 6 - 15).

As a purely binary format, at least 21 bytes are needed to store all information representing a single change in a diff result file:

  • operation: 2 bit (values: create/modify/delete)
  • object type: 2 bit (values: node/way/relation)
  • deletion_skipped: 1 bit (true/false)
  • old_id: 64 bit
  • new_id: 64 bit
  • version: 32 bit
    (Note: actual sequence of rows needs to be preserved!)

I guess there's a number of different options to store this kind of data.

@matkoniecz

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2019

one app misbehaving. solution: tell them to stop and document why auto retrying is a bad idea

So what JOSM should do? I would be happy to open issue on their bug tracker but based on #2201 (comment) I see following solutions

  • reupload, currently used by JOSM (leading to potential duplicates)
  • refreshing the data in the editor - while it may be feasible in iD it is probably not doable in JOSM. Note that JOSM allows to load data for massive area, not resonable to reload. And it may be loaded for example from overpass queries (one may easily have 10 000 objects scattered across world, refreshing areas by downloading data for such area is not going to work and rerunning queries is going to fail as soon as overpass has lag greater than measured in seconds).
  • replace the data in the editor - potentially losing all edits by user is worse than occasional duplicates that are fairly easy to handle. It also runs in all problems of a previous solution.

Maybe I am missing something but I see no way to improve things on a JOSM side.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 20, 2019

@matkoniecz

I would be happy to open issue on their bug tracker

Appreciate your feedback. As we're still discussing different options and haven't come to some final conclusion yet, I think it would be prudent not to bother JOSM devs with every half baked idea that we may have come up with. I think it's just too early to communicate or even recommend certain implementation options.

Maybe I am missing something but I see no way to improve things on a JOSM side.

We've already discussed several options in this issue to the improve the situation for JOSM. My suggestion would be to please take the time and read through all of the previous comments. I know this topic is quite tricky, if you have questions, we're happy to help out.

(Edit: reformulated to be more welcoming)

@matkoniecz

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2019

My suggestion would be to please take the time and read through all of the previous comments.

I did this.

I encountered claims that JOSM is buggy/misbehaving in a current situation. But unless I missed something there is currently no way for JOSM to change its behavior without causing even worse problems.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 20, 2019

@matkoniecz : right, that's exactly the situation today. JOSM simply has no better way of doing uploads with what is currently being provided by the OSM API.

This issue is about evolving the OSM API to provide better mechanisms to editing application like JOSM. Once they are in place, JOSM can take advantage of it and improve its error handling.

That's also the reason why I mentioned that it doesn't make sense to open an issue with JOSM today. We have to do some work on the OSM API side first.

@simonpoole

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2019

JOSM could simply ask the user what to do (which for example would allow the user to check with a browser if their changes have actually been uploaded, and -then- decide on a course of action), I know, that's a completely novel concept and completely unheard of, but maybe it is worth a try?

@bhousel

This comment has been minimized.

Copy link
Member

commented Apr 20, 2019

I know, that's a completely novel concept and completely unheard of, but maybe it is worth a try?

Why are you so nasty?

@simonpoole

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2019

I know, that's a completely novel concept and completely unheard of, but maybe it is worth a try?

Why are you so nasty?

Pointing out a glaringly obvious way to do something better than current clearly broken behaviour is nasty?

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 20, 2019

@simonpoole : I don't know if you read my updated comment in #2201 (comment), if not, I'd recommend to take a look at it. There are indeed situations, in which an automated retry occurs without user intervention (no, I'm not talking about JOSM). So delegating the decision to the user may not really work out.

Also, we're talking about a completely downward-compatible 100% opt-in approach. If you don't implement it, nothing will change at all.

I don't want to bore folks here with lots of data, but in case you're still interested: I tried persisting a diffResult with 10k entries as JSONB. Saving the changeset took 2 seconds in total (that's on cgimap, not the Rails code that takes way longer). Out of those two seconds, saving the diffResult took 14 milliseconds. I used the SQL statement below to asses the size impact: it's at most 80k.

So the only remaining factors to asses the space requirements are now:

  • how many changesets do we have in status "open" at any one time?
  • what is the maximum number of changed objects in that timeframe?

I guess the majority of changesets will have less than 500-700 changes - each of those changesets fits on one 2k page on the database, and don't require TOASTing.

SELECT *, pg_size_pretty(total_bytes) AS total
    , pg_size_pretty(index_bytes) AS INDEX
    , pg_size_pretty(toast_bytes) AS toast
    , pg_size_pretty(table_bytes) AS TABLE
  FROM (
( SELECT *, total_bytes-index_bytes-COALESCE(toast_bytes,0) AS table_bytes FROM (
      SELECT c.oid,nspname AS table_schema, relname AS TABLE_NAME
              , c.reltuples AS row_estimate
              , pg_total_relation_size(c.oid) AS total_bytes
              , pg_indexes_size(c.oid) AS index_bytes
              , pg_total_relation_size(reltoastrelid) AS toast_bytes
          FROM pg_class c
          LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
          WHERE relkind = 'r'
  ) a
) a where table_name = 'changeset_idempotency_cache';
@simonpoole

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2019

@mmd-osm yes so there is a special case of a special case in which re-trying automatically is legit from the RFC pov (but obviously as it breaks the POST semantics, actually doing so is clearly broken too).

If we are so concerned about this specific scenario , then obviously the far simpler solution to the whole issue is to generate a checksum with a suitable method and return an error to the application in question if the repeated upload matches, because while you might get the JOSM devs to use a indempotent-token-whatever your random web browser based application out there is not going to do so.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 20, 2019

obviously the far simpler solution to the whole issue is to generate a checksum with a suitable method and return an error to the application in question

I indeed thought about this, too. Sending an error message instead of the diffResult response might not be what editors are expecting, but yes that's an option for sure. I haven't really thought about some strange corner cases yet, where creating the same object twice would be a legitimate operation from an editor's pov. Maybe that's an entirely theoretical thing only.

while you might get the JOSM devs to use a indempotent-token-whatever your random web browser based application out there is not going to do so

This isn't a JOSM only show. Every editor dev team is free to adopt whatever concept we come up with, if they think it's important enough for them (or their users). Ideally, that concept should be straightforward to implement and use for a client. In the end, it should 'just work' without bothering the end user with technical details.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 20, 2019

@simonpoole : I'm a bit curious as to what's your experience with OkHttp in Vespucci - which as we all know is about to be replaced in the near future due to their bold move to Kotlin.

I found a blog post explicitly mentioning retries of HTTP POSTs "on slow and unreliable connections", see https://medium.com/inloopx/okhttp-is-quietly-retrying-requests-is-your-api-ready-19489ef35ace (with section "Fix the real problem" looking very familiar). Some subsequent pull request and issues seemed to have proposed some changes, and I'm not exactly sure if that blog post is still accurate as is.

Some people on Stack Overflow recommended to set retryOnConnectionFailure=false, or have some custom interceptor in place. I couldn't find either one by briefly skimming through the Vespucci sources.

Do you have anything special in place to deal with this situation? I would assume poor network connectivity isn't entirely theoretical in places where people use Vespucci.

@simonpoole

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2019

@mmd-osm nice try, but we never used OkHttp versions prior to 3.3 (auto-re-trying on connection setup is obviously quite OK, no reason to disable that), and no we are not removing OkHttp because of Kotlin (the added bloat would be a reason to do so though), but because of de-supporting 150'000'000 devices in a minor version change.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 20, 2019

@simonpoole : I thought more in terms of "OkHttp retries when a client attempts to reuse a TCP connection that a server has unexpectedly closed." as mentioned in this issue, only created 3 days ago for 3.12.x. The issue starts out with:
"What they complain about is that data is uploaded to the server but mobile still shows as not uploaded. I have researched a bit and found that okHttp is retrying requests on connection failure and guess it is related to this topic."

This may be some peculiar issue relevant only to some apps, that's why I'm asking if you experienced this before with Vespucci.

And sorry for the mix up wrt Kotlin and the minor version breakage. I thought the Kotlin story was more of a concern, obviously both moves aren't exactly helpful for Vespucci.

@simonpoole

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2019

@mmd-osm given my understanding of the OkHttp issue, it would only be exercised if we (in the case of Vespucci manually) retried after a failed attempt, and the original connection hadn't already been evicted from the cache.

As we are then already retrying in a situation in which we shouldn't be doing so blindly I kind to fail to see (independent of the app in question) the net result being different than if you retried without the mechanism in OkHttp. Which further implies that it is practically impossible to determine if that has ever happened. I can see the behaviour being problematic if after the failure you try to upload something else, but that is not going to be the case in our use scenarios.

To recap: I think we should do something about the issue, my only problem with your approach is that it currently is focused on auto-retrying, which might be an appropriate behaviour if the failure was due to a short network glitch, but not if something different went wrong (for example you lost mobile coverage because you are moving).

@woodpeck

This comment has been minimized.

Copy link
Contributor

commented Apr 20, 2019

Do we have an idea of how big this problem actually is? Has anyone gone through the last years of changeset uploads and checked how often the same user uploaded the same objects twice in a row, and how many duplicate objects resulted from that? I have a feeling that the problem might be less of an issue in practice, and perhaps not necessarily of the calibre that requires our best and brightest to spend their valuable time to solve. But I might be wrong, and of course it's the prerogative of our best and brightest to apply their time to any aspect of the API. I'm volunteering to research if it has any bearing on the discourse.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented Apr 21, 2019

my only problem with your approach is that you are rewarding doing the wrong thing and not supplying a solution for devs that don't want to go that route, forcing us to emulate JOSMs broken behaviour.

That would be a strong case for having some kind of separate API call to query the most recent upload status based on some token, or (permitting incompatible changes) allowing a single upload per changeset only, and then query the changeset status using the existing API calls.

A compromise could be, to provide an additional option in the changeset upload message, to "auto close" a changeset (this would make changesets atomic, btw), again allowing devs to decide if they want to opt in.

I'm sure that would work out pretty well for iD, as they do a single upload per changeset anyway, and don't really care about the diffResult response. Querying the changeset status right after the fact would be just a very natural thing to do then.

From a technical pov, we probably need to double check, how to deal with in-flight transactions when retrieving the changeset status.

Do we have an idea of how big this problem actually is?

I'm with @gravitystorm here: I'm reasonably keen to see the API become fully idempotent. Network requests / responses / connections are lost for all kinds of reasons, and are more likely to happen the worse your internet connection (mobile, rural, far from OSMF's datacentres etc).

Maybe it's not obvious, why I'm pushing this topic at all. Changeset uploads bothered me from time to time in the past, be it that the upload took too long, or the network had some intermittent issues. It's also reports like (this one), that made me think, yes we should probably do something here.

The "better overall user experience" epic is split up in 3 major topics:

The idea is, no matter how bad or broken your internet connectivity is, we want everyone to contribute without much hassle.

Some of those ideas have at least some resemblance with what Matt posted a few years ago in http://www.asklater.com/matt/blog/2015/11/18/the-road-to-api-07/, although some details actually differ.

@mvl22

This comment has been minimized.

Copy link

commented Apr 22, 2019

Rather than require clients to send a token and have responsibility for checking this, would one approach, which I suspect would not require substantial changes to the server infrastructure be:

  1. Whenever a POST comes in (for any API call), create some kind of hash of the incoming data, first omitting anything which is known to be different (e.g. the request time).
  2. Store the hash in a global table of requests(hash, response) which is a table of incoming hashes and responses where each entry has a lifetime of say one day or a week (as users/clients are unlikely to retry after that).
  3. When the server has processed the request and sends back the response set the response field to the response XML and continue the response to the client as normal.
  4. If the server receives a repeated request that has the same hash, i.e. the client has just resent the same request, immediately send back the same response, without any further processing. For good practice, send an additional header stating that this is a cached response.

This would seem to me to require no changes to any client, and eliminate almost all duplications on the server, with the code change essentially just a global interception to any POST request by implementing that cache at that level.

This approach admittedly does not solve Andy's point:

But that still doesn't work, since if the server receives a new trace, and then the owner changes the description 2 seconds later on the website, and the server receives the same creation request 2 seconds after that, it doesn't have enough information to know if it's an intentional recreation or just a resent request.

but I don't think that is actually solvable, because a changed description is by definition a change in the request and could constitute a deliberate and separate change to the data.

@simonpoole

This comment has been minimized.

Copy link
Contributor

commented Apr 22, 2019

@mvl22 see #2201 (comment) I would think returning an error in all cases in which no token has been supplied is in order, there are obviously a couple of edge cases in which this might not be able to catch all repeated uploads outside of actual edits, but seriously, I don't think we really care about those.

@androideveloper

This comment has been minimized.

Copy link

commented Apr 23, 2019

We are experiencing the same issue with okhttp. We are going to use the concept of X-REQUEST-ID in the header to solve the duplication issue https://stackoverflow.com/a/54356305/1635488.
Server should keep list of ids and check if request was procedeed earlier.

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented May 2, 2019

@simonpoole : I was sort of expecting your feedback on my proposal to include an auto close (-> close_changeset=true) attribute in the osmChange message header (#2201 (comment)).

Several people in this issue seem to suggest that they want atomic changesets, which this attribute would enable. If the upload fails or times out, it would be safe to re-upload: either the osmChange mesasge gets processed, or rejected in case the changeset is already closed.

I don't want to introduce one upload per changeset as a default behavior, as it would be an incompatible, breaking change. Adding that attribute is the next best thing to do. By the way, an additional URL parameter "?close_changeset=true" could serve a similar purpose, if there are concerns about extending the XML structure.

Should we go for this as one option, and continue with the diffResult topic?

@simonpoole

This comment has been minimized.

Copy link
Contributor

commented May 2, 2019

@mmd-osm IMHO the people wanting atomic changesets actually want something else: the ability to upload a essentially unbounded number of changes over an unbounded number of uploads atomically. That and a couple of unicorns :-).

Auto-closing the changeset would mostly resolve the "has the upload succeed or not" question for the non-chunked uploads at the expense of having to re-download, which in this case I would think is acceptable. In general I would support this if no more general solution is on the horizon.

@mvl22

This comment has been minimized.

Copy link

commented May 2, 2019

@mvl22 see #2201 (comment) I would think returning an error in all cases in which no token has been supplied is in order, there are obviously a couple of edge cases in which this might not be able to catch all repeated uploads outside of actual edits, but seriously, I don't think we really care about those.

I am suggesting though that clients should not have to change - instead that the server itself should do the token generation.

Would my suggested approach noted above work?

@mmd-osm

This comment has been minimized.

Copy link
Contributor Author

commented May 3, 2019

@simonpoole : right, this inevitably includes ideas like throwing away the whole changeset unless the changeset is properly closed (link). Clearly, that's something well beyond the scope of this issue. We would immediately inherit all of the issues of the asynch queues (#2201 (comment)). I'd say this is actually quite similar to having some kind of staging area to stash all changes. Say Hello to the world of branches, commits and merge (conflicts), this time with OSM data.

@mvl22 : there are some similarities to #2201 (comment) in the "let's handle all POST requests" approach, only that you're creating the key based on a payload hash - which we discussed in #2201 (comment) as "calculating a checksum". I noted in #2201 (comment) this may or may not be possible, depending on the endpoint semantics and needs to be clarified. Keeping the full blown XML response around in a table was already questioned in #2201 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.