Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cr-syncer does not retry on failed updates #64

Closed
oliver-goetz opened this issue Jan 26, 2021 · 21 comments
Closed

cr-syncer does not retry on failed updates #64

oliver-goetz opened this issue Jan 26, 2021 · 21 comments

Comments

@oliver-goetz
Copy link
Contributor

We find quite a number of these errors in our cr-syncer logs.

2021/01/26 16:40:16 Syncing key "default/1710.2053438" from queue "downstream" failed: WarehouseOrder default/1710.2053438 @ 184236994: update status failed: Operation cannot be fulfilled on warehouseorders.ewm.sap.com "1710.2053438": the object has been modified; please apply your changes to the latest version and try again

cr-syncer does not retry applying the change as the error message suggests. If there are other update events from downstream cluster, it will get in sync again. But if there are no more change events from downstream for that CR it remains in a not synced state.
When a CR is in this state it even does sync changes from upstream cluster anymore.
It remains in this inconsistent state until cr-syncer is restarted.

@oliver-goetz
Copy link
Contributor Author

oliver-goetz commented Jan 27, 2021

I made another observation. Sometimes the status of robots is not updated in cloud cluster, even though cr-syncer logs give the impression that the status was synced.
I made a little screen cap video. The upper left terminal shows cr-syncer logs of mir-robot01, the upper right terminal watches the robot CR of mir-robot01 directly on the robot. Finally the lower terminal watches the CR of mir-robot01 in the cloud cluster.

You see that on the robot its status is continuously updating, cr-syncer receives the update event, but it is not synced to cloud cluster.
Additionally you see some of the syncing key issues mentioned before, but they are probably not related.

Looking into the nginx log of cloud cluster, there is a 409 error like this in the meantime for each robot and then no more log entries for those CRs for a while. Thus, cr-syncer is probably not calling the cloud cluster kube-api, even when it is still receiving update events.
Syncing recovers after a while, but I don't know the concrete point in time. It is rather a matter of minutes than seconds until it recovers.

111.111.111.111 - [111.111.111.111] - - [27/Jan/2021:14:30:42 +0000] "PUT /apis/core.kubernetes/apis/registry.cloudrobotics.com/v1alpha1/namespaces/default/robots/mir-robot01/status HTTP/2.0" 409 374 "-" "cr-syncer.binary/v0.0.0 (linux/amd64) kubernetes/$Format" 968 0.003 [default-kubernetes-443] 222.222.222.222:443 374 0.004 409 61189d593e12555bb2278915a242e3da

It seems that this issue occurs at all CRs of robot CRD. Either CR updates of all robots fail or all are succeeding.

There might be the same issue for other CRDs too, but it does not necessarily happens at the same time.

@oliver-goetz
Copy link
Contributor Author

I just looked for the 409 error and API server returns it, if there is a conflict. A reason for this is that cr-syncer refers to an too old version of the CR. I don't know why this happens at the robot CR, because we only change it manually, when changing the labels, but it already happened several times since yesterday.
https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#error-codes

@oliver-goetz
Copy link
Contributor Author

oliver-goetz commented Jan 27, 2021

I was wrong regarding the number of 409 requests. There is not just 1 request for each CR but ~40 requests within a couple of seconds. Then there is a break for about 10 minutes, before requests for this CRD start again. These requests are successful HTTP 200 requests.

Maybe this storm of failed request leads the rate limit interface to back off for a while until it starts with new requests again.
I'm still wondering why we neither see a success nor a failure message in cr-syncer log.

How about using Patch instead of Update and UpdateStatus? In our case it is clear that upstream is always the origin of spec and downstream always the origin of status. Thus, there is no risk that upstream changed something downstream is going to update or vice versa.

@drigz
Copy link
Contributor

drigz commented Jan 28, 2021

Questions for you:

  • Did this issue start with the recent changes or is it something you saw before?
  • Are you blocked? I haven't tried, but I think run-install.sh $project crc-0.1.0/crc-0.1.0+5a43187.tar.gz should install a version before the recent changes.
  • What version of Kubernetes are you running in the cloud cluster? There's a (small) chance that the cloud apiserver's behavior is different between our setups.

Maybe this storm of failed request leads the rate limit interface to back off for a while until it starts with new requests again.

This is plausible. I've not seen the 409 storms, and I only see upstream 409s when a separate process is updating the upstream resource too, which makes me wonder what is different in your setup. The 409s I see are immediately and successfully retried. Given that 409s are quite common, I'm surprised the controller-runtime is handling them so badly here: I would expect to see exponentional backoff rather than many-retries-then-a-long-pause.

I'm still wondering why we neither see a success nor a failure message in cr-syncer log.

I can't reproduce this behavior, so something weird is going on. In your video, this was already broken at the start. There are a couple of codepaths that print neither success nor failure:

If we can't come up with a better idea, adding print debugging to the start of syncDownstream and to the silent return paths would narrow this down.

How about using Patch instead of Update and UpdateStatus?

I'm not 100% about the mechanics of Patch but it seems like it would lead to divergence between the upstream and downstream resources. For example, if you add a property downstream, then remove it, you would need some way of crafting the patch so that that property is deleted upstream as well. Update/UpdateStatus have the semantics that we want here. Also, if we can't trust the resourceVersion that we're getting, we can't trust the spec/status either, so this would probably just break differently.

@oliver-goetz
Copy link
Contributor Author

This is how the issue started on the robot side. You find like 8 Syncing key... errors for downstream when the whole thing started. There seems to be no specific reason for the failure on the robot.

2021/01/27 14:30:35 Copied Robot mir-robot01 status@v5523725 to upstream@v184842987
2021/01/27 14:30:37 Got update event from upstream for WarehouseOrder 1710.2053443@v184842994
2021/01/27 14:30:37 Got update event from upstream for WarehouseOrder 1710.2053715@v184842995
2021/01/27 14:30:37 Got update event from downstream for WarehouseOrder 1710.2053715@v5523728
2021/01/27 14:30:37 Got update event from upstream for WarehouseOrder 1710.2053716@v184842996
2021/01/27 14:30:37 Copied WarehouseOrder 1710.2053715 status@v5523728 to upstream@v184842995
2021/01/27 14:30:37 Got update event from upstream for WarehouseOrder 1710.2053715@v184842997
2021/01/27 14:30:37 Got update event from downstream for WarehouseOrder 1710.2053715@v5523729
2021/01/27 14:30:37 Copied WarehouseOrder 1710.2053715 status@v5523729 to upstream@v184842997
2021/01/27 14:30:37 Got update event from upstream for RobotType freight100@v184843003
2021/01/27 14:30:37 Got update event from downstream for Robot mir-robot01@v5523732
2021/01/27 14:30:37 Got update event from upstream for Robot mir-robot01@v184843008
2021/01/27 14:30:37 Copied Robot mir-robot01 status@v5523732 to upstream@v184843008
2021/01/27 14:30:39 Got update event from downstream for Robot mir-robot01@v5523735
2021/01/27 14:30:39 Copied Robot mir-robot01 status@v5523735 to upstream@v184843025
2021/01/27 14:30:41 Got update event from downstream for Robot mir-robot01@v5523738
2021/01/27 14:30:41 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 184843008: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/27 14:30:42 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 184843008: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/27 14:30:42 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 184843008: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/27 14:30:42 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 184843008: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/27 14:30:42 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 184843008: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/27 14:30:42 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 184843008: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/27 14:30:42 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 184843008: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/27 14:30:43 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 184843008: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/27 14:30:43 Got update event from downstream for Robot mir-robot01@v5523741
2021/01/27 14:30:44 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 184843008: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/27 14:30:45 Got update event from downstream for Robot mir-robot01@v5523744
2021/01/27 14:30:47 Got update event from downstream for Robot mir-robot01@v5523749
2021/01/27 14:30:49 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 184843008: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/27 14:30:49 Got update event from downstream for Robot mir-robot01@v5523752
2021/01/27 14:30:51 Got update event from downstream for Robot mir-robot01@v5523755
2021/01/27 14:30:53 Got update event from downstream for Robot mir-robot01@v5523758

I wondered why the issue occurs simultaneously on all robots and I maybe found the reason why in the Cloud Cluster which is running on version 1.16.15-gke.7800.
In the one minute from 14:30 until 14:31 there are 10000 errors Gone: too old resource version: 184197748 (184349850) on our CR watchers in cloud cluster, which trigger a reconnect.
I checked the logs and there are millions of such errors within the log retention period. This error probably occurs since we switched to version 12 of python Kubernetes API. It includes a PR which raises an exception on 410 errors. This exception is restarting all affected watchers. The watchers have no waiting time before reconnecting. Maybe this is overloading the API server somehow which causes more 410 errors...Before we covered the 410 errors by ourselves and more gracefully by just raising the resourceVersion without starting all over again. This does not work anymore because of the exception.
I'll include a waiting period when reconnecting and see if this solves the issue.

@drigz
Copy link
Contributor

drigz commented Jan 28, 2021

That explanation sounds right on the money. The logs show:

2021/01/27 14:30:37 Got update event from upstream for Robot mir-robot01@v184843008
[snip]
2021/01/27 14:30:39 Copied Robot mir-robot01 status@v5523735 to upstream@v184843025
[snip]
2021/01/27 14:30:41 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 184843008: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again

The cr-syncer never receives the update event from upstream@v184843025, so it keeps trying to update @v184843008. The retry behavior is not that sensible (I'll look into exponential backoff) but it would help in this case, as the update event v184843025 hasn't arrived by the end of that log excerpt (14s after v184843025 was created).

@oliver-goetz
Copy link
Contributor Author

That explanation sounds right on the money. The logs show:

2021/01/27 14:30:37 Got update event from upstream for Robot mir-robot01@v184843008
[snip]
2021/01/27 14:30:39 Copied Robot mir-robot01 status@v5523735 to upstream@v184843025
[snip]
2021/01/27 14:30:41 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 184843008: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again

The cr-syncer never receives the update event from upstream@v184843025, so it keeps trying to update @v184843008. The retry behavior is not that sensible (I'll look into exponential backoff) but it would help in this case, as the update event v184843025 hasn't arrived by the end of that log excerpt (14s after v184843025 was created).

Maybe an exponential backoff is not the best choice for cr-syncer. As you see here there is a series of very fast retries with a high probability that the key is not in sync yet, followed by a very long backoff time. A more linear approach would be better for this case.
There is a ItemFastSlowRateLimiter in the same workqueue package. Imho it would make sense to use this with a fast and slow delay which is reasonable for remote connections, like 5x0.5 seconds fast delay followed by 5 second slow delay.

Btw. I fixed our too old resource version bug, but the cr-syncer problem is still there.

@oliver-goetz
Copy link
Contributor Author

oliver-goetz commented Jan 28, 2021

I changed the limiters in this commit.
Unfortunately it does not solve the problem. The slowDelay time is now 5 seconds. This is more reasonable for a syncer than those huge exponential backoff times. But it is not working because there are no events from upstream to refresh the resourceVersion sometimes for minutes.

2021/01/28 16:41:19 Got update event from downstream for Robot mir-robot01@v5688231
2021/01/28 16:41:21 Got update event from downstream for Robot mir-robot01@v5688235
2021/01/28 16:41:22 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 185506126: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/28 16:41:23 Got update event from downstream for Robot mir-robot01@v5688238
2021/01/28 16:41:25 Got update event from downstream for Robot mir-robot01@v5688242
2021/01/28 16:41:27 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 185506126: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/28 16:41:27 Got update event from downstream for Robot mir-robot01@v5688245
2021/01/28 16:41:29 Got update event from downstream for Robot mir-robot01@v5688249
2021/01/28 16:41:31 Got update event from downstream for Robot mir-robot01@v5688252
2021/01/28 16:41:32 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 185506126: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/28 16:41:33 Got update event from downstream for Robot mir-robot01@v5688255
2021/01/28 16:41:35 Got update event from downstream for Robot mir-robot01@v5688259
2021/01/28 16:41:37 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 185506126: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/28 16:41:37 Got update event from downstream for Robot mir-robot01@v5688262
2021/01/28 16:41:39 Got update event from downstream for Robot mir-robot01@v5688267
2021/01/28 16:41:41 Got update event from downstream for Robot mir-robot01@v5688270
2021/01/28 16:41:42 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 185506126: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/28 16:41:43 Got update event from downstream for Robot mir-robot01@v5688273
2021/01/28 16:41:45 Got update event from downstream for Robot mir-robot01@v5688277
2021/01/28 16:41:47 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 185506126: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/28 16:41:47 Got update event from downstream for Robot mir-robot01@v5688280
2021/01/28 16:41:50 Got update event from downstream for Robot mir-robot01@v5688284
2021/01/28 16:41:52 Got update event from downstream for Robot mir-robot01@v5688287
2021/01/28 16:41:52 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 185506126: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/28 16:41:54 Got update event from downstream for Robot mir-robot01@v5688290
2021/01/28 16:41:56 Got update event from downstream for Robot mir-robot01@v5688294
2021/01/28 16:41:57 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 185506126: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/28 16:41:58 Got update event from downstream for Robot mir-robot01@v5688297
2021/01/28 16:42:00 Got update event from downstream for Robot mir-robot01@v5688301
2021/01/28 16:42:02 Got update event from downstream for Robot mir-robot01@v5688304
2021/01/28 16:42:02 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 185506126: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/28 16:42:04 Got update event from downstream for Robot mir-robot01@v5688307
2021/01/28 16:42:06 Got update event from downstream for Robot mir-robot01@v5688311
2021/01/28 16:42:07 Syncing key "default/mir-robot01" from queue "downstream" failed: Robot default/mir-robot01 @ 185506126: update status failed: Operation cannot be fulfilled on robots.registry.cloudrobotics.com "mir-robot01": the object has been modified; please apply your changes to the latest version and try again
2021/01/28 16:42:08 Got update event from downstream for Robot mir-robot01@v5688314
2021/01/28 16:42:10 Got update event from downstream for Robot mir-robot01@v5688318

@oliver-goetz
Copy link
Contributor Author

Anyhow, it is running better, because it at least refreshes every 5 minutes when the resyncPeriod is over. if we reduce this to 1 minute, this is at least a workaround.

@drigz
Copy link
Contributor

drigz commented Jan 28, 2021 via email

@oliver-goetz
Copy link
Contributor Author

I just checked the logs and the issue appeared before too, but by far less often than now. Additionally it appeared rarely on Robot CRD but on other CRD which already had a status subresource.
Maybe this issue is related to the activation of status subresource in the Robot CRD. Robot CRs are almost always changed on the robot side and in a relatively high frequency (2 seconds), whereas for other CRs the update frequency is not that high an the message flow is more a back and forth between upstream and downstream. I can see the last status updates of the robots on a UI which makes errors here much more transparent.

My guess is that this is somehow related to the activation of status subresources in CRDs and we notice it right now because of the special circumstances with the robot CRD.

@drigz
Copy link
Contributor

drigz commented Jan 28, 2021

My guess is that this is somehow related to the activation of status subresources in CRDs

Thanks, that seems mostly likely. I would guess that the apiserver is misbehaving. I will try updating my cluster to 1.16.15-gke.7800 and seeing if I can reproduce the issue, although I was already on 1.16.15-gke.6000, so the difference is probably due to the differing use patterns (I was doing frequent status updates but hadn't been doing them for a long time).

I guess the easiest workaround for you would be to either revert to the older release, or to disable the status subresource on the Robot CR and set CR_SYNCER_RBAC=false.

@oliver-goetz
Copy link
Contributor Author

During my last test, when I just disabled status subresource, it went well for a while. Then, the errors appeared again...I don't know why.
That's why I decided to replace Update and UpdateStatus by Patch using JSONPatchType.
This was a kind of break through regarding those errors in downstream. 4 robots are running for more than an hour without any downstream issue. You find my implementation here.

It supports status and status subtree, adds the part which is owned by the robot (removes manually created items in this section) and leaves the rest alone, in case it is a status subtree.

Yet, there are open points:

  • cr-syncer tests are currently failing. But I tested in our cluster.
  • it should be implemented for upstream too, for the case the robot owns the spec.

This solution looks promising for me. The concept of federation layer describes pretty clear which properties are owned by cloud and which by the robots. This could be handled pretty well with Patching CRs, so I don't see much benefit in messing around with not completely synced keys.

What do you think?
If you like it, I could implement the missing upstream part and fix the tests.

@drigz
Copy link
Contributor

drigz commented Jan 29, 2021

I might be missing something, but it seems that patching doesn't address the root cause. As I understand it:

  • The cr-syncer's watch on the upstream cluster is failing. The upstream cluster gets updated by the cr-syncer, but the cr-syncer never gets the corresponding update event from upstream. In fact, it doesn't seem to get update events from upstream at all.
  • This means that, regardless of backoff strategy, the cr-syncer gets 409s because it uses the wrong upstream resource version.

Patching avoids the 409s because it means that the status can still be updated after the upstream watch fails. However, the cr-syncer will still fail to sync the spec, as it won't get update events. If the upstream spec changes, the downstream spec will not be updated. This is not so bad for the Robot (as spec changes are rare) but would presumably be bad for WarehouseOrders or ChartAssignments.

Do you think that's the case? Or is there a reason that the spec syncing would behave differently after this change?

@oliver-goetz
Copy link
Contributor Author

Last week on Friday, I had some time and finalized the patch version of cr-syncer for up- and downstream including fixing the tests and some cleanup. You can find the result here, if you like to have a look at it.
Updating CRs in the cluster worked pretty smooth from the point in time, but you are right with your assumption that the delay of syncing the spec from cloud to robot has its downsides too. These are things like robots standing around with nothing to do, because the WarehouseOrder is in the cloud cluster, but not synced to the robot cluster yet. But the sync from ustream to downstream was at least taking place every 5 minute, from downstream to upstream the exponential backoff time of DefaultControllerRateLimiter made the problem even worse. More thoughts about it later.

First the good thing. I finally found the root cause and "solved" it. The reason for all this was a bit unexpected, so let me summarize, what went wrong:

  • Prometheus prometheus-kube-prometheus had an error with its persistent volume and was in a crash loop because of it.
    • Error messages were looking like this WAL segment loaded" segment=12220 maxSegment=12229.
    • I did not spend too much time into research. First explanation I found was, that there have been too many files on the filesystem. Which could make sense because it was running for quite some time
  • Each prometheus restart and crash leaded to a reload of nginx configuration
  • Reloading the nginx configuration seems to break the long running GET call for watching CRs
    • It seems to be broken until the resync period of the informer is over
    • Other GET, POST, PATCH, PUT calls to nginx were not affected. I thought reloading the configuration would not affect already open connection, but maybe there is timeout in those cases nginx and closed the cr_syncer GET call after a while.
  • After I deleted promotheus approllout, things got back to normal

That could be a kind of a pitfall in case any crashing app with an ingress configuration could break the robot communication. Maybe there is a way to change nginx configuration to mitigate that risk. From my perspective it is worth to invest some time here.

Even if it was not the root cause, there are some points at cr-syncer, we might be able to improve

  • DefaultControllerRateLimiter is probably not the best choice for our scenario. I would suggest to use NewItemFastSlowRateLimiter instead
    • Exponential backoff time made things a lot worse. When it is retrying in a high frequency the issue is probably not solved yet and when it is solved (because of resync) cr-syncer is backing off. We had outages of hours with that. With NewItemFastSlowRateLimiter the outages would last only for resyncPeriod
    • This does not only apply for this error. A bad connection from robot to cloud could cause the same issue and that's the case I'm more worried about.
  • Lowering the time of resyncPeriod could reduce the outage times in case of errors at cost of some additional load when everything is running smoothly
  • Using patch instead of update would not solve the problem completely but at least the communication from robot to cloud would be more resilient.
    • In case we find a way to improve resilience of cr-syncer, I would be finde to stick with update. On the other hand, I don't see a downside of using patch.

@drigz What do you think?

@ensonic
Copy link
Contributor

ensonic commented Feb 3, 2021

Why did "each Prometheus restart and crash leaded to a reload of nginx configuration"? Or was the nginx killed to free up memory due to Prometheus spiking in memory use?

@drigz
Copy link
Contributor

drigz commented Feb 4, 2021

Amazing work Oliver! I can reproduce this by deleting the prometheus pod in the GKE cluster.

Why did "each Prometheus restart and crash leaded to a reload of nginx configuration"?

When it crashes ingress-nginx rewrites the nginx config to return a 503 for /prometheus/. When the pod comes back up, the working config is restored. Reloading the config causes nginx to shut down the old workers, terminating the existing WATCH connections. By testing with curl, it appears to be getting a TLS close notify, so I don't know why client-go doesn't start a new WATCH.

Changing the rate limiter and/or resync period to avoid extreme backoff seems sensible but not sufficient in my opinion. Config updates are not rare, and as you say, brief network outages could cause similar behavior.

Things I would like to try first:

  • working out why client-go doesn't notice the connection dropping
  • setting ReadIdleTimeout for detect dropped connections

@drigz
Copy link
Contributor

drigz commented Feb 5, 2021

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.19.md#changelog-since-v1194: I believe upgrading the Kubernetes client to v1.19.4 will improve the behavior if the network drops out, by setting a timeout of 30s. We might want to apply this config manually as upgrading to v1.19.4 may not be trivial.

I still haven't worked out why client-go doesn't notice the graceful close, though: curl sees a TLS close-notify with and without --http2, and the Go TLS client seems to explicitly handle and return EOF on close-notify. However, if I run the cr-syncer with GODEBUG=http2debug=2, I don't see anything at 14:38:10 (just after the upstream connection is closed) until 30 seconds later, when the transport is closed for being idle (but even that doesn't prompt client-go to reestablish the watch):

2021/02/05 14:38:10 Copied Robot dev-rodrigoq-muc status@v132415 to upstream@v1741976
2021/02/05 14:39:40 http2: Transport closing idle conn 0xc000e00a80 (forSingleUse=false, maxStream=15)
2021/02/05 14:39:40 http2: Transport readFrame error on conn 0xc000e00a80: (*net.OpError) read tcp 192.168.9.18:50316->35.240.89.64:443: use of closed network connection

drigz added a commit to drigz/core that referenced this issue Feb 8, 2021
This limits the effect of a connection dropout by detecting it and
reconnecting after ~30s. This is added to Kubernetes in v1.19.4, but
I think we need to switch to Go modules before updating to that.

Fixes googlecloudrobotics#64.

Change-Id: I709b6bcecc35fd28bd904e6047bf479b0609117d
drigz added a commit to drigz/core that referenced this issue Feb 8, 2021
The newer version will gracefully close the connection to the cr-syncer
when the config reloads, so this goes some way towards fixing googlecloudrobotics#64
(although the cr-syncer should handled dropped connections better).

The Deployment is based on upstream's deploy.yaml, but without RBAC
policies, the namespace and the service account, to avoid changing too
much. The newer Deployment drops privileges and has more graceful
shutdown behavior.
https://github.com/kubernetes/ingress-nginx/blob/master/deploy/static/provider/cloud/deploy.yaml

This requires adjusting the ingress due to the following breaking
changes (which, incidentally, despite the "breaking changes" category in
the changelog, were not listed as breaking...):

- the secure-backends annotation has been removed
- path rewriting now uses regex groups rather than prefixes

While we're here, we can stop using the deprecated extensions/v1beta1
Ingress.

Change-Id: I24c266f21e489a5c9079e9f884699c84d6be6a0e
@drigz
Copy link
Contributor

drigz commented Feb 8, 2021

Good news, I think I got to the bottom of this. curl behaves different because it is only doing one streaming request. When two or more streaming requests are going through a single HTTP/2 connection, the ingress will shutdown differently: it immediately sends GOAWAY (Go seems to ignore this) and then drops the connection (without a TLS close-notify or TCP FIN) ten seconds later. 30s later, Go's TCP keepalive notices this, but that doesn't seem to propagate through the http2 later, and so client-go doesn't see it and restart the watch.

Upgrading nginx is sufficient to improve the behavior, although that won't help when the connection is dropped for real due to lost connectivity.

I'll send the following changes for internal review:

cloud-robotics-github-robot pushed a commit that referenced this issue Feb 8, 2021
The newer version will gracefully close the connection to the cr-syncer
when the config reloads, so this goes some way towards fixing #64
(although the cr-syncer should handled dropped connections better).

The Deployment is based on upstream's deploy.yaml, but without RBAC
policies, the namespace and the service account, to avoid changing too
much. The newer Deployment drops privileges and has more graceful
shutdown behavior.
https://github.com/kubernetes/ingress-nginx/blob/master/deploy/static/provider/cloud/deploy.yaml

This requires adjusting the ingress due to the following breaking
changes (which, incidentally, despite the "breaking changes" category in
the changelog, were not listed as breaking...):

- the secure-backends annotation has been removed
- path rewriting now uses regex groups rather than prefixes

While we're here, we can stop using the deprecated extensions/v1beta1
Ingress.

Change-Id: I24c266f21e489a5c9079e9f884699c84d6be6a0e
GitOrigin-RevId: 564dd80
@oliver-goetz
Copy link
Contributor Author

@drigz good work 👍
I just updated our main test cluster and its robots to the latest version and killed prometheus several times. The problem is solved, cr-syncer on all robots kept running without any service interruption.
Thank you for fixing this 😄

@drigz
Copy link
Contributor

drigz commented Feb 11, 2021

Couldn't have figured it out without your help, thank you too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants