-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Periodic drops in Locate healthy instance count #149
Comments
The CPU utilization shown on App Engine does not seem to go above 75%. It does seem to show the periodic scaling events, which would cause a lot of the Heartbeat instances to reconnect to the new Locate instances. It also looks like some of the health transmissions from the Heartbeat and the writes to Memorystore from the Locate are taking a bit longer than expected (not sure I see any periodic patterns in the former, though). So, what I think is happening is that high latency for some of these operations is causing the entries to expire, but the connection is not being terminated because the operations are succeeding (no error is returned). This would explain why the number of heartbeat connections recovers quickly, but the number of healthy instances drops and does not recover until the connection is killed and the registration is re-sent. |
Thank you for the update! Ooh, okay; so the registration is removed but health updates continue? Is it possible to make health updates without a remaining registration an error case? Other questions/thoughts:
|
Also a general thought / observation, the unintended interaction between user-requests and platform-requests (the latter being vulnerable to the former) seems like an anti-pattern. Ideally, these would be independent of one another. This could be an argument for separating the request handling in different services for the different cases. |
The recovery time has been reduced from ~1 hour to 5 mins.
I see mostly durations in the 0-2 seconds bucket now (though there are a few in the 30+ one).
|
Here is a query to identify clients with simple Time-of-Day schedules. I excluded from the query a large number (thousands) of windows clients that run 1 or 2 tests per day against many servers. I am guessing this is a connectivity monitoring tool and that the tests are not synchronized with other clients (I saw this behavior in other beacon data), LMK if there are other scoring properties that might be helpful. See the last subquery for documentation on the columns. |
The following query joins 3-hour periodic client data from
|
The integration responsible for the 3-hour scheduled requests has been identified as that using the "ndt7-client-go-cmd/0.5.0 ndt7-client-go-cmd/0.5.0" user agent string (graph). The following queries show that the requests are coming from 7876 different IPs, with a maximum of only 28 requests per IP/minute. So, it's not possible to just limit a specific set of IPs using this user agent. A more generic rate limiting solution will need to be implemented (e.g., setting a limit on the number of requests/minute for any client).
|
Number of tests sent per day by the integration during the 10 minute slots of periodic requests every 3 hours: 76,497 (~2% of total tests).
|
In total, tests with the "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0" user agent string represent about 80,000 tests per day.
|
I wrote a query to detect all ndt7 clients that don't seem to be randomizing their run times (e.g. a large fraction of their tests are in roughly the same minute). |
Here is more information on the BigQuery It intended to has some additional (manually curated) filtering and then be used to generate a pattern matcher, to block requests early in the process. The Output columns are:
|
Change #162 was released yesterday at EOD. We did not see any periodic drops in the Locate's healthy instance count today. |
On July 20th, we observed sharp, periodic drops in the Locate's healthy instance count since about June 6th.
The drops affect all experiments equally and are only visible in production (not in sandbox or staging).
The drops seem to happen on roughly a ~3 hour schedule, which lines up with the 3-hour synchronous clients that target the platform.
The text was updated successfully, but these errors were encountered: