Periodic drops in Locate healthy instance count #149

cristinaleonr · 2023-07-21T20:46:21Z

On July 20th, we observed sharp, periodic drops in the Locate's healthy instance count since about June 6th.

The drops affect all experiments equally and are only visible in production (not in sandbox or staging).

The drops seem to happen on roughly a ~3 hour schedule, which lines up with the 3-hour synchronous clients that target the platform.

stephen-soltesz · 2023-08-01T02:51:29Z

Thinking about this, I realized in our first investigation we did not look at the response of these flash crowds to the number of App Engine instances. We're clearly hitting the upper limit for instance count (40) - And, I'm again struck by how the total recovery time takes about an hour, which is the same as App Engine's longest connection time. And, that the number of reported heartbeat connections doesn't have the same long recovery as health status does. What could cause this mismatch?

count (locate_health_status{experiment=~"$experiment.*"})

min by (experiment) (locate_health_status{experiment=~"$experiment.*"})

sum by (experiment) (locate_current_heartbeat_connections{experiment=~"$experiment.*"})

Zoomed in on one of the spikes, the healthy count drops before the instance count rises.

stephen-soltesz · 2023-08-01T12:48:29Z

I'm suspecting this might just be cpu exhaustion again. The config uses a 2 CPU VM. I recommend estimating the cpu time per request during steady state and seeing if the request rate during synchronous scheduling exceeds the expected capacity.

stephen-soltesz · 2023-08-01T22:21:20Z

Roughly about 30% of the time (one hour every 3 hours), we have about ~5-10% reduction in reported available servers for up to an hour, with instantaneous availability dropping as low as ~50% over shorter 4-5min windows.

cristinaleonr · 2023-08-02T15:15:44Z

The CPU utilization shown on App Engine does not seem to go above 75%.

It does seem to show the periodic scaling events, which would cause a lot of the Heartbeat instances to reconnect to the new Locate instances.

It also looks like some of the health transmissions from the Heartbeat and the writes to Memorystore from the Locate are taking a bit longer than expected (not sure I see any periodic patterns in the former, though).

So, what I think is happening is that high latency for some of these operations is causing the entries to expire, but the connection is not being terminated because the operations are succeeding (no error is returned). This would explain why the number of heartbeat connections recovers quickly, but the number of healthy instances drops and does not recover until the connection is killed and the registration is re-sent.

stephen-soltesz · 2023-08-02T17:21:48Z

Thank you for the update! Ooh, okay; so the registration is removed but health updates continue? Is it possible to make health updates without a remaining registration an error case?

Other questions/thoughts:

I'm skeptical of the average cpu utilization. I can see individual instances with much higher utilization with prometheus metrics. Could the flash crowd be impacting a subset of the instances (and this why a small percentage of hosts are affected)?
Why would more instances cause established heartbeat connections to reconnect?
Is the top line of the "Heartbeat health transmission time" panel correct? It's filled with over 58sec events.

stephen-soltesz · 2023-08-02T17:34:59Z

Also a general thought / observation, the unintended interaction between user-requests and platform-requests (the latter being vulnerable to the former) seems like an anti-pattern. Ideally, these would be independent of one another. This could be an argument for separating the request handling in different services for the different cases.

cristinaleonr · 2023-08-17T13:49:41Z

Yes, I have made the updates without a registration an error case (Locate: Only write health/prometheus updates to Memorystore if key already exists #154). We've also increased the CPU (Locate: Create separate app.yaml configuration for mlab-ns with more cpu #153).

The recovery time has been reduced from ~1 hour to 5 mins.

As App Engine scales, I would imagine it reallocates the sessions to balance the load. There is some documentation that suggest this is the case.
The query for the "Heartbeat health transmission time" panel seems correct. Let me know if you see a problem with it.

sum by(le) (increase(heartbeat_health_transmission_duration_bucket{workload=~"$experiment.*"}[5m]))

I see mostly durations in the 0-2 seconds bucket now (though there are a few in the 30+ one).

Agreed, we should prioritize separating user requests from platform ones.

mattmathis · 2023-08-21T20:52:00Z

Here is a query to identify clients with simple Time-of-Day schedules.
https://console.cloud.google.com/bigquery?sq=754187384106:6748707322b04693a2d74358ea40a7fd
Clients near the top of the list are likely to be causing us harm, clients further down not so much. It does not check to see if the tests are run in the same minute of each hour. I could add this (and other) properties if they would help.

I excluded from the query a large number (thousands) of windows clients that run 1 or 2 tests per day against many servers. I am guessing this is a connectivity monitoring tool and that the tests are not synchronized with other clients (I saw this behavior in other beacon data), LMK if there are other scoring properties that might be helpful.

See the last subquery for documentation on the columns.

cristinaleonr · 2023-08-30T00:46:41Z

The following query joins 3-hour periodic client data from measurement-lab.ndt.ndt7 with user agent data from the Locate ("2023-08-28").

WITH periodic AS (
  SELECT raw.ClientIP, COUNT(*) AS testCount, TIMESTAMP_TRUNC(a.TestTime, HOUR) AS roundTime
  FROM `measurement-lab.ndt.ndt7` 
  WHERE date = "2023-08-28"
  AND (TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 01:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 04:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 07:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 10:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 13:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 16:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 19:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 22:00:00+00", SECOND) BETWEEN 1 AND 600
  )
  GROUP BY roundTime, ClientIP
  ORDER BY testCount DESC
),

windows AS (
  SELECT ClientIP, COUNT(*) AS windowCount, SUM(testCount) AS testSum, ARRAY_AGG(STRUCT(roundTime, testCount)) AS periods
  FROM periodic
  GROUP BY ClientIP
  ORDER BY windowCount DESC
),

locate AS (
  SELECT httpRequest.remoteIp, ARRAY_AGG(DISTINCT httpRequest.userAgent IGNORE NULLS) AS userAgent
  FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230828`
  GROUP BY httpRequest.remoteIp
)

SELECT ClientIP, testSum, userAgent, periods
FROM windows INNER JOIN locate ON (ClientIP = locate.remoteIp)
WHERE windowCount = 8
ORDER BY testSum DESC

cristinaleonr · 2023-09-07T21:45:07Z

The integration responsible for the 3-hour scheduled requests has been identified as that using the "ndt7-client-go-cmd/0.5.0 ndt7-client-go-cmd/0.5.0" user agent string (graph).

The following queries show that the requests are coming from 7876 different IPs, with a maximum of only 28 requests per IP/minute. So, it's not possible to just limit a specific set of IPs using this user agent. A more generic rate limiting solution will need to be implemented (e.g., setting a limit on the number of requests/minute for any client).

SELECT httpRequest.remoteIp AS metric, COUNT(*) AS requests
FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230830`
WHERE httpRequest.userAgent = "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0"
GROUP BY httpRequest.remoteIp
ORDER BY requests DESC

SELECT TIMESTAMP_TRUNC(timestamp, MINUTE) AS time, httpRequest.remoteIp AS metric, COUNT(*) AS requests
FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230830`
WHERE httpRequest.userAgent = "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0"
GROUP BY time, httpRequest.remoteIp
ORDER BY requests DESC

cristinaleonr · 2023-10-01T17:49:16Z

Number of tests sent per day by the integration during the 10 minute slots of periodic requests every 3 hours: 76,497 (~2% of total tests).

SELECT COUNT(*) AS numTests
  FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230927`
  WHERE httpRequest.userAgent = "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0"
  AND (TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 01:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 04:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 07:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 10:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 13:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 16:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 19:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 22:00:00+00", SECOND) BETWEEN 1 AND 600
  )

cristinaleonr · 2023-10-01T20:31:31Z

In total, tests with the "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0" user agent string represent about 80,000 tests per day.

SELECT COUNT(*) AS numTests
  FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230927`
  WHERE httpRequest.userAgent = "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0"

mattmathis · 2023-10-02T17:28:37Z

I wrote a query to detect all ndt7 clients that don't seem to be randomizing their run times (e.g. a large fraction of their tests are in roughly the same minute).
BigQuery
Client amd64,ndt7-client-go,0.4.1,ndt7-client-go-cmd,windows,0.4.1 is not just hitting hard, it is also hitting 9-11 different sites. There are lots of other clients that are also using poorly randomized timers

mattmathis · 2023-10-03T03:37:39Z

Here is more information on the BigQuery

It intended to has some additional (manually curated) filtering and then be used to generate a pattern matcher, to block requests early in the process. The Output columns are:

ClientIP - yes that IP address
clientFingerPrint - Concatenated client metadata
totalTests per ClientIP*clientFingerPrint
- (For ist and (none),this is likely to include more than one actual client)
minuteBin - Minute of the tests rounded to the nearest 5 minutes (-1, +4)
- all subsequent columns are grouped by ClientIPclientFingerPrintminuteBin
hours - number of active hours per day
hourBits - Bitwise encoding of active hours. 0x1 is 0:00-0:59 UTC
tests - count
Metros - Number metros tested
Site - Number of sites tested
exampleSite - one randomly chosen exampleSite (not stable)

cristinaleonr · 2023-10-03T19:50:34Z

Change #162 was released yesterday at EOD. We did not see any periodic drops in the Locate's healthy instance count today.

This was referenced Jul 21, 2023

Add timeout for /health endpoint checks #145

Merged

Add new label to Kubernetes request metric #146

Merged

Add histogram for health transmission duration #147

Merged

This was referenced Aug 4, 2023

Locate: Add histogram for Memorystore request duration #152

Merged

Locate: Create separate app.yaml configuration for mlab-ns with more cpu #153

Merged

Locate: Only write health/prometheus updates to Memorystore if key already exists #154

Merged

cristinaleonr self-assigned this Aug 7, 2023

cristinaleonr mentioned this issue Oct 1, 2023

Rate limit requests for integration sending too many periodic requests #162

Merged

This was referenced Oct 3, 2023

Extend rate limit period to 15 minutes #163

Merged

Add alert for Locate request latency m-lab/prometheus-support#1015

Merged

Exclude 'Too Many Requests' from Locate errors m-lab/prometheus-support#1014

Merged

cristinaleonr mentioned this issue Oct 9, 2023

Periodic 3-hour load spikes m-lab/data-annotations#37

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodic drops in Locate healthy instance count #149

Periodic drops in Locate healthy instance count #149

cristinaleonr commented Jul 21, 2023

stephen-soltesz commented Aug 1, 2023 •

edited

Loading

stephen-soltesz commented Aug 1, 2023

stephen-soltesz commented Aug 1, 2023

cristinaleonr commented Aug 2, 2023

stephen-soltesz commented Aug 2, 2023

stephen-soltesz commented Aug 2, 2023

cristinaleonr commented Aug 17, 2023

mattmathis commented Aug 21, 2023

cristinaleonr commented Aug 30, 2023 •

edited

Loading

cristinaleonr commented Sep 7, 2023

cristinaleonr commented Oct 1, 2023 •

edited

Loading

cristinaleonr commented Oct 1, 2023

mattmathis commented Oct 2, 2023

mattmathis commented Oct 3, 2023 •

edited

Loading

cristinaleonr commented Oct 3, 2023

Periodic drops in Locate healthy instance count #149

Periodic drops in Locate healthy instance count #149

Comments

cristinaleonr commented Jul 21, 2023

stephen-soltesz commented Aug 1, 2023 • edited Loading

stephen-soltesz commented Aug 1, 2023

stephen-soltesz commented Aug 1, 2023

cristinaleonr commented Aug 2, 2023

stephen-soltesz commented Aug 2, 2023

stephen-soltesz commented Aug 2, 2023

cristinaleonr commented Aug 17, 2023

mattmathis commented Aug 21, 2023

cristinaleonr commented Aug 30, 2023 • edited Loading

cristinaleonr commented Sep 7, 2023

cristinaleonr commented Oct 1, 2023 • edited Loading

cristinaleonr commented Oct 1, 2023

mattmathis commented Oct 2, 2023

mattmathis commented Oct 3, 2023 • edited Loading

cristinaleonr commented Oct 3, 2023

stephen-soltesz commented Aug 1, 2023 •

edited

Loading

cristinaleonr commented Aug 30, 2023 •

edited

Loading

cristinaleonr commented Oct 1, 2023 •

edited

Loading

mattmathis commented Oct 3, 2023 •

edited

Loading