Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic drops in Locate healthy instance count #149

Open
cristinaleonr opened this issue Jul 21, 2023 · 15 comments
Open

Periodic drops in Locate healthy instance count #149

cristinaleonr opened this issue Jul 21, 2023 · 15 comments
Assignees

Comments

@cristinaleonr
Copy link
Contributor

On July 20th, we observed sharp, periodic drops in the Locate's healthy instance count since about June 6th.

Screenshot 2023-07-21 4 35 50 PM

Screenshot 2023-07-21 4 37 14 PM

The drops affect all experiments equally and are only visible in production (not in sandbox or staging).

Screenshot 2023-07-21 4 41 40 PM

The drops seem to happen on roughly a ~3 hour schedule, which lines up with the 3-hour synchronous clients that target the platform.

Screenshot 2023-07-21 4 45 27 PM

Screenshot 2023-07-21 4 45 38 PM

@stephen-soltesz
Copy link
Contributor

stephen-soltesz commented Aug 1, 2023

Thinking about this, I realized in our first investigation we did not look at the response of these flash crowds to the number of App Engine instances. We're clearly hitting the upper limit for instance count (40) - And, I'm again struck by how the total recovery time takes about an hour, which is the same as App Engine's longest connection time. And, that the number of reported heartbeat connections doesn't have the same long recovery as health status does. What could cause this mismatch?

count (locate_health_status{experiment=~"$experiment.*"})
Screen Shot 2023-07-31 at 10 42 59 PM

min by (experiment) (locate_health_status{experiment=~"$experiment.*"})
Screen Shot 2023-07-31 at 10 44 39 PM

sum by (experiment) (locate_current_heartbeat_connections{experiment=~"$experiment.*"})
Screen Shot 2023-07-31 at 10 44 28 PM

Zoomed in on one of the spikes, the healthy count drops before the instance count rises.
Screen Shot 2023-07-31 at 10 48 58 PM

@stephen-soltesz
Copy link
Contributor

I'm suspecting this might just be cpu exhaustion again. The config uses a 2 CPU VM. I recommend estimating the cpu time per request during steady state and seeing if the request rate during synchronous scheduling exceeds the expected capacity.

Screen Shot 2023-08-01 at 8 46 06 AM

@stephen-soltesz
Copy link
Contributor

Roughly about 30% of the time (one hour every 3 hours), we have about ~5-10% reduction in reported available servers for up to an hour, with instantaneous availability dropping as low as ~50% over shorter 4-5min windows.

Screen Shot 2023-08-01 at 6 17 17 PM

@cristinaleonr
Copy link
Contributor Author

The CPU utilization shown on App Engine does not seem to go above 75%.

Screenshot 2023-08-02 10 44 50 AM

It does seem to show the periodic scaling events, which would cause a lot of the Heartbeat instances to reconnect to the new Locate instances.

Screenshot 2023-08-02 11 09 52 AM

It also looks like some of the health transmissions from the Heartbeat and the writes to Memorystore from the Locate are taking a bit longer than expected (not sure I see any periodic patterns in the former, though).

Screenshot 2023-08-02 10 48 10 AM

Screenshot 2023-08-02 10 49 51 AM

So, what I think is happening is that high latency for some of these operations is causing the entries to expire, but the connection is not being terminated because the operations are succeeding (no error is returned). This would explain why the number of heartbeat connections recovers quickly, but the number of healthy instances drops and does not recover until the connection is killed and the registration is re-sent.

@stephen-soltesz
Copy link
Contributor

Thank you for the update! Ooh, okay; so the registration is removed but health updates continue? Is it possible to make health updates without a remaining registration an error case?

Other questions/thoughts:

  • I'm skeptical of the average cpu utilization. I can see individual instances with much higher utilization with prometheus metrics. Could the flash crowd be impacting a subset of the instances (and this why a small percentage of hosts are affected)?
  • Why would more instances cause established heartbeat connections to reconnect?
  • Is the top line of the "Heartbeat health transmission time" panel correct? It's filled with over 58sec events.

@stephen-soltesz
Copy link
Contributor

Also a general thought / observation, the unintended interaction between user-requests and platform-requests (the latter being vulnerable to the former) seems like an anti-pattern. Ideally, these would be independent of one another. This could be an argument for separating the request handling in different services for the different cases.

@cristinaleonr
Copy link
Contributor Author

The recovery time has been reduced from ~1 hour to 5 mins.
Screenshot 2023-08-17 9 45 51 AM

  • As App Engine scales, I would imagine it reallocates the sessions to balance the load. There is some documentation that suggest this is the case.

  • The query for the "Heartbeat health transmission time" panel seems correct. Let me know if you see a problem with it.

sum by(le) (increase(heartbeat_health_transmission_duration_bucket{workload=~"$experiment.*"}[5m]))

I see mostly durations in the 0-2 seconds bucket now (though there are a few in the 30+ one).
Screenshot 2023-08-17 9 46 26 AM

  • Agreed, we should prioritize separating user requests from platform ones.

@mattmathis
Copy link

Here is a query to identify clients with simple Time-of-Day schedules.
https://console.cloud.google.com/bigquery?sq=754187384106:6748707322b04693a2d74358ea40a7fd
Clients near the top of the list are likely to be causing us harm, clients further down not so much. It does not check to see if the tests are run in the same minute of each hour. I could add this (and other) properties if they would help.

I excluded from the query a large number (thousands) of windows clients that run 1 or 2 tests per day against many servers. I am guessing this is a connectivity monitoring tool and that the tests are not synchronized with other clients (I saw this behavior in other beacon data), LMK if there are other scoring properties that might be helpful.

See the last subquery for documentation on the columns.

@cristinaleonr
Copy link
Contributor Author

cristinaleonr commented Aug 30, 2023

The following query joins 3-hour periodic client data from measurement-lab.ndt.ndt7 with user agent data from the Locate ("2023-08-28").

WITH periodic AS (
  SELECT raw.ClientIP, COUNT(*) AS testCount, TIMESTAMP_TRUNC(a.TestTime, HOUR) AS roundTime
  FROM `measurement-lab.ndt.ndt7` 
  WHERE date = "2023-08-28"
  AND (TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 01:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 04:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 07:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 10:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 13:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 16:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 19:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(a.TestTime, TIMESTAMP "2023-08-28 22:00:00+00", SECOND) BETWEEN 1 AND 600
  )
  GROUP BY roundTime, ClientIP
  ORDER BY testCount DESC
),

windows AS (
  SELECT ClientIP, COUNT(*) AS windowCount, SUM(testCount) AS testSum, ARRAY_AGG(STRUCT(roundTime, testCount)) AS periods
  FROM periodic
  GROUP BY ClientIP
  ORDER BY windowCount DESC
),

locate AS (
  SELECT httpRequest.remoteIp, ARRAY_AGG(DISTINCT httpRequest.userAgent IGNORE NULLS) AS userAgent
  FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230828`
  GROUP BY httpRequest.remoteIp
)

SELECT ClientIP, testSum, userAgent, periods
FROM windows INNER JOIN locate ON (ClientIP = locate.remoteIp)
WHERE windowCount = 8
ORDER BY testSum DESC

@cristinaleonr
Copy link
Contributor Author

The integration responsible for the 3-hour scheduled requests has been identified as that using the "ndt7-client-go-cmd/0.5.0 ndt7-client-go-cmd/0.5.0" user agent string (graph).

Screenshot 2023-09-07 5 31 48 PM

The following queries show that the requests are coming from 7876 different IPs, with a maximum of only 28 requests per IP/minute. So, it's not possible to just limit a specific set of IPs using this user agent. A more generic rate limiting solution will need to be implemented (e.g., setting a limit on the number of requests/minute for any client).

SELECT httpRequest.remoteIp AS metric, COUNT(*) AS requests
FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230830`
WHERE httpRequest.userAgent = "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0"
GROUP BY httpRequest.remoteIp
ORDER BY requests DESC
SELECT TIMESTAMP_TRUNC(timestamp, MINUTE) AS time, httpRequest.remoteIp AS metric, COUNT(*) AS requests
FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230830`
WHERE httpRequest.userAgent = "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0"
GROUP BY time, httpRequest.remoteIp
ORDER BY requests DESC

@cristinaleonr
Copy link
Contributor Author

cristinaleonr commented Oct 1, 2023

Number of tests sent per day by the integration during the 10 minute slots of periodic requests every 3 hours: 76,497 (~2% of total tests).

SELECT COUNT(*) AS numTests
  FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230927`
  WHERE httpRequest.userAgent = "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0"
  AND (TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 01:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 04:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 07:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 10:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 13:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 16:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 19:00:00+00", SECOND) BETWEEN 1 AND 600
    OR TIMESTAMP_DIFF(timestamp, TIMESTAMP "2023-09-27 22:00:00+00", SECOND) BETWEEN 1 AND 600
  )

@cristinaleonr
Copy link
Contributor Author

In total, tests with the "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0" user agent string represent about 80,000 tests per day.

SELECT COUNT(*) AS numTests
  FROM `mlab-ns.locate_exports.appengine_googleapis_com_nginx_request_20230927`
  WHERE httpRequest.userAgent = "ndt7-client-go-cmd/0.5.0 ndt7-client-go/0.5.0"

@mattmathis
Copy link

I wrote a query to detect all ndt7 clients that don't seem to be randomizing their run times (e.g. a large fraction of their tests are in roughly the same minute).
BigQuery
Client amd64,ndt7-client-go,0.4.1,ndt7-client-go-cmd,windows,0.4.1 is not just hitting hard, it is also hitting 9-11 different sites. There are lots of other clients that are also using poorly randomized timers

@mattmathis
Copy link

mattmathis commented Oct 3, 2023

Here is more information on the BigQuery

It intended to has some additional (manually curated) filtering and then be used to generate a pattern matcher, to block requests early in the process. The Output columns are:

  • ClientIP - yes that IP address
  • clientFingerPrint - Concatenated client metadata
  • totalTests per ClientIP*clientFingerPrint
    • (For ist and (none),this is likely to include more than one actual client)
  • minuteBin - Minute of the tests rounded to the nearest 5 minutes (-1, +4)
    • all subsequent columns are grouped by ClientIPclientFingerPrintminuteBin
  • hours - number of active hours per day
  • hourBits - Bitwise encoding of active hours. 0x1 is 0:00-0:59 UTC
  • tests - count
  • Metros - Number metros tested
  • Site - Number of sites tested
  • exampleSite - one randomly chosen exampleSite (not stable)

@cristinaleonr
Copy link
Contributor Author

Change #162 was released yesterday at EOD. We did not see any periodic drops in the Locate's healthy instance count today.

Screenshot 2023-10-03 3 49 55 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants