-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ysera load is jumping up/down #685
Comments
It's paired with culebre, which is showing normal load. Relevant VCL director european_servers_0 random {
{ .backend = F_Odin___Amsterdam; .weight = 10; }
{ .backend = F_Nidhogg___Umea; .weight = 30; }
}
director european_servers_1 random {
{ .backend = F_Ysera___Slough; .weight = 10; }
{ .backend = F_Culebre___Dublin; .weight = 30; }
} declare local var.x_mod INTEGER;
if (server.region == "APAC" || server.region == "Asia" || server.region ~ "^Asia-") {
set req.backend = australia;
} else if ((server.region == "North-America" || server.region ~ "^(US|SA)-")
&& server.datacenter != "IAD" && server.datacenter != "EWR" && server.datacenter != "LGA" && server.datacenter != "YYZ" && server.datacenter != "MIA" && server.datacenter != "DFW" ) {
set req.backend = america;
} else {
if (req.url.path ~ "^/([1-9]?[0-9]+)/([0-9]+)/([0-9]+)\.png") {
// Compute the X metatile and split into two groups
set var.x_mod = std.atoi(re.group.2);
set var.x_mod /= 8;
set var.x_mod %= 2;
// Assign to different servers based on groups
if (var.x_mod == 0 ) {
set req.backend = europe_0;
} else {
set req.backend = europe_1;
}
} else {
set req.backend = europe_0;
}
} |
Turned off ysera at the CDN. No major impact on traffic. odin and nidhogg are spiking. Suspecting that there's someone using the tile server and bypassing the CDN. |
Contacted Fastly support. Unsure what's going on. |
Removed odin from all_servers and european_servers_all groups to see if it has any impact. |
No impact, as expected. Now removing odin completely from the CDN |
Removing it from the CDN removes all load, as expected. Bringing it back in with minimal weight. |
Bringing it back with a weight of 1 has brought the traffic up to the same level it had when it had a weight of 10 |
I'm wondering if it's healthcheck related somehow. if nidhogg failed its healthcheck it would divert all of its traffic to odin, regardless of weighting. Trying an adjustment by making the healthcheck only want 1/10 of checks to pass, with checks ever 5s. |
Confirmed that servers were failing health checks - at least pyrene, nidhogg, and ysera. A cluster-wide apache restart has been completed, and that appears to have fixed it. |
A few things to investigate What was wrong that an Apache restart fixed?The error What happened to Pyrene?Pyrene stopped responding to all traffic at 2022-07-17T00:03:15Z. This is suspiciously close to midnight UTC. What monitoring do we need?We had no monitoring that indicated that health checks were failing, and when I loaded tiles manually, everything worked, so it wasn't immediately obvious what went wrong. What alarms do we need?Probably some apache ones, but perhaps also on monitoring of health checks |
Failover from US to Europe doesn't work as intended. Each region needs its own even/odd split defined, even if the even/odd servers are the same. They can then fall back to the right servers in Europe |
Published in openstreetmap/owg-website@6fd72e7 |
I considered that this might be a sampling artifact where conversion to a rate results in oddities, but the same is present on the CPU graphs
The text was updated successfully, but these errors were encountered: