Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ysera load is jumping up/down #685

Closed
pnorman opened this issue Jul 18, 2022 · 17 comments
Closed

ysera load is jumping up/down #685

pnorman opened this issue Jul 18, 2022 · 17 comments
Assignees
Labels
incident-report service:tiles The raster map on tile.openstreetmap.org

Comments

@pnorman
Copy link
Collaborator

pnorman commented Jul 18, 2022

image

I considered that this might be a sampling artifact where conversion to a rate results in oddities, but the same is present on the CPU graphs

image

@pnorman pnorman added the service:tiles The raster map on tile.openstreetmap.org label Jul 18, 2022
@pnorman
Copy link
Collaborator Author

pnorman commented Jul 18, 2022

It's paired with culebre, which is showing normal load.

Relevant VCL

director european_servers_0 random {
  { .backend = F_Odin___Amsterdam; .weight = 10; }
  { .backend = F_Nidhogg___Umea; .weight = 30; }
}
director european_servers_1 random {
  { .backend = F_Ysera___Slough; .weight = 10; }
  { .backend = F_Culebre___Dublin; .weight = 30; }
}
declare local var.x_mod INTEGER;

if (server.region == "APAC" || server.region == "Asia" || server.region ~ "^Asia-") {
  set req.backend = australia;
} else if ((server.region == "North-America" || server.region ~ "^(US|SA)-")
&& server.datacenter != "IAD" && server.datacenter != "EWR" && server.datacenter != "LGA" && server.datacenter != "YYZ" && server.datacenter != "MIA" && server.datacenter != "DFW" ) {
  set req.backend = america;
} else {
  if (req.url.path ~ "^/([1-9]?[0-9]+)/([0-9]+)/([0-9]+)\.png") {
    // Compute the X metatile and split into two groups
    set var.x_mod = std.atoi(re.group.2);
    set var.x_mod /= 8;
    set var.x_mod %= 2;
    
    // Assign to different servers based on groups
    if (var.x_mod == 0 ) {
      set req.backend = europe_0;
    } else {
      set req.backend = europe_1;
    }
  } else {
    set req.backend = europe_0;
  }
}

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 18, 2022

Turned off ysera at the CDN. No major impact on traffic. odin and nidhogg are spiking. Suspecting that there's someone using the tile server and bypassing the CDN.

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 18, 2022

Restored the CDN to config it was over the weekend, but it's not changed anything.

Odin is badly off. Started at about 7 AM UTC
image

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 18, 2022

Contacted Fastly support. Unsure what's going on.

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 18, 2022

Graph of

SELECT COUNT(*), time FROM logs.fastly_logs_v18
WHERE year = 2022 AND month = 07 AND day = 18 AND hour = 08 AND render='odin.openstreetmap.org' and cachehit='MISS'
GROUP BY time ORDER BY time;

fastly-odin-log-bysecond

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 18, 2022

Removed odin from all_servers and european_servers_all groups to see if it has any impact.

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 18, 2022

Removed odin from all_servers and european_servers_all groups to see if it has any impact.

No impact, as expected. Now removing odin completely from the CDN

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 18, 2022

Removing it from the CDN removes all load, as expected. Bringing it back in with minimal weight.

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 18, 2022

Bringing it back with a weight of 1 has brought the traffic up to the same level it had when it had a weight of 10

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 18, 2022

I'm wondering if it's healthcheck related somehow. if nidhogg failed its healthcheck it would divert all of its traffic to odin, regardless of weighting.

Trying an adjustment by making the healthcheck only want 1/10 of checks to pass, with checks ever 5s.

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 18, 2022

I changed the health check to 1/10 threshold & window, 1 initial, 5000s interval and it's changed the behavior.

The behavior is still not correct, there should be a 1:30 ratio between odin traffic and nidhogg traffic, but the fact that it changed indicates some connection with health checks.

fastly-odin-nidhogg-healthcheck change

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 18, 2022

Confirmed that servers were failing health checks - at least pyrene, nidhogg, and ysera.

A cluster-wide apache restart has been completed, and that appears to have fixed it.

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 18, 2022

A few things to investigate

What was wrong that an Apache restart fixed?

The error [Mon Jul 18 09:38:37.894876 2022] [mpm_event:error] [pid 2239:tid 140189126043520] AH03490: scoreboard is full, not at MaxRequestWorkers.Increase ServerLimit. was observed

What happened to Pyrene?

Pyrene stopped responding to all traffic at 2022-07-17T00:03:15Z. This is suspiciously close to midnight UTC.

What monitoring do we need?

We had no monitoring that indicated that health checks were failing, and when I loaded tiles manually, everything worked, so it wasn't immediately obvious what went wrong.

What alarms do we need?

Probably some apache ones, but perhaps also on monitoring of health checks

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 18, 2022

Failover from US to Europe doesn't work as intended. america is defined as a fallback from pyrene to all_servers, which will put US traffic on the wrong even/odd servers. The same is true for Australia.

Each region needs its own even/odd split defined, even if the even/odd servers are the same. They can then fall back to the right servers in Europe

@pnorman
Copy link
Collaborator Author

pnorman commented Jul 19, 2022

The tile service experienced degraded service from 2022-07-18T07:30:00Z to approximately 2022-07-18T10:00:00Z, impacting an average of 4% of traffic during the event.

See graph of sum(rate(fastly_rt_status_code_total{service_name=~"OSM Tiles CDN",status_code=~"404|5.."}[$__rate_interval]))/sum(rate(fastly_rt_requests_total{service_name=~"OSM Tiles CDN"}[$__rate_interval]))
image

@pnorman pnorman self-assigned this Jul 24, 2022
@pnorman
Copy link
Collaborator Author

pnorman commented Jul 25, 2022

Published in openstreetmap/owg-website@6fd72e7

@pnorman pnorman closed this as completed Jul 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
incident-report service:tiles The raster map on tile.openstreetmap.org
Projects
None yet
Development

No branches or pull requests

1 participant