From b11b9ad29ad704c0c5f0f075bebf20f14820a574 Mon Sep 17 00:00:00 2001 From: Nat Welch Date: Wed, 14 Oct 2020 18:24:20 +0000 Subject: [PATCH 1/5] init docs --- docs/playbooks/Elevated_5xx.md | 0 .../playbooks/Elevated_Latency_Greater_than_2s.md | 0 docs/playbooks/Elevated_Rate_Limited_Count.md | 0 docs/playbooks/Host_Down.md | 0 ..._Token_Capacity_Utilization_Above_Threshold.md | 0 docs/playbooks/index.md | 15 +++++++++++++++ 6 files changed, 15 insertions(+) create mode 100644 docs/playbooks/Elevated_5xx.md create mode 100644 docs/playbooks/Elevated_Latency_Greater_than_2s.md create mode 100644 docs/playbooks/Elevated_Rate_Limited_Count.md create mode 100644 docs/playbooks/Host_Down.md create mode 100644 docs/playbooks/Realm_Token_Capacity_Utilization_Above_Threshold.md create mode 100644 docs/playbooks/index.md diff --git a/docs/playbooks/Elevated_5xx.md b/docs/playbooks/Elevated_5xx.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/playbooks/Elevated_Latency_Greater_than_2s.md b/docs/playbooks/Elevated_Latency_Greater_than_2s.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/playbooks/Elevated_Rate_Limited_Count.md b/docs/playbooks/Elevated_Rate_Limited_Count.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/playbooks/Host_Down.md b/docs/playbooks/Host_Down.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/playbooks/Realm_Token_Capacity_Utilization_Above_Threshold.md b/docs/playbooks/Realm_Token_Capacity_Utilization_Above_Threshold.md new file mode 100644 index 000000000..e69de29bb diff --git a/docs/playbooks/index.md b/docs/playbooks/index.md new file mode 100644 index 000000000..ae570e8bf --- /dev/null +++ b/docs/playbooks/index.md @@ -0,0 +1,15 @@ +# Playbooks + +This folder contains documents for playbooks both for responding to alerts and common interactions with the verification server in production on GCP. All of our responses here are specific to GCP. + +## Alerts + + - [Host Down](Host_Down.md) + - [Elevated 5xx](Elevated_5xx.md) + - [Elevated Rate Limited Count](Elevated_Rate_Limited_Count.md) + - [Elevated Latency Greater than 2s](Elevated_Latency_Greater_than_2s.md) + - [Realm Token Capacity Utilization Above Threshold](Realm_Token_Capacity_Utilization_Above_Threshold.md) + +## Common Actions + + - New Realm Admin From caf59363cd9a9ed72e5f20d7fe7e1a973ec3f0e5 Mon Sep 17 00:00:00 2001 From: Nat Welch Date: Mon, 26 Oct 2020 16:31:59 -0400 Subject: [PATCH 2/5] Add docs from meeting with marilia --- docs/playbooks/Elevated_5xx.md | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/docs/playbooks/Elevated_5xx.md b/docs/playbooks/Elevated_5xx.md index e69de29bb..78a54bda6 100644 --- a/docs/playbooks/Elevated_5xx.md +++ b/docs/playbooks/Elevated_5xx.md @@ -0,0 +1,31 @@ +Elevated 5xx Alert + +This alert fires when greater than 2 requests per second returning status code of 500 -> 599 for greater than five minutes. + +First check if the site is up +Check if https://encv.org/ loads (or appropriate domain for this environment) +Check if you can login +If you can, admins aren't affected +If you can't login everyone is affected +Post in chat that you've got the alert +Communicate to your team that you are actively looking at this alert to lower confusion. +Look at the metrics dashboard +Load https://console.cloud.google.com/monitoring/dashboards +Open the dashboard titled "Verification Server", and look at the top left graph +Or query the Metrics explorer with the following MQL +``` +cloud_run_revision::run.googleapis.com/request_count +| align rate() +| every 1m +| [resource.service_name, metric.response_code_class] +``` +Look for servers with elevated 5xx +Look at request logs, you can navigate by hand or use the following query +``` +resource.type="cloud_run_revision" +resource.labels.service_name="e2e-runner" +severity=ERROR +``` + + + From faa235716cda8dbdf4078707bac7616067efec15 Mon Sep 17 00:00:00 2001 From: Nat Welch Date: Mon, 26 Oct 2020 16:33:50 -0400 Subject: [PATCH 3/5] with formatting --- docs/playbooks/Elevated_5xx.md | 31 ++++++++++++++----------------- 1 file changed, 14 insertions(+), 17 deletions(-) diff --git a/docs/playbooks/Elevated_5xx.md b/docs/playbooks/Elevated_5xx.md index 78a54bda6..1da8b7e59 100644 --- a/docs/playbooks/Elevated_5xx.md +++ b/docs/playbooks/Elevated_5xx.md @@ -1,31 +1,28 @@ -Elevated 5xx Alert +# Elevated 5xx Alert This alert fires when greater than 2 requests per second returning status code of 500 -> 599 for greater than five minutes. -First check if the site is up -Check if https://encv.org/ loads (or appropriate domain for this environment) -Check if you can login -If you can, admins aren't affected -If you can't login everyone is affected -Post in chat that you've got the alert -Communicate to your team that you are actively looking at this alert to lower confusion. -Look at the metrics dashboard -Load https://console.cloud.google.com/monitoring/dashboards -Open the dashboard titled "Verification Server", and look at the top left graph -Or query the Metrics explorer with the following MQL +* First check if the site is up + * Check if https://encv.org/ loads (or appropriate domain for this environment) + * Check if you can login + * If you can, admins aren't affected + * If you can't login everyone is affected +* Post in chat that you've got the alert + * Communicate to your team that you are actively looking at this alert to lower confusion. +* Look at the metrics dashboard + * Load https://console.cloud.google.com/monitoring/dashboards + * Open the dashboard titled "Verification Server", and look at the top left graph + * Or query the Metrics explorer with the following MQL ``` cloud_run_revision::run.googleapis.com/request_count | align rate() | every 1m | [resource.service_name, metric.response_code_class] ``` -Look for servers with elevated 5xx -Look at request logs, you can navigate by hand or use the following query + * Look for servers with elevated 5xx +* Look at request logs, you can navigate by hand or use the following query ``` resource.type="cloud_run_revision" resource.labels.service_name="e2e-runner" severity=ERROR ``` - - - From c6212d6a0b7c13931c086cbb9b239e618e4d0d9c Mon Sep 17 00:00:00 2001 From: Nat Welch Date: Wed, 28 Oct 2020 17:32:33 +0000 Subject: [PATCH 4/5] init 5xx --- docs/playbooks/Elevated_5xx.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/docs/playbooks/Elevated_5xx.md b/docs/playbooks/Elevated_5xx.md index e69de29bb..cbf8fe2f2 100644 --- a/docs/playbooks/Elevated_5xx.md +++ b/docs/playbooks/Elevated_5xx.md @@ -0,0 +1,3 @@ +# Elevated 5xx + +## \ No newline at end of file From a8916b4b5bb87012ae0db92e8e0b91842347037b Mon Sep 17 00:00:00 2001 From: Nat Welch Date: Wed, 28 Oct 2020 17:58:11 +0000 Subject: [PATCH 5/5] Add docs from today --- .../Elevated_Latency_Greater_than_2s.md | 36 +++++++++++++++++++ ...en_Capacity_Utilization_Above_Threshold.md | 30 ++++++++++++++++ 2 files changed, 66 insertions(+) diff --git a/docs/playbooks/Elevated_Latency_Greater_than_2s.md b/docs/playbooks/Elevated_Latency_Greater_than_2s.md index e69de29bb..cd5baef95 100644 --- a/docs/playbooks/Elevated_Latency_Greater_than_2s.md +++ b/docs/playbooks/Elevated_Latency_Greater_than_2s.md @@ -0,0 +1,36 @@ +# Elevated Latency Greater than 2s + +This alert fires when any loadbalancing.googleapis.com/https/backend_latencies stream is above a threshold of 2s for greater than 5 minutes. + +Check: + +``` +fetch https_lb_rule +| metric 'loadbalancing.googleapis.com/https/backend_latencies' +| filter + (resource.backend_name != 'NO_BACKEND_SELECTED' + && resource.forwarding_rule_name == 'verification-server-https') +| align delta(1m) +| every 1m +| group_by [resource.backend_target_name], + [value_backend_latencies_percentile: + percentile(value.backend_latencies, 99)] +| condition val() > 2000 'ms' +``` + +Troubleshooting: + +* Check which backend is having problems +* Check Logs Explorer for requests to this server on + +``` +resource.type="cloud_run_revision" +resource.labels.configuration_name="BACKEND NAME" +severity=INFO +(httpRequest.latency>"TIMEms" OR +jsonPayload.latency>"TIMEms") +``` + +* Check with the team if there are any ongoing changes recently +* Check with Google Cloud Support Status Dashboard for ongoing incidents + * Open a support case. diff --git a/docs/playbooks/Realm_Token_Capacity_Utilization_Above_Threshold.md b/docs/playbooks/Realm_Token_Capacity_Utilization_Above_Threshold.md index e69de29bb..dc62391fa 100644 --- a/docs/playbooks/Realm_Token_Capacity_Utilization_Above_Threshold.md +++ b/docs/playbooks/Realm_Token_Capacity_Utilization_Above_Threshold.md @@ -0,0 +1,30 @@ +# Realm Token Capacity Utilization Above Threshold + +This alerts fires when we are out of Token quota. We fire this alert to notify system admins because we can't notify realm admins. + +To triage, check the following: + +* Metric for realm capacity +* Contact realm admin that this spike is expected + +``` +generic_task :: custom.googleapis.com/opencensus/en-verification-server/api/issue/realm_token_capacity_latest | align rate() | every 1m | group_by [metric.realm], [max(value.realm_token_capacity_latest)] +``` + +Note that metrics only show the realm ID (an integer) because realm names are PII. + +TODO(marilia): Add way to turn Realm ID into Realm Name. + +To mitigate: + +Once you know what realm it is and that this is legitimate use: + + +* Login to the realm (Click Join Realm in System Admin) +* Under your name click "Manage Realm - Settings" +* Then select the subheader "Abuse prevention" +* Add to the quota by giving a temporary burst of the amount you think this realm needs. + + +TODO(thegrinch): Add logging of new burst capacity. +