google · google-oss-robot · Oct 29, 2020 · Oct 14, 2020 · Oct 26, 2020 · Oct 26, 2020
@@ -0,0 +1,31 @@
+# Elevated 5xx Alert
+
+This alert fires when greater than 2 requests per second returning status code of 500 -> 599 for greater than five minutes.
+
+* First check if the site is up
+   * Check if https://encv.org/ loads (or appropriate domain for this environment)
+   * Check if you can login
+   * If you can, admins aren't affected
+   * If you can't login everyone is affected
+* Post in chat that you've got the alert
+   * Communicate to your team that you are actively looking at this alert to lower confusion.
+* Look at the metrics dashboard
+   * Load https://console.cloud.google.com/monitoring/dashboards
+   * Open the dashboard titled "Verification Server", and look at the top left graph
+   * Or query the Metrics explorer with the following MQL
+
+```
+cloud_run_revision::run.googleapis.com/request_count
+| align rate()
+| every 1m
+| [resource.service_name, metric.response_code_class]
+```
+
+   * Look for servers with elevated 5xx
+   * Look at request logs, you can navigate by hand or use the following query
+
+```
+resource.type="cloud_run_revision"
+resource.labels.service_name="e2e-runner"
+severity=ERROR
+```
@@ -0,0 +1,36 @@
+# Elevated Latency Greater than 2s
+
+This alert fires when any loadbalancing.googleapis.com/https/backend_latencies stream is above a threshold of 2s for greater than 5 minutes.
+
+Check:
+
+```
+fetch https_lb_rule
+| metric 'loadbalancing.googleapis.com/https/backend_latencies'
+| filter
+    (resource.backend_name != 'NO_BACKEND_SELECTED'
+     && resource.forwarding_rule_name == 'verification-server-https')
+| align delta(1m)
+| every 1m
+| group_by [resource.backend_target_name],
+    [value_backend_latencies_percentile:
+       percentile(value.backend_latencies, 99)]
+| condition val() > 2000 'ms'
+```
+
+Troubleshooting:
+
+* Check which backend is having problems
+* Check Logs Explorer for requests to this server on
+
+```
+resource.type="cloud_run_revision"
+resource.labels.configuration_name="BACKEND NAME"
+severity=INFO
+(httpRequest.latency>"TIMEms" OR
+jsonPayload.latency>"TIMEms")
+```
+
+* Check with the team if there are any ongoing changes recently
+* Check with Google Cloud Support Status Dashboard for ongoing incidents
+  * Open a support case.
@@ -0,0 +1,30 @@
+# Realm Token Capacity Utilization Above Threshold
+
+This alerts fires when we are out of Token quota. We fire this alert to notify system admins because we can't notify realm admins.
+
+To triage, check the following:
+
+* Metric for realm capacity
+* Contact realm admin that this spike is expected
+
+```
+generic_task :: custom.googleapis.com/opencensus/en-verification-server/api/issue/realm_token_capacity_latest | align rate() | every 1m | group_by [metric.realm], [max(value.realm_token_capacity_latest)]
+```
+
+Note that metrics only show the realm ID (an integer) because realm names are PII.
+
+TODO(marilia): Add way to turn Realm ID into Realm Name.
+
+To mitigate:
+
+Once you know what realm it is and that this is legitimate use:
+
+
+* Login to the realm (Click Join Realm in System Admin)
+* Under your name click "Manage Realm - Settings"
+* Then select the subheader "Abuse prevention"
+* Add to the quota by giving a temporary burst of the amount you think this realm needs.
+
+
+TODO(thegrinch): Add logging of new burst capacity.
+
@@ -0,0 +1,15 @@
+# Playbooks
+
+This folder contains documents for playbooks both for responding to alerts and common interactions with the verification server in production on GCP. All of our responses here are specific to GCP.
+
+## Alerts
+
+ - [Host Down](Host_Down.md)
+ - [Elevated 5xx](Elevated_5xx.md)
+ - [Elevated Rate Limited Count](Elevated_Rate_Limited_Count.md)
+ - [Elevated Latency Greater than 2s](Elevated_Latency_Greater_than_2s.md)
+ - [Realm Token Capacity Utilization Above Threshold](Realm_Token_Capacity_Utilization_Above_Threshold.md)
+
+## Common Actions
+
+ - New Realm Admin