Skip to content
This repository was archived by the owner on Jul 12, 2023. It is now read-only.
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions docs/playbooks/Elevated_5xx.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Elevated 5xx Alert

This alert fires when greater than 2 requests per second returning status code of 500 -> 599 for greater than five minutes.

* First check if the site is up
* Check if https://encv.org/ loads (or appropriate domain for this environment)
* Check if you can login
* If you can, admins aren't affected
* If you can't login everyone is affected
* Post in chat that you've got the alert
* Communicate to your team that you are actively looking at this alert to lower confusion.
* Look at the metrics dashboard
* Load https://console.cloud.google.com/monitoring/dashboards
* Open the dashboard titled "Verification Server", and look at the top left graph
* Or query the Metrics explorer with the following MQL

```
cloud_run_revision::run.googleapis.com/request_count
| align rate()
| every 1m
| [resource.service_name, metric.response_code_class]
```

* Look for servers with elevated 5xx
* Look at request logs, you can navigate by hand or use the following query

```
resource.type="cloud_run_revision"
resource.labels.service_name="e2e-runner"
severity=ERROR
```
36 changes: 36 additions & 0 deletions docs/playbooks/Elevated_Latency_Greater_than_2s.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Elevated Latency Greater than 2s

This alert fires when any loadbalancing.googleapis.com/https/backend_latencies stream is above a threshold of 2s for greater than 5 minutes.

Check:

```
fetch https_lb_rule
| metric 'loadbalancing.googleapis.com/https/backend_latencies'
| filter
(resource.backend_name != 'NO_BACKEND_SELECTED'
&& resource.forwarding_rule_name == 'verification-server-https')
| align delta(1m)
| every 1m
| group_by [resource.backend_target_name],
[value_backend_latencies_percentile:
percentile(value.backend_latencies, 99)]
| condition val() > 2000 'ms'
```

Troubleshooting:

* Check which backend is having problems
* Check Logs Explorer for requests to this server on

```
resource.type="cloud_run_revision"
resource.labels.configuration_name="BACKEND NAME"
severity=INFO
(httpRequest.latency>"TIMEms" OR
jsonPayload.latency>"TIMEms")
```

* Check with the team if there are any ongoing changes recently
* Check with Google Cloud Support Status Dashboard for ongoing incidents
* Open a support case.
Empty file.
Empty file added docs/playbooks/Host_Down.md
Empty file.
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Realm Token Capacity Utilization Above Threshold

This alerts fires when we are out of Token quota. We fire this alert to notify system admins because we can't notify realm admins.

To triage, check the following:

* Metric for realm capacity
* Contact realm admin that this spike is expected

```
generic_task :: custom.googleapis.com/opencensus/en-verification-server/api/issue/realm_token_capacity_latest | align rate() | every 1m | group_by [metric.realm], [max(value.realm_token_capacity_latest)]
```

Note that metrics only show the realm ID (an integer) because realm names are PII.

TODO(marilia): Add way to turn Realm ID into Realm Name.

To mitigate:

Once you know what realm it is and that this is legitimate use:


* Login to the realm (Click Join Realm in System Admin)
* Under your name click "Manage Realm - Settings"
* Then select the subheader "Abuse prevention"
* Add to the quota by giving a temporary burst of the amount you think this realm needs.


TODO(thegrinch): Add logging of new burst capacity.

15 changes: 15 additions & 0 deletions docs/playbooks/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Playbooks

This folder contains documents for playbooks both for responding to alerts and common interactions with the verification server in production on GCP. All of our responses here are specific to GCP.

## Alerts

- [Host Down](Host_Down.md)
- [Elevated 5xx](Elevated_5xx.md)
- [Elevated Rate Limited Count](Elevated_Rate_Limited_Count.md)
- [Elevated Latency Greater than 2s](Elevated_Latency_Greater_than_2s.md)
- [Realm Token Capacity Utilization Above Threshold](Realm_Token_Capacity_Utilization_Above_Threshold.md)

## Common Actions

- New Realm Admin