Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to ignore gRPC error codes for alerts #54

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions charts/generic-service/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,14 +154,15 @@ app:
| `alerting.cpu.maxThrottleFactor` | `0.01` | The maximum fraction of the container's execution time during which it experiences CPU throttling |
| `alerting.cpu.quotaBufferFactor` | `1.0` | Multiplied with `resources.*.cpu` to determine minimum allowed unused CPU quota in namespace |
| `alerting.http.sampleInterval` | `20m` | The time interval in which to measure HTTP responses for triggering alerts |
| `alerting.http.referenceInterval` | `1w` | The time interval to to compare with the sample interval to detect changes |
| `alerting.http.referenceInterval` | `1w` | The time interval to compare with the sample interval to detect changes |
| `alerting.http.maxSlowdown` | `2.5` | The maximum HTTP response slowdown in the sample interval compared to the reference interval |
| `alerting.http.max4xxRatio` | `2.5` | The maximum HTTP 4xx ratio increase in the sample interval compared to the reference interval |
| `alerting.http.max5xxCount` | `0` | The maximum number of HTTP 5xx responses (except 504) in the sample interval |
| `alerting.http.maxTimeoutCount` | `0` | The maximum number of HTTP gateway timeout responses (504) in the sample interval |
| `alerting.grpc.requestsMetric` | `grpc_server_handled_total` | The name of the Prometheus metric counting gRPC requests |
| `alerting.grpc.ignoreErrorCodes` | `[]` | Which non-successful gRPC status codes will be ignored for triggering alerts |
| `alerting.grpc.sampleInterval` | `20m` | The time interval in which to measure gRPC responses |
| `alerting.grpc.referenceInterval` | `1w` | The time interval to to compare with the sample interval to detect changes |
| `alerting.grpc.referenceInterval` | `1w` | The time interval to compare with the sample interval to detect changes |
| `alerting.grpc.maxErrorRatio` | `2.5` | The maximum gRPC error ratio increase in the sample interval compared to the reference interval |
| `alerting.grpc.errorDuration` | | The duration for which the gRPC error rate has to remain elevated before triggering an alert |
| `alerting.grpc.maxCriticalErrors` | `0` | The maximum number of critical gRPC errors responses in the sample interval |
Expand Down
5 changes: 3 additions & 2 deletions charts/generic-service/templates/alerts.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -197,10 +197,11 @@ spec:

{{- if or (eq .Values.ingress.protocol "grpc") (eq .Values.ingress.protocol "grpcs") }}
{{- if .Values.alerting.grpc.referenceInterval }}
{{ $ignoreCodes := prepend .Values.alerting.grpc.ignoreErrorCodes "OK" }}
- alert: GrpcErrors
expr: |
(sum(rate({{ .Values.alerting.grpc.requestsMetric }}{namespace="{{ .Release.Namespace }}",release="{{ .Release.Name }}",grpc_code!="OK"}[{{ .Values.alerting.grpc.sampleInterval }}])) / sum(rate({{ .Values.alerting.grpc.requestsMetric }}{namespace="{{ .Release.Namespace }}",release="{{ .Release.Name }}"}[{{ .Values.alerting.grpc.sampleInterval }}]))) /
(sum(rate({{ .Values.alerting.grpc.requestsMetric }}{namespace="{{ .Release.Namespace }}",release="{{ .Release.Name }}",grpc_code!="OK"}[{{ .Values.alerting.grpc.referenceInterval }}])) / sum(rate({{ .Values.alerting.grpc.requestsMetric }}{namespace="{{ .Release.Namespace }}",release="{{ .Release.Name }}"}[{{ .Values.alerting.grpc.referenceInterval }}])))
(sum(rate({{ .Values.alerting.grpc.requestsMetric }}{namespace="{{ .Release.Namespace }}",release="{{ .Release.Name }}",grpc_code!~"{{ $ignoreCodes | join "|" }}"}[{{ .Values.alerting.grpc.sampleInterval }}])) / sum(rate({{ .Values.alerting.grpc.requestsMetric }}{namespace="{{ .Release.Namespace }}",release="{{ .Release.Name }}"}[{{ .Values.alerting.grpc.sampleInterval }}]))) /
(sum(rate({{ .Values.alerting.grpc.requestsMetric }}{namespace="{{ .Release.Namespace }}",release="{{ .Release.Name }}",grpc_code!~"{{ $ignoreCodes | join "|" }}"}[{{ .Values.alerting.grpc.referenceInterval }}])) / sum(rate({{ .Values.alerting.grpc.requestsMetric }}{namespace="{{ .Release.Namespace }}",release="{{ .Release.Name }}"}[{{ .Values.alerting.grpc.referenceInterval }}])))
> {{ .Values.alerting.grpc.maxErrorRatio }}
{{- if .Values.alerting.grpc.errorDuration }}
for: {{ .Values.alerting.grpc.errorDuration }}
Expand Down
6 changes: 6 additions & 0 deletions charts/generic-service/values.schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -907,6 +907,12 @@
"default": "grpc_server_handled_total",
"description": "The name of the Prometheus metric counting gRPC requests"
},
"ignoreErrorCodes": {
"type": "array",
"items": {"type": "string"},
"default": [],
"description": "Which non-successful gRPC status codes will be ignored for triggering alerts"
},
"sampleInterval": {
"type": "string",
"default": "15m",
Expand Down
1 change: 1 addition & 0 deletions charts/generic-service/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,7 @@ alerting:
maxTimeoutCount: 0
grpc:
requestsMetric: grpc_server_handled_total
ignoreErrorCodes: []
sampleInterval: 20m
referenceInterval: 1w
maxErrorRatio: 2.5
Expand Down
Loading