Skip to content

Commit

Permalink
fix(OMA): readding doc I accidentally sent to the shadow realm
Browse files Browse the repository at this point in the history
  • Loading branch information
jeff-colucci committed Apr 26, 2024
1 parent a2b9884 commit 04a4a8a
Showing 1 changed file with 175 additions and 0 deletions.
175 changes: 175 additions & 0 deletions src/content/docs/tutorial-create-alerts/manage-alert-quality.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
---
title: "Manage your alert quality"
metaDescription: "How to manage the quality of your alerts with New Relic"
redirects:
- /docs/new-relic-solutions/observability-maturity/uptime-performance-reliability/alert-quality-management-guide/
- /docs/new-relic-solutions/best-practices-guides/alerts-applied-intelligence/alert-quality-management-tutorial/
- /docs/new-relic-solutions/observability-maturity/aqm-implementation-guide
- /docs/new-relic-solutions/observability-maturity/uptime-performance-reliability/aqm-implementation-guide
freshnessValidatedDate: 2023-07-20
---

When teams receive too many alerts or too many false alarms, alert fatigue begins to occur. As either factor increases, that fatigue begins to have serious, negative consequences. Overwhelmed incident responders become used to false alerts, and the prioritize ones that are easier to resolve quickly instead of more serious issues. Worse, they often begin to simply close unresolved incidents to stay within response time targets. This means real alerts become lost in the noise while incident response times and severe outage occurances increase.

To fix alert fatigue and prevent it from occurring in the future, you must improve the quality of your alerts. Adopting a policy of alert quality management (AQM) focuses on reducing the number of nuisance incidents so that you focus only on <InlinePopover type="alerts" /> with true business impact. This reduces alert fatigue and ensures that you and your team focus your attention on the right places at the right times.

You're a good candidate for AQM if:

* You have too many alerts.
* You have alerts that stay open for long time periods.
* You have a lot of alerts aren't relevant.
* Your customers discover your issues before your monitoring tools do.

<Callout variant="tip">
Want to try a hands on learning approach before you start implementing this in your account? Check out the [alert quality management lab](https://learn.newrelic.com/hands-on-lab-alert-quality-management).
</Callout>

## Why use alert quality management? [#why-aqm]

When adopting practices based on alert quality management, you'll decrease response time and increase awareness of critical events. As you improve your alert signal-to-noise ratio, you'll lessen confusion and be able to rapidly identify and isolate the root cause of your problems. The goal is to reduce less valuable alerts while creating easier ways to identify when more valuable incidents occur. This results in:

* Increased uptime and availability.
* Reduced mean time to resolution (MTTR).
* Decreased alert volume.
* The ability to easily identify alerts that are not valuable, so you can either make them valuable or remove them.

## Using key performance indicators [#kpi]

Using the right key performance indicators (KPIs) helps you to find the noisiest and least valuable alerts so you can improve their value or remove them. You'll use the AQM process to collect and measure incident volume and engagement KPIs, then use them to identify trends to fix issues that create serious problems. Below, you'll find information on all the KPIs, as well as a NRQL query for each one to help you monitor them from anywhere in the New Relic UI.

### Incident volume [#volume]

You should treat incidents (with or without alerts) like a queue of tasks. Just like a queue, the number of alerts should always be as close to zero as possible. Each incident should trigger an investigatory or corrective action to resolve the condition. If an alert doesn't result in some sort of action, then you should question the value of the alert condition.

In particular, if you see specific incidents that are frequently triggered, then you should question whether you're in a constant state of meaningful impact or if you simply have a large volume of noise. The incident volume KPIs help you answer those questions and measure progress towards a healthy state of high quality alerting.

<CollapserGroup>
<Collapser
id="kpi-incident-count"
title="Incident count KPI"
>
This is the number of incidents generated over a period of time. Typically you should compare the current and previous weeks.

<DoNotTranslate>**Goal:**</DoNotTranslate> Reduce the number of low value and nuisance incidents.

<DoNotTranslate>**Best practices:**</DoNotTranslate>
* Ensure condition settings are intended to detect real business impact.
* Ensure condition settings are detecting abnormal behavior.
* Use the incident details `Acknowledge` feature to help measure the value of alerts. See [percentage incident acknowledge KPI](#kpi-user-engagement).
* Report AQM KPIs to all stakeholders.

```sql
FROM NrAiIncident SELECT count(*) AS 'Incident Count' WHERE event = 'open' AND priority = 'critical' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO
```
</Collapser>

<Collapser
id="kpi-incident-duration"
title="Accumulated incident duration KPI"
>
This is the total sum of minutes that all incidents have accumulated over a period of time. Typically you should compare the current and previous weeks.

<DoNotTranslate>**Goal:**</DoNotTranslate> Reduce the total accumulated minutes of incidents.

<DoNotTranslate>**Best practices:**</DoNotTranslate>
* Don't manually close incidents, as doing so can warp the accuracy of this KPI.
* Remove alerts that don't result in any remediation actions from the recipients.
* Improve <DoNotTranslate>**percent investigated**</DoNotTranslate> and <DoNotTranslate>**mean-time-to-investigate**</DoNotTranslate> KPIs by communicating their importance in improving detection and response times.
* Report AQM KPIs to all stakeholders.

```sql
FROM NrAiIncident SELECT sum(durationSeconds)/60 AS 'Incident Minutes' WHERE event = 'close' AND priority = 'critical' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO
```
</Collapser>

<Collapser
id="kpi-mttc"
title="Mean time to close (MTTC) KPI"
>
This is the average duration of incidents within the period of time measured. You want this number to be as low as possible.

<DoNotTranslate>**Goal:**</DoNotTranslate> Reduce MTTC

<DoNotTranslate>**Best practices:**</DoNotTranslate>
* Don't manually close incidents, as doing so can warp the accuracy of this KPI.
* Improve reliability engineering skills.
* Report AQM KPIs to all stakeholders.

```sql
FROM NrAiIncident SELECT average(durationSeconds/60) AS 'Incident MTTC (minutes)' WHERE event = 'close' AND priority = 'critical' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO
```
</Collapser>

<Collapser
id="kpi-pct-under-five"
title="Percent under 5 minutes KPI"
>
This is the percentage of incidents with a total duration less than five minutes. This can indicate an incident changing state too frequently, which obscures the cause and severity of the incident. This state is known as <DoNotTranslate>**incident flapping**</DoNotTranslate>.

<DoNotTranslate>**Goal:**</DoNotTranslate> Minimize percentage of incidents with short durations.

<DoNotTranslate>**Best practices:**</DoNotTranslate>
* Ensure that conditions are detecting legitimate anomalies with meaningful impact on your system.
* Understand service level management.
```sql
FROM NrAiIncident SELECT percentage(count(*), WHERE durationSeconds <= 5*60) AS '% Under 5min' WHERE event = 'close' AND priority = 'critical' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO
```
</Collapser>
</CollapserGroup>

### User engagement [#user]

You should measure the value of an incident by the amount of attention it receives. The amount of engagement an individual alert receives is a direct measurement of its value. More engagement implies a valuable alert, while less (or zero) engagement implies an alert may simply be noisy and should be modified or disabled.

There's a significant difference between measuring the moment of incident awareness and acknowledging when resolution activity begins. If you're using an integration with New Relic alerts, be sure that the `Acknowledge` event sent to New Relic triggers when resolution activity begins, not when the incident is sent to the external incident management tool.

<CollapserGroup>
<Collapser
id="kpi-pct-ack"
title="Percentage acknowledged KPI"
>
This identifies the percentage of incidents that with a `true` acknowledgement flag. You should compare the current and previous weeks.

<DoNotTranslate>**Goal:**</DoNotTranslate> Increase the percentage of incident engagement.

<DoNotTranslate>**Best practices:**</DoNotTranslate>
* Ensure that your DevOps team knows when it's appropriate to acknowledge an incident alert, if applicable.
* Gamify alert acknowledgement to drive usage.
* Discourage mass acknowledgement exercises.

```sql
FROM NrAiIssue SELECT filter(count(*), WHERE event='acknowledge')/filter(count(*), WHERE event='create')*100 AS '% Investigated' WHERE priority='CRITICAL' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO
```
</Collapser>

<Collapser
id="kpi-mtti"
title="Mean time to investigate (MTTI) KPI"
>
This identifies the average time it takes for you to acknowledge an incident. Typically you should compare the current and previous weeks.

<DoNotTranslate>**Goal:**</DoNotTranslate> Reduce the mean time to investigate.

<DoNotTranslate>**Best practices:**</DoNotTranslate>
* Work at building incident responder's confidence in alerts.
* Ensure that valuable alerts are acknowledged.
* Incentivize response teams to respond quickly to alerts.

```sql
FROM NrAiIssue SELECT average(acknowledgeTime - activateTime) / 60000 AS 'Incident MTTI (minutes)' WHERE event = 'acknowledge' SINCE 1 WEEK AGO COMPARE WITH 1 WEEK AGO
```
</Collapser>
</CollapserGroup>

## What's next? [#next]

Once you implement the AQM process from the [previous doc](/docs/tutorial-create-alerts/improve-with-alerts/), you'll see significant reductions in the volume of alerts while maintaining reliability and stability. Your AQM KPIs can provide accurate inforation on these improvements when you follow the best practices as listed above.

Once you've finished implementing AQM, you can also look into improving and managing other aspects of your platform, such as:
* [Service level management](/docs/new-relic-solutions/observability-maturity/uptime-performance-reliability/slm-implementation-guide/)
* [Reliability engineering](/docs/new-relic-solutions/observability-maturity/uptime-performance-reliability/diagnostics-beginner-guide/)
* [Customer experience](/docs/new-relic-solutions/observability-maturity/customer-experience/quality-foundation-implementation-guide/)

<UserJourneyControls
previousStep={{path: "/docs/tutorial-create-alerts/improve-with-alerts/", title: "Previous step", body: "Learn how to improve your stack with alerts"}}
/>

0 comments on commit 04a4a8a

Please sign in to comment.