Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLO Burn rate monitoring is incorrect #41

Closed
lswith opened this issue Feb 1, 2022 · 5 comments
Closed

SLO Burn rate monitoring is incorrect #41

lswith opened this issue Feb 1, 2022 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@lswith
Copy link

lswith commented Feb 1, 2022

Hi team, I've been using your tool extensively and I am loving it!

I have come across an issue with the monitoring of an SLO.

My current alerting configuration is as follows:

apiVersion: openslo/v1alpha
kind: SLO
metadata:
  displayName: xxx
  name: xxx
spec:
  service: xxx
  budgetingMethod: Occurrences
  objectives:
    - ratioMetrics:
        total:
          source: sumologic
          queryType: Logs
          query: |
            xxx
        good: 
          source: sumologic
          queryType: Logs
          query: 'xxx'
        incremental: true
      displayName: xxx
      target: 0.99
alerts:
  burnRate:
    - shortWindow: '10m'
      shortLimit: 14
      longWindow: '1h'
      longLimit: 14
      notifications:
        - connectionType: 'Email'
          recipients:
            - 'xxx@xxx'
          triggerFor:
            - Warning
            - ResolvedWarning
    - shortWindow: '30m'
      shortLimit: 6
      longWindow: '6h'
      longLimit: 6
      notifications:
        - connectionType: 'Email'
          recipients:
            - 'xxx@xxx'
          triggerFor:
            - Warning
            - ResolvedWarning
    - shortWindow: '6h'
      shortLimit: 1
      longWindow: '24h'
      longLimit: 1
      notifications:
        - connectionType: 'Email'
          recipients:
            - 'xxx@xxx'
          triggerFor:
            - Warning
            - ResolvedWarning

When I evaluate the SLO over a 24h period, it is currently at 98.41897 (which is below the 99 to meet the SLO).

I would have expected that I would receive at least 1 email stating that this SLO is not being met, however all the monitors generated aren't being triggered.

I'm wondering if the calculation of one of these items may be incorrect?


Current version: There is no slogen command to output the version, but I'm pointing to the latest of the main branch.

@lswith
Copy link
Author

lswith commented Feb 1, 2022

I'm wondering what happens when tmCount is 0 on this line: https://github.com/SumoLogic-Labs/slogen/blob/59d58d9c9ac440a88675755e333d1b800f9bde43/libs/monitor.go#L66

I think it should be guarded against.

@lswith
Copy link
Author

lswith commented Feb 1, 2022

Looking at my specific monitoring query it seems that the .Budget isn't being filled with 0.9 but is 0 instead.

_view=xxx
| timeslice 6h 
| sum(sliceGoodCount) as tmGood, sum(sliceTotalCount) as tmCount  group by _timeslice
| fillmissing timeslice(1m)
| tmGood/tmCount as tmSLO 
| (tmCount-tmGood) as tmBad 
| total tmCount as totalCount  
--> HERE --> | totalCount*(1-0) as errorBudget
--> HERE --> | ((tmBad/tmCount)/(1-0)) as sliceBurnRate
| if(queryEndTime() - _timeslice <= 6h,sliceBurnRate, 0  )  as latestBurnRate 
| sum(tmGood) as totalGood, max(totalCount) as totalCount, max(latestBurnRate) as latestBurnRate 
| (1-(totalGood/totalCount))/(1-0) as longBurnRate
| if (longBurnRate > 1 , 1,0) as long_burn_exceeded
| if ( latestBurnRate > 1, 1,0) as short_burn_exceeded
| long_burn_exceeded + short_burn_exceeded as combined_burn

This would actually fix the query and I can see that the the burn rate would exceed the value of 1.

@agaurav
Copy link
Contributor

agaurav commented Feb 1, 2022

Hi team, I've been using your tool extensively and I am loving it!

thnx @lswith, its very encouraging for us to know it being useful.

I'm wondering what happens when tmCount is 0 on this line:

https://github.com/SumoLogic-Labs/slogen/blob/59d58d9c9ac440a88675755e333d1b800f9bde43/libs/monitor.go#L66

I think it should be guarded against.

Looking at my specific monitoring query it seems that the .Budget isn't being filled with 0.9 but is 0 instead.

_view=xxx
| timeslice 6h 
| sum(sliceGoodCount) as tmGood, sum(sliceTotalCount) as tmCount  group by _timeslice
| fillmissing timeslice(1m)
| tmGood/tmCount as tmSLO 
| (tmCount-tmGood) as tmBad 
| total tmCount as totalCount  
--> HERE --> | totalCount*(1-0) as errorBudget
--> HERE --> | ((tmBad/tmCount)/(1-0)) as sliceBurnRate
| if(queryEndTime() - _timeslice <= 6h,sliceBurnRate, 0  )  as latestBurnRate 
| sum(tmGood) as totalGood, max(totalCount) as totalCount, max(latestBurnRate) as latestBurnRate 
| (1-(totalGood/totalCount))/(1-0) as longBurnRate
| if (longBurnRate > 1 , 1,0) as long_burn_exceeded
| if ( latestBurnRate > 1, 1,0) as short_burn_exceeded
| long_burn_exceeded + short_burn_exceeded as combined_burn

This would actually fix the query and I can see that the the burn rate would exceed the value of 1.

great catch on both, will fix and create a new release after testing them out by end of the day.

@agaurav agaurav added the bug Something isn't working label Feb 1, 2022
@agaurav agaurav self-assigned this Feb 1, 2022
@agaurav agaurav closed this as completed in 30d9aad Feb 1, 2022
agaurav added a commit that referenced this issue Feb 1, 2022
fix #41 : bug in monitor query template and param
@agaurav
Copy link
Contributor

agaurav commented Feb 1, 2022

hey @lswith, made an attempt to fix both .Budget not being set and tmCount being 0 in v0.7.10 and tested it for a few configs.

please let me know if you still face the issue after upgrading to the new version.
And mega thnx for reporting this critical bug and the cause along with it :)

@lswith
Copy link
Author

lswith commented Feb 2, 2022

Just confirmed. This fixed the issues with alerting! Thanks again

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants