Add rules diff verbosity flag #199

justinTM · 2021-07-16T20:36:50Z

No description provided.

justinTM · 2021-07-19T17:20:30Z

@pracucci status update? make images is failing for my forked repo locally so relying on this repo for production ticket

justinTM · 2021-07-20T18:37:56Z

built docker image locally from my forked repo but still not seeing diff rules. hoping to get this merged here this week if possible? @jtlisi @gotjosh @gouthamve

gotjosh · 2021-07-20T18:57:05Z

Thank you very much for your contribution @justinTM, this is on my list for this week.

justinTM · 2021-07-26T16:28:45Z

@gotjosh not to spam but any status update? if we could set up a super quick 15 minute call maybe i can get local docker image built with your help? i wouldn't mind writing short documentation afterwards for contributors to build locally, too.

gouthamve

LGTM

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

gouthamve · 2021-08-06T10:01:47Z

➜  cortex-tools git:(20210716_add_rules_diff_verbosity_flag) ✗ ./cortextool rules diff --id=10428 --key=$GCOM_TOKEN --address=https://xxxxxx.grafana.net --namespaces=ops-us-east-0.metamonitoring rules.yml
Changes are indicated with the following symbols:
  + updated

The following changes will be made if the provided rule set is synced:
~ Namespace: ops-us-east-0.metamonitoring
  ~ Group: alertmanager.rules                                                                                                                                                                                    

Diff Summary: 0 Groups Created, 1 Groups Updated, 0 Groups Deleted
➜  cortex-tools git:(20210716_add_rules_diff_verbosity_flag) ✗ ./cortextool rules diff --id=10428 --key=$GCOM_TOKEN --address=https://xxxxxxxx.grafana.net --namespaces=ops-us-east-0.metamonitoring rules.yml --verbose
Changes are indicated with the following symbols:
  + updated

The following changes will be made if the provided rule set is synced:
~ Namespace: ops-us-east-0.metamonitoring
  ~ Group: alertmanager.rules                                                                                                                                                                                    
+ name: alertmanager.rules
+ rules:                                                                                                                                                                                                         
+     - alert: CHANGEDDDD                                                                                                                                                                                        
+       expr: max_over_time(alertmanager_config_last_reload_successful{job="global-alertmanager/alertmanager"}[5m]) == 0                                                                                         
+       for: 10m                                                                                                                                                                                                 
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: Configuration has failed to load for {{$labels.instance}} in {{$labels.cluster}}.                                                                                                         
+         summary: Reloading an Alertmanager configuration has failed.                                                                                                                                           
+     - alert: AlertmanagerMembersInconsistent                                                                                                                                                                   
+       expr: max_over_time(alertmanager_cluster_members{job="global-alertmanager/alertmanager"}[5m]) < on(job, namespace, cluster) group_left() count by(job, namespace, cluster) (max_over_time(alertmanager_cluster_members{job="global-alertmanager/alertmanager"}[5m]))                                                                                                                                                      
+       for: 10m                                                                                                                                                                                                 
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: Alertmanager {{$labels.instance}} in {{$labels.cluster}} has only found {{ $value }} members of the {{$labels.job}} cluster.                                                              
+         summary: A member of an Alertmanager cluster has not found all other cluster members.                                                                                                                  
+     - alert: AlertmanagerFailedToSendAlerts                                                                                                                                                                    
+       expr: (rate(alertmanager_notifications_failed_total{job="global-alertmanager/alertmanager"}[5m]) / rate(alertmanager_notifications_total{job="global-alertmanager/alertmanager"}[5m])) > 0.01            
+       for: 5m                                                                                                                                                                                                  
+       labels:                                                                                                                                                                                                  
+         severity: warning                                                                                                                                                                                      
+       annotations:                                                                                                                                                                                             
+         description: Alertmanager {{$labels.instance}} in {{$labels.cluster}} failed to send {{ $value | humanizePercentage }} of notifications to {{ $labels.integration }}.                                  
+         summary: An Alertmanager instance failed to send notifications.                                                                                                                                        
+     - alert: AlertmanagerClusterFailedToSendAlerts                                                                                                                                                             
+       expr: min by(job, namespace, integration, cluster) (rate(alertmanager_notifications_failed_total{integration=~"pagerduty",job="global-alertmanager/alertmanager"}[5m]) / rate(alertmanager_notifications_total{integration=~"pagerduty",job="global-alertmanager/alertmanager"}[5m])) > 0.01                                                                                                                              
+       for: 5m                                                                                                                                                                                                  
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: The minimum notification failure rate to {{ $labels.integration }} sent from any instance in the {{$labels.job}} cluster is {{ $value | humanizePercentage }}.                            
+         summary: All Alertmanager instances in a cluster failed to send notifications to a critical integration.                                                                                               
+     - alert: AlertmanagerClusterFailedToSendAlerts                                                                                                                                                             
+       expr: min by(job, namespace, integration, cluster) (rate(alertmanager_notifications_failed_total{integration!~"pagerduty",job="global-alertmanager/alertmanager"}[5m]) / rate(alertmanager_notifications_total{integration!~"pagerduty",job="global-alertmanager/alertmanager"}[5m])) > 0.01                                                                                                                              
+       for: 5m                                                                                                                                                                                                  
+       labels:                                                                                                                                                                                                  
+         severity: warning                                                                                                                                                                                      
+       annotations:                                                                                                                                                                                             
+         description: The minimum notification failure rate to {{ $labels.integration }} sent from any instance in the {{$labels.job}} cluster is {{ $value | humanizePercentage }}.                            
+         summary: All Alertmanager instances in a cluster failed to send notifications to a non-critical integration.                                                                                           
+     - alert: AlertmanagerConfigInconsistent                                                                                                                                                                    
+       expr: count by(job, namespace, cluster) (count_values by(job, namespace, cluster) ("config_hash", alertmanager_config_hash{job="global-alertmanager/alertmanager"})) != 1                                
+       for: 20m                                                                                                                                                                                                 
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: Alertmanager instances within the {{$labels.job}} cluster have different configurations.                                                                                                  
+         summary: Alertmanager instances within the same cluster have different configurations.                                                                                                                 
+     - alert: AlertmanagerClusterDown                                                                                                                                                                           
+       expr: (count by(job, namespace, cluster) (avg_over_time(up{job="global-alertmanager/alertmanager"}[5m]) < 0.5) / count by(job, namespace, cluster) (up{job="global-alertmanager/alertmanager"})) >= 0.5  
+       for: 5m                                                                                                                                                                                                  
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: '{{ $value | humanizePercentage }} of Alertmanager instances within the {{$labels.job}} cluster have been up for less than half of the last 5m.'                                          
+         summary: Half or more of the Alertmanager instances within the same cluster are down.                                                                                                                  
+     - alert: AlertmanagerClusterCrashlooping                                                                                                                                                                   
+       expr: (count by(job, namespace, cluster) (changes(process_start_time_seconds{job="global-alertmanager/alertmanager"}[10m]) > 4) / count by(job, namespace, cluster) (up{job="global-alertmanager/alertmanager"})) >= 0.5                                                                                                                                                                                                  
+       for: 5m                                                                                                                                                                                                  
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: '{{ $value | humanizePercentage }} of Alertmanager instances within the {{$labels.job}} cluster have restarted at least 5 times in the last 10m.'                                         
+         summary: Half or more of the Alertmanager instances within the same cluster are crashlooping.                                                                                                          
+                                                                                                                                                                                                                
+ name: alertmanager.rules
+ rules:                                                                                                                                                                                                         
+     - alert: AlertmanagerFailedReload                                                                                                                                                                          
+       expr: max_over_time(alertmanager_config_last_reload_successful{job="global-alertmanager/alertmanager"}[5m]) == 0                                                                                         
+       for: 10m                                                                                                                                                                                                 
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: Configuration has failed to load for {{$labels.instance}} in {{$labels.cluster}}.                                                                                                         
+         summary: Reloading an Alertmanager configuration has failed.                                                                                                                                           
+     - alert: AlertmanagerMembersInconsistent                                                                                                                                                                   
+       expr: max_over_time(alertmanager_cluster_members{job="global-alertmanager/alertmanager"}[5m]) < on(job, namespace, cluster) group_left() count by(job, namespace, cluster) (max_over_time(alertmanager_cluster_members{job="global-alertmanager/alertmanager"}[5m]))                                                                                                                                                      
+       for: 10m                                                                                                                                                                                                 
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: Alertmanager {{$labels.instance}} in {{$labels.cluster}} has only found {{ $value }} members of the {{$labels.job}} cluster.                                                              
+         summary: A member of an Alertmanager cluster has not found all other cluster members.                                                                                                                  
+     - alert: AlertmanagerFailedToSendAlerts                                                                                                                                                                    
+       expr: (rate(alertmanager_notifications_failed_total{job="global-alertmanager/alertmanager"}[5m]) / rate(alertmanager_notifications_total{job="global-alertmanager/alertmanager"}[5m])) > 0.01            
+       for: 5m                                                                                                                                                                                                  
+       labels:                                                                                                                                                                                                  
+         severity: warning                                                                                                                                                                                      
+       annotations:                                                                                                                                                                                             
+         description: Alertmanager {{$labels.instance}} in {{$labels.cluster}} failed to send {{ $value | humanizePercentage }} of notifications to {{ $labels.integration }}.                                  
+         summary: An Alertmanager instance failed to send notifications.
+     - alert: AlertmanagerClusterFailedToSendAlerts
+       expr: min by(job, namespace, integration, cluster) (rate(alertmanager_notifications_failed_total{integration=~"pagerduty",job="global-alertmanager/alertmanager"}[5m]) / rate(alertmanager_notifications_total{integration=~"pagerduty",job="global-alertmanager/alertmanager"}[5m])) > 0.01
+       for: 5m
+       labels:
+         severity: critical
+       annotations:
+         description: The minimum notification failure rate to {{ $labels.integration }} sent from any instance in the {{$labels.job}} cluster is {{ $value | humanizePercentage }}.
+         summary: All Alertmanager instances in a cluster failed to send notifications to a critical integration.
+     - alert: AlertmanagerClusterFailedToSendAlerts
+       expr: min by(job, namespace, integration, cluster) (rate(alertmanager_notifications_failed_total{integration!~"pagerduty",job="global-alertmanager/alertmanager"}[5m]) / rate(alertmanager_notifications_total{integration!~"pagerduty",job="global-alertmanager/alertmanager"}[5m])) > 0.01
+       for: 5m
+       labels:
+         severity: warning
+       annotations:
+         description: The minimum notification failure rate to {{ $labels.integration }} sent from any instance in the {{$labels.job}} cluster is {{ $value | humanizePercentage }}.
+         summary: All Alertmanager instances in a cluster failed to send notifications to a non-critical integration.
+     - alert: AlertmanagerConfigInconsistent
+       expr: count by(job, namespace, cluster) (count_values by(job, namespace, cluster) ("config_hash", alertmanager_config_hash{job="global-alertmanager/alertmanager"})) != 1
+       for: 20m
+       labels:
+         severity: critical
+       annotations:
+         description: Alertmanager instances within the {{$labels.job}} cluster have different configurations.
+         summary: Alertmanager instances within the same cluster have different configurations.
+     - alert: AlertmanagerClusterDown
+       expr: (count by(job, namespace, cluster) (avg_over_time(up{job="global-alertmanager/alertmanager"}[5m]) < 0.5) / count by(job, namespace, cluster) (up{job="global-alertmanager/alertmanager"})) >= 0.5
+       for: 5m
+       labels:
+         severity: critical
+       annotations:
+         description: '{{ $value | humanizePercentage }} of Alertmanager instances within the {{$labels.job}} cluster have been up for less than half of the last 5m.'
+         summary: Half or more of the Alertmanager instances within the same cluster are down.
+     - alert: AlertmanagerClusterCrashlooping
+       expr: (count by(job, namespace, cluster) (changes(process_start_time_seconds{job="global-alertmanager/alertmanager"}[10m]) > 4) / count by(job, namespace, cluster) (up{job="global-alertmanager/alertmanager"})) >= 0.5
+       for: 5m
+       labels:
+         severity: critical
+       annotations:
+         description: '{{ $value | humanizePercentage }} of Alertmanager instances within the {{$labels.job}} cluster have restarted at least 5 times in the last 10m.'
+         summary: Half or more of the Alertmanager instances within the same cluster are crashlooping.
+ 

Diff Summary: 0 Groups Created, 1 Groups Updated, 0 Groups Deleted

The second part even though has + was in red for me. I fixed that output in a commit I pushed to the branch. I think we're good to go!

gouthamve · 2021-08-06T10:41:00Z

@justinTM Can you open an issue for the docker images with clear steps and the errors you were seeing? This'll help us debug the issue and if its not obvious, I'll be happy to jump on a call with you!

* add rules diff verbosity flag * Make the deleted lines - Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com> Co-authored-by: Justin Mai <justin.mai@drillinginfo.com> Co-authored-by: Goutham Veeramachaneni <gouthamve@gmail.com>

add rules diff verbosity flag

1b56d84

justinTM requested a review from a team as a code owner July 16, 2021 20:36

justinTM mentioned this pull request Jul 16, 2021

Richer diff when running cortextool diff|sync #34

Open

gouthamve approved these changes Aug 6, 2021

View reviewed changes

Make the deleted lines -

1c40b99

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

gouthamve merged commit d35da62 into grafana:main Aug 6, 2021

simonswine mentioned this pull request Jan 7, 2022

Importing cortextool from separate repo cortex-tools grafana/mimir#708

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rules diff verbosity flag #199

Add rules diff verbosity flag #199

justinTM commented Jul 16, 2021

justinTM commented Jul 19, 2021 •

edited

justinTM commented Jul 20, 2021

gotjosh commented Jul 20, 2021

justinTM commented Jul 26, 2021

gouthamve left a comment

gouthamve commented Aug 6, 2021 •

edited

gouthamve commented Aug 6, 2021

Add rules diff verbosity flag #199

Add rules diff verbosity flag #199

Conversation

justinTM commented Jul 16, 2021

justinTM commented Jul 19, 2021 • edited

justinTM commented Jul 20, 2021

gotjosh commented Jul 20, 2021

justinTM commented Jul 26, 2021

gouthamve left a comment

Choose a reason for hiding this comment

gouthamve commented Aug 6, 2021 • edited

gouthamve commented Aug 6, 2021

justinTM commented Jul 19, 2021 •

edited

gouthamve commented Aug 6, 2021 •

edited