Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rules diff verbosity flag #199

Merged

Conversation

justinTM
Copy link
Contributor

No description provided.

@justinTM
Copy link
Contributor Author

justinTM commented Jul 19, 2021

@pracucci status update? make images is failing for my forked repo locally so relying on this repo for production ticket

@justinTM
Copy link
Contributor Author

built docker image locally from my forked repo but still not seeing diff rules. hoping to get this merged here this week if possible? @jtlisi @gotjosh @gouthamve

@gotjosh
Copy link
Collaborator

gotjosh commented Jul 20, 2021

Thank you very much for your contribution @justinTM, this is on my list for this week.

@justinTM
Copy link
Contributor Author

@gotjosh not to spam but any status update? if we could set up a super quick 15 minute call maybe i can get local docker image built with your help? i wouldn't mind writing short documentation afterwards for contributors to build locally, too.

Copy link
Member

@gouthamve gouthamve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>
@gouthamve
Copy link
Member

gouthamve commented Aug 6, 2021

➜  cortex-tools git:(20210716_add_rules_diff_verbosity_flag) ✗ ./cortextool rules diff --id=10428 --key=$GCOM_TOKEN --address=https://xxxxxx.grafana.net --namespaces=ops-us-east-0.metamonitoring rules.yml
Changes are indicated with the following symbols:
  + updated

The following changes will be made if the provided rule set is synced:
~ Namespace: ops-us-east-0.metamonitoring
  ~ Group: alertmanager.rules                                                                                                                                                                                    

Diff Summary: 0 Groups Created, 1 Groups Updated, 0 Groups Deleted
➜  cortex-tools git:(20210716_add_rules_diff_verbosity_flag) ✗ ./cortextool rules diff --id=10428 --key=$GCOM_TOKEN --address=https://xxxxxxxx.grafana.net --namespaces=ops-us-east-0.metamonitoring rules.yml --verbose
Changes are indicated with the following symbols:
  + updated

The following changes will be made if the provided rule set is synced:
~ Namespace: ops-us-east-0.metamonitoring
  ~ Group: alertmanager.rules                                                                                                                                                                                    
+ name: alertmanager.rules
+ rules:                                                                                                                                                                                                         
+     - alert: CHANGEDDDD                                                                                                                                                                                        
+       expr: max_over_time(alertmanager_config_last_reload_successful{job="global-alertmanager/alertmanager"}[5m]) == 0                                                                                         
+       for: 10m                                                                                                                                                                                                 
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: Configuration has failed to load for {{$labels.instance}} in {{$labels.cluster}}.                                                                                                         
+         summary: Reloading an Alertmanager configuration has failed.                                                                                                                                           
+     - alert: AlertmanagerMembersInconsistent                                                                                                                                                                   
+       expr: max_over_time(alertmanager_cluster_members{job="global-alertmanager/alertmanager"}[5m]) < on(job, namespace, cluster) group_left() count by(job, namespace, cluster) (max_over_time(alertmanager_cluster_members{job="global-alertmanager/alertmanager"}[5m]))                                                                                                                                                      
+       for: 10m                                                                                                                                                                                                 
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: Alertmanager {{$labels.instance}} in {{$labels.cluster}} has only found {{ $value }} members of the {{$labels.job}} cluster.                                                              
+         summary: A member of an Alertmanager cluster has not found all other cluster members.                                                                                                                  
+     - alert: AlertmanagerFailedToSendAlerts                                                                                                                                                                    
+       expr: (rate(alertmanager_notifications_failed_total{job="global-alertmanager/alertmanager"}[5m]) / rate(alertmanager_notifications_total{job="global-alertmanager/alertmanager"}[5m])) > 0.01            
+       for: 5m                                                                                                                                                                                                  
+       labels:                                                                                                                                                                                                  
+         severity: warning                                                                                                                                                                                      
+       annotations:                                                                                                                                                                                             
+         description: Alertmanager {{$labels.instance}} in {{$labels.cluster}} failed to send {{ $value | humanizePercentage }} of notifications to {{ $labels.integration }}.                                  
+         summary: An Alertmanager instance failed to send notifications.                                                                                                                                        
+     - alert: AlertmanagerClusterFailedToSendAlerts                                                                                                                                                             
+       expr: min by(job, namespace, integration, cluster) (rate(alertmanager_notifications_failed_total{integration=~"pagerduty",job="global-alertmanager/alertmanager"}[5m]) / rate(alertmanager_notifications_total{integration=~"pagerduty",job="global-alertmanager/alertmanager"}[5m])) > 0.01                                                                                                                              
+       for: 5m                                                                                                                                                                                                  
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: The minimum notification failure rate to {{ $labels.integration }} sent from any instance in the {{$labels.job}} cluster is {{ $value | humanizePercentage }}.                            
+         summary: All Alertmanager instances in a cluster failed to send notifications to a critical integration.                                                                                               
+     - alert: AlertmanagerClusterFailedToSendAlerts                                                                                                                                                             
+       expr: min by(job, namespace, integration, cluster) (rate(alertmanager_notifications_failed_total{integration!~"pagerduty",job="global-alertmanager/alertmanager"}[5m]) / rate(alertmanager_notifications_total{integration!~"pagerduty",job="global-alertmanager/alertmanager"}[5m])) > 0.01                                                                                                                              
+       for: 5m                                                                                                                                                                                                  
+       labels:                                                                                                                                                                                                  
+         severity: warning                                                                                                                                                                                      
+       annotations:                                                                                                                                                                                             
+         description: The minimum notification failure rate to {{ $labels.integration }} sent from any instance in the {{$labels.job}} cluster is {{ $value | humanizePercentage }}.                            
+         summary: All Alertmanager instances in a cluster failed to send notifications to a non-critical integration.                                                                                           
+     - alert: AlertmanagerConfigInconsistent                                                                                                                                                                    
+       expr: count by(job, namespace, cluster) (count_values by(job, namespace, cluster) ("config_hash", alertmanager_config_hash{job="global-alertmanager/alertmanager"})) != 1                                
+       for: 20m                                                                                                                                                                                                 
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: Alertmanager instances within the {{$labels.job}} cluster have different configurations.                                                                                                  
+         summary: Alertmanager instances within the same cluster have different configurations.                                                                                                                 
+     - alert: AlertmanagerClusterDown                                                                                                                                                                           
+       expr: (count by(job, namespace, cluster) (avg_over_time(up{job="global-alertmanager/alertmanager"}[5m]) < 0.5) / count by(job, namespace, cluster) (up{job="global-alertmanager/alertmanager"})) >= 0.5  
+       for: 5m                                                                                                                                                                                                  
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: '{{ $value | humanizePercentage }} of Alertmanager instances within the {{$labels.job}} cluster have been up for less than half of the last 5m.'                                          
+         summary: Half or more of the Alertmanager instances within the same cluster are down.                                                                                                                  
+     - alert: AlertmanagerClusterCrashlooping                                                                                                                                                                   
+       expr: (count by(job, namespace, cluster) (changes(process_start_time_seconds{job="global-alertmanager/alertmanager"}[10m]) > 4) / count by(job, namespace, cluster) (up{job="global-alertmanager/alertmanager"})) >= 0.5                                                                                                                                                                                                  
+       for: 5m                                                                                                                                                                                                  
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: '{{ $value | humanizePercentage }} of Alertmanager instances within the {{$labels.job}} cluster have restarted at least 5 times in the last 10m.'                                         
+         summary: Half or more of the Alertmanager instances within the same cluster are crashlooping.                                                                                                          
+                                                                                                                                                                                                                
+ name: alertmanager.rules
+ rules:                                                                                                                                                                                                         
+     - alert: AlertmanagerFailedReload                                                                                                                                                                          
+       expr: max_over_time(alertmanager_config_last_reload_successful{job="global-alertmanager/alertmanager"}[5m]) == 0                                                                                         
+       for: 10m                                                                                                                                                                                                 
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: Configuration has failed to load for {{$labels.instance}} in {{$labels.cluster}}.                                                                                                         
+         summary: Reloading an Alertmanager configuration has failed.                                                                                                                                           
+     - alert: AlertmanagerMembersInconsistent                                                                                                                                                                   
+       expr: max_over_time(alertmanager_cluster_members{job="global-alertmanager/alertmanager"}[5m]) < on(job, namespace, cluster) group_left() count by(job, namespace, cluster) (max_over_time(alertmanager_cluster_members{job="global-alertmanager/alertmanager"}[5m]))                                                                                                                                                      
+       for: 10m                                                                                                                                                                                                 
+       labels:                                                                                                                                                                                                  
+         severity: critical                                                                                                                                                                                     
+       annotations:                                                                                                                                                                                             
+         description: Alertmanager {{$labels.instance}} in {{$labels.cluster}} has only found {{ $value }} members of the {{$labels.job}} cluster.                                                              
+         summary: A member of an Alertmanager cluster has not found all other cluster members.                                                                                                                  
+     - alert: AlertmanagerFailedToSendAlerts                                                                                                                                                                    
+       expr: (rate(alertmanager_notifications_failed_total{job="global-alertmanager/alertmanager"}[5m]) / rate(alertmanager_notifications_total{job="global-alertmanager/alertmanager"}[5m])) > 0.01            
+       for: 5m                                                                                                                                                                                                  
+       labels:                                                                                                                                                                                                  
+         severity: warning                                                                                                                                                                                      
+       annotations:                                                                                                                                                                                             
+         description: Alertmanager {{$labels.instance}} in {{$labels.cluster}} failed to send {{ $value | humanizePercentage }} of notifications to {{ $labels.integration }}.                                  
+         summary: An Alertmanager instance failed to send notifications.
+     - alert: AlertmanagerClusterFailedToSendAlerts
+       expr: min by(job, namespace, integration, cluster) (rate(alertmanager_notifications_failed_total{integration=~"pagerduty",job="global-alertmanager/alertmanager"}[5m]) / rate(alertmanager_notifications_total{integration=~"pagerduty",job="global-alertmanager/alertmanager"}[5m])) > 0.01
+       for: 5m
+       labels:
+         severity: critical
+       annotations:
+         description: The minimum notification failure rate to {{ $labels.integration }} sent from any instance in the {{$labels.job}} cluster is {{ $value | humanizePercentage }}.
+         summary: All Alertmanager instances in a cluster failed to send notifications to a critical integration.
+     - alert: AlertmanagerClusterFailedToSendAlerts
+       expr: min by(job, namespace, integration, cluster) (rate(alertmanager_notifications_failed_total{integration!~"pagerduty",job="global-alertmanager/alertmanager"}[5m]) / rate(alertmanager_notifications_total{integration!~"pagerduty",job="global-alertmanager/alertmanager"}[5m])) > 0.01
+       for: 5m
+       labels:
+         severity: warning
+       annotations:
+         description: The minimum notification failure rate to {{ $labels.integration }} sent from any instance in the {{$labels.job}} cluster is {{ $value | humanizePercentage }}.
+         summary: All Alertmanager instances in a cluster failed to send notifications to a non-critical integration.
+     - alert: AlertmanagerConfigInconsistent
+       expr: count by(job, namespace, cluster) (count_values by(job, namespace, cluster) ("config_hash", alertmanager_config_hash{job="global-alertmanager/alertmanager"})) != 1
+       for: 20m
+       labels:
+         severity: critical
+       annotations:
+         description: Alertmanager instances within the {{$labels.job}} cluster have different configurations.
+         summary: Alertmanager instances within the same cluster have different configurations.
+     - alert: AlertmanagerClusterDown
+       expr: (count by(job, namespace, cluster) (avg_over_time(up{job="global-alertmanager/alertmanager"}[5m]) < 0.5) / count by(job, namespace, cluster) (up{job="global-alertmanager/alertmanager"})) >= 0.5
+       for: 5m
+       labels:
+         severity: critical
+       annotations:
+         description: '{{ $value | humanizePercentage }} of Alertmanager instances within the {{$labels.job}} cluster have been up for less than half of the last 5m.'
+         summary: Half or more of the Alertmanager instances within the same cluster are down.
+     - alert: AlertmanagerClusterCrashlooping
+       expr: (count by(job, namespace, cluster) (changes(process_start_time_seconds{job="global-alertmanager/alertmanager"}[10m]) > 4) / count by(job, namespace, cluster) (up{job="global-alertmanager/alertmanager"})) >= 0.5
+       for: 5m
+       labels:
+         severity: critical
+       annotations:
+         description: '{{ $value | humanizePercentage }} of Alertmanager instances within the {{$labels.job}} cluster have restarted at least 5 times in the last 10m.'
+         summary: Half or more of the Alertmanager instances within the same cluster are crashlooping.
+ 

Diff Summary: 0 Groups Created, 1 Groups Updated, 0 Groups Deleted

The second part even though has + was in red for me. I fixed that output in a commit I pushed to the branch. I think we're good to go!

@gouthamve gouthamve merged commit d35da62 into grafana:main Aug 6, 2021
@gouthamve
Copy link
Member

@justinTM Can you open an issue for the docker images with clear steps and the errors you were seeing? This'll help us debug the issue and if its not obvious, I'll be happy to jump on a call with you!

simonswine pushed a commit to grafana/mimir that referenced this pull request Jan 12, 2022
* add rules diff verbosity flag

* Make the deleted lines -

Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

Co-authored-by: Justin Mai <justin.mai@drillinginfo.com>
Co-authored-by: Goutham Veeramachaneni <gouthamve@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants