-
Notifications
You must be signed in to change notification settings - Fork 612
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ISPN-14556 Document how to monitor cross-site replication (#10813)
Co-authored-by: Pedro Ruivo <pruivo@redhat.com>
- Loading branch information
1 parent
c330c12
commit f823327
Showing
5 changed files
with
192 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file added
BIN
+25 KB
documentation/src/main/asciidoc/topics/images/prometheus_xsite_alert.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
82 changes: 82 additions & 0 deletions
82
documentation/src/main/asciidoc/topics/proc_xsite_monitoring.adoc
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
[id='monitor-xsite-replication'] | ||
= Monitoring the status of cross-site replication | ||
|
||
Monitor the site status of your backup locations to detect interruptions in the communication between the sites. | ||
When a remote site status changes to `offline`, {brandname} stops replicating your data to the backup location. | ||
Your data become out of sync and you must fix the inconsistencies before bringing the clusters back online. | ||
|
||
Monitoring cross-site events is necessary for early problem detection. | ||
Use one of the following monitoring strategies: | ||
|
||
* link:#monitoring-cross-site-rest[Monitoring cross-site replication with the REST API] | ||
* link:#monitoring-cross-site-prometheus[Monitoring cross-site replication with the Prometheus metrics] or any other monitoring system | ||
|
||
[[monitoring-cross-site-rest]] | ||
[discrete] | ||
== Monitoring cross-site replication with the REST API | ||
Monitor the status of cross-site replication for all caches using the REST endpoint. | ||
You can implement a custom script to poll the REST endpoint or use the following example. | ||
|
||
.Prerequisites | ||
* Enable cross-site replication. | ||
|
||
.Procedure | ||
. Implement a script to poll the REST endpoint. | ||
+ | ||
The following example demonstrates how you can use a Python script to poll the site status every five seconds. | ||
|
||
[source,python,options="nowrap",subs=attributes+] | ||
---- | ||
include::python/monitor_site_status.py[] | ||
---- | ||
|
||
When a site status changes from `online` to `offline` or vice-versa, the function `on_event` is invoked. | ||
|
||
If you want to use this script, you must specify the following variables: | ||
|
||
* `USERNAME` and `PASSWORD`: The username and password of {brandname} user with permission to access the REST endpoint. | ||
* `POLL_INTERVAL_SEC`: The number of seconds between polls. | ||
* `SERVERS`: The list of {brandname} Servers at this site. | ||
The script only requires a single valid response but the list is provided to allow fail over. | ||
* `REMOTE_SITES`: The list of remote sites to monitor on these servers. | ||
* `CACHES`: The list of cache names to monitor. | ||
|
||
[role="_additional-resources"] | ||
.Additional resources | ||
* link:{rest_docs}#rest_v2_cache_manager_site_status_rest[REST API: Getting status of backup locations] | ||
|
||
[[monitoring-cross-site-prometheus]] | ||
[discrete] | ||
== Monitoring cross-site replication with the Prometheus metrics | ||
|
||
Prometheus, and other monitoring systems, let you configure alerts to detect when a site status changes to `offline`. | ||
|
||
TIP: Monitoring cross-site latency metrics can help you to discover potential issues. | ||
|
||
.Prerequisites | ||
* Enable cross-site replication. | ||
|
||
.Procedure | ||
. Configure {brandname} metrics. | ||
. Configure alerting rules using the Prometheus metrics format. | ||
* For the site status, use `1` for `online` and `0` for `offline`. | ||
* For the `expr` filed, use the following format: + | ||
`vendor_cache_manager_default_cache_<cache name>_x_site_admin_<site name>_status`. | ||
+ | ||
In the following example, Prometheus alerts you when the *NYC* site gets `offline` for cache named `work` or `sessions`. | ||
+ | ||
[source,yaml,options="nowrap",subs=attributes+] | ||
---- | ||
include::yaml/prometheus_xsite_rules.yml[] | ||
---- | ||
+ | ||
The following image shows an alert that the *NYC* site is `offline` for cache `work`. | ||
+ | ||
image::prometheus_xsite_alert.png[align="center",title="Prometheus Alert"] | ||
|
||
[role="_additional-resources"] | ||
.Additional resources | ||
* link:{server_docs}#configuring-metrics_statistics-jmx[Configuring {brandname} metrics] | ||
* link:https://prometheus.io/docs/alerting/latest/overview/[Prometheus Alerting Overview] | ||
* link:https://grafana.com/docs/grafana/latest/alerting/[Grafana Alerting Documentation] | ||
* link:https://docs.openshift.com/container-platform/latest/monitoring/managing-alerts.html#creating-alerting-rules-for-user-defined-projects_managing-alerts[Openshift Managing Alerts] |
102 changes: 102 additions & 0 deletions
102
documentation/src/main/asciidoc/topics/python/monitor_site_status.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
#!/usr/bin/python3 | ||
import time | ||
import requests | ||
from requests.auth import HTTPDigestAuth | ||
|
||
|
||
class InfinispanConnection: | ||
|
||
def __init__(self, server: str = 'http://localhost:11222', cache_manager: str = 'default', | ||
auth: tuple = ('admin', 'change_me')) -> None: | ||
super().__init__() | ||
self.__url = f'{server}/rest/v2/cache-managers/{cache_manager}/x-site/backups/' | ||
self.__auth = auth | ||
self.__headers = { | ||
'accept': 'application/json' | ||
} | ||
|
||
def get_sites_status(self): | ||
try: | ||
rsp = requests.get(self.__url, headers=self.__headers, auth=HTTPDigestAuth(self.__auth[0], self.__auth[1])) | ||
if rsp.status_code != 200: | ||
return None | ||
return rsp.json() | ||
except: | ||
return None | ||
|
||
|
||
# Specify credentials for {brandname} user with permission to access the REST endpoint | ||
USERNAME = 'admin' | ||
PASSWORD = 'change_me' | ||
# Set an interval between cross-site status checks | ||
POLL_INTERVAL_SEC = 5 | ||
# Provide a list of servers | ||
SERVERS = [ | ||
InfinispanConnection('http://127.0.0.1:11222', auth=(USERNAME, PASSWORD)), | ||
InfinispanConnection('http://127.0.0.1:12222', auth=(USERNAME, PASSWORD)) | ||
] | ||
#Specify the names of remote sites | ||
REMOTE_SITES = [ | ||
'nyc' | ||
] | ||
#Provide a list of caches to monitor | ||
CACHES = [ | ||
'work', | ||
'sessions' | ||
] | ||
|
||
|
||
def on_event(site: str, cache: str, old_status: str, new_status: str): | ||
# TODO implement your handling code here | ||
print(f'site={site} cache={cache} Status changed {old_status} -> {new_status}') | ||
|
||
|
||
def __handle_mixed_state(state: dict, site: str, site_status: dict): | ||
if site not in state: | ||
state[site] = {c: 'online' if c in site_status['online'] else 'offline' for c in CACHES} | ||
return | ||
|
||
for cache in CACHES: | ||
__update_cache_state(state, site, cache, 'online' if cache in site_status['online'] else 'offline') | ||
|
||
|
||
def __handle_online_or_offline_state(state: dict, site: str, new_status: str): | ||
if site not in state: | ||
state[site] = {c: new_status for c in CACHES} | ||
return | ||
|
||
for cache in CACHES: | ||
__update_cache_state(state, site, cache, new_status) | ||
|
||
|
||
def __update_cache_state(state: dict, site: str, cache: str, new_status: str): | ||
old_status = state[site].get(cache) | ||
if old_status != new_status: | ||
on_event(site, cache, old_status, new_status) | ||
state[site][cache] = new_status | ||
|
||
|
||
def update_state(state: dict): | ||
rsp = None | ||
for conn in SERVERS: | ||
rsp = conn.get_sites_status() | ||
if rsp: | ||
break | ||
if rsp is None: | ||
print('Unable to fetch site status from any server') | ||
return | ||
|
||
for site in REMOTE_SITES: | ||
site_status = rsp.get(site, {}) | ||
new_status = site_status.get('status') | ||
if new_status == 'mixed': | ||
__handle_mixed_state(state, site, site_status) | ||
else: | ||
__handle_online_or_offline_state(state, site, new_status) | ||
|
||
|
||
if __name__ == '__main__': | ||
_state = {} | ||
while True: | ||
update_state(_state) | ||
time.sleep(POLL_INTERVAL_SEC) |
7 changes: 7 additions & 0 deletions
7
documentation/src/main/asciidoc/topics/yaml/prometheus_xsite_rules.yml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
groups: | ||
- name: Cross Site Rules | ||
rules: | ||
- alert: Cache Work and Site NYC | ||
expr: vendor_cache_manager_default_cache_work_x_site_admin_nyc_status == 0 | ||
- alert: Cache Sessions and Site NYC | ||
expr: vendor_cache_manager_default_cache_sessions_x_site_admin_nyc_status == 0 |