Skip to content

Commit

Permalink
ISPN-14556 Document how to monitor cross-site replication (#10813)
Browse files Browse the repository at this point in the history
Co-authored-by: Pedro Ruivo <pruivo@redhat.com>
  • Loading branch information
domiborges and pruivo committed Apr 13, 2023
1 parent c330c12 commit f823327
Show file tree
Hide file tree
Showing 5 changed files with 192 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ include::{topics}/proc_enabling_jmx_port.adoc[leveloffset=+2]
include::{topics}/ref_jmx_mbeans.adoc[leveloffset=+2]
include::{topics}/proc_registering_jmx_mbean_servers.adoc[leveloffset=+2]
include::{topics}/proc_exporting_metrics_state_transfer.adoc[leveloffset=+1]
include::{topics}/proc_xsite_monitoring.adoc[leveloffset=+1]

// Restore the parent context.
ifdef::parent-context[:context: {parent-context}]
Expand Down
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
82 changes: 82 additions & 0 deletions documentation/src/main/asciidoc/topics/proc_xsite_monitoring.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
[id='monitor-xsite-replication']
= Monitoring the status of cross-site replication

Monitor the site status of your backup locations to detect interruptions in the communication between the sites.
When a remote site status changes to `offline`, {brandname} stops replicating your data to the backup location.
Your data become out of sync and you must fix the inconsistencies before bringing the clusters back online.

Monitoring cross-site events is necessary for early problem detection.
Use one of the following monitoring strategies:

* link:#monitoring-cross-site-rest[Monitoring cross-site replication with the REST API]
* link:#monitoring-cross-site-prometheus[Monitoring cross-site replication with the Prometheus metrics] or any other monitoring system

[[monitoring-cross-site-rest]]
[discrete]
== Monitoring cross-site replication with the REST API
Monitor the status of cross-site replication for all caches using the REST endpoint.
You can implement a custom script to poll the REST endpoint or use the following example.

.Prerequisites
* Enable cross-site replication.

.Procedure
. Implement a script to poll the REST endpoint.
+
The following example demonstrates how you can use a Python script to poll the site status every five seconds.

[source,python,options="nowrap",subs=attributes+]
----
include::python/monitor_site_status.py[]
----

When a site status changes from `online` to `offline` or vice-versa, the function `on_event` is invoked.

If you want to use this script, you must specify the following variables:

* `USERNAME` and `PASSWORD`: The username and password of {brandname} user with permission to access the REST endpoint.
* `POLL_INTERVAL_SEC`: The number of seconds between polls.
* `SERVERS`: The list of {brandname} Servers at this site.
The script only requires a single valid response but the list is provided to allow fail over.
* `REMOTE_SITES`: The list of remote sites to monitor on these servers.
* `CACHES`: The list of cache names to monitor.

[role="_additional-resources"]
.Additional resources
* link:{rest_docs}#rest_v2_cache_manager_site_status_rest[REST API: Getting status of backup locations]

[[monitoring-cross-site-prometheus]]
[discrete]
== Monitoring cross-site replication with the Prometheus metrics

Prometheus, and other monitoring systems, let you configure alerts to detect when a site status changes to `offline`.

TIP: Monitoring cross-site latency metrics can help you to discover potential issues.

.Prerequisites
* Enable cross-site replication.

.Procedure
. Configure {brandname} metrics.
. Configure alerting rules using the Prometheus metrics format.
* For the site status, use `1` for `online` and `0` for `offline`.
* For the `expr` filed, use the following format: +
`vendor_cache_manager_default_cache_<cache name>_x_site_admin_<site name>_status`.
+
In the following example, Prometheus alerts you when the *NYC* site gets `offline` for cache named `work` or `sessions`.
+
[source,yaml,options="nowrap",subs=attributes+]
----
include::yaml/prometheus_xsite_rules.yml[]
----
+
The following image shows an alert that the *NYC* site is `offline` for cache `work`.
+
image::prometheus_xsite_alert.png[align="center",title="Prometheus Alert"]

[role="_additional-resources"]
.Additional resources
* link:{server_docs}#configuring-metrics_statistics-jmx[Configuring {brandname} metrics]
* link:https://prometheus.io/docs/alerting/latest/overview/[Prometheus Alerting Overview]
* link:https://grafana.com/docs/grafana/latest/alerting/[Grafana Alerting Documentation]
* link:https://docs.openshift.com/container-platform/latest/monitoring/managing-alerts.html#creating-alerting-rules-for-user-defined-projects_managing-alerts[Openshift Managing Alerts]
102 changes: 102 additions & 0 deletions documentation/src/main/asciidoc/topics/python/monitor_site_status.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
#!/usr/bin/python3
import time
import requests
from requests.auth import HTTPDigestAuth


class InfinispanConnection:

def __init__(self, server: str = 'http://localhost:11222', cache_manager: str = 'default',
auth: tuple = ('admin', 'change_me')) -> None:
super().__init__()
self.__url = f'{server}/rest/v2/cache-managers/{cache_manager}/x-site/backups/'
self.__auth = auth
self.__headers = {
'accept': 'application/json'
}

def get_sites_status(self):
try:
rsp = requests.get(self.__url, headers=self.__headers, auth=HTTPDigestAuth(self.__auth[0], self.__auth[1]))
if rsp.status_code != 200:
return None
return rsp.json()
except:
return None


# Specify credentials for {brandname} user with permission to access the REST endpoint
USERNAME = 'admin'
PASSWORD = 'change_me'
# Set an interval between cross-site status checks
POLL_INTERVAL_SEC = 5
# Provide a list of servers
SERVERS = [
InfinispanConnection('http://127.0.0.1:11222', auth=(USERNAME, PASSWORD)),
InfinispanConnection('http://127.0.0.1:12222', auth=(USERNAME, PASSWORD))
]
#Specify the names of remote sites
REMOTE_SITES = [
'nyc'
]
#Provide a list of caches to monitor
CACHES = [
'work',
'sessions'
]


def on_event(site: str, cache: str, old_status: str, new_status: str):
# TODO implement your handling code here
print(f'site={site} cache={cache} Status changed {old_status} -> {new_status}')


def __handle_mixed_state(state: dict, site: str, site_status: dict):
if site not in state:
state[site] = {c: 'online' if c in site_status['online'] else 'offline' for c in CACHES}
return

for cache in CACHES:
__update_cache_state(state, site, cache, 'online' if cache in site_status['online'] else 'offline')


def __handle_online_or_offline_state(state: dict, site: str, new_status: str):
if site not in state:
state[site] = {c: new_status for c in CACHES}
return

for cache in CACHES:
__update_cache_state(state, site, cache, new_status)


def __update_cache_state(state: dict, site: str, cache: str, new_status: str):
old_status = state[site].get(cache)
if old_status != new_status:
on_event(site, cache, old_status, new_status)
state[site][cache] = new_status


def update_state(state: dict):
rsp = None
for conn in SERVERS:
rsp = conn.get_sites_status()
if rsp:
break
if rsp is None:
print('Unable to fetch site status from any server')
return

for site in REMOTE_SITES:
site_status = rsp.get(site, {})
new_status = site_status.get('status')
if new_status == 'mixed':
__handle_mixed_state(state, site, site_status)
else:
__handle_online_or_offline_state(state, site, new_status)


if __name__ == '__main__':
_state = {}
while True:
update_state(_state)
time.sleep(POLL_INTERVAL_SEC)
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
groups:
- name: Cross Site Rules
rules:
- alert: Cache Work and Site NYC
expr: vendor_cache_manager_default_cache_work_x_site_admin_nyc_status == 0
- alert: Cache Sessions and Site NYC
expr: vendor_cache_manager_default_cache_sessions_x_site_admin_nyc_status == 0

0 comments on commit f823327

Please sign in to comment.