A library that supports pull-based health checks of Onyx peers
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.circleci
resources
scripts
src/onyx
test/onyx
.gitignore
CHANGES.MD
LICENSE
README.md
project.clj

README.md

onyx-peer-http-query

Onyx Peer HTTP Query provides an inbuilt HTTP server to service replica and cluster queries that can be directed at Onyx nodes. One use case is to provide a health check for your Onyx nodes, as it becomes easy to determine what a node's view of the cluster is.

HTTP Server

This library exposes an HTTP server to service replica and cluster queries across languages.

To use it, add onyx-peer-http-query to your dependencies:

[org.onyxplatform/onyx-peer-http-query "0.14.1.0"]

Require onyx.http-query in your peer bootup namespace:

(:require [onyx.http-query])

And add the following lines to Onyx's peer-config

 :onyx.query/server? true
 :onyx.query.server/port 8080

In addition, you can optionally add the IP to listen on with

 :onyx.query.server/ip "127.0.0.1"

JMX selectors can, and should be whitelisted/queried via the peer-config: e.g.

 :onyx.query.server/metrics-selectors ["org.onyxplatform:*" "com.amazonaws.management:*"]

The default behaviour is

 :onyx.query.server/metrics-selectors ["*:*"]

Individual metrics tags can be blacklisted via the peer-config:

 :onyx.query.server/metrics-blacklist [#"blacklisted_tag1" #"blacklistregex.*"]

Accessing the HTTP server

Then query it to get a view of that nodes understanding of the cluster:

$ http --json http://localhost:8080/replica/peers
HTTP/1.1 200 OK
Content-Length: 197
Content-Type: application/json
Date: Tue, 23 Feb 2016 03:35:08 GMT
Server: Jetty(9.2.10.v20150310)

{
    "as-of-entry": 12,
    "as-of-timestamp": 1456108757818,
    "result": [
        "e52df81d-38c9-44e6-9e3d-177d3e83292b",
        "fd4725f9-3429-49eb-840d-6c3e29cecc41",
        "fc933dda-7260-4547-93fc-241a02ca599a"
    ],
    "status": "success"
}

Note as-of-entry and as-of-timestamp. By comparing as-of-entry between nodes, you can discover whether a node is lagging behind the cluster.

Further API endpoints are described here.

Endpoints

The Replica Query Server has a number of endpoints for accessing the information about a running Onyx cluster. Below we display the HTTP method, the URI, the docstring for the route, and any associated parameters that it takes in its query string.

Summary

  • /health
  • /peergroup/heartbeat
  • /peergroup/stuckpeers
  • /peergroup/health
  • /network/media-driver/active
  • /metrics
  • /state
  • /job/catalog
  • /job/flow-conditions
  • /job/lifecycles
  • /job/task
  • /job/triggers
  • /job/windows
  • /job/workflow
  • /job/exception
  • /replica
  • /replica/completed-jobs
  • /replica/job-allocations
  • /replica/job-scheduler
  • /replica/jobs
  • /replica/killed-jobs
  • /replica/peer-site
  • /replica/peer-state
  • /replica/peers
  • /replica/task-allocations
  • /replica/allocation-version
  • /replica/task-scheduler
  • /replica/tasks


Route

[:get] /health

Query Params Schema

{"threshold" java.lang.Long}

Docstring

A single health check call to check whether the following statuses are healthy: /network/media-driver/active, /peergroup/heartbeat, and /peergroup/stuckpeers. Considers the peer group dead if timeout is greater than ?threshold=VALUE. Returns status 200 if healthy, 500 if unhealthy. Use this route for failure monitoring, automatic rebooting, etc.

--

Route

[:get] /peergroup/heartbeat

Query Params Schema

{}

Docstring

Returns the number of milliseconds since the peer group last heartbeated.


Route

[:get] /peergroup/stuckpeers

Query Params Schema

{}

Docstring

Returns the number of milliseconds that a peer has been stuck while being shutdown, indicating a stuck thread.


Route

[:get] /peergroup/health

Query Params Schema

{}

Docstring

A health check call to check whether the peer group has heartbeated more recently than a threshold. Considers the peer group dead if timeout is greater than ?threshold=VALUE. Returns status 200 if healthy, 500 if unhealthy. Use this route for failure monitoring, automatic rebooting, etc.

Route

[:get] /network/media-driver

Query Params Schema

{}

Docstring

Returns a map describing the media driver status. e.g.

{:active true, 
 :driver-timeout-ms 10000, 
 :log "INFO: Aeron directory /var/folders/c5/2t4q99_53mz_c1h9hk12gn7h0000gn/T/aeron-lucas exists
       INFO: Aeron CnC file /var/folders/c5/2t4q99_53mz_c1h9hk12gn7h0000gn/T/aeron-lucas/cnc.dat exists
       INFO: Aeron toDriver consumer heartbeat is 687 ms old"}

Route

[:get] /network/media-driver/active

Query Params Schema

{}

Docstring

Returns a boolean for whether the media driver is active and has heartbeated within driver-timeout-ms milliseconds.


Route

[:get] /metrics

Query Params Schema

{}

Docstring

Returns any numeric JMX metrics contained in this VM, converted to prometheus tags.


Route

[:get] /state

Query Params Schema

{"job-id" java.lang.String "task-id" java.lang.String "slot-id" java.lang.Long "window-id" java.lang.String "allocation-version" java.lang.Long ;; optional "start-time" java.lang.Long ;; optional "end-time" java.lang.Long ;; optional "groups" [Any]}

Docstring

Retrieve a task's window state for a particular job. Must supply the :allocation-version for the job. The allocation version can be looked up via the /replica/allocation-version, or by subscribing to the log and looking up the [:allocation-version job-id].

If groups is supplied, only the state for the groups supplied will be retrieved.


Route

[:get] /job/catalog

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns catalog for this job.


Route

[:get] /job/flow-conditions

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns flow conditions for this job.


Route

[:get] /job/lifecycles

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns lifecycles for this job.


Route

[:get] /job/task

Query Params Schema

{"job-id" java.lang.String, "task-id" java.lang.String}

Docstring

Given a job id and task id, returns catalog entry for this task.


Route

[:get] /job/triggers

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns triggers for this job.


Route

[:get] /job/windows

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns windows for this job.


Route

[:get] /job/workflow

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns workflow for this job.


Route

[:get] /job/exception

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns the exception that killed this job, if one exists.


Route

[:get] /replica

Query Params Schema

``

Docstring

Derefences the replica as an immutable value.


Route

[:get] /replica/completed-jobs

Query Params Schema

``

Docstring

Lists all the job ids that have been completed.


Route

[:get] /replica/job-allocations

Query Params Schema

``

Docstring

Returns a map of job id -> task id -> peer ids, denoting which peers are assigned to which tasks.


Route

[:get] /replica/job-scheduler

Query Params Schema
Docstring

Returns the job scheduler for this tenancy of the cluster.


Route

[:get] /replica/jobs

Query Params Schema

``

Docstring

Lists all non-killed, non-completed job ids.


Route

[:get] /replica/killed-jobs

Query Params Schema

``

Docstring

Lists all the job ids that have been killed.


Route

[:get] /replica/peer-site

Query Params Schema

{"peer-id" java.lang.String}

Docstring

Given a peer id, returns the Aeron hostname and port that this peer advertises to the rest of the cluster.


Route

[:get] /replica/peer-state

Query Params Schema

{"peer-id" java.lang.String}

Docstring

Given a peer id, returns its current execution state (e.g. :idle, :active, etc).


Route

[:get] /replica/peers

Query Params Schema

``

Docstring

Lists all the peer ids.


Route

[:get] /replica/task-allocations

Query Params Schema

``

Docstring

Given a job id, returns a map of task id -> peer ids, denoting which peers are assigned to which tasks for this job only.


Route

[:get] /replica/allocation-version

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns the replica-version at which the job last rescheduled. This is important because the replica-version forms part of the vector clock that is used to determine ordering/validity of messages in the cluster, along with the barrier epoch.


Route

[:get] /replica/task-scheduler

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns the task scheduler for this job.


Route

[:get] /replica/tasks

Query Params Schema

{"job-id" java.lang.String}

Docstring

Given a job id, returns all the task ids for this job.

License

Copyright © 2016 Distributed Masonry Inc.

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.