Skip to content
Ankit Gupta edited this page Mar 15, 2022 · 35 revisions

ServiceQ

ServiceQ is a fault-tolerant HTTP load balancer. All incoming HTTP requests are routed to the cluster via a probabilistic algorithm in case of partial cluster shutdown (few nodes down) and are buffered in case of total cluster shutdown (all nodes down). The probabilistic routing ensures that a down node temporally receives lesser requests leading to reduced failures. The buffering provides assurance to clients that the requests will be executed irrespective of the present state of the cluster.

Below graph shows the routing probability (P) on a down node (D) in a 8-node cluster with respect to number of requests (r). Notice how quickly the probability reduces as and when the incoming requests on D start to fail. Depending on the rate of request, it will only take a few seconds (sometime even milliseconds) to move all requests away from D, thus ensuring more requests are routed to healthier nodes. Note that, even when requests keep failing on D (however less), ServiceQ retries them on other nodes until they succeed.

If the request fails on all the nodes, ServiceQ buffers the request. It then periodically retries them on the cluster, until they succeed. If the buffer is full, all incoming requests get rejected. Below is the state transition diagram of the system with respect to an incoming request.

R: (Incoming Request)
G: (ServiceQ Entry Point), Q: (ServiceQ Buffer)
PCS: (Partial Cluster Shutdown), TCS: (Total Cluster Shutdown)

Configuration


All configurations are handled in the file sq.properties (it closely resembles a typical INI file). After installation is finished, the permanent location of this file is /usr/local/serviceq/config/sq.properties. There are 4 mandatory properties - LISTENER_PORT, PROTO, ENDPOINTS, CONCURRENCY_PEAK and rest are optional or to be left default. Default LISTENER_PORT is 5252 and default PROTO is http. Every property is added on a separate line and can be commented by a # prefix.

LISTENER_PORT=5252

PROTO=http (for both http and https)

Service Endpoints


A group of upstream services, deployed on a set of servers/ports, can be added as a comma-separated list to ENDPOINTS. Make sure scheme is added to every endpoint with an optional port. If port is not provided, ServiceQ will consider http and https endpoints to be running on 80 and 443 respectively. Although not suggestible, but endpoint list can contain a combination of services running on both http and https scheme.

ENDPOINTS=http://my.server1.com:8080,http://my.server2.com:8080,http://my.server3.com:8080

Concurrency Peak


ServiceQ can handle and distribute concurrent requests pretty well, and is only limited by the load handling capabilities of upstream services. That is why it is encouraged to load test services (any of the load testers - ab, wrk, jmeter, httpress would do) and find out maximum load the cluster can take. Determining this at cluster level is important because the bottleneck is usually a central database or message queue or throttled third party service, being accessed from all services. Both the active connections and deferred queue are governed by this limit.

CONCURRENCY_PEAK=2048

If the concurrency peak is achieved, ServiceQ will gladly respond with 429 Too Many Requests. If due to any block, ServiceQ is unable to accept new connections, it will remain queued in the linux receive buffer, due to which clients may experience slowness. So, it is advisable to have client timeouts for such scenarios.

Deferred Request Queue


If n out of n nodes are down, the requests are queued up. These are forwarded in FIFO order when any one node is available next. Though the system doesn't place restriction, unless asked to, on the kind of requests that can be queued up and forwarded, it is important to note the implications of the same. ServiceQ responds 503 Service Unavailable to the client if all nodes are down. The deferred request behaviour thus becomes desirable, in cases of requests that contain HTTP methods which change the state of the system and client's workflow is not dependant on the response. So, a fire and forget PUT request, when all services were down, will go and update the destination system, albeit at a later point in time. On the other hand, if the client has exited after firing a GET request, and ServiceQ tries to get response on next availability, the result of GET is lost and is an overhead to the system. This should be avoided. The control is provided to the user on whether to enable queueing and the kind of requests to be considered for queueing (for example we might want to have only POST/PUT/PATCH/DELETE on specific routes to be buffered).

Enable/Disable deferred queue

ENABLE_DEFERRED_Q=true

Format of Request to enable deferred queue for (Suffix '!' if disabling for a single method+route combination). First token should always be method followed by optional route and optional exclamation mark (!). Few examples -

Q_REQUEST_FORMATS=ALL (buffer all)
Q_REQUEST_FORMATS=POST,PUT,PATCH,DELETE (buffer these methods)
Q_REQUEST_FORMATS=POST /orders !,POST,PUT /orders,PATCH,DELETE (buffer POST except /orders, block PUT except PUT /orders)
Q_REQUEST_FORMATS=POST /orders,PATCH (buffer POST /orders, PATCH)

Upfront Request Queue


ServiceQ can also be used to process requests asynchronously right from the moment the request lands. This behaviour puts the incoming request in an upfront queue and immediately returns to the client. The request is then processed from the queue rather than synchronously. The formats defined in Q_REQUEST_FORMATS list are applicable for upfront queueing as well.

Enable/Disable upfront queue

ENABLE_UPFRONT_Q=true

Routing Algorithm


ServiceQ's approach to routing involves building an error feedback loop on top of randomized+roundrobin algorithm. The routing algorithm takes into account the retry attempt and current/past service state in order to choose the best possible routing. Due to inclusion of error feedback, the probability of choosing an unhealthy node reduces over time. So, if all nodes are up, there is <5% deviation, and it increases if few nodes are deemed unhealthy. If all nodes are down and ServiceQ fails to successfully process the request, the request is queued to be processed later (if eligible).

Cluster State and Behaviour


n-node cluster healthy, all nodes up

Active connections are forwarded to one of the nodes in the cluster. The choice of node is made after consulting with the routing algorithm. The maximum number of active connections are governed by CONCURRENCY_PEAK setting.

n-node cluster unhealthy, [1:n-1] nodes down

Process is same as above, except that the error rate increases, which is stored in a hashset against the service address and logged to the disk with appropriate error code.

n-node cluster unhealthy, n nodes down

Error rate shoots to 100%, and request is bufferred in a FIFO queue. The bufferred request remains in the queue until minimum of one service is re-available. If there are active connections being accepted at this moment, they are forwarded concurrently to bufferred requests. There is no precedence logic here.

Outgoing Request Timeout


It is a good practice to set timeouts to outgoing requests, so time taking requests can be shorted, connections can be freed up and latency is kept in check. To enable this behaviour, ServiceQ adds a timeout to every outgoing request to cluster. The default value is 5s and should be kept low to allow retries to be faster.

# Timeout (s) is added to each outgoing request to endpoints, the existing timeouts are overriden, value of -1 means no timeout
OUTGOING_REQUEST_TIMEOUT=5

Custom Response Headers


ServiceQ can add custom headers to the client responses. These are added as a pipe-separated list to CUSTOM_RESPONSE_HEADERS. It is recommended to thoroughly test the headers before adding them as few of them can adversely affect the client.

CUSTOM_RESPONSE_HEADERS=Connection: keep-alive|Server

Client Responses


If upstream and clients are both alive, ServiceQ simply tunnels the response from upstream to client. In case of failures, relevant responses are provided to help the client recover. For example -

Upstream Connected            - Tunneled Response
Concurrent Conn Limit Exceed  - 429 Too Many Requests ({"sq_msg": "Request Discarded"})
All Nodes Are Down            - 503 Service Unavailable ({"sq_msg": "Request Buffered"})
Request Timed Out             - 504 Gateway Timeout
Request Malformed             - 400 Bad Request
Undeterministic Error         - 502 Bad Gateway

Error Detection and Logging


ServiceQ detects and logs three types of errors: ServiceQ Flooded (Error Code 601), Service Unavailability (Error Code 701) and HTTP Request (Error Code 702) errors. Error Code 702 includes upstream timeouts, malformed request and unexpected connection loss. Errors are logged to /usr/local/serviceq/logs/serviceq_error.log and follow this format -

ServiceQ: 2020/06/18 14:10:28 Error detected on https://api.server0.org:8001 [Code: 601, SERVICEQ_FLOODED]
ServiceQ: 2020/06/18 14:10:28 Error detected on https://api.server0.org:8001 [Code: 701, UPSTREAM_DOWN]
ServiceQ: 2020/06/18 14:11:12 Error detected on https://api.server1.org:8002 [Code: 702, UPSTREAM_TIMED_OUT]
ServiceQ: 2020/06/18 14:13:33 Error detected on https://api.server1.org:8002 [Code: 702, UPSTREAM_TIMED_OUT]

The data related errors detected from upstream are tunneled directly to client without logging.

HTTPS Support


ServiceQ comes with complete TLS/SSL support making proxy connections more secure. ServiceQ can act both as a TLS terminating proxy or a TLS forwarding proxy. It means even though connection between client and ServiceQ can be secured by configuring ServiceQ with TLS, you can still choose to configure ENDPOINTS with either HTTP or HTTPS urls and ServiceQ will forward the request.

By default, SSL is disabled. It can be enabled by setting SSL_ENABLE to true.

SSL_ENABLE=true

SSL handshake requires a SSL certificate and a private key. There are two ways to add them in ServiceQ -

Automatic

ServiceQ can automatically issue and store SSL certificate and private key from designated CA (Let's Encrypt). It can be enabled by setting SSL_AUTO_ENABLE to true.

SSL_AUTO_ENABLE=true

In order for issuance process to succeed, user also needs to configure below information -

# Any path with appropriate read/write permissions will work
SSL_AUTO_CERTIFICATE_DIR=/etc/ssl/certs 
# Any email
SSL_AUTO_EMAIL=me@mydomain.com
# Domain pointing to whichever IP/port serviceq is running
SSL_AUTO_DOMAIN_NAMES=myservice.com
# Renew before x days of expiration
SSL_AUTO_RENEW_BEFORE=30

Note that SSL_AUTO_ENABLE=true is only considered if SSL_ENABLE=true.

Manual

Self obtain SSL certificate and private key files from a CA and configure as below -

# Any path with appropriate read permissions will work
SSL_CERTIFICATE_FILE=/usr/certs/cert.pem
SSL_PRIVATE_KEY_FILE=/usr/certs/key.pem

Note that manual mode is only considered if SSL_ENABLE=true and SSL_AUTO_ENABLE=false.

To improve on SSL performance, it is advisable to add keep-alive header to the CUSTOM_RESPONSE_HEADERS key. When using keep-alive, an optional timeout can be added after which the persistent TCP connection will be dropped and client has to re-establish connection.

KEEP_ALIVE_TIMEOUT=120