Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post release #196

Merged
merged 2 commits into from Mar 10, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion build.sbt
Expand Up @@ -28,7 +28,7 @@ val commonSettings = Seq(
lazy val docsSettings = Seq(
docsMappingsAPIDir := "api",
micrositeName := "Kanaloa",
micrositeDescription := "For your resiliency - traffic control with adaptive concurrency throttling.",
micrositeDescription := "Kanaloa - resiliency against traffic oversaturation",
micrositeAuthor := "Kanaloa contributors",
micrositeHighlightTheme := "atom-one-light",
micrositeHomepage := "http://iheartradio.github.io/kanaloa",
Expand Down
10 changes: 10 additions & 0 deletions docs/src/main/tut/docs/colophon.md
@@ -0,0 +1,10 @@
---
layout: page
title: Colophon
section: colophon
position: 8
---

### Special Thanks

The autothrottle algorithm was first suggested by @richdougherty. @ktoso kindly reviewed this library and provided valuable feedback.
7 changes: 7 additions & 0 deletions docs/src/main/tut/docs/guide/monitor.md
Expand Up @@ -46,3 +46,10 @@ kanaloa {
}
}
```


#### Visualize with Grafana

We provide a [grafana dashboard](https://github.com/iheartradio/kanaloa/blob/master/grafana/dashboard.json) if you are using grafana for statsD visualization.
We also provide a [docker image](https://github.com/iheartradio/docker-grafana-graphite) with which you can quickly get up and running a statsD server and visualization web app. Please follow the instructions there.

24 changes: 14 additions & 10 deletions docs/src/main/tut/docs/theories.md
Expand Up @@ -32,13 +32,16 @@ Fig 2. Positive traffic oversataturation - responses per second
Obviously, if the incoming traffic remains at 250 requests/second, i.e. if the high latencies didn't slow down the incoming traffic, the latencies will continue to grow dramatically.

### Negative oversaturation
*Negative oversaturation* refers to the ones that occur when the capacity is negatively affected to below the incoming traffic. Again, let's use a simulation to demonstrate this scenario. In this one the mock service has a capacity of 220 requests/second. It was serving a traffic of 180 requests/second when it's capactity degrades by 30% capacity to 150 requests/second at roughly 15:32, which means that the capacity is 30 requests/second short. In a matter of seconds, the latency grew to over 10 seconds. Then due to this high latency the incoming traffic falls back to 110 requests/second. But the latencies continued to grow. At roughly 15:39 the capacity recovered. But due to the contention and queue the previous traffic overflow already caused, the latencies remains dreadful.
*Negative oversaturation* refers to the ones that occur when the capacity is negatively affected to below the incoming traffic. Again, let's use a simulation to demonstrate this scenario. In this one the mock service starts with a capacity of 220 requests/second serving a traffic of 180 requests/second. Then it's capacity degrades by 30% capacity to 150 requests/second at roughly 15:32, which means that the capacity is 30 requests/second short.

![baseline_response](../img/Negative_Gatling_response_time_baseline.png){: .center-image}

Fig 3. Negative traffic oversataturation - response time
{: .caption .center}

In a matter of seconds, the latency grew to over 10 seconds. Then due to this high latency the incoming traffic falls back to 110 requests/second. But the latencies continued to grow. At roughly 15:39 the capacity recovered. But due to the contention and queue the previous traffic overflow already caused, the latencies remains dreadful.


![baseline_response](../img/Negative_Gatling_response_num_baseline.png){: .center-image}

Fig 4. Positive traffic oversataturation - responses per second
Expand Down Expand Up @@ -147,33 +150,33 @@ Fig 6. Positive traffic oversataturation with kanaloa - responses per second

When the incoming traffic exceeds the capacity at 200 requests/second, we first see a spike of latencies to roughly 4 seconds; then it falls back to roughly 1-1.5 second. The spike is due to the burst mode enabled in kanaloa. In burst mode, kanaloa will not reject traffic, hence the high latencies. Burst mode is limited to a short period of time. As soon as it exits the burst mode, kanaloa starts to reject excessive traffic to keep latency in control. The latency gradually improves afterwards as kanaloa optimizing the throttle. It's very easier to see the difference kanaloa makes comparing Fig 1,2 and Fig 5, 6

Now let's take a look how Kanaloa achieve this. Kanaloa provides real-time monotoring which gives us good insights into how it works.
Now lets take a look at how Kanaloa achieve this. Kanaloa provides real-time monitoring which gives us good insights into how it works.

![kanaloa_response_num](../img/Positive_Grafana_inbound.png){: .center-image}

Fig 7. Positive traffic oversataturation with Kanaloa - Inbound traffic
{: .caption .center}

Kanaloa achieves low latency by only allow a portion of incoming traffic that is within the capacity of the service and reject the excessive portion. As indicated by the charts above, after the incoming traffic exceeds the capacity which is 200 requests/second, kanaloa starts to reject portion of traffic. With the incoming traffic at 250 requests/second, kanaloa rejects 50 requests/second, which leaves 200 requests/second to pass through the service which is exactly it's capactity.
Kanaloa achieves low latency by only allow a portion of incoming traffic that is within the capacity of the service and reject the excessive portion. As indicated by the charts above, after the incoming traffic exceeds the capacity which is 200 requests/second, kanaloa starts to reject portion of traffic. With the incoming traffic at 250 requests/second, kanaloa rejects 50 requests/second, which leaves 200 requests/second to pass through the service which is exactly it's capacity.

Kanaloa rejects traffic by applying two measures. First, it uses it's concurrency throttle to cap the concurrent requests the service handles.
![kanaloa_response_num](../img/Positive_Grafana_pool_process.png){: .center-image}

Fig 8. Positive traffic oversataturation with Kanaloa - concurrency throttle
{: .caption .center}

The *utlized* metric in the upper chart is the concurrent requests the service is handling. When the incoming traffic ramps up, this number slowly increases as well, but when the traffic oversaturation happens at around 16:14, it quickly reaches the maximum cap Kanaloa is configured to allow, which is 60. This means that out of all the concurrent requests Kanaloa received at that moment, 60 go into the service. The time they spent in the service is indicated in the lower chart titled as "Process Time". As indicated in the chart, as traffic oversaturation happes, the procss time quickly increases alone with the number of concurrent requests in service. Thanks to the cap kanaloa imposes on concurrent requests, the process time also caps at 300ms.
The *utilized* metric in the upper chart is the concurrent requests the service is handling. When the incoming traffic ramps up, this number slowly increases as well, but when the traffic oversaturation happens at around 16:14, it quickly reaches the maximum cap Kanaloa is configured to allow, which is 60. This means that out of all the concurrent requests Kanaloa received at that moment, 60 go into the service. The time they spent in the service is indicated in the lower chart titled as "Process Time". As indicated in the chart, as traffic oversaturation happens, the process time quickly increases alone with the number of concurrent requests in service. Thanks to the cap kanaloa imposes on concurrent requests, the process time also caps at 300ms.

As mentioned previously, Kanaloa's conccurrent throtle is adaptive. The mock service can handle 20 concurrent requests in parallel, sending more requests to it will only have them sitting in a queue waiting and causes some contention. Kanaloa monitors the performance, i.e. throughput and process time, and gradually managed to set the throttle at around 20 concurrent requests - exactly the number the mock service can handle in parallel. This results in the process time shrinking to 100ms which is close to before traffic oversaturation.
As mentioned previously, Kanaloa's concurrency throttle is adaptive. The mock service can handle 20 concurrent requests in parallel, sending more requests to it will only have them sitting in a queue waiting and causes some contention. Kanaloa monitors the performance, i.e. throughput and process time, and gradually managed to set the throttle at around 20 concurrent requests - exactly the number the mock service can handle in parallel. This results in the process time shrinking to 100ms which is close to before traffic oversaturation.

Now that the requests going into the services is capped, the exccessive requests go into a queue inside Kanaloa. See blow:
Now that the requests going into the services is capped, the excessive requests go into a queue inside Kanaloa. See blow:

![kanaloa_response_num](../img/Positive_Grafana_queue_wait.png){: .center-image}

Fig 9. Positive traffic oversataturation with Kanaloa - kanaloa queue
{: .caption .center}

When the traffic oversaturation happens at around 16:14, kanaloa first enters burst mode - all the excessive requets go into the queue; with roughly 700 requests in the queue (as seen in the "Queue Length" chart), the wait time, defined as the time requests stay in this queue before they can be sent to the service, reaches 3-4 seconds as well. This is the main part of the initial latency spike we saw in Fig 5.
When the traffic oversaturation happens at around 16:14, kanaloa first enters burst mode - all the excessive requests go into the queue; with roughly 700 requests in the queue (as seen in the "Queue Length" chart), the wait time, defined as the time requests stay in this queue before they can be sent to the service, reaches 3-4 seconds as well. This is the main part of the initial latency spike we saw in Fig 5.

Very quickly Kanaloa exits the burst mode and PIE starts to increase the drop rate with which Kanaloa rejects traffic as seen in Fig 7.

Expand Down Expand Up @@ -209,9 +212,10 @@ Fig 12. Negative traffic oversataturation with Kanaloa - inbound vs throughput

In the above charts, the throughput (depicted in the upper chart) is the capacity of the service. As it degrades to around 150 requests/second, Kanaloa starts to reject requests at roughly 30 requests/second out of the 180 incoming requests/second. This keeps the latencies low v.s. the 10+ seconds latencies we saw in Fig 3. Just like in the positive oversaturation scenario, Kanaloa does this traffic rejection through the combination of it's adaptive concurrency throttle and PIE.

## Summary
## Conclusion

Kanaloa protects your service against oversaturated traffic by adaptively throttles the concurrent requests the service handles and regulate the incoming traffic using a Little's law based algorithm called PIE that drops requests based on estimated wait time. The main advantages of Kanaloa are:
1. it requires little knowledge of the service capacity beforehand - it learns it on the fly.
Without backpressure services becomes unusable or even collapse when incoming traffic exceeds capacity. This is more prone to happen and difficult to mitigate when, in complex systems, capacity has a higher risk of being negatively impacted to an unknown level. 
Kanaloa protects your service against oversaturated traffic by adaptively throttles the concurrent requests the service handles and regulate the incoming traffic using a Little’s law based algorithm called PIE that drops requests based on estimated wait time. The main advantages of Kanaloa are:
1. it requires little knowledge of the service capacity beforehand — it learns it on the fly.
2. it's adaptive to the dynamic capacity of the service and thus is suitable to deal with both positive and negative traffic oversaturation.
3. it's a reverse proxy in front of the service. No implementation is needed at the service side.
12 changes: 11 additions & 1 deletion docs/src/main/tut/index.md
Expand Up @@ -6,12 +6,22 @@ technologies:
- third: ["Graphite", "Kanaloa uses Graphite to provide realtime monitoring"]
---

# Kanaloa
# ![icon](./img/navbar_brand2x.png) Kanaloa

[![Join the chat at https://gitter.im/iheartradio/kanaloa](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/iheartradio/kanaloa?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
[![Build Status](https://travis-ci.org/iheartradio/kanaloa.svg)](https://travis-ci.org/iheartradio/kanaloa)
[![Coverage Status](https://coveralls.io/repos/github/iheartradio/kanaloa/badge.svg?branch=0.5.x)](https://coveralls.io/github/iheartradio/kanaloa?branch=0.5.x)
[![Download](https://api.bintray.com/packages/iheartradio/maven/kanaloa-core/images/download.svg)](https://bintray.com/iheartradio/maven/kanaloa-core/_latestVersion)



Kanaloa is library to make more resilient a service as a reverse proxy by providing:


1. [the ability to exert backpressure during oversaturated traffic (incoming traffic exceeds capacity).](./docs/guide/backpressure.html)
2. [circuit breaker](./docs/guide/circuit-breaker.html)
3. [At-least-once delivery](./docs/guide/at-least-once.html)
4. [real-time monitor](./docs/guide/monitor.html)
5. [proportional load balancing (if you see appropriate)](./docs/guide/load-balancing.html)

For the motivation and methodologies of this library, go to [theories](./docs/theories.html).
37 changes: 37 additions & 0 deletions stress/gatling/src/test/scala/kanaloa/research/ResearchTests.scala
Expand Up @@ -55,6 +55,23 @@ class KanaloaNegativeOverflowSimulation extends NegativeOverflowSimulation("clus

class BaselineNegativeOverflowSimulation extends NegativeOverflowSimulation("round_robin")

class BaselineOverheadGaugeSimulation extends Simulation {
setUp(
Users(
numOfUsers = 20,
path = "straight_unthrottled",
throttle = Some(500000),
rampUp = 1.seconds
)
).protocols(http.disableCaching)
.assertions(
global.requestsPerSec.gte(800),
global.responseTime.percentile3.lte(4),
global.successfulRequests.percent.gte(100)
)
}


/**
* Baseline LB without kanaloa
*/
Expand Down Expand Up @@ -86,3 +103,23 @@ class KanaloaOverheadGaugeSimulation extends Simulation {
global.successfulRequests.percent.gte(100)
)
}


class BaselineLoadBalanceOneNodeUnresponsiveSimulation extends Simulation {
setUp(
Users(
numOfUsers = 900,
path = "round_robin",
throttle = Some(200),
rampUp = 1.seconds
),
CommandSchedule(Command("unresponsive"), services(1), 30.seconds),
CommandSchedule(Command("back-online"), services(1), 55.seconds)
).protocols(http.disableCaching)
.assertions(
global.requestsPerSec.gte(140),
global.responseTime.percentile3.lte(3000),
global.successfulRequests.percent.gte(90),
global.failedRequests.count.lte(100)
)
}
37 changes: 1 addition & 36 deletions stress/gatling/src/test/scala/kanaloa/stress/Simulations.scala
Expand Up @@ -36,7 +36,7 @@ class KanaloaLocalOverflowSimulation extends Simulation {
).protocols(http.disableCaching)
.assertions(
global.requestsPerSec.gte(250),
global.responseTime.percentile2.lte(1000),
global.responseTime.percentile2.lte(1200),
global.responseTime.percentile3.lte(5000), //mainly due to the burst mode
global.successfulRequests.percent.gte(60)
)
Expand All @@ -60,23 +60,6 @@ class KanaloaLoadBalanceOverflowSimulation extends Simulation {
}



class BaselineOverheadGaugeSimulation extends Simulation {
setUp(
Users(
numOfUsers = 20,
path = "straight_unthrottled",
throttle = Some(500000),
rampUp = 1.seconds
)
).protocols(http.disableCaching)
.assertions(
global.requestsPerSec.gte(800),
global.responseTime.percentile3.lte(4),
global.successfulRequests.percent.gte(100)
)
}

/**
* Kanaloa LB most basic without stress.
*/
Expand Down Expand Up @@ -152,24 +135,6 @@ class KanaloaLoadBalanceOneNodeSlowThroughputSimulation extends Simulation {
)
}

class BaselineLoadBalanceOneNodeUnresponsiveSimulation extends Simulation {
setUp(
Users(
numOfUsers = 900,
path = "round_robin",
throttle = Some(200),
rampUp = 1.seconds
),
CommandSchedule(Command("unresponsive"), services(1), 30.seconds),
CommandSchedule(Command("back-online"), services(1), 55.seconds)
).protocols(http.disableCaching)
.assertions(
global.requestsPerSec.gte(140),
global.responseTime.percentile3.lte(3000),
global.successfulRequests.percent.gte(90),
global.failedRequests.count.lte(100)
)
}

class KanaloaLoadBalanceOneNodeJoiningSimulation extends Simulation {
setUp(
Expand Down