Merge pull request #19 from nabeken/multi-queue-2

Add multiqueue package to support multiple region deployment (part 2)
nabeken · Mar 6, 2021 · 299f176 · 299f176
2 parents 2cc91ff + 9ab920d
commit 299f176
Show file tree

Hide file tree

Showing 8 changed files with 759 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -1,13 +1,14 @@
 # aws-go-sqs
 
-[![GoDoc](http://img.shields.io/badge/godoc-reference-blue.svg)](http://godoc.org/github.com/nabeken/aws-go-sqs/queue)
+![Go](https://github.com/nabeken/aws-go-sqs/workflows/Go/badge.svg)
+[![PkgGoDev](https://pkg.go.dev/badge/github.com/nabeken/aws-go-sqs/v3)](https://pkg.go.dev/github.com/nabeken/aws-go-sqs/v3)
 [![MIT License](http://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
 
 `aws-go-sqs` is a SQS wrapper library for [aws/aws-sdk-go](https://github.com/aws/aws-sdk-go).
 
 # Usage
 
-`v3` and later requires Go Modules support to import.
+`v3` and later require Go Modules support to import this package.
 ```go
 import "github.com/nabeken/aws-go-sqs/v3/queue"
 ```
@@ -17,11 +18,90 @@ From v3 train (and `master` branch), we no longer use `gopkg.in`.
 - We have [v1 branch](https://github.com/nabeken/aws-go-sqs/tree/v1) so you can import it from `gopkg.in/nabeken/aws-go-sqs.v1`.
 - We have [v2 branch](https://github.com/nabeken/aws-go-sqs/tree/v2) so you can import it from `gopkg.in/nabeken/aws-go-sqs.v2`.
 
-## Example
+# Multi-queue implementation
 
-See examples in [GoDoc](http://godoc.org/github.com/nabeken/aws-go-sqs/queue#pkg-examples).
+v3 has [multiqueue](multiqueue/) package to address multi-queue (region) deployment of SQS. SQS is a crucial messaging component but it's still possible to become unavailable for several hours. We experienced that incident at [2020-04-20](https://status.aws.amazon.com/rss/sqs-ap-northeast-1.rss).
 
-## Testing
+Since SQS is just a message bus between components, deploying SQS to the multiple regions for availability works even the system isn't fully deployed to the multiple regions.
+
+## How `multiqueue` works
+
+`multiqueue.Dispatcher` maintains:
+- a slice of all of the `*queue.Queue`
+- a slice of "available" `*queue.Queue`
+- circuit breakers for each `*queue.Queue`
+
+When the circuit state transitions to open, the dispatcher will remove the associated queue from the "available" slice so that it won't be used. The circuit breaker will update the state to half-opened then the dispatcher will push it back to "available" slice. If there is no further error, the circuit will be closed.
+
+When there is no queue in the "available" slice, the dispatcher returns a queue from all of the `*queue.Queue` by random.
+
+Example code:
+```go
+// Create SQS instance
+s := sqs.New(session.Must(session.NewSession()))
+
+// Create Queue instances
+q1 := queue.MustNew(s, "example-queue1")
+q2 := queue.MustNew(s, "example-queue2")
+
+// https://godoc.org/github.com/mercari/go-circuitbreaker#Options
+cbOpts := &circuitbreaker.Options{
+    Interval:   1 * time.Minute,
+    ShouldTrip: circuitbreaker.NewTripFuncFailureRate(100, 0.7),
+}
+
+d := multiqueue.New(cbOpts, q1, q2).
+    WithOnStateChange(func(q *queue.Queue, from, to circuitbreaker.State) {
+        log.Printf("%s: state has been changed from %s to %s", *q.URL, from, to)
+    })
+
+// exec will be q1 or q2
+exec := d.Dispatch()
+
+// calling the existing method in *queue.Queue via Do involves the circuit breaker
+_, err := exec.Do(ctx, func() (interface{}, error) {
+  return exec.SendMessage("MESSAGE BODY)
+})
+...
+```
+
+You can find [example/test-multiqueue](example/test-multiqueue) for the full example.
+
+## Consideration for ReceiveMessage via `multiqueue`
+
+There is a case which causes the performance issue when there are no messages in one of the registered queues.
+
+Let's say you have two queues: Queue A has 1000 messages and Queue B has no messages.
+
+`Dispatcher.Dispatch()` will return one of the registered queues. The problem is when the B is returned. SQS will block until new message arrives for 20 seconds in default (WaitReceiveTime). You spend 20 seconds even you have 1000 messages in the A. If, unfortunately, the dispatcher returns B again, you will spend additional 20 seconds.
+
+To avoid this situation, you should poll all the registered queueus in parallel. `Dispatcher.GetExecutors()` will return all of the registered queue wrapped with `Executor` so that you still can call the API over the circuit breaker.
+
+## Design note
+
+When it comes to multi-region deployment, you may think about *primary* and *secondary* and use secondary when the primary becomes unavailable. Such failover codepath won't be called until the primary becomes unavailable so you have a rare chance to get it tested in production. You may agree that such code may contain a bug or the secondary queue configuration may not be up-to-date since mostly the secondary queue is not used at all. Any hidden problems would be triggered when you're on fire.
+
+Let's use all of the available queues all the time and stop sending requests to an unavailable queue until it recovers. You can have the confidence that your system always works with all of the available queues.
+
+AWS recommends this patterns in Well-Architected Framework.
+> https://wa.aws.amazon.com/wat.question.REL_11.en.html says:
+>
+> Use static stability to prevent bimodal behavior: Bimodal behavior is when your workload exhibits different behavior under normal and failure modes, for example, relying on launching new instances if an Availability Zone fails. You should instead build workloads that are statically stable and operate in only one mode. In this case, provision enough instances in each Availability Zone to handle the workload load if one AZ were removed and then use Elastic Load Balancing or Amazon Route 53 health checks to shift load away from the impaired instances.
+>
+> https://d1.awsstatic.com/whitepapers/architecture/AWS-Reliability-Pillar.pdf says:
+>
+> **Test disaster recovery implementation to validate the implementation**: Regularly test failover to DR to
+ensure that RTO and RPO are met.
+>
+> A pattern to avoid is developing recovery paths that are rarely executed. For example, you might have a secondary data store that is used for read-only queries.
+> When you write to a data store and the primary fails, you might want to fail over to the secondary data store.
+> If you don’t frequently test this failover, you might find that your assumptions about the capabilities of the secondary data store are incorrect.
+> The capacity of the secondary, which might have been sufficient when you last tested, may be no longer be able to tolerate the load under this scenario.
+> Our experience has shown that the only error recovery that works is the path you test frequently. This is why having a small number of recovery paths is best.
+> You can establish recovery patterns and regularly test them. If you have a complex or critical recovery path, you still need to regularly execute that failure in production to convince yourself that the recovery path works.
+> In the example we just discussed, you should fail over to the standby regularly, regardless of need.
+
+# Testing
 
 We have some integration tests in `queue/queue_test.go`.
 

diff --git a/example/test-multiqueue/.gitignore b/example/test-multiqueue/.gitignore
@@ -0,0 +1 @@
+test-multiqueue
diff --git a/example/test-multiqueue/README.md b/example/test-multiqueue/README.md
@@ -0,0 +1,87 @@
+# test-multiqueue - A command-line tool to emulate the circuit breaker state transition
+
+We need a tool to be able to observe a real-time behavior of the `multiqueue` implementation and its performance. `test-multiqueue` command allows you to:
+- run `SendMessage` and `ReceiveMessage` API under the circuit breaker.
+- inject errors under the certain error rate and duration via HTTP
+- show the total number of {unique,} messages (remember: SQS is at-least-once delivery)
+
+```
+go build && ./test-multiqueue -h
+Usage of ./test-multiqueue:
+  -concurrency int
+    	specify concurrency (default 1)
+  -count int
+    	number of messages (default 10000)
+  -drain
+    	drain
+  -queue1 string
+    	specify SQS queue name 1
+  -queue2 string
+    	specify SQS queue name 2
+  -region1 string
+    	specify a region for queue1 (default "ap-northeast-1")
+  -region2 string
+    	specify a region for queue2 (default "ap-southeast-1")
+```
+
+## Example
+
+Send 100,0000 messages to queues in ap-northeast-1 and ap-southeast-1 and receive the messages from the both queues:
+```sh
+./test-multiqueue \
+  -queue1 example-ap-northeast-1 \
+  -queue2 example-ap-southeast-1 \
+  -concurrency 10 \
+  -count 100000
+```
+
+While the tool is working, you can inject errors via HTTP:
+```
+# set 80% error rate for 5 minutes for queue1
+curl "http://127.0.0.1:9003/?index=0&duration=5m&error_rate=0.8" | jq -r .
+[
+  {
+    "URL": "https://sqs.ap-northeast-1.amazonaws.com/EXAMPLE/example-ap-northeast-1",
+    "Until": "2020-05-07T23:52:13.096187+09:00",
+    "ErrRate": 0.8
+  },
+  {
+    "URL": "https://sqs.ap-southeast-1.amazonaws.com/EXAMPLE/example-ap-southeast-1",
+    "Until": "0001-01-01T00:00:00Z",
+    "ErrRate": 0
+  }
+]
+```
+
+# Test scenario
+
+```sh
+run_mq_test() {
+  # start
+  echo "$(date) Start"
+  sleep 120
+
+  echo "$(date) 80% error rate for queue 1 for 3m"
+  curl -sSL "http://127.0.0.1:9003/?index=0&duration=3m&error_rate=0.8" | jq -r .
+  sleep 180
+
+  # wait for 60 seconds to converge
+  sleep 60
+
+  echo "$(date) 80% error rate for queue 2 for 3m"
+  curl -sSL "http://127.0.0.1:9003/?index=1&duration=3m&error_rate=0.8" | jq -r .
+  sleep 180
+
+  # wait for 30 seconds
+  sleep 30
+
+  echo "$(date) 80% error rate for queue 1 and 2 for 3m and 4m"
+  curl -sSL "http://127.0.0.1:9003/?index=0&duration=3m&error_rate=0.8" | jq -r .
+  curl -sSL "http://127.0.0.1:9003/?index=1&duration=4m&error_rate=0.8" | jq -r .
+  sleep 120
+}
+
+run_mq_test
+```
+
+Happy testing.