Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiqueue package to support multiple region deployment (part 2) #19

Merged
merged 32 commits into from Mar 6, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
47d6f2c
options: Fix #8 Add WaitTimeSeconds
nabeken May 4, 2020
2d9cc6c
multiqueue: initial import
nabeken May 4, 2020
5d1d106
test-multiqueue: initial import
nabeken May 4, 2020
e2a73e6
update the import path to v3
nabeken May 4, 2020
2dccb06
Update README
nabeken May 4, 2020
a2377dc
update readme
nabeken May 4, 2020
5a84b40
update the design note
nabeken May 4, 2020
6eec36d
update the desing note
nabeken May 4, 2020
bab0f5e
wording
nabeken May 4, 2020
b9a7337
multiqueue: fix deadlock when there is no available queue
nabeken May 6, 2020
3e9f33d
test-multiqueue: support more complex test scenario
nabeken May 6, 2020
5f75c73
multiqueue: Add GetExecutors() to return all of the registered queues
nabeken May 6, 2020
7cfc1a6
test-multiqueue: run receiver for all queues in parallel
nabeken May 6, 2020
3260976
Update README
nabeken May 6, 2020
167752f
test-multiqueue: count in total and unique total
nabeken May 6, 2020
2ccdfd1
remove v3 suffix since we do not change the API at all
nabeken May 7, 2020
adc69ea
multiqueue: add more tests
nabeken May 7, 2020
a020e05
test-multiqueue: add README how to use test-multiqueue
nabeken May 7, 2020
127b2a1
README: refer to Well-Architected Framework
nabeken Nov 21, 2020
35845cb
Add a blank line
nabeken Nov 21, 2020
b9c7455
README: cleanup
nabeken Nov 23, 2020
c2d9970
README: tweak
nabeken Nov 23, 2020
f0f2ab7
update import path again
nabeken Nov 23, 2020
0d0dd3c
update go-circuitbreaker
nabeken Nov 23, 2020
45039a8
mq: use go-circuitbreaker's OnStateChange hook
nabeken Nov 23, 2020
b346387
README: Use pkg.go.dev
nabeken Nov 23, 2020
d6db8d1
README: fix typo
nabeken Nov 23, 2020
588d1ca
README: Add a badge for Github Actions
nabeken Nov 23, 2020
25d63b0
test-multiqueue: remove the old state hook
nabeken Nov 23, 2020
5715853
test-multiqueue: add the test scenario
nabeken Nov 23, 2020
739e02c
Add -sSL to the curl
nabeken Nov 23, 2020
9ab920d
test-multiqueue: update the scenario
nabeken Nov 23, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
90 changes: 85 additions & 5 deletions README.md
@@ -1,13 +1,14 @@
# aws-go-sqs

[![GoDoc](http://img.shields.io/badge/godoc-reference-blue.svg)](http://godoc.org/github.com/nabeken/aws-go-sqs/queue)
![Go](https://github.com/nabeken/aws-go-sqs/workflows/Go/badge.svg)
[![PkgGoDev](https://pkg.go.dev/badge/github.com/nabeken/aws-go-sqs/v3)](https://pkg.go.dev/github.com/nabeken/aws-go-sqs/v3)
[![MIT License](http://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

`aws-go-sqs` is a SQS wrapper library for [aws/aws-sdk-go](https://github.com/aws/aws-sdk-go).

# Usage

`v3` and later requires Go Modules support to import.
`v3` and later require Go Modules support to import this package.
```go
import "github.com/nabeken/aws-go-sqs/v3/queue"
```
Expand All @@ -17,11 +18,90 @@ From v3 train (and `master` branch), we no longer use `gopkg.in`.
- We have [v1 branch](https://github.com/nabeken/aws-go-sqs/tree/v1) so you can import it from `gopkg.in/nabeken/aws-go-sqs.v1`.
- We have [v2 branch](https://github.com/nabeken/aws-go-sqs/tree/v2) so you can import it from `gopkg.in/nabeken/aws-go-sqs.v2`.

## Example
# Multi-queue implementation

See examples in [GoDoc](http://godoc.org/github.com/nabeken/aws-go-sqs/queue#pkg-examples).
v3 has [multiqueue](multiqueue/) package to address multi-queue (region) deployment of SQS. SQS is a crucial messaging component but it's still possible to become unavailable for several hours. We experienced that incident at [2020-04-20](https://status.aws.amazon.com/rss/sqs-ap-northeast-1.rss).

## Testing
Since SQS is just a message bus between components, deploying SQS to the multiple regions for availability works even the system isn't fully deployed to the multiple regions.

## How `multiqueue` works

`multiqueue.Dispatcher` maintains:
- a slice of all of the `*queue.Queue`
- a slice of "available" `*queue.Queue`
- circuit breakers for each `*queue.Queue`

When the circuit state transitions to open, the dispatcher will remove the associated queue from the "available" slice so that it won't be used. The circuit breaker will update the state to half-opened then the dispatcher will push it back to "available" slice. If there is no further error, the circuit will be closed.

When there is no queue in the "available" slice, the dispatcher returns a queue from all of the `*queue.Queue` by random.

Example code:
```go
// Create SQS instance
s := sqs.New(session.Must(session.NewSession()))

// Create Queue instances
q1 := queue.MustNew(s, "example-queue1")
q2 := queue.MustNew(s, "example-queue2")

// https://godoc.org/github.com/mercari/go-circuitbreaker#Options
cbOpts := &circuitbreaker.Options{
Interval: 1 * time.Minute,
ShouldTrip: circuitbreaker.NewTripFuncFailureRate(100, 0.7),
}

d := multiqueue.New(cbOpts, q1, q2).
WithOnStateChange(func(q *queue.Queue, from, to circuitbreaker.State) {
log.Printf("%s: state has been changed from %s to %s", *q.URL, from, to)
})

// exec will be q1 or q2
exec := d.Dispatch()

// calling the existing method in *queue.Queue via Do involves the circuit breaker
_, err := exec.Do(ctx, func() (interface{}, error) {
return exec.SendMessage("MESSAGE BODY)
})
...
```

You can find [example/test-multiqueue](example/test-multiqueue) for the full example.

## Consideration for ReceiveMessage via `multiqueue`

There is a case which causes the performance issue when there are no messages in one of the registered queues.

Let's say you have two queues: Queue A has 1000 messages and Queue B has no messages.

`Dispatcher.Dispatch()` will return one of the registered queues. The problem is when the B is returned. SQS will block until new message arrives for 20 seconds in default (WaitReceiveTime). You spend 20 seconds even you have 1000 messages in the A. If, unfortunately, the dispatcher returns B again, you will spend additional 20 seconds.

To avoid this situation, you should poll all the registered queueus in parallel. `Dispatcher.GetExecutors()` will return all of the registered queue wrapped with `Executor` so that you still can call the API over the circuit breaker.

## Design note

When it comes to multi-region deployment, you may think about *primary* and *secondary* and use secondary when the primary becomes unavailable. Such failover codepath won't be called until the primary becomes unavailable so you have a rare chance to get it tested in production. You may agree that such code may contain a bug or the secondary queue configuration may not be up-to-date since mostly the secondary queue is not used at all. Any hidden problems would be triggered when you're on fire.

Let's use all of the available queues all the time and stop sending requests to an unavailable queue until it recovers. You can have the confidence that your system always works with all of the available queues.

AWS recommends this patterns in Well-Architected Framework.
> https://wa.aws.amazon.com/wat.question.REL_11.en.html says:
>
> Use static stability to prevent bimodal behavior: Bimodal behavior is when your workload exhibits different behavior under normal and failure modes, for example, relying on launching new instances if an Availability Zone fails. You should instead build workloads that are statically stable and operate in only one mode. In this case, provision enough instances in each Availability Zone to handle the workload load if one AZ were removed and then use Elastic Load Balancing or Amazon Route 53 health checks to shift load away from the impaired instances.
>
> https://d1.awsstatic.com/whitepapers/architecture/AWS-Reliability-Pillar.pdf says:
>
> **Test disaster recovery implementation to validate the implementation**: Regularly test failover to DR to
ensure that RTO and RPO are met.
>
> A pattern to avoid is developing recovery paths that are rarely executed. For example, you might have a secondary data store that is used for read-only queries.
> When you write to a data store and the primary fails, you might want to fail over to the secondary data store.
> If you don’t frequently test this failover, you might find that your assumptions about the capabilities of the secondary data store are incorrect.
> The capacity of the secondary, which might have been sufficient when you last tested, may be no longer be able to tolerate the load under this scenario.
> Our experience has shown that the only error recovery that works is the path you test frequently. This is why having a small number of recovery paths is best.
> You can establish recovery patterns and regularly test them. If you have a complex or critical recovery path, you still need to regularly execute that failure in production to convince yourself that the recovery path works.
> In the example we just discussed, you should fail over to the standby regularly, regardless of need.

# Testing

We have some integration tests in `queue/queue_test.go`.

Expand Down
1 change: 1 addition & 0 deletions example/test-multiqueue/.gitignore
@@ -0,0 +1 @@
test-multiqueue
87 changes: 87 additions & 0 deletions example/test-multiqueue/README.md
@@ -0,0 +1,87 @@
# test-multiqueue - A command-line tool to emulate the circuit breaker state transition

We need a tool to be able to observe a real-time behavior of the `multiqueue` implementation and its performance. `test-multiqueue` command allows you to:
- run `SendMessage` and `ReceiveMessage` API under the circuit breaker.
- inject errors under the certain error rate and duration via HTTP
- show the total number of {unique,} messages (remember: SQS is at-least-once delivery)

```
go build && ./test-multiqueue -h
Usage of ./test-multiqueue:
-concurrency int
specify concurrency (default 1)
-count int
number of messages (default 10000)
-drain
drain
-queue1 string
specify SQS queue name 1
-queue2 string
specify SQS queue name 2
-region1 string
specify a region for queue1 (default "ap-northeast-1")
-region2 string
specify a region for queue2 (default "ap-southeast-1")
```

## Example

Send 100,0000 messages to queues in ap-northeast-1 and ap-southeast-1 and receive the messages from the both queues:
```sh
./test-multiqueue \
-queue1 example-ap-northeast-1 \
-queue2 example-ap-southeast-1 \
-concurrency 10 \
-count 100000
```

While the tool is working, you can inject errors via HTTP:
```
# set 80% error rate for 5 minutes for queue1
curl "http://127.0.0.1:9003/?index=0&duration=5m&error_rate=0.8" | jq -r .
[
{
"URL": "https://sqs.ap-northeast-1.amazonaws.com/EXAMPLE/example-ap-northeast-1",
"Until": "2020-05-07T23:52:13.096187+09:00",
"ErrRate": 0.8
},
{
"URL": "https://sqs.ap-southeast-1.amazonaws.com/EXAMPLE/example-ap-southeast-1",
"Until": "0001-01-01T00:00:00Z",
"ErrRate": 0
}
]
```

# Test scenario

```sh
run_mq_test() {
# start
echo "$(date) Start"
sleep 120

echo "$(date) 80% error rate for queue 1 for 3m"
curl -sSL "http://127.0.0.1:9003/?index=0&duration=3m&error_rate=0.8" | jq -r .
sleep 180

# wait for 60 seconds to converge
sleep 60

echo "$(date) 80% error rate for queue 2 for 3m"
curl -sSL "http://127.0.0.1:9003/?index=1&duration=3m&error_rate=0.8" | jq -r .
sleep 180

# wait for 30 seconds
sleep 30

echo "$(date) 80% error rate for queue 1 and 2 for 3m and 4m"
curl -sSL "http://127.0.0.1:9003/?index=0&duration=3m&error_rate=0.8" | jq -r .
curl -sSL "http://127.0.0.1:9003/?index=1&duration=4m&error_rate=0.8" | jq -r .
sleep 120
}

run_mq_test
```

Happy testing.