Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: configurable NATS connection retries #86

Merged
merged 9 commits into from
Oct 31, 2023

Conversation

TylerGillson
Copy link
Contributor

@TylerGillson TylerGillson commented Oct 17, 2023

Add configuration options & logic for NATS connection retries. Fixes #87.

In our use case, the NATS worker may boot before the NATs leader on occasion, hence these options would be very useful!

Signed-off-by: Tyler Gillson <tyler.gillson@gmail.com>
@maxpert
Copy link
Owner

maxpert commented Oct 20, 2023

Thanks for the build, but I anticipate that there should be a way to configure it in NATS configuration file and people can do it via that rather than adding additional bloat in configuration. Can you check if that is possible via nats configuration file?

@TylerGillson
Copy link
Contributor Author

Thanks for the build, but I anticipate that there should be a way to configure it in NATS configuration file and people can do it via that rather than adding additional bloat in configuration. Can you check if that is possible via nats configuration file?

@maxpert My interpretation of the NATS client library is that the reconnection options are only applicable after an initial connection is established.

And I don't see anything promising in the server options either. I will nevertheless toy around with the reconnection options and let you know.

@TylerGillson
Copy link
Contributor Author

TylerGillson commented Oct 20, 2023

@maxpert your anticipation was correct 😁

However, I still don't see a client configuration file or parser for same in the NATS library.

What do you think about the changes now (heavily influenced by this example)? Only one new configuration option.

Or WDYT about just enabling reconnections by default with some sane total wait time?

@maxpert
Copy link
Owner

maxpert commented Oct 20, 2023

You might be right. Let me explore myself too.

@TylerGillson
Copy link
Contributor Author

You might be right. Let me explore myself too.

Ok - I did validate this code though, FWIW. My earlier comment was uninformed. The client retries work fine on the initial connection with these changes.

cfg/config.go Outdated Show resolved Hide resolved
stream/nats.go Outdated Show resolved Hide resolved
stream/nats.go Outdated Show resolved Hide resolved
Copy link
Owner

@maxpert maxpert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change time units, I will have to give it another pass once done. I think this PR makes sens.e

Signed-off-by: Tyler Gillson <tyler.gillson@gmail.com>
Copy link
Owner

@maxpert maxpert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of more things to fine tune.

cfg/config.go Outdated Show resolved Hide resolved
stream/nats.go Outdated Show resolved Hide resolved
stream/nats.go Outdated Show resolved Hide resolved
stream/nats.go Outdated Show resolved Hide resolved
Signed-off-by: Tyler Gillson <tyler.gillson@gmail.com>
@TylerGillson
Copy link
Contributor Author

TylerGillson commented Oct 27, 2023

@maxpert apologies for the mixup but I was testing with the wrong binary 🤦🏼‍♂️ Had to make a few more changes.

If the initial connection fails, nats.Connect() returns a *nats.Conn whose private conn field is nil, along with a nil error.

So I've reverted this PR back to the original approach where I handle the retry logic on the marmot side. I'd appreciate any insight or opinions you have from your own testing.

Seeing the following behaviour without the marmot-side retries:

[root@localhost marmot]# cat marmot.toml
# Path to target SQLite database
seq_map_path="/etc/kubernetes/marmot-sm.cbor"
db_path="state.db"

[nats]
# address of the nats leader
urls=[
  "nats://fakefakefakenotanats:4222"
]
# Number of retries if establishing a NATS connection
connect_retries=30

# Console STDOUT configurations
[logging]
# Configure console logging
verbose=true
# "console" | "json"
format="console"

[root@localhost marmot]# ./marmot -config marmot.toml
11:07AM DBG Opening database node_id=15460159781391886746 path=state.db
11:07AM DBG Forcing WAL checkpoint node_id=15460159781391886746
11:08AM PNC Unable to initialize snapshot storage error="dial tcp: lookup fakefakefakenotanats: Try again" node_id=15460159781391886746
panic: Unable to initialize snapshot storage

goroutine 1 [running]:
github.com/rs/zerolog/log.Panic.(*Logger).Panic.func1({0x107f170?, 0x0?})
        /root/.gvm/pkgsets/go1.21.3/global/pkg/mod/github.com/rs/zerolog@v1.29.1/log.go:376 +0x27
github.com/rs/zerolog.(*Event).msg(0xc0000a4480, {0x107f170, 0x25})
        /root/.gvm/pkgsets/go1.21.3/global/pkg/mod/github.com/rs/zerolog@v1.29.1/event.go:156 +0x2c2
github.com/rs/zerolog.(*Event).Msg(...)
        /root/.gvm/pkgsets/go1.21.3/global/pkg/mod/github.com/rs/zerolog@v1.29.1/event.go:108
main.main()
        /root/marmot/marmot.go:66 +0x70a
[root@localhost marmot]#

Since we aren't checking the status of the nats.Conn returned by nats.Connect().

Signed-off-by: Tyler Gillson <tyler.gillson@gmail.com>
@TylerGillson
Copy link
Contributor Author

@maxpert - it is working as expected now:
image

@maxpert
Copy link
Owner

maxpert commented Oct 27, 2023

Perfect I will run couple of tests and try it in my sandbox this coming week.

Copy link
Owner

@maxpert maxpert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and few minor nits I want to be ironed out

stream/nats.go Outdated Show resolved Hide resolved
stream/nats.go Outdated Show resolved Hide resolved
Signed-off-by: Tyler Gillson <tyler.gillson@gmail.com>
@TylerGillson
Copy link
Contributor Author

@maxpert I added ReconnectWaitSeconds per your request, but haven't exposed timeout because I was trying to avoid configuration bloat. Currently the default timeout of 2s is set for the dialer for each connection attempt.

Would you like me to expose that option too?

@maxpert maxpert merged commit 1b32206 into maxpert:master Oct 31, 2023
10 checks passed
@TylerGillson TylerGillson deleted the feat-nats-connect-retries branch October 31, 2023 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

No support for NATS connection retries
2 participants