Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade go-msgpack to 2.1.1 #577

Merged
merged 2 commits into from
Nov 15, 2023
Merged

Upgrade go-msgpack to 2.1.1 #577

merged 2 commits into from
Nov 15, 2023

Conversation

swenson
Copy link

@swenson swenson commented Oct 18, 2023

We add a parameter that allows the behavior of the time encoding to be changed by NetworkTransportConfig.MsgpackUseNewTimeFormat. The default value is false, which keeps the current behavior of time encoding is compatible with with go-msgpack v0.5.5 (the version in go.mod prior to this change).

However, users who upgrade to this version of Raft will no longer have their own version of pre-v2 go-msgpack honored, which could cause unexpected binary compatibility problems.

This will be a problem for HashiCorp products who might blindly upgrade:

  • Vault (go-msgpack v1.1.5)
  • Nomad (go-msgpack v1.1.5).

I don't believe this will be a problem for Consul (go-msgpack v0.5.5).

For the sake of non-HashiCorp products using this library, I'd recommend merging this and making a release with a full v2 so that people are not surprised.

I'll follow-up this PR and release with PRs to upgrade Vault, Consul, and Nomad.

I tested this branch with Vault using
NetworkTransportConfig.MsgpackUseNewTimeFormat set to true and verified it was able to replicate to Vault 1.15.0 (which uses go-msgpack v1.1.5). (I also tested that the same setting of false would fail to replicate to Vault 1.15.0 with a time decoding error, as expected).

I also added a quick benchmark in bench_test.go, and did not notice any significant performance differences before and after this change (which is not surprising, since I don't assume msgpack is a significant cost for snapshots).

@swenson swenson requested a review from a team as a code owner October 18, 2023 17:44
@swenson swenson requested review from jmurret and removed request for a team October 18, 2023 17:44
swenson pushed a commit to hashicorp/raft-boltdb that referenced this pull request Oct 19, 2023
This adds the option to specify what version of the `time.Time` encoding
format is desired. The default option is false, which preserves
compatibility with the current dependency of
`hashicorp/go-msgpack 0.5.5` in go.mod, but users of this library will
probably want to override the the setting.

While we're here, upgrade the go.mod file.

I tested that this appears to work with Nomad.

This needs to be updated and merged after
hashicorp/raft#577
swenson pushed a commit to hashicorp/nomad that referenced this pull request Oct 19, 2023
And set the `time.Time` option to use the go-msgpack-1.1.5-compatible
encoding in all the places, since that is the (now previous) version
in `go.mod`.

v2 2.1.1 was specifically designe to honor backwards compatibility
with 1.1.5 and 0.5.5, and to clean up the code base to be more
maintainable. There may performance lost with the 1.1.5 to 2.1.1
migration since the fastpath code was removed, but the increased safety
is probably worth it. See
[the release notes for go-msgkack 2.1.0](https://github.com/hashicorp/go-msgpack/releases/tag/v2.1.0)
for more details.

I tested this by running this code, and booting up a cluster with
a node also running the prior version of Nomad (before the upgrade).
Before I made the changes to set the right `time.Time` option, the
previous-version node would throw a bunch of time-decoding errors.
After fixing the option, the node came up smoothly, even after
changing leadership between them.

This relies on
- [ ] hashicorp/serf#705
- [ ] hashicorp/raft-boltdb#38
- [ ] hashicorp/raft#577
- [ ] hashicorp/memberlist#292

and maybe
- [ ] hashicorp/net-rpc-msgpackrpc#12
swenson pushed a commit to hashicorp/vault that referenced this pull request Oct 19, 2023
And set the `time.Time` option to use the go-msgpack-1.1.5-compatible
encoding in all the places, since that is the (now previous) version in
`go.mod`.

v2 2.1.1 was specifically designed to honor backwards compatibility
with 1.1.5 and 0.5.5, and to clean up the code base to be more
maintainable. There may performance lost with the 1.1.5 to 2.1.1
migration since the fastpath code was removed, but the increased safety
is probably worth it. See
[the release notes for go-msgkack 2.1.0](https://github.com/hashicorp/go-msgpack/releases/tag/v2.1.0)
for more details.

I tested this by running this code, and booting up a cluster with a node
also running Vault 1.15.0 (before the upgrade). Before I made the
changes to set the right `time.Time` option, the previous-version node
would throw a bunch of time-decoding errors. After fixing the option,
the node came up smoothly, even after changing leadership between them.

This relies on
- [ ] hashicorp/raft-boltdb#38
- [ ] hashicorp/raft#577

and will need to be updated after they are merged to get the `go.mod`
fixes removed.
Copy link
Member

@banks banks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @swenson.

I thought through whether there is anything else that will help us validate that this doesn't break things but I'm don't think there is. I had one suggestion inline that I don't expect to change anything but would give maybe a tiny bit of extra confidence.

// Do a snapshot
NoErr(WaitFuture(raft.raft.Snapshot()), b)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this benchmark used to check before and after performance of the msgpack encoding? What were the results?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Differences were basically non-existent, as I suspect the disk I/O / sync was the dominant cost.

t.Errorf("Expected time %s to encode as %+v but got %+v", stamp, expected, buf.Bytes())
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth having a similar test that encodes an entire Log populated with some arbitrary data? My thinking is that that struct is really the main thing msgpack is used for other than some minor RPC framing.

As far as I know the only part of the encoding that caused issues due to changes upstream was the time.Time in the AppendedAt field which this covers, but it might provided added confidence that there are no other unknown incompatibilities too?

Not essential, just a thought.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, there are some tests like that in go-msgpack already: https://github.com/hashicorp/go-msgpack/blob/main/codec/raft_test.go which links to a bunch of data I captured from running Raft with various versions of go-msgpack https://github.com/hashicorp/go-msgpack/tree/main/codec/internal/testdata

@banks
Copy link
Member

banks commented Oct 25, 2023

However, users who upgrade to this version of Raft will no longer have their own version of pre-v2 go-msgpack honored, which could cause unexpected binary compatibility problems.

🤔 It took me a while to parse this and think through the ramifications. Is this correct @swenson ?

  1. If you currently build against go-msgpack 1.1.5 then the problem is that your raft messages are currently encoded on the wire with new time format while this PR will default to old format. While the new node can accept messages from the old leader, if the new node becomes leader while non-upgraded servers are still present it will cause them to crash or error. Fix: ensure that you set the new NetworkTransportConfig.MsgpackUseNewTimeFormat to true and test mixed scenarios
  2. If you currently build against go-msgpack 0.5.5 or earlier and you use go-msgpack in your app outside of raft, then you should be OK because raft will now import the new msgpack version as v2 and leave your app importing v0.5.5. If you don't use it outside of raft you are also OK.

For the sake of non-HashiCorp products using this library, I'd recommend merging this and making a release with a full v2 so that people are not surprised.

I'm not sure who this advice would be aimed at though? Is that if you import raft but are also a library with downstream consumers?

@swenson
Copy link
Author

swenson commented Oct 25, 2023

However, users who upgrade to this version of Raft will no longer have their own version of pre-v2 go-msgpack honored, which could cause unexpected binary compatibility problems.

🤔 It took me a while to parse this and think through the ramifications. Is this correct @swenson ?

  1. If you currently build against go-msgpack 1.1.5 then the problem is that your raft messages are currently encoded on the wire with new time format while this PR will default to old format. While the new node can accept messages from the old leader, if the new node becomes leader while non-upgraded servers are still present it will cause them to crash or error. Fix: ensure that you set the new NetworkTransportConfig.MsgpackUseNewTimeFormat to true and test mixed scenarios
  2. If you currently build against go-msgpack 0.5.5 or earlier and you use go-msgpack in your app outside of raft, then you should be OK because raft will now import the new msgpack version as v2 and leave your app importing v0.5.5. If you don't use it outside of raft you are also OK.

Exactly!

For the sake of non-HashiCorp products using this library, I'd recommend merging this and making a release with a full v2 so that people are not surprised.

I'm not sure who this advice would be aimed at though? Is that if you import raft but are also a library with downstream consumers?

This isn't strictly necessary. But, I assume there might be non-HashiCorp libraries or servers using this library? And if so, the next release breaking compatibility unless you specify a certain flag, depending on what version you were using before and what version of go-msgpack you might have in your go.mod, might be surprising. So, I thought doing a v2 release might help alleviate problems.

(With HashiCorp stuff, I think we have it mostly handled.)

@swenson
Copy link
Author

swenson commented Oct 25, 2023

(If we are doing a release, we could also put the upgrade notes in the release notes)

swenson pushed a commit to hashicorp/nomad that referenced this pull request Oct 25, 2023
And set the `time.Time` option to use the go-msgpack-1.1.5-compatible
encoding in all the places, since that is the (now previous) version
in `go.mod`.

v2 2.1.1 was specifically designe to honor backwards compatibility
with 1.1.5 and 0.5.5, and to clean up the code base to be more
maintainable. There may performance lost with the 1.1.5 to 2.1.1
migration since the fastpath code was removed, but the increased safety
is probably worth it. See
[the release notes for go-msgkack 2.1.0](https://github.com/hashicorp/go-msgpack/releases/tag/v2.1.0)
for more details.

I tested this by running this code, and booting up a cluster with
a node also running the prior version of Nomad (before the upgrade).
Before I made the changes to set the right `time.Time` option, the
previous-version node would throw a bunch of time-decoding errors.
After fixing the option, the node came up smoothly, even after
changing leadership between them.

This relies on
- [ ] hashicorp/serf#705
- [ ] hashicorp/raft-boltdb#38
- [ ] hashicorp/raft#577
- [ ] hashicorp/memberlist#292

and maybe
- [ ] hashicorp/net-rpc-msgpackrpc#12
swenson pushed a commit to hashicorp/vault that referenced this pull request Oct 25, 2023
And set the `time.Time` option to use the go-msgpack-1.1.5-compatible
encoding in all the places, since that is the (now previous) version in
`go.mod`.

v2 2.1.1 was specifically designed to honor backwards compatibility
with 1.1.5 and 0.5.5, and to clean up the code base to be more
maintainable. There may performance lost with the 1.1.5 to 2.1.1
migration since the fastpath code was removed, but the increased safety
is probably worth it. See
[the release notes for go-msgkack 2.1.0](https://github.com/hashicorp/go-msgpack/releases/tag/v2.1.0)
for more details.

I tested this by running this code, and booting up a cluster with a node
also running Vault 1.15.0 (before the upgrade). Before I made the
changes to set the right `time.Time` option, the previous-version node
would throw a bunch of time-decoding errors. After fixing the option,
the node came up smoothly, even after changing leadership between them.

This relies on
- [ ] hashicorp/raft-boltdb#38
- [ ] hashicorp/raft#577

and will need to be updated after they are merged to get the `go.mod`
fixes removed.
@HouseK
Copy link

HouseK commented Oct 26, 2023

Note: This might handle #473

@swenson
Copy link
Author

swenson commented Nov 13, 2023

@banks pinging this again

@swenson
Copy link
Author

swenson commented Nov 13, 2023

(Oops, I missed that you approved this @banks. Thanks! I'll wait to merge in case there is something else you wanted to bring up.)

@banks
Copy link
Member

banks commented Nov 14, 2023

@swenson thanks. Still OK with merging this, my only pause is about how and where we document the nuances I tried to spell out above - do you have a plan for that? I guess in the raft library changelog CHANGELOG.md (although I'm not sure how consistent we've been at maintaining that...) but also maybe on the releases page when we draft that? Maybe an internal email to other teams within HC who use this? Do you think it's worth drafting that and getting feedback as part of the PR?

@swenson
Copy link
Author

swenson commented Nov 14, 2023

@banks will do. I'll draft something in a comment later today.

@swenson
Copy link
Author

swenson commented Nov 14, 2023

@banks how about:

Email

Hi developers,

If you aren't a developer who uses the https://github.com/hashicorp/raft library, you can stop reading now.

tl;dr The next Raft release (1.6.0) will switch the underlying hashicorp/go-msgpack library from v0.5.5 to v2.1.1. On its own, v2.1.1 should have the same default encoding as v0.5.5.

However, some users of the library have overridden the go-msgpack library with other versions that may be use an incompatible binary encoding.

It is possible in go-msgpack v2.1.1 and Raft 1.6.0 to specify which binary encoding is preferred at runtime.

We have PRs out for Consul, Nomad, and Vault to facilitate these upgrades and set the options appropriately:

More details are available in the PR #577 and in the release notes (TBD).

If you have further questions, feel free to reach out in #tech-raft.

Thanks,
--The Raft maintainers

Release notes for 1.6.0

Upgrade hashicorp/go-msgpack to v2, with go.mod upgraded from v0.5.5 to v2.1.1.

go-msgpack v2.1.1 is by default binary compatible with v0.5.5 ("non-builtin" encoding of time.Time), but can decode messages produced by v1.1.5 as well ("builtin" encoding of time.Time).

However, if users of this libary overrode the version of go-msgpack (especially to v1), this could break compatibility if raft nodes are running a mix of versions.

This compatibility can be configured at runtime in Raft using NetworkTransportConfig.MsgpackUseNewTimeFormat -- the default is false, which maintains compatibility with go-msgpack v0.5.5, but if set to true,
will be compatible with go-msgpack v1.1.5.

@banks
Copy link
Member

banks commented Nov 15, 2023

Those look awesome @swenson thanks. I think we can maybe duplicate the Release Notes on both the Releases page in GH and in CHANGELOG.md - we normally update that by hand it seems and on main. If you remember to do that before cutting the release so it's also in the tag then even better but I don't think we are very consistent about that!

Christopher Swenson added 2 commits November 15, 2023 09:40
We add a parameter that allows the behavior of the time encoding to be
changed by `NetworkTransportConfig.MsgpackUseNewTimeFormat`.
The default value is `false`, which keeps the current behavior of time
encoding is compatible with with `go-msgpack` v0.5.5 (the version
in `go.mod` prior to this change).

However, users who upgrade to this version of Raft will no longer
have their own version of pre-v2 `go-msgpack` honored, which could
cause unexpected binary compatibility problems.

This will be a problem for HashiCorp products who might blindly
upgrade:
* [Vault](https://github.com/hashicorp/vault/blob/f2f532ec802d356346c2302e9c38734f2a594f3f/go.mod#L98)
  (`go-msgpack` v1.1.5)
* [Nomad](https://github.com/hashicorp/nomad/blob/cb2363f2fbbb771af9c8a62ed09a60250df647a2/go.mod#L64)
  (`go-msgpack` v1.1.5).

I don't believe this will be a problem for [Consul](https://github.com/hashicorp/consul/blob/8eb074e7c17a0a2077436637c06f521b036407f5/go.mod#L199)
(`go-msgpack` v0.5.5).

For the sake of non-HashiCorp products using this library, I'd recommend
merging this and making a release with a full `v2` so that people are
not surprised.

I'll follow-up this PR and release with PRs to upgrade Vault, Consul,
and Nomad.

I tested this branch with Vault using
`NetworkTransportConfig.MsgpackUseNewTimeFormat` set to `true` and
verified it was able to replicate to Vault 1.15.0 (which uses
`go-msgpack` v1.1.5). (I also tested that the same setting of `false`
would fail to replicate to Vault 1.15.0 with a time decoding error,
as expected).

I also added a quick benchmark in `bench_test.go`, and did not notice
any significant performance differences before and after this change
(which is not surprising, since I don't assume msgpack is a significant
cost for snapshots).
@swenson
Copy link
Author

swenson commented Nov 15, 2023

Thanks!

@swenson swenson merged commit b96f998 into main Nov 15, 2023
6 checks passed
@swenson swenson deleted the vault-19781/msgpack-upgrade branch November 15, 2023 17:44
swenson pushed a commit to hashicorp/vault that referenced this pull request Nov 15, 2023
And set the `time.Time` option to use the go-msgpack-1.1.5-compatible
encoding in all the places, since that is the (now previous) version in
`go.mod`.

v2 2.1.1 was specifically designed to honor backwards compatibility
with 1.1.5 and 0.5.5, and to clean up the code base to be more
maintainable. There may performance lost with the 1.1.5 to 2.1.1
migration since the fastpath code was removed, but the increased safety
is probably worth it. See
[the release notes for go-msgkack 2.1.0](https://github.com/hashicorp/go-msgpack/releases/tag/v2.1.0)
for more details.

I tested this by running this code, and booting up a cluster with a node
also running Vault 1.15.0 (before the upgrade). Before I made the
changes to set the right `time.Time` option, the previous-version node
would throw a bunch of time-decoding errors. After fixing the option,
the node came up smoothly, even after changing leadership between them.

This relies on
- [ ] hashicorp/raft-boltdb#38
- [ ] hashicorp/raft#577

and will need to be updated after they are merged to get the `go.mod`
fixes removed.
swenson pushed a commit to hashicorp/vault that referenced this pull request Dec 22, 2023
And set the `time.Time` option to use the go-msgpack-1.1.5-compatible
encoding in all the places, since that is the (now previous) version in
`go.mod`.

v2 2.1.1 was specifically designed to honor backwards compatibility
with 1.1.5 and 0.5.5, and to clean up the code base to be more
maintainable. There may performance lost with the 1.1.5 to 2.1.1
migration since the fastpath code was removed, but the increased safety
is probably worth it. See
[the release notes for go-msgkack 2.1.0](https://github.com/hashicorp/go-msgpack/releases/tag/v2.1.0)
for more details.

I tested this by running this code, and booting up a cluster with a node
also running Vault 1.15.0 (before the upgrade). Before I made the
changes to set the right `time.Time` option, the previous-version node
would throw a bunch of time-decoding errors. After fixing the option,
the node came up smoothly, even after changing leadership between them.

This relies on
- [ ] hashicorp/raft-boltdb#38
- [ ] hashicorp/raft#577

and will need to be updated after they are merged to get the `go.mod`
fixes removed.
swenson pushed a commit to hashicorp/vault that referenced this pull request Jan 8, 2024
And set the `time.Time` option to use the go-msgpack-1.1.5-compatible
encoding in all the places, since that is the (now previous) version in
`go.mod`.

v2 2.1.1 was specifically designed to honor backwards compatibility
with 1.1.5 and 0.5.5, and to clean up the code base to be more
maintainable. There may performance lost with the 1.1.5 to 2.1.1
migration since the fastpath code was removed, but the increased safety
is probably worth it. See
[the release notes for go-msgkack 2.1.0](https://github.com/hashicorp/go-msgpack/releases/tag/v2.1.0)
for more details.

I tested this by running this code, and booting up a cluster with a node
also running Vault 1.15.0 (before the upgrade). Before I made the
changes to set the right `time.Time` option, the previous-version node
would throw a bunch of time-decoding errors. After fixing the option,
the node came up smoothly, even after changing leadership between them.

This relies on
- [ ] hashicorp/raft-boltdb#38
- [ ] hashicorp/raft#577

and will need to be updated after they are merged to get the `go.mod`
fixes removed.
swenson pushed a commit to hashicorp/vault that referenced this pull request Jan 8, 2024
And set the `time.Time` option to use the go-msgpack-1.1.5-compatible
encoding in all the places, since that is the (now previous) version in
`go.mod`.

v2 2.1.1 was specifically designed to honor backwards compatibility
with 1.1.5 and 0.5.5, and to clean up the code base to be more
maintainable. There may performance lost with the 1.1.5 to 2.1.1
migration since the fastpath code was removed, but the increased safety
is probably worth it. See
[the release notes for go-msgkack 2.1.0](https://github.com/hashicorp/go-msgpack/releases/tag/v2.1.0)
for more details.

I tested this by running this code, and booting up a cluster with a node
also running Vault 1.15.0 (before the upgrade). Before I made the
changes to set the right `time.Time` option, the previous-version node
would throw a bunch of time-decoding errors. After fixing the option,
the node came up smoothly, even after changing leadership between them.

This relies on
- hashicorp/raft-boltdb#38
- hashicorp/raft#577

I did a simple pair of benchmarks (one with a final sync, one without)
and ran them before and after on both my Mac (M2 Max) laptop and my
Linux (AMD Threadripper 3970X) desktop.

tl;dr There was no performance difference for this benchmark.

```
goos: darwin
goarch: arm64
pkg: github.com/hashicorp/vault/physical/raft
                   │    a.txt    │            b.txt             │
                   │   sec/op    │   sec/op     vs base         │
RaftWithNetwork-10   58.65m ± 2%   58.62m ± 2%  ~ (p=0.937 n=6)
```

```
goos: linux
goarch: amd64
pkg: github.com/hashicorp/vault/physical/raft
cpu: AMD Ryzen Threadripper 3970X 32-Core Processor
                   │    c.txt    │            d.txt             │
                   │   sec/op    │   sec/op     vs base         │
RaftWithNetwork-64   5.861m ± 1%   5.837m ± 0%  ~ (p=0.240 n=6)
```
@ncabatoff ncabatoff mentioned this pull request Feb 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants