Engine identity validation #902

ejweber · 2023-05-11T16:36:41Z

This PR provides the base mechanisms to allow for engine identity verification during normal Longhorn operation (assuming the higher level components like longhorn-instance-manager and longhorn-manager make use of them).

ejweber · 2023-06-26T17:59:45Z

This PR is ready for review.

I ran the Longhorn manager core tests on it with the expectation that all tests would pass (all changes should be "opt in"). There were four failures related to the fact that there is no test image like longhornio/longhorn-test:upgrade-test... with the new API version these changes specify:

tests.test_backing_image.test_engine_live_upgrade_with_backing_image
tests.test_engine_upgrade.test_engine_offline_upgrade
tests.test_engine_upgrade.test_engine_live_upgrade
tests.test_engine_upgrade.test_engine_live_upgrade_with_intensive_data_writing

It looks like code to generate a new image lives here, but it is ~~a GitHub bot~~ some automation that normally executes it. Is this a process I should kick off somehow?

PhanLe1010 · 2023-06-26T21:11:33Z

It looks like the test-engine image is built by this Terraform script. Do you know how to trigger that script @yangchiu @chriscchien

yangchiu · 2023-06-27T00:55:10Z

I triggered the pipeline from https://ci.longhorn.io/job/private/job/build-engine-test-images/, but it said latest engine test images have already published and didn't do anything.

@chriscchien Can you help to elaborate more on how to set this right? Thanks!

shuo-wu

In general LGTM

shuo-wu · 2023-06-27T03:18:06Z

app/cmd/replica.go

+	// The --volume-name flag for the replica command (added before the identity validation feature) and the
+	// --volume-name global flag are redundant. Since both are optional, proceed with either one, but do not proceed if
+	// they are different.
+	volumeName, err := resolveVolumeName(c.String("volume-name"), c.GlobalString("volume-name"))


Can you ignore flag --volume-name in command replica and rely on the global option only?
Same below. Then there is no need to add resolveVolumeName()

I was hesitant to do this for backwards compatibility reasons, though I will admit I'm not sure I have a perfect handle on exactly what those reasons are.

Older versions of longhorn-manager will want to pass the replica-command-specific --volume-name flag. New versions of longhorn-manager will be aware of the global --volume-name flag. The resolveVolumeName function ensures that we can correctly respond to both.

Is there any scenario in which an older longhorn-manager will talk to an updated longhorn-engine? (I think the answer is no, since longhorn-manager always upgrades first and then longhorn-engine is upgraded asynchronously, but this was my concern.)

Should I remove the replica-command-specific --volume-name flag entirely?

I have not made any changes in response to this comment yet.

I don't think we need to consider older version longhorn manager pods. Unlike the instance manager or engine image, all old Longhron manager pods will be evicted (daemonset rolling upgrade percentage is 100%) before launching the first new longhorn manager pod. And the new instance manager launching & volume upgrade happen after the new longhorn manager ready.
As long as the new longhorn managers are aware of using different flags (global flag for new engines, cmd-specific flag for old engine) for different versions of engines, it should work.

Makes sense. I removed the function and ignored the command-specific flag for my new use case. Once I have a longhorn-manager PR merged that only uses the global flag for the new engine, I will come back and remove the command-specific flag entirely (per the code comments).

proto/ptypes/interceptor.go

ejweber · 2023-06-29T14:20:16Z

I triggered the pipeline from https://ci.longhorn.io/job/private/job/build-engine-test-images/, but it said latest engine test images have already published and didn't do anything.

@chriscchien Can you help to elaborate more on how to set this right? Thanks!

I looks like that automation accept only works against the https://github.com/longhorn/longhorn-engine repo and accepts an optional commit ID (defaults to tip of master). Since these changes aren't yet in that repo, it seems we cannot run the automation until after merge.

shuo-wu · 2023-07-03T03:39:52Z

app/cmd/replica.go

-		util.GetAddresses(volumeName, address, types.DataServerProtocol(dataServerProtocol))
+		// TODO (5845): Remove replica command-specific volume-name flag when its use is completely replaced in
+		// longhorn-manager.
+		util.GetAddresses(c.String("volume-name"), address, types.DataServerProtocol(dataServerProtocol))


Why will you use c.String("volume-name") rather than variable volumeName here?

My intention is for this to be temporary. Until longhorn-manager changes eventually merge, longhorn-manager will not be aware of the global flag on master-head and will continue to provide the replica command specific volume-name flag. This might cause annoying test failures, etc. I am thinking we can merge this longhorn-engine PR, allowing me to merge the longhorn-instance-manager and longhorn-manager PRs. Then I can come back and completely remove the command-specific flag per the code comments.

For this kind of change that involves multiple repos and compatibility issues, we will merge all PRs at the same time. There is no need to worry about temporary mismatchings between the engine PR merge and the manager PR merge. So please directly use the global flags in this PR.

I currently import the longhorn-engine commits on my own branch in my longhorn-instance-manager and longhorn-manager PRs. This allows me to test those changes, since the commits aren't in the longhorn/longhorn-engine repo yet. My intention was to change each to reference the correct. My intention was to fix these imports after the longhorn-engine changes merged and before the longhorn-instance-manager and longhorn-manager changes were approved. Please let me know what the best practice here is.

Agree with @shuo-wu about the using the global flags in this PR directly.

In general, since longhorn-engine is imported by longhorn-manager and longhorn-instance-manager, I think we can use this flow:

Update the longhorn-engine PR repo as if longhorn-manager and longhorn-instance-manager are already updated

In longhorn-manager and longhorn-instance-manager PRs, use the replace directive in go.mod to point to the in-progress engine PR to test the in-progress changes

When all longhorn-engine, longhorn-manager, longhorn-instance-manager are approved and ready for merging.

longhorn-engine PR will be merged first

In longhorn-manager and longhorn-instance-manager PRs, edit go.mod to point to the correct longhorn-engine commits

Merge longhorn-manager and longhorn-instance-manager PRs

This PRs merging process should happen within a few hours so there is no need to worry about temporary missmatchings

chriscchien · 2023-07-03T10:23:26Z

I triggered the pipeline from https://ci.longhorn.io/job/private/job/build-engine-test-images/, but it said latest engine test images have already published and didn't do anything.
@chriscchien Can you help to elaborate more on how to set this right? Thanks!

I looks like that automation accept only works against the https://github.com/longhorn/longhorn-engine repo and accepts an optional commit ID (defaults to tip of master). Since these changes aren't yet in that repo, it seems we cannot run the automation until after merge.

Yes the engine upgrade test images built by pipeline based on latest longhorn-engine information

$ sudo docker run longhornio/longhorn-engine:master-head longhorn version --client-only
{
	"clientVersion": {
		"version": "d1d6d308",
		"gitCommit": "d1d6d308d37a5b465bc1c2824da2e0702830de7d",
		"buildDate": "2023-06-27T15:51:32+00:00",
		"cliAPIVersion": 8,
		"cliAPIMinVersion": 3,
		"controllerAPIVersion": 4,
		"controllerAPIMinVersion": 3,
		"dataFormatVersion": 1,
		"dataFormatMinVersion": 1
	},
	"serverVersion": null
}

And create below test images for e2e cases use by this script :

longhornio/longhorn-test:upgrade-test.8-3.4-3.1-1
longhornio/longhorn-test:version-test.7-2.4-3.1-1
longhornio/longhorn-test:version-test.9-4.4-3.1-1
longhornio/longhorn-test:version-test.2-2.4-3.1-1
longhornio/longhorn-test:version-test.8-3.4-3.1-1
longhornio/longhorn-test:version-test.9-9.4-3.1-1

Our e2e test can specify custom test engine image and if that engine have different APIVserions with master-head, related engine upgrade test case can not pass because upgrade test images not exist, thank you.

ejweber · 2023-07-03T15:49:10Z

I have never observed it on my local machine, but the new integration tests executed by drone have failed a couple of times because the engine and/or replica server are not yet listening on the expected socket. I added a commit that takes advantage of existing functions with built-in retries.

PhanLe1010

I drew this graph to assist myself while reviewing this PR. I thought it might be useful to share here as well. Please correct me if you see anything wrong.

From the graph, there are some points that I notice for this PR:

I think the SyncAgentGRPCServer should also have this identity verification similar as the ControllerGRPCServer and the ReplicaGRPCServer because the sync agent also doing tasks related to volume's snapshots and data. It would be dangerous for the syncAgent to perform the operation on the wrong data.
SyncAgentServer should also include volume name and instance name in GRPC metadata when it talk to ReplicaServer because we don't want to talk to a wrong ReplicaServer. For example, operations like this, this, this, this
, this, and this

ejweber · 2023-07-05T22:03:34Z

I think the SyncAgentGRPCServer should also have this identity verification similar as the ControllerGRPCServer and the ReplicaGRPCServer because the sync agent also doing tasks related to volume's snapshots and data. It would be dangerous for the syncAgent to perform the operation on the wrong data.

Good catch. We already initialize the SyncAgentServiceClient with the client side interceptor here, but I missed the server side interceptor. Fixed in latest commit.

SyncAgentServer should also include volume name and instance name in GRPC metadata when it talk to ReplicaServer because we don't want to talk to a wrong ReplicaServer. For example, operations like this, this, this, this
, this, and this

Another good catch. I missed these dials. Fixed in latest commit.

I made these changes but am adding integration tests to confirm them. I didn't get this done before wrapping up today.

Add integration tests that use sync agent server/client.

ejweber · 2023-07-06T15:23:50Z

pkg/sync/rpc/server.go

+				err = finishErr
+			} else {
+				err = errors.Wrap(err, finishErr.Error())
+			}


While developing an integration test, I discovered a problem with the way we were handling errors in this defer block. Even if an error occurred during the operation, as long as FinishRebuild returned successfully, we would return a nil error. This led to the below and a useless error message at the client.

ERROR: 2018/08/12 13:43:07 grpc: server failed to encode response: rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil

ejweber · 2023-07-25T20:54:43Z

There is a new commit that fixes cloning, which is not included in the core tests. My code mistakenly assumed that the volume name we cloned from and to would be the same. This caused us to try to communicate with a from replica using the to volume name (triggering an invalid gRPC error). Please rereview when you are able.

cc @PhanLe1010 @shuo-wu @james-munson

Cloning tests passed:

Updated engine, instance-manager, and manager: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4652
Updated engine and manager with old instance-manager: https://ci.longhorn.io/job/private/job/longhorn-tests-regression/4654

shuo-wu

In general LGTM. The comments are about some nitty issues. The main noticeable one is, should we cache and output replicaName for the replica list of Controller?

shuo-wu · 2023-07-31T14:55:24Z

pkg/controller/rebuild.go

 	if err != nil {
 		return err
 	}

-	toDisks, _, err := GetReplicaDisksAndHead(address)
+	toDisks, _, err := GetReplicaDisksAndHead(address, c.Name, instanceName)


Can we add the instanceName in types.Replica? It seems that I had mentioned this in another PR.
BTW, it's better to rename c.Name to c.VolumeName to avoid confusion.

You mentioned it in the instance-manager PR: longhorn/longhorn-instance-manager#229 (comment).

This is something @PhanLe1010 and I have discussed quite a bit, including in #902 (comment).

Adding instanceName to types.Replica would be one thing, but we would need to make quite a few changes to controller client methods (which would propagate up the Longhorn stack) in order to make it useful. Because of the added complexity, and the limited practical benefit, we have steered away from doing this. If we do proceed that direction, I suggest we do it in a separate set of PRs.

Changing to VolumeName in the latest commit. The Name field is originally to the Controller type and is referenced in a number of locations.

Because of the added complexity, and the limited practical benefit, we have steered away from doing this. If we do proceed that direction, I suggest we do it in a separate set of PRs.

Sometimes this lacking of the replica name makes the log debugging of the engine more difficult. And now with this info, we can make the identity validation between engines and replicas more convenient. To be honest, I prefer to have this change. WDYT? cc @innobead

But YES, if we want to address this, it's better to have a separate ticket for it. The compatibility handling make take some efforts.

@PhanLe1010 mentioned this too. Seems a valid reason to want to proceed.

I played a bit with adding it again yesterday afternoon. It's going to be quite awkward in locations since we are used to passing just replica addresses. (For example, we currently pass a comma separated list of addresses in the CLI controller command and a list of addresses in the gRPC VolumeStart method). Doable of course, but more disruptive than my changes so far have been.

longhorn-engine/app/cmd/controller.go

Line 108 in 8b902d8

replicas := c.StringSlice("replica")

longhorn-engine/proto/ptypes/controller.pb.go

Lines 261 to 268 in 8b902d8

type VolumeStartRequest struct {

ReplicaAddresses []string `protobuf:"bytes,1,rep,name=replicaAddresses,proto3" json:"replicaAddresses,omitempty"`

Size int64 `protobuf:"varint,2,opt,name=size,proto3" json:"size,omitempty"`

CurrentSize int64 `protobuf:"varint,3,opt,name=currentSize,proto3" json:"currentSize,omitempty"`

XXX_NoUnkeyedLiteral struct{} `json:"-"`

XXX_unrecognized []byte `json:"-"`

XXX_sizecache int32 `json:"-"`

}

I'll leave it to you, @innobead, and others with more seniority on the team to decide how to prioritize it.

YES. After adding it, we should input <replica name>=<replica address> for the CLI and a map for gRPC API VolumeStart. It's more trivial and complicated.

Agree that we should handle it in a new improvement ticket since this PR already satisfies its goal which is eliminating the possibility of talking to replica of different volume.

pkg/sync/sync.go

app/cmd/controller.go

PhanLe1010

LGTM

(reviewed the newly added snapshot clone commit)

Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber · 2023-08-18T18:08:11Z

@mergify backport v1.5.x v1.4.x

mergify · 2023-08-18T18:08:15Z

backport v1.5.x v1.4.x

❌ Command disallowed due to command restrictions in the Mergify configuration.

sender-permission>=write

PhanLe1010 · 2023-08-18T18:27:11Z

@mergify backport v1.5.x v1.4.x

mergify · 2023-08-18T18:27:31Z

backport v1.5.x v1.4.x

✅ Backports have been created

#925 Engine identity validation (backport #902) has been created for branch v1.5.x
#926 Engine identity validation (backport #902) has been created for branch v1.4.x but encountered conflicts

ejweber changed the title ~~5845 engine identity~~ Engine identity validation May 11, 2023

ejweber mentioned this pull request May 11, 2023

Engine Identity Validation LEP longhorn/longhorn#5918

Merged

ejweber force-pushed the 5845-engine-identity branch 3 times, most recently from 55a33f1 to ccef6d6 Compare May 26, 2023 17:07

ejweber force-pushed the 5845-engine-identity branch 6 times, most recently from 01bf625 to fac97c8 Compare June 26, 2023 14:58

ejweber marked this pull request as ready for review June 26, 2023 17:50

ejweber requested a review from a team as a code owner June 26, 2023 17:50

shuo-wu previously approved these changes Jun 27, 2023

View reviewed changes

ejweber dismissed shuo-wu’s stale review via 7a0ee88 June 27, 2023 20:34

ejweber force-pushed the 5845-engine-identity branch 2 times, most recently from 4188f8a to 666a256 Compare June 28, 2023 21:34

ejweber mentioned this pull request Jun 29, 2023

Modify Engine Identity Validation LEP for changes to longhorn-engine PR longhorn/longhorn#6228

Merged

shuo-wu reviewed Jul 3, 2023

View reviewed changes

PhanLe1010 reviewed Jul 3, 2023

View reviewed changes

ejweber force-pushed the 5845-engine-identity branch from f4414d2 to dcc2be8 Compare July 5, 2023 22:02

ejweber force-pushed the 5845-engine-identity branch from dcc2be8 to 448f4ac Compare July 6, 2023 15:19

ejweber commented Jul 6, 2023

View reviewed changes

ejweber force-pushed the 5845-engine-identity branch from 9f790fd to e05a119 Compare July 25, 2023 20:40

shuo-wu reviewed Jul 31, 2023

View reviewed changes

app/cmd/controller.go Show resolved Hide resolved

PhanLe1010 previously approved these changes Jul 31, 2023

View reviewed changes

ejweber dismissed PhanLe1010’s stale review via 8b902d8 August 1, 2023 17:56

shuo-wu approved these changes Aug 2, 2023

View reviewed changes

ejweber added 13 commits August 3, 2023 09:35

Add gRPC identity interceptor to servers and clients

7b5dd7d

Signed-off-by: Eric Weber <eric.weber@suse.com>

Add instanceName field to ReplicaAddress message

b265881

Signed-off-by: Eric Weber <eric.weber@suse.com>

Add identity validation flags to CLI commands

5b44a0e

Signed-off-by: Eric Weber <eric.weber@suse.com>

Add identity validation integration tests

4c3a925

Signed-off-by: Eric Weber <eric.weber@suse.com>

Polish up engine identity log and error messages

30309a0

Signed-off-by: Eric Weber <eric.weber@suse.com>

Remove replica --volume-name flag in favor of the new global one

0e39068

Signed-off-by: Eric Weber <eric.weber@suse.com>

Prevent integration flakes due to closed socket

70945aa

Signed-off-by: Eric Weber <eric.weber@suse.com>

Add identity verification to sync agent when acting as client and server

ef66fe6

Signed-off-by: Eric Weber <eric.weber@suse.com>

Add identity verification integration tests for sync agent

96f7681

Signed-off-by: Eric Weber <eric.weber@suse.com>

Polish identity validation server interceptor

d8e9052

Signed-off-by: Eric Weber <eric.weber@suse.com>

Bump ControllerAPIVersion

3348a5b

Signed-off-by: Eric Weber <eric.weber@suse.com>

Fix SnapshotClone with engine identity validation enabled

d49163c

Signed-off-by: Eric Weber <eric.weber@suse.com>

Change Name to VolumeName and cache VolumeName in controller client

1531c5f

Signed-off-by: Eric Weber <eric.weber@suse.com>

ejweber force-pushed the 5845-engine-identity branch from 8b902d8 to 1531c5f Compare August 3, 2023 14:35

PhanLe1010 merged commit 4d54af9 into longhorn:master Aug 4, 2023
5 checks passed

This was referenced Aug 18, 2023

Engine identity validation (backport #902) #925

Merged

Engine identity validation (backport #902) #926

Closed

ejweber mentioned this pull request Aug 29, 2023

[BACKPORT][v1.5.2][IMPROVEMENT] Longhorn-engine processes should refuse to serve requests not intended for them longhorn/longhorn#6550

Closed

FrankYang0529 mentioned this pull request Dec 19, 2023

Upgrade packages version #935

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine identity validation #902

Engine identity validation #902

ejweber commented May 11, 2023 •

edited

Loading

ejweber commented Jun 26, 2023 •

edited

Loading

PhanLe1010 commented Jun 26, 2023 •

edited

Loading

yangchiu commented Jun 27, 2023

shuo-wu left a comment

shuo-wu Jun 27, 2023

ejweber Jun 27, 2023

shuo-wu Jun 30, 2023

ejweber Jun 30, 2023

ejweber commented Jun 29, 2023

shuo-wu Jul 3, 2023

ejweber Jul 3, 2023

shuo-wu Jul 4, 2023

ejweber Jul 5, 2023

PhanLe1010 Jul 5, 2023 •

edited

Loading

chriscchien commented Jul 3, 2023 •

edited

Loading

ejweber commented Jul 3, 2023

PhanLe1010 left a comment •

edited

Loading

ejweber commented Jul 5, 2023 •

edited

Loading

ejweber Jul 6, 2023

ejweber commented Jul 25, 2023 •

edited

Loading

shuo-wu left a comment

shuo-wu Jul 31, 2023

ejweber Aug 1, 2023 •

edited

Loading

shuo-wu Aug 2, 2023

shuo-wu Aug 2, 2023

ejweber Aug 2, 2023

shuo-wu Aug 2, 2023

PhanLe1010 Aug 2, 2023 •

edited

Loading

PhanLe1010 left a comment

ejweber commented Aug 18, 2023

mergify bot commented Aug 18, 2023

PhanLe1010 commented Aug 18, 2023

mergify bot commented Aug 18, 2023 •

edited

Loading

	type VolumeStartRequest struct {
	ReplicaAddresses []string `protobuf:"bytes,1,rep,name=replicaAddresses,proto3" json:"replicaAddresses,omitempty"`
	Size int64 `protobuf:"varint,2,opt,name=size,proto3" json:"size,omitempty"`
	CurrentSize int64 `protobuf:"varint,3,opt,name=currentSize,proto3" json:"currentSize,omitempty"`
	XXX_NoUnkeyedLiteral struct{} `json:"-"`
	XXX_unrecognized []byte `json:"-"`
	XXX_sizecache int32 `json:"-"`
	}

Engine identity validation #902

Engine identity validation #902

Conversation

ejweber commented May 11, 2023 • edited Loading

ejweber commented Jun 26, 2023 • edited Loading

PhanLe1010 commented Jun 26, 2023 • edited Loading

yangchiu commented Jun 27, 2023

shuo-wu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ejweber commented Jun 29, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PhanLe1010 Jul 5, 2023 • edited Loading

Choose a reason for hiding this comment

chriscchien commented Jul 3, 2023 • edited Loading

ejweber commented Jul 3, 2023

PhanLe1010 left a comment • edited Loading

Choose a reason for hiding this comment

ejweber commented Jul 5, 2023 • edited Loading

Choose a reason for hiding this comment

ejweber commented Jul 25, 2023 • edited Loading

shuo-wu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ejweber Aug 1, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PhanLe1010 Aug 2, 2023 • edited Loading

Choose a reason for hiding this comment

PhanLe1010 left a comment

Choose a reason for hiding this comment

ejweber commented Aug 18, 2023

mergify bot commented Aug 18, 2023

❌ Command disallowed due to command restrictions in the Mergify configuration.

PhanLe1010 commented Aug 18, 2023

mergify bot commented Aug 18, 2023 • edited Loading

✅ Backports have been created

ejweber commented May 11, 2023 •

edited

Loading

ejweber commented Jun 26, 2023 •

edited

Loading

PhanLe1010 commented Jun 26, 2023 •

edited

Loading

PhanLe1010 Jul 5, 2023 •

edited

Loading

chriscchien commented Jul 3, 2023 •

edited

Loading

PhanLe1010 left a comment •

edited

Loading

ejweber commented Jul 5, 2023 •

edited

Loading

ejweber commented Jul 25, 2023 •

edited

Loading

ejweber Aug 1, 2023 •

edited

Loading

PhanLe1010 Aug 2, 2023 •

edited

Loading

mergify bot commented Aug 18, 2023 •

edited

Loading