Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine identity validation #902

Merged
merged 13 commits into from
Aug 4, 2023
Merged

Conversation

ejweber
Copy link
Contributor

@ejweber ejweber commented May 11, 2023

longhorn/longhorn#5845

This PR provides the base mechanisms to allow for engine identity verification during normal Longhorn operation (assuming the higher level components like longhorn-instance-manager and longhorn-manager make use of them).

@ejweber ejweber changed the title 5845 engine identity Engine identity validation May 11, 2023
@ejweber ejweber force-pushed the 5845-engine-identity branch 3 times, most recently from 55a33f1 to ccef6d6 Compare May 26, 2023 17:07
@ejweber ejweber force-pushed the 5845-engine-identity branch 6 times, most recently from 01bf625 to fac97c8 Compare June 26, 2023 14:58
@ejweber ejweber marked this pull request as ready for review June 26, 2023 17:50
@ejweber ejweber requested a review from a team as a code owner June 26, 2023 17:50
@ejweber
Copy link
Contributor Author

ejweber commented Jun 26, 2023

This PR is ready for review.

I ran the Longhorn manager core tests on it with the expectation that all tests would pass (all changes should be "opt in"). There were four failures related to the fact that there is no test image like longhornio/longhorn-test:upgrade-test... with the new API version these changes specify:

  • tests.test_backing_image.test_engine_live_upgrade_with_backing_image
  • tests.test_engine_upgrade.test_engine_offline_upgrade
  • tests.test_engine_upgrade.test_engine_live_upgrade
  • tests.test_engine_upgrade.test_engine_live_upgrade_with_intensive_data_writing

It looks like code to generate a new image lives here, but it is a GitHub bot some automation that normally executes it. Is this a process I should kick off somehow?

@PhanLe1010
Copy link
Contributor

PhanLe1010 commented Jun 26, 2023

It looks like the test-engine image is built by this Terraform script. Do you know how to trigger that script @yangchiu @chriscchien

@yangchiu
Copy link
Member

I triggered the pipeline from https://ci.longhorn.io/job/private/job/build-engine-test-images/, but it said latest engine test images have already published and didn't do anything.

@chriscchien Can you help to elaborate more on how to set this right? Thanks!

shuo-wu
shuo-wu previously approved these changes Jun 27, 2023
Copy link
Contributor

@shuo-wu shuo-wu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM

Comment on lines 119 to 122
// The --volume-name flag for the replica command (added before the identity validation feature) and the
// --volume-name global flag are redundant. Since both are optional, proceed with either one, but do not proceed if
// they are different.
volumeName, err := resolveVolumeName(c.String("volume-name"), c.GlobalString("volume-name"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you ignore flag --volume-name in command replica and rely on the global option only?
Same below. Then there is no need to add resolveVolumeName()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hesitant to do this for backwards compatibility reasons, though I will admit I'm not sure I have a perfect handle on exactly what those reasons are.

Older versions of longhorn-manager will want to pass the replica-command-specific --volume-name flag. New versions of longhorn-manager will be aware of the global --volume-name flag. The resolveVolumeName function ensures that we can correctly respond to both.

  1. Is there any scenario in which an older longhorn-manager will talk to an updated longhorn-engine? (I think the answer is no, since longhorn-manager always upgrades first and then longhorn-engine is upgraded asynchronously, but this was my concern.)
  2. Should I remove the replica-command-specific --volume-name flag entirely?

I have not made any changes in response to this comment yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to consider older version longhorn manager pods. Unlike the instance manager or engine image, all old Longhron manager pods will be evicted (daemonset rolling upgrade percentage is 100%) before launching the first new longhorn manager pod. And the new instance manager launching & volume upgrade happen after the new longhorn manager ready.
As long as the new longhorn managers are aware of using different flags (global flag for new engines, cmd-specific flag for old engine) for different versions of engines, it should work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I removed the function and ignored the command-specific flag for my new use case. Once I have a longhorn-manager PR merged that only uses the global flag for the new engine, I will come back and remove the command-specific flag entirely (per the code comments).

proto/ptypes/interceptor.go Outdated Show resolved Hide resolved
proto/ptypes/interceptor.go Outdated Show resolved Hide resolved
@ejweber ejweber force-pushed the 5845-engine-identity branch 2 times, most recently from 4188f8a to 666a256 Compare June 28, 2023 21:34
@ejweber
Copy link
Contributor Author

ejweber commented Jun 29, 2023

I triggered the pipeline from https://ci.longhorn.io/job/private/job/build-engine-test-images/, but it said latest engine test images have already published and didn't do anything.

@chriscchien Can you help to elaborate more on how to set this right? Thanks!

I looks like that automation accept only works against the https://github.com/longhorn/longhorn-engine repo and accepts an optional commit ID (defaults to tip of master). Since these changes aren't yet in that repo, it seems we cannot run the automation until after merge.

util.GetAddresses(volumeName, address, types.DataServerProtocol(dataServerProtocol))
// TODO (5845): Remove replica command-specific volume-name flag when its use is completely replaced in
// longhorn-manager.
util.GetAddresses(c.String("volume-name"), address, types.DataServerProtocol(dataServerProtocol))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why will you use c.String("volume-name") rather than variable volumeName here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My intention is for this to be temporary. Until longhorn-manager changes eventually merge, longhorn-manager will not be aware of the global flag on master-head and will continue to provide the replica command specific volume-name flag. This might cause annoying test failures, etc. I am thinking we can merge this longhorn-engine PR, allowing me to merge the longhorn-instance-manager and longhorn-manager PRs. Then I can come back and completely remove the command-specific flag per the code comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this kind of change that involves multiple repos and compatibility issues, we will merge all PRs at the same time. There is no need to worry about temporary mismatchings between the engine PR merge and the manager PR merge. So please directly use the global flags in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently import the longhorn-engine commits on my own branch in my longhorn-instance-manager and longhorn-manager PRs. This allows me to test those changes, since the commits aren't in the longhorn/longhorn-engine repo yet. My intention was to change each to reference the correct. My intention was to fix these imports after the longhorn-engine changes merged and before the longhorn-instance-manager and longhorn-manager changes were approved. Please let me know what the best practice here is.

Copy link
Contributor

@PhanLe1010 PhanLe1010 Jul 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @shuo-wu about the using the global flags in this PR directly.

In general, since longhorn-engine is imported by longhorn-manager and longhorn-instance-manager, I think we can use this flow:

  1. Update the longhorn-engine PR repo as if longhorn-manager and longhorn-instance-manager are already updated
  2. In longhorn-manager and longhorn-instance-manager PRs, use the replace directive in go.mod to point to the in-progress engine PR to test the in-progress changes
  3. When all longhorn-engine, longhorn-manager, longhorn-instance-manager are approved and ready for merging.
    1. longhorn-engine PR will be merged first
    2. In longhorn-manager and longhorn-instance-manager PRs, edit go.mod to point to the correct longhorn-engine commits
    3. Merge longhorn-manager and longhorn-instance-manager PRs
    4. This PRs merging process should happen within a few hours so there is no need to worry about temporary missmatchings

@chriscchien
Copy link

chriscchien commented Jul 3, 2023

I triggered the pipeline from https://ci.longhorn.io/job/private/job/build-engine-test-images/, but it said latest engine test images have already published and didn't do anything.
@chriscchien Can you help to elaborate more on how to set this right? Thanks!

I looks like that automation accept only works against the https://github.com/longhorn/longhorn-engine repo and accepts an optional commit ID (defaults to tip of master). Since these changes aren't yet in that repo, it seems we cannot run the automation until after merge.

Yes the engine upgrade test images built by pipeline based on latest longhorn-engine information

$ sudo docker run longhornio/longhorn-engine:master-head longhorn version --client-only
{
	"clientVersion": {
		"version": "d1d6d308",
		"gitCommit": "d1d6d308d37a5b465bc1c2824da2e0702830de7d",
		"buildDate": "2023-06-27T15:51:32+00:00",
		"cliAPIVersion": 8,
		"cliAPIMinVersion": 3,
		"controllerAPIVersion": 4,
		"controllerAPIMinVersion": 3,
		"dataFormatVersion": 1,
		"dataFormatMinVersion": 1
	},
	"serverVersion": null
}

And create below test images for e2e cases use by this script :

  • longhornio/longhorn-test:upgrade-test.8-3.4-3.1-1
  • longhornio/longhorn-test:version-test.7-2.4-3.1-1
  • longhornio/longhorn-test:version-test.9-4.4-3.1-1
  • longhornio/longhorn-test:version-test.2-2.4-3.1-1
  • longhornio/longhorn-test:version-test.8-3.4-3.1-1
  • longhornio/longhorn-test:version-test.9-9.4-3.1-1

Our e2e test can specify custom test engine image and if that engine have different APIVserions with master-head, related engine upgrade test case can not pass because upgrade test images not exist, thank you.

@ejweber
Copy link
Contributor Author

ejweber commented Jul 3, 2023

I have never observed it on my local machine, but the new integration tests executed by drone have failed a couple of times because the engine and/or replica server are not yet listening on the expected socket. I added a commit that takes advantage of existing functions with built-in retries.

Copy link
Contributor

@PhanLe1010 PhanLe1010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I drew this graph to assist myself while reviewing this PR. I thought it might be useful to share here as well. Please correct me if you see anything wrong.

longhorn_grpc_communication

From the graph, there are some points that I notice for this PR:

  1. I think the SyncAgentGRPCServer should also have this identity verification similar as the ControllerGRPCServer and the ReplicaGRPCServer because the sync agent also doing tasks related to volume's snapshots and data. It would be dangerous for the syncAgent to perform the operation on the wrong data.
  2. SyncAgentServer should also include volume name and instance name in GRPC metadata when it talk to ReplicaServer because we don't want to talk to a wrong ReplicaServer. For example, operations like this, this, this, this
    , this, and this

@ejweber
Copy link
Contributor Author

ejweber commented Jul 5, 2023

  1. I think the SyncAgentGRPCServer should also have this identity verification similar as the ControllerGRPCServer and the ReplicaGRPCServer because the sync agent also doing tasks related to volume's snapshots and data. It would be dangerous for the syncAgent to perform the operation on the wrong data.

Good catch. We already initialize the SyncAgentServiceClient with the client side interceptor here, but I missed the server side interceptor. Fixed in latest commit.

SyncAgentServer should also include volume name and instance name in GRPC metadata when it talk to ReplicaServer because we don't want to talk to a wrong ReplicaServer. For example, operations like this, this, this, this
, this, and this

Another good catch. I missed these dials. Fixed in latest commit.

I made these changes but am adding integration tests to confirm them. I didn't get this done before wrapping up today.

  • Add integration tests that use sync agent server/client.

err = finishErr
} else {
err = errors.Wrap(err, finishErr.Error())
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While developing an integration test, I discovered a problem with the way we were handling errors in this defer block. Even if an error occurred during the operation, as long as FinishRebuild returned successfully, we would return a nil error. This led to the below and a useless error message at the client.

ERROR: 2018/08/12 13:43:07 grpc: server failed to encode response:  rpc error: code = Internal desc = grpc: error while marshaling: proto: Marshal called with nil

@ejweber
Copy link
Contributor Author

ejweber commented Jul 25, 2023

There is a new commit that fixes cloning, which is not included in the core tests. My code mistakenly assumed that the volume name we cloned from and to would be the same. This caused us to try to communicate with a from replica using the to volume name (triggering an invalid gRPC error). Please rereview when you are able.

cc @PhanLe1010 @shuo-wu @james-munson

Cloning tests passed:

Copy link
Contributor

@shuo-wu shuo-wu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general LGTM. The comments are about some nitty issues. The main noticeable one is, should we cache and output replicaName for the replica list of Controller?

if err != nil {
return err
}

toDisks, _, err := GetReplicaDisksAndHead(address)
toDisks, _, err := GetReplicaDisksAndHead(address, c.Name, instanceName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the instanceName in types.Replica? It seems that I had mentioned this in another PR.
BTW, it's better to rename c.Name to c.VolumeName to avoid confusion.

Copy link
Contributor Author

@ejweber ejweber Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mentioned it in the instance-manager PR: longhorn/longhorn-instance-manager#229 (comment).

This is something @PhanLe1010 and I have discussed quite a bit, including in #902 (comment).

Adding instanceName to types.Replica would be one thing, but we would need to make quite a few changes to controller client methods (which would propagate up the Longhorn stack) in order to make it useful. Because of the added complexity, and the limited practical benefit, we have steered away from doing this. If we do proceed that direction, I suggest we do it in a separate set of PRs.

Changing to VolumeName in the latest commit. The Name field is originally to the Controller type and is referenced in a number of locations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the added complexity, and the limited practical benefit, we have steered away from doing this. If we do proceed that direction, I suggest we do it in a separate set of PRs.

Sometimes this lacking of the replica name makes the log debugging of the engine more difficult. And now with this info, we can make the identity validation between engines and replicas more convenient. To be honest, I prefer to have this change. WDYT? cc @innobead

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But YES, if we want to address this, it's better to have a separate ticket for it. The compatibility handling make take some efforts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PhanLe1010 mentioned this too. Seems a valid reason to want to proceed.

I played a bit with adding it again yesterday afternoon. It's going to be quite awkward in locations since we are used to passing just replica addresses. (For example, we currently pass a comma separated list of addresses in the CLI controller command and a list of addresses in the gRPC VolumeStart method). Doable of course, but more disruptive than my changes so far have been.

  1. replicas := c.StringSlice("replica")
  2. type VolumeStartRequest struct {
    ReplicaAddresses []string `protobuf:"bytes,1,rep,name=replicaAddresses,proto3" json:"replicaAddresses,omitempty"`
    Size int64 `protobuf:"varint,2,opt,name=size,proto3" json:"size,omitempty"`
    CurrentSize int64 `protobuf:"varint,3,opt,name=currentSize,proto3" json:"currentSize,omitempty"`
    XXX_NoUnkeyedLiteral struct{} `json:"-"`
    XXX_unrecognized []byte `json:"-"`
    XXX_sizecache int32 `json:"-"`
    }

I'll leave it to you, @innobead, and others with more seniority on the team to decide how to prioritize it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

YES. After adding it, we should input <replica name>=<replica address> for the CLI and a map for gRPC API VolumeStart. It's more trivial and complicated.

Copy link
Contributor

@PhanLe1010 PhanLe1010 Aug 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we should handle it in a new improvement ticket since this PR already satisfies its goal which is eliminating the possibility of talking to replica of different volume.

pkg/sync/sync.go Outdated Show resolved Hide resolved
PhanLe1010
PhanLe1010 previously approved these changes Jul 31, 2023
Copy link
Contributor

@PhanLe1010 PhanLe1010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

(reviewed the newly added snapshot clone commit)

Signed-off-by: Eric Weber <eric.weber@suse.com>
Signed-off-by: Eric Weber <eric.weber@suse.com>
Signed-off-by: Eric Weber <eric.weber@suse.com>
Signed-off-by: Eric Weber <eric.weber@suse.com>
Signed-off-by: Eric Weber <eric.weber@suse.com>
Signed-off-by: Eric Weber <eric.weber@suse.com>
Signed-off-by: Eric Weber <eric.weber@suse.com>
Signed-off-by: Eric Weber <eric.weber@suse.com>
Signed-off-by: Eric Weber <eric.weber@suse.com>
Signed-off-by: Eric Weber <eric.weber@suse.com>
Signed-off-by: Eric Weber <eric.weber@suse.com>
Signed-off-by: Eric Weber <eric.weber@suse.com>
Signed-off-by: Eric Weber <eric.weber@suse.com>
@PhanLe1010 PhanLe1010 merged commit 4d54af9 into longhorn:master Aug 4, 2023
5 checks passed
@ejweber
Copy link
Contributor Author

ejweber commented Aug 18, 2023

@mergify backport v1.5.x v1.4.x

@mergify
Copy link

mergify bot commented Aug 18, 2023

backport v1.5.x v1.4.x

❌ Command disallowed due to command restrictions in the Mergify configuration.

  • sender-permission>=write

@PhanLe1010
Copy link
Contributor

@mergify backport v1.5.x v1.4.x

@mergify
Copy link

mergify bot commented Aug 18, 2023

backport v1.5.x v1.4.x

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants