Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle nomad exec termination events in order #10657

Merged
merged 4 commits into from May 25, 2021
Merged

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented May 25, 2021

Ensure that websocket channels are closed gracefully and that exit codes are properly captured, to avoid the spurious "unexpected EOF" failures even when command completes successfully.

Examining the paths, there are two bugs contributing to the cause.

First, the http handler that bridge the internal streaming RPC to websocket connection had a goroutine that could returned early before emitting the websocket.CloseNormalClosure message.

Second, on the api consumer side, if we got an "unexpected EOF" error, the api may prioritize handling it before emitting the last output or exit code. In theory, we can safely ignore any error occurring after receiving the exit code message, as the exit code message is the last message.

In this PR, we fix the goroutine early termination, and ensure that all receive messages and errors are processed in order.

Fixes #9892

Mahmood Ali added 3 commits May 24, 2021 13:35
In this loop, we ought to close the websocket connection gracefully when
the StreamErrWrapper reaches EOF.

Previously, it's possible that that we drop the last few events or skip sending
the websocket closure. If `handler(handlerPipe)` returns and `cancel` is called,
before the loop here completes processing streaming events, the loop exits
prematurely without propagating the last few events.

Instead here, the loop continues until we hit `httpPipe` EOF (through
`decoder.Decode`), to ensure we process the events to completion.
refactor the api handling of `nomad exec`, and ensure that we process
all received events before handling websocket closing.

The exit code should be the last message received, and we ought to
ignore any websocket close error we receive afterwards.

Previously, we used two channels: one for websocket frames and another
for handling errors. This raised the possibility that we processed the
error before processing the frames, resulting into an "unexpected EOF"
error.
@notnoop notnoop self-assigned this May 25, 2021
Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; just the one question / suggestion

}

func (s *execSession) startConnection() (*websocket.Conn, error) {
nodeClient, _ := s.client.GetNodeClientWithTimeout(s.alloc.NodeID, ClientConnTimeout, s.q)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be helpful to have a comment explaining why nodeClient being nil here is acceptable - also, would it make sense to switch on the error here? E.g. we could get back NodeDownErr which seems like a case to bail early?

https://github.com/hashicorp/nomad/blob/main/api/api.go#L492

@vercel vercel bot temporarily deployed to Preview – nomad May 25, 2021 18:25 Inactive
@notnoop notnoop merged commit e694000 into main May 25, 2021
@notnoop notnoop deleted the b-alloc-exec-closing branch May 25, 2021 18:51
@tgross tgross added this to the 1.1.1 milestone Jun 3, 2021
notnoop pushed a commit that referenced this pull request Jun 9, 2021
Handle `nomad exec` termination events in order
@surajthakur
Copy link

surajthakur commented Jul 23, 2021

Hi Team,

I raise this discussion a few days ago https://discuss.hashicorp.com/t/nomad-exec-error-failed-to-exec-into-task-unexpected-eof/26539

Does this change requires access to NodeID which a non admin ACL token does not have access to? Which it will return 403 ?

namespace "default" {
  policy = "read"
  capabilities = ["submit-job","dispatch-job","read-logs","alloc-exec"]
}

Because the exec seems to be fine admin token.

Thanks

@notnoop
Copy link
Contributor Author

notnoop commented Jul 27, 2021

Hi @surajthakur ! I'm sorry I missed this comment earlier. I've investigated a related issue in #10922 . Can you try out the suggestions in #10922 (comment) and see if the change addresses the issue you are seeing. If not, I'd encourage you to file a new GitHub issue and I will follow up closer.

I just tried reproducing the issue with permissions but without much luck.

@surajthakur
Copy link

surajthakur commented Jul 28, 2021

Hey @notnoop, The above mentioned issue is be a different one.

My issue was related to ACL when performing exec. If I provided an admin ACL, then the exec was fine, as I suppose it would have easily fetched the nodeID.

Where as given the below ACL, doesn't have nodeID fetch access, so when i tried exec with the below policy nomad server would return 403. and exec returning EOF exec error

namespace "default" {
  policy = "read"
  capabilities = ["submit-job","dispatch-job","read-logs","alloc-exec"]
}

There is special case here: If I try to make exec request bypassing load balancer, using http://nc-server-1:4646 the exec request is success. But in case of https it goes through load balancer(haproxy), it returns 403.

I got solved it giving read permission for nodeID. So this is the new ACL permission

namespace "default" {
  policy = "read"
  capabilities = ["submit-job","dispatch-job","read-logs","alloc-exec"]
}
node {
  policy = "read"
}

I checked in the earlier version which nomad 1.0.4, there was an 403 in /v1/node/* but exec probably just ignored it where was in nomad 1.1.2 it is kind of enforced and gives error if nodeID is returned 403

If you want I can share logs.

if you still want I can create a github issue on this

@notnoop
Copy link
Contributor Author

notnoop commented Jul 28, 2021

I suspect there are multiple issues at play - mind if you try the binaries in the issue I linked? The bit about how exec succeeds when bypassing the load balancer makes me suspect it's the issue there and not permission. I suspect that when you are using the root token, the CLI attempts and succeeds at connecting to the node directly bypassing the ELB for actual exec call. The CLI attempts to connect directly to the node to avoid overloading servers with unnecessary traffic. When the token lacks node permissions, it will fallback to execing through the ELB and because of the http/2 issue, nomad 1.1.2 fails.

For record, here is my attempt with nomad 1.1.2:

$ nomad-1.1.2 version
Nomad v1.1.2 (60638a086ef9630e2a9ba1e237e8426192a44244)
$ nomad-1.1.2 acl token self
Accessor ID  = 276633aa-2cb5-4380-7fc3-f6218e634b52
Secret ID    = 61da6c98-eebb-cc78-fe1c-88b81ba3d6a6
Name         = mytest
Type         = client
Global       = false
Policies     = [default]
Create Time  = 2021-07-28 01:38:51.481141562 +0000 UTC
Create Index = 18
Modify Index = 18
$ nomad-1.1.2 acl policy info default
Name        = default
Description = <none>
Rules       = namespace "default" {
  policy = "read"
  capabilities = ["submit-job","dispatch-job","read-logs","alloc-exec"]
}
CreateIndex = 17
ModifyIndex = 17
$ nomad-1.1.2 alloc exec 69b5df67 uname -a
Linux 8787cc5ae398 5.4.0-1048-aws #50~18.04.1-Ubuntu SMP Tue May 4 17:40:02 UTC 2021 x86_64 GNU/Linux

On the server side, I see the node request failing followed by exec request:

2021-07-28T01:44:40.768Z [DEBUG] http: request complete: method=GET path=/v1/allocations?prefix=69b5df67 duration=350.024µs
2021-07-28T01:44:40.846Z [DEBUG] http: request complete: method=GET path=/v1/allocation/69b5df67-6c19-34b9-1df7-b088a13f7340?namespace=default duration=383.697µs
2021-07-28T01:44:40.910Z [DEBUG] http: request failed: method=GET path=/v1/node/d1bb81ee-9f5a-4a19-2d51-7f7286014c51 error="Permission denied" code=403
2021-07-28T01:44:40.910Z [DEBUG] http: request complete: method=GET path=/v1/node/d1bb81ee-9f5a-4a19-2d51-7f7286014c51 duration=235.152µs
2021-07-28T01:44:41.125Z [DEBUG] http: alloc exec channel closed with error: error=<nil>
2021-07-28T01:44:41.125Z [DEBUG] http: request complete: method=GET path=/v1/client/allocation/69b5df67-6c19-34b9-1df7-b088a13f7340/exec?command=%5B%22uname%22%2C%22-a%22%5D&task=redis&tty=true duration=154.664218ms

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

api/exec may return "unexpected EOF"
4 participants