Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad failing to create templating sandbox on Windows; appears to leak DACL entries on nomad.exe #20585

Open
tomqwpl opened this issue May 14, 2024 · 4 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/platform-windows theme/template type/bug

Comments

@tomqwpl
Copy link

tomqwpl commented May 14, 2024

Nomad version

Nomad v1.7.5
BuildDate 2024-02-13T15:10:13Z
Revision 5f5d464

Operating system and Environment details

Windows 11

Issue

Repeated errors of

10:21.868+0100 [ERROR] client.alloc_runner.task_runner.task_hook.template: failed to create template manager: alloc_id=03b05e98-3048-9dd0-c4ff-b6e10ef27059 task=Main error="could not create platform sandbox: could not grant object access: could not create new DACL for \"c:\\path-to-installation-of\\nomad.exe\": The parameter is incorrect."
    2024-05-13T18:
...
    2024-05-13T18:10:21.925+0100 [INFO]  client.alloc_runner.task_runner: not restarting task: alloc_id=03b05e98-3048-9dd0-c4ff-b6e10ef27059 task=Main reason="Policy allows no restarts"
    2024-05-13T18:10:21.925+0100 [INFO]  client.alloc_runner.task_runner: Task event: alloc_id=03b05e98-3048-9dd0-c4ff-b6e10ef27059 task=Main type="Not Restarting" msg="Policy allows no restarts" failed=true

Reproduction steps

Hard to give specific reproduction steps at this time. I have one instance of Nomad running on a laptop, this isn't a "production" environment.

This occurred when I was performing some load testing on a solution that utilises nomad. I submitted 1000 jobs in quick succession, all of which were fairly small jobs. Many succeeded, but very many failed with this error. To it feels like a load issue.
Note that I've deliberately got the jobs configured not to allow restarts (I don't want then to be rerun just because they return a non zero exit code)

Possible reproduction would be to submit a very large number of jobs, each of which has default resource requirements, and each of which is a "raw-exec" doing something simple like "dir". Nomad will attempt to schedule maybe 100 at once (32GB memory, default resource requirements are 300MB).

Actual Result

It looks like Nomad isn't cleaning up the access control lists properly, possibly always, possibly only in error circumstances. Having run the experiment above, the only explanation I could think of was the ACl had got too big. So I used powershell to run Get-Acl nomad.exe | Format-List. There were many hundreds of entries in the list. So it would appear that the "The parameter is incorrect" error above means "enough already, that DACL is way too big". Nomad appears to be not always correctly undoing the ACL modifications it does, so the ACL grows until it can grow no longer. Incidentally I tried looking at the access control list using Windows explorer (Properties, Security tab), Windows explorer crashed because the ACL was too big for it to display.

@tomqwpl
Copy link
Author

tomqwpl commented May 14, 2024

On the face of it, this appears to be a fairly major bug on Windows.
I haven't yet tried reproducing this in isolation with standalone Nomad, so there's a chance that it's the particular way that we are using nomad that's the issue. However, my testing appears to show that each I submit just a single job, the ACL on the nomad.exe has an ACE added to it that is never removed. Just one job in isolation, no further load required.

That is:

  • Get-Acl nomad.exe | Format-List See that it has the standard expected entries it would inherit from directory and so on
  • Run a nomad job
  • Get-Acl nomad.exe | Format-List See that the ACL has grown with an entry of the form S-1-15-2-2427714051-1004092651-189808086-1284803327-1995376635-766556020-4219361275 Allow FullControl.
  • Each time a job is run a new entry is added.

Implication would be that once a certain number of jobs have been run, that's it, nomad will no longer work and be unable to launch new jobs.

Using the disable_file_sandbox client option stops this problem from occurring.

I will experiment with vanilla nomad to verify I can create a simple reproduction scenario.

@jrasell jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation May 14, 2024
@tomqwpl
Copy link
Author

tomqwpl commented May 14, 2024

Reproduction:

  • Run nomad with nomad agent --dev
  • Print ACl for nomad using powershell Get-Acl nomad.exe | Format-List
  • Run golang program below
  • Print ACL again, note increase in length of ACL

Note that the problem only occurs (obviously) if there's a Template in the nomad task.

Golang driver program:

package main

import (
	"log"

	nomadapi "github.com/hashicorp/nomad/api"
)

func stringptr(s string) *string {
	return &s
}

func intptr(i int) *int {
	return &i
}

func main() {
	client, err := nomadapi.NewClient(&nomadapi.Config{
		Address: "http://localhost:4646",
	})
	if err != nil {
		log.Fatal(err)
	}

	job := nomadapi.Job{
		ID:   stringptr("job-1"),
		Type: stringptr("batch"),
		TaskGroups: []*nomadapi.TaskGroup{
			{
				Name: stringptr("main"),
				Tasks: []*nomadapi.Task{
					{
						Name:   "Task1",
						Driver: "raw_exec",
						Config: map[string]interface{}{
							"command": `c:\windows\cmd.exe`,
							"args":    []string{"/c", "dir"},
						},
						Resources: &nomadapi.Resources{
							Cores: intptr(8),
						},
						Templates: []*nomadapi.Template{
							{
								EmbeddedTmpl: stringptr("hello world"),
								DestPath:     stringptr("./foo.txt"),
							},
						},
					},
				},
			},
		},
	}
	_, _, err = client.Jobs().Register(&job, &nomadapi.WriteOptions{})
	if err != nil {
		log.Fatal(err)
	}

}

I suspect what is actually run isn't important, I suspect the fact that I use raw-exec here isn't especially important, but I haven't tried it with anything else.
I get the same result with latest nomad 1.7.7

@tgross
Copy link
Member

tgross commented May 17, 2024

Hi @tomqwpl just a heads up that we're digging into this as part of working on #20034

@tgross tgross added the stage/accepted Confirmed, and intend to work on. No timeline committment though. label May 17, 2024
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage May 17, 2024
@NucaChance
Copy link

Just hit this in production, so glad to see it being worked on. If more logs/examples are needed I can help with that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/platform-windows theme/template type/bug
Development

No branches or pull requests

3 participants