Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node driver could not register on Windows in a little possibility #143

Closed
andyzhangx opened this issue Apr 30, 2021 · 13 comments
Closed

Node driver could not register on Windows in a little possibility #143

andyzhangx opened this issue Apr 30, 2021 · 13 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/windows Categorizes an issue or PR as relevant to SIG Windows.

Comments

@andyzhangx
Copy link
Member

andyzhangx commented Apr 30, 2021

On Windows, we found that this node-driver-registrar hang on below statement forever in a little possibility, and then after restart registrar manually, it could register correctly, question: is there any way to add timeout and retry in below statement?

// Registers kubelet plugin watcher api.
registerapi.RegisterRegistrationServer(grpcServer, registrar)

  • node-driver-registrar hang
I0429 19:12:33.856777    5500 main.go:113] Version: v2.1.0-0-g80d42f24
E0429 19:12:35.055779    5500 processstarttime.go:43] Could not get process start time, could not read /proc: CreateFile /proc: The system cannot find the file specified.
E0429 19:12:35.055779    5500 processstarttime.go:43] Could not get process start time, could not read /proc: CreateFile /proc: The system cannot find the file specified.
I0429 19:12:35.055779    5500 main.go:137] Attempting to open a gRPC connection with: "unix://C:\\\\csi\\\\csi.sock"
I0429 19:12:35.055779    5500 connection.go:153] Connecting to unix://C:\\csi\\csi.sock
I0429 19:12:44.501633    5500 main.go:144] Calling CSI driver to discover driver name
I0429 19:12:44.511914    5500 main.go:154] CSI driver name: "disk.csi.azure.com"
I0429 19:12:44.512939    5500 node_register.go:52] Starting Registration Server at: /registration/disk.csi.azure.com-reg.sock
I0429 19:12:44.523926    5500 node_register.go:61] Registration Server started at: /registration/disk.csi.azure.com-reg.sock
I0429 19:12:44.524941    5500 node_register.go:83] Skipping healthz server because HTTP endpoint is set to: ""
  • After restart driver(run node-driver-registrar), it works
0429 19:12:34.188716    5480 main.go:113] Version: v2.1.0-0-g80d42f24
E0429 19:12:35.839739    5480 processstarttime.go:43] Could not get process start time, could not read /proc: CreateFile /proc: The system cannot find the file specified.
E0429 19:12:35.840735    5480 processstarttime.go:43] Could not get process start time, could not read /proc: CreateFile /proc: The system cannot find the file specified.
I0429 19:12:35.840735    5480 main.go:137] Attempting to open a gRPC connection with: "unix://C:\\\\csi\\\\csi.sock"
I0429 19:12:35.840735    5480 connection.go:153] Connecting to unix://C:\\csi\\csi.sock
I0429 19:12:45.261075    5480 main.go:144] Calling CSI driver to discover driver name
I0429 19:12:45.264589    5480 main.go:154] CSI driver name: "disk.csi.azure.com"
I0429 19:12:45.264589    5480 node_register.go:52] Starting Registration Server at: /registration/disk.csi.azure.com-reg.sock
I0429 19:12:45.266595    5480 node_register.go:61] Registration Server started at: /registration/disk.csi.azure.com-reg.sock
I0429 19:12:45.268093    5480 node_register.go:83] Skipping healthz server because HTTP endpoint is set to: ""
I0429 19:12:45.616080    5480 main.go:80] Received GetInfo call: &InfoRequest{}
I0429 19:12:47.292740    5480 main.go:90] Received NotifyRegistrationStatus call: &RegistrationStatus{PluginRegistered:true,Error:,}

/kind bug
/sig windows

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. sig/windows Categorizes an issue or PR as relevant to SIG Windows. labels Apr 30, 2021
@andyzhangx andyzhangx changed the title Node driver could not register on Windows Node driver could not register on Windows in a little possibility Apr 30, 2021
@jingxu97
Copy link
Contributor

cc @mauriciopoppe

@mauriciopoppe
Copy link
Member

mauriciopoppe commented Jun 17, 2021

Following up my comments on #149, after I read how the kubelet plugin registration works, after the gRPC server is started we should get a notification from the kubelet through the callback

func (e registrationServer) GetInfo(ctx context.Context, req *registerapi.InfoRequest) (*registerapi.PluginInfo, error) {

If we don't get the callback then as a workaround we can force a container restart, I tried a similar approach in kubernetes-csi/external-snapshotter#504, the idea is to crash the container if we don't get to a valid state after some seconds to force a restart, we could use race timer, in pseudocode it'd be something like:

// Registered is set to true once the driver receives an ack from the kubelet
var Registered = false

func (e registrationServer) GetInfo(ctx context.Context, req *registerapi.InfoRequest) (*registerapi.PluginInfo, error) {  
	klog.Infof("Received GetInfo call: %+v", req)
    Registered = true
  	return &registerapi.PluginInfo{
		Type:              registerapi.CSIPlugin,
		Name:              e.driverName,
		Endpoint:          e.endpoint,
		SupportedVersions: e.version,
	}, nil
}

// ...

func nodeRegister(csiDriverName, httpEndpoint string) {
	// Registers kubelet plugin watcher api.
	registerapi.RegisterRegistrationServer(grpcServer, registrar)

    go func() {
       time.Sleep(time.Second) // or a timer with exp backoff
       if !Registered {
         os.Exit(1)
       }
    }()

	go healthzServer(socketPath, httpEndpoint)
	go removeRegSocket(csiDriverName)
	// Starts service
	if err := grpcServer.Serve(lis); err != nil {
		klog.Errorf("Registration Server stopped serving: %v", err)
		os.Exit(1)
	}
	// If gRPC server is gracefully shutdown, exit
	os.Exit(0)
}

In the external-snapshotter PR I also made changes to the manifests to make sure that a pod is only considered after some time, also this approach worked in the snapshotter because it's a Deployment and it's the only container.

Another idea is to use the http endpoint already exposed and a readiness probe, that endpoint would check the value of Registered to see if the pod is alive, if not then it'd get restarted

These ideas are workarounds though, the fact that sometimes we get a callback and sometimes we don't is something that we should ask in sig-windows

cc @jingxu97 @lizhuqi

@mauriciopoppe
Copy link
Member

cc @lizhuqi

@lizhuqi
Copy link

lizhuqi commented Jun 19, 2021

+1 to ask in sig-windows. I will ping sig-windows to take a look in the slack.

@andyzhangx
Copy link
Member Author

I would vote for force restart if registrar does not get response from kubelet.

@aramase
Copy link
Contributor

aramase commented Jul 16, 2021

We just upgraded our e2e tests to run against 1.21 windows for secrets-store-csi-driver and most of the e2e runs are failing with this error. There is no InfoRequest from kubelet only for windows.

https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_secrets-store-csi-driver/638/pull-secrets-store-csi-driver-e2e-windows/1416135481476780032/artifacts/2021-07-16T211148/node-driver-registrar.log

@mauriciopoppe
Copy link
Member

sig-windows asked to check the output of sc query afunix, if you see this error again please try to run this command in your nodes.

@aramase
Copy link
Contributor

aramase commented Aug 4, 2021

@msau42 Thank you for merging #152. When is the next node-driver-registrar planned for?

@msau42
Copy link
Collaborator

msau42 commented Aug 4, 2021

When is the next node-driver-registrar planned for?

Within the next two weeks.

@andyzhangx
Copy link
Member Author

@andyzhangx
Copy link
Member Author

I have tried with --mode=kubelet-registration-probe with v2.3.0, it would fail on both Linux and Windows:

F0814 00:55:19.223354       1 main.go:157] Kubelet plugin registration hasn't succeeded yet, file=/var/lib/kubelet/plugins/file.csi.azure.com/registration doesn't exist.
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc00000e001, 0xc0000ca600, 0xaa, 0xfe)
        /mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:1026 +0xb9
k8s.io/klog/v2.(*loggingT).output(0xf9d840, 0xc000000003, 0x0, 0x0, 0xc00016d030, 0x0, 0xcd3ca3, 0x7, 0x9d, 0x0)
        /mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:975 +0x1e5
k8s.io/klog/v2.(*loggingT).printf(0xf9d840, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0xb32465, 0x48, 0xc0002626e0, 0x1, ...)
        /mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:753 +0x19a
k8s.io/klog/v2.Fatalf(...)
        /mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:1514
main.main()
        /mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/cmd/csi-node-driver-registrar/main.go:157 +0xfc6

goroutine 7 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0xf9d840)
        /mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:1169 +0x8b
created by k8s.io/klog/v2.init.0
        /mnt/vss/_work/1/go/src/github.com/kubernetes-csi/node-driver-registrar/vendor/k8s.io/klog/v2/klog.go:420 +0xdf

@andyzhangx andyzhangx reopened this Aug 14, 2021
@mauriciopoppe
Copy link
Member

@andyzhangx
Copy link
Member Author

@andyzhangx I wrote how to use it here: https://github.com/kubernetes-csi/node-driver-registrar/blob/master/README.md#health-check-with-an-exec-probe, what's your setup?

@mauriciopoppe thanks, it works well with that config!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/windows Categorizes an issue or PR as relevant to SIG Windows.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants