Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add NPD endpoints: /debug/pprof, /healthz, /conditions. #83

Merged
merged 1 commit into from
Feb 4, 2017

Conversation

Random-Liu
Copy link
Member

@Random-Liu Random-Liu commented Jan 25, 2017

For #76.

This PR:

  • Added /healthz for health checking. Currently it always returns 200.
  • Added /conditions to list all internal conditions inside the NPD. This is mainly used for debugging.
  • Added /debug/pprof pprof endpoint. This will be used for following performance benchmark and optimization.

Note that to get more debug information, we removed the -w ldflag. The binary size only increased from 40MB+ to 60MB+, which should be fine.


This change is Reviewable

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 25, 2017
@Random-Liu
Copy link
Member Author

Random-Liu commented Jan 25, 2017

@andyxning I need the pprof endpoint to work on the performance stuff, so send the PR first. Hope you haven't started working on that.

@Random-Liu Random-Liu mentioned this pull request Jan 25, 2017
10 tasks
@andyxning
Copy link
Member

@Random-Liu Yes, i have not spare some time to do this. Since you have done this, i will make a review on this. :)

@Random-Liu
Copy link
Member Author

@andyxning Thanks. I know it's Spring Festival soon. Enjoy yourself! :)

@Random-Liu
Copy link
Member Author

@andyxning Add you as the reviewer. Thanks a lot!

@andyxning
Copy link
Member

Review status: 0 of 6 files reviewed at latest revision, 7 unresolved discussions, some commit checks failed.


Makefile, line 34 at r2 (raw file):

./bin/node-problem-detector: $(PKG_SOURCES)
	GOOS=linux go build -o bin/node-problem-detector \
	     -ldflags '-extldflags "-static" -X $(PKG)/pkg/version.version=$(VERSION)' \

SGTM


cmd/node_problem_detector.go, line 44 at r2 (raw file):

	printVersion            = flag.Bool("version", false, "Print version information and quit")
	hostnameOverride        = flag.String("hostname-override", "", "Custom node name used to override hostname")
	serverPort              = flag.Int("server-port", 10254, "The port to bind the node problem detector server. Use 0 to disable.")
  • maybe port and address command line argument are more briefness, as currently we do not have any other http service.
  • I have grep -r 10254 . in kubernetes, heapster and contrib and find that gcr.io/google_containers/nginx-ingress-controller has used that port. Maybe we should choose another one. BTW, does kubernetes has some standards one port usage.
  • IIUC, 0 is mainly used to select a port randomly. So, maybe we need to make another command line argument, like http-disabled, to stop npd from starting the http service and the default value for this command line option is false. This seems to be clearer.

cmd/node_problem_detector.go, line 45 at r2 (raw file):

	hostnameOverride        = flag.String("hostname-override", "", "Custom node name used to override hostname")
	serverPort              = flag.Int("server-port", 10254, "The port to bind the node problem detector server. Use 0 to disable.")
	serverAddress           = flag.String("server-address", "127.0.0.1", "The address to bind the node problem detector server.")

Any special considerations for using 127.0.0.1 instead of 0.0.0.0.


cmd/node_problem_detector.go, line 96 at r2 (raw file):

		err := http.ListenAndServe(net.JoinHostPort(*serverAddress, strconv.Itoa(*serverPort)), nil)
		if err != nil {
			glog.Errorf("Failed to start server: %v", err)
  • glog.Fatalf makes failure fail faster instead of just logging. This error is lost of functionality. :)
  • Any reasons we need to put http server generating logic to a loop? IIUC, this will success on first try and will last or fail when bind and listen fails where we need to make npd fail fast in order to make users realize that npd can not start and one major functionality is not woking.

cmd/node_problem_detector.go, line 116 at r2 (raw file):

	// Start http server.
	if *serverPort > 0 {
if !httpDisabled {
    startHTTPServer(p)
}


pkg/condition/manager.go, line 62 at r2 (raw file):

type conditionManager struct {
	// The lock protecting the object. Only 2 fields will be accessed by more

we need to refactor the code since #79 has been merged.


pkg/util/http.go, line 23 at r2 (raw file):

	"net/http"
)

SGTM


Comments from Reviewable

@andyxning
Copy link
Member

@Random-Liu Sorry for the delay. It is Spring Festival holidays. :)

@andyxning
Copy link
Member

Review status: 0 of 6 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/util/http.go, line 24 at r2 (raw file):

ReturnHTTPJson is a helper function to generate json http response.

ReturnHTTPJson generates json http response.


pkg/util/http.go, line 36 at r2 (raw file):

ReturnHTTPError is a helper function to generate error http response.

ReturnHTTPError generates error http response.


Comments from Reviewable

@Random-Liu
Copy link
Member Author

No problem at all. Enjoy your holiday! :)


Comments from Reviewable

@Random-Liu
Copy link
Member Author

Review status: 0 of 6 files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


cmd/node_problem_detector.go, line 44 at r2 (raw file):

maybe port and address command line argument are more briefness, as currently we do not have any other http service.

Will do.

I have grep -r 10254 . in kubernetes, heapster and contrib and find that gcr.io/google_containers/nginx-ingress-controller has used that port. Maybe we should choose another one. BTW, does kubernetes has some standards one port usage.

Good catch. As far as I know, there is no standards for port usage for now. I choose 10254 because it is not used in https://github.com/kubernetes/kubernetes/blob/master/pkg/master/ports/ports.go
I'll use 10256 instead then.

IIUC, 0 is mainly used to select a port randomly. So, maybe we need to make another command line argument, like http-disabled, to stop npd from starting the http service and the default value for this command line option is false. This seems to be clearer.

Actually I copy the code from https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/options/options.go#L106. I don't think we want to select port randomly, let's just keep the behavior the same with kubelet for now. :)


cmd/node_problem_detector.go, line 45 at r2 (raw file):

Previously, andyxning (Ning Xie) wrote…

Any special considerations for using 127.0.0.1 instead of 0.0.0.0.

Actually, kube-proxy and kubelet use 127.0.0.1 as healthz address, 0.0.0.0 as server address.

Hm, let's use 127.0.0.1 as default, if we need remote access, we can always start NPD with 0.0.0.0. :)


cmd/node_problem_detector.go, line 96 at r2 (raw file):

Previously, andyxning (Ning Xie) wrote…
  • glog.Fatalf makes failure fail faster instead of just logging. This error is lost of functionality. :)
  • Any reasons we need to put http server generating logic to a loop? IIUC, this will success on first try and will last or fail when bind and listen fails where we need to make npd fail fast in order to make users realize that npd can not start and one major functionality is not woking.

You are right. Will change to fatal and remove the loop.


Comments from Reviewable

@Random-Liu
Copy link
Member Author

Review status: 0 of 6 files reviewed at latest revision, 9 unresolved discussions.


cmd/node_problem_detector.go, line 44 at r2 (raw file):

Previously, Random-Liu (Lantao Liu) wrote…

maybe port and address command line argument are more briefness, as currently we do not have any other http service.

Will do.

I have grep -r 10254 . in kubernetes, heapster and contrib and find that gcr.io/google_containers/nginx-ingress-controller has used that port. Maybe we should choose another one. BTW, does kubernetes has some standards one port usage.

Good catch. As far as I know, there is no standards for port usage for now. I choose 10254 because it is not used in https://github.com/kubernetes/kubernetes/blob/master/pkg/master/ports/ports.go
I'll use 10256 instead then.

IIUC, 0 is mainly used to select a port randomly. So, maybe we need to make another command line argument, like http-disabled, to stop npd from starting the http service and the default value for this command line option is false. This seems to be clearer.

Actually I copy the code from https://github.com/kubernetes/kubernetes/blob/master/cmd/kubelet/app/options/options.go#L106. I don't think we want to select port randomly, let's just keep the behavior the same with kubelet for now. :)

Done.


cmd/node_problem_detector.go, line 45 at r2 (raw file):

Previously, Random-Liu (Lantao Liu) wrote…

Actually, kube-proxy and kubelet use 127.0.0.1 as healthz address, 0.0.0.0 as server address.

Hm, let's use 127.0.0.1 as default, if we need remote access, we can always start NPD with 0.0.0.0. :)

Done.


cmd/node_problem_detector.go, line 96 at r2 (raw file):

Previously, Random-Liu (Lantao Liu) wrote…

You are right. Will change to fatal and remove the loop.

Done.


pkg/condition/manager.go, line 62 at r2 (raw file):

Previously, andyxning (Ning Xie) wrote…

we need to refactor the code since #79 has been merged.

Done.


pkg/util/http.go, line 24 at r2 (raw file):

Previously, andyxning (Ning Xie) wrote…

ReturnHTTPJson is a helper function to generate json http response.

ReturnHTTPJson generates json http response.

Done.


pkg/util/http.go, line 36 at r2 (raw file):

Previously, andyxning (Ning Xie) wrote…

ReturnHTTPError is a helper function to generate error http response.

ReturnHTTPError generates error http response.

Done.


Comments from Reviewable

@Random-Liu
Copy link
Member Author

@andyxning Addressed comments. PTAL~

@andyxning
Copy link
Member

Review status: 0 of 6 files reviewed at latest revision, 6 unresolved discussions.


cmd/node_problem_detector.go, line 116 at r2 (raw file):

Previously, andyxning (Ning Xie) wrote…
if !httpDisabled {
    startHTTPServer(p)
}

Since we need to keep up with kubelet, this should not be modified. :)

LGTM


pkg/condition/manager.go, line 62 at r2 (raw file):

Previously, Random-Liu (Lantao Liu) wrote…

Done.

nit: The lock protecting the object. is useless and can be deleted. WDYT. @Random-Liu


pkg/condition/manager_test.go, line 85 at r3 (raw file):

func TestGetConditions(t *testing.T) {
	m, _, _ := newTestManager()

LGTM


pkg/problemdetector/problem_detector.go, line 81 at r3 (raw file):

	// Add the handler to serve condition http request.
	http.HandleFunc("/conditions", func(w http.ResponseWriter, r *http.Request) {
		util.ReturnHTTPJson(w, p.conditionManager.GetConditions())

LGTM


pkg/util/http.go, line 23 at r2 (raw file):

Previously, andyxning (Ning Xie) wrote…

SGTM

LGTM


Comments from Reviewable

@andyxning
Copy link
Member

Reviewed 1 of 6 files at r1, 1 of 1 files at r2, 3 of 3 files at r3.
Review status: 5 of 6 files reviewed at latest revision, 6 unresolved discussions.


Comments from Reviewable

@andyxning
Copy link
Member

@Random-Liu just one nit. Overall LGTM.

@Random-Liu
Copy link
Member Author

pkg/condition/manager.go, line 62 at r2 (raw file):

Previously, andyxning (Ning Xie) wrote…

nit: The lock protecting the object. is useless and can be deleted. WDYT. @Random-Liu

Done.


Comments from Reviewable

@Random-Liu
Copy link
Member Author

@andyxning Thanks for reviewing! Apply LGTM based on #83 (comment).

@Random-Liu Random-Liu added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 3, 2017
@andyxning
Copy link
Member

Reviewed 1 of 1 files at r4.
Review status: all files reviewed at latest revision, 5 unresolved discussions.


Comments from Reviewable

@andyxning
Copy link
Member

/lgtm


Review status: all files reviewed at latest revision, 5 unresolved discussions.


Comments from Reviewable

@Random-Liu Random-Liu merged commit 0876f61 into kubernetes:master Feb 4, 2017
@BiancaTofan
Copy link

Will a readiness probe be added in the future? I can see that some changes haven't been merged yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants