Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite finalURLTemplate used only for metrics because of dynamic client change #68530

Merged
merged 1 commit into from Sep 17, 2018

Conversation

wenjiaswe
Copy link
Contributor

@wenjiaswe wenjiaswe commented Sep 12, 2018

What this PR does / why we need it:
When new easy-to-use dynamic client introduced in #62913, name and namespace are appended in url and ends up in request.pathPrefix, so original function finalURLTemplate was not able to replace the name of namespaces with string "{namespace}", which cause the overwhelming metrics issue mentioned in #68115 .

This PR edited finalURLTemplate function to replace the name of namespace with string "{namespace}" in the pathPrefix for dynamic client.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #68115

Special notes for your reviewer:

  1. As mentioned in issue Histogram metrics generating overwhelming number of prometheus metrics #68115,

This seems to be a pretty critical bug. Will mark it for v1.12. Please triage.

Please help to review the PR asap if possible. Thanks very much in advance!

  1. This fix only fix the namespace. The name is also impacted since name should be replaced by {name} in finalURLTemplate. The name string is appended to url directly without a prefix like /name/, which is different from namespaces (/namespaces/ACTUALNAMESPACE), it might involve more than just parse the request.pathPrefix. Since it does not impact the metrics, and also considering the time sensitivity during the code freeze, maybe we could consider fix it later? Please let me know if it actually does have impact that big enough to impact 1.12 release.

@saad-ali what do yo think?

Release note:

fix a bug that overwhelming number of prometheus metrics are generated because $NAMESPACE is not replaced by string "{namespace}"

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 12, 2018
@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Sep 12, 2018
@wenjiaswe
Copy link
Contributor Author

/cc @deads2k @caesarxuchao @roycaihw

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Sep 12, 2018
@wenjiaswe
Copy link
Contributor Author

/kind bug

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Sep 12, 2018
@roycaihw
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 12, 2018
@wenjiaswe
Copy link
Contributor Author

/retest

@wenjiaswe
Copy link
Contributor Author

/milestone v1.12

@k8s-ci-robot
Copy link
Contributor

@wenjiaswe: You must be a member of the kubernetes/kubernetes-milestone-maintainers github team to set the milestone.

In response to this:

/milestone v1.12

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@roycaihw roycaihw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I left some comments. Given finalURLTemplate is only used by metrics, and prometheus metrics seem to prune resource {name} from URL, I'm okay with fixing {namespace} first and have {name} as a followup after code freeze. I'm not familiar with metrics so I cannot tell the impact.

func replaceNamespace4DynamicClient(oldPath string) string {
segments := strings.Split(oldPath, "/")
for i := 0; i < len(segments)-1; i++ {
if segments[i] == "namespaces" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern is:

  • if the resource name is "namespaces" (e.g. a node called "namespaces");
  • if r.baseURL.Path contains "/namespaces/"

I'm not sure if we have constraints (e.g. resources cannot be named "namespaces") that could help us rule out the two cases above. If not, you could check if r.pathPrefix contains r.baseURL.Path, and ignore the baseURL part during parsing. The rest of pathPrefix should follow Kubernetes API convention format (legacy group v.s. named group; cluster-scoped v.s. namespaced).

This also helps you parse the {name} segment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case is the namespace kind, where the url is /api/v1/namespaces/{name}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked in my local cluster, it seems like I can create a namespace called "namespace" with no problem. But in that case, the replaceNamespace4DynamicClient function can still handle it by blindly replace whatever segment is following /namespaces/ with {namespace}. r.baseURL and pathPrefix are separate in rest.Request. So I don't get how would checking r.pathPrefix and r.baseURL.Path help?

For example, as in testcase now, if full url is "http://localhost/pre1/namespaces/ns/r1", then r.pathPrefix is "/pre1/namespaces/ns/r1". r.baseURL is "http://localhost"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little skeptical of this approach, but maybe you can make it work. :)

You at least need to check more than one path segment. Here's some test cases off the top of my head.

  • /api/v1/namespaces/kube-system/services/monitoring-heapster is about a service; it should match.
  • /api/v1/namespaces/kube-system is about a namespace; it should NOT match (kube-system is the name of the resource in this case, which is not inside a namespace).
  • /apis/apps/v1/namespaces/foo/deployments/bar is about resource bar in namespace foo, it should match.
  • /apis/apps/v1/namespaces/namespaces/deployments/namespaces is about resource namespaces in namespace namespaces, it should match.

Copy link
Contributor Author

@wenjiaswe wenjiaswe Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a new commit after talking to @roycaihw offline. It parses the request.pathPrefix based on the type of API groups, either in /api/v1... format for legacy group or /apis/apps/v1... format for named group.

Based on @lavalamp comment, I think there is no rule on nomenclature rule... As far as Daniel's concern, the first, third and forth case should all be fine. However, both Haowei and Daniel mentioned that if we have a dynamic resource client that has no namespace but has a resource called "namespaces", then current fix would blindly replace the segment after the resource name with "{namespace}", which is wrong...

In that case, we will need to pass the dynamicResourceClient struct to rest/request.go so it knows exactly what the structure is instead of blind string manipulation. And it would be more than "only have to do the analysis if you have no hits for the namespace being set." as @deads2k original suggested, let me do more analysis and see if it would have impact on other parts than just metrics.

Copy link
Member

@roycaihw roycaihw Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/api/v1/namespaces/kube-system is about a namespace; it should NOT match

Given the fact that user can choose to set pathPrefix, resource and resourceName at the same time, a pathPrefix can also be part of a /api/v1/namespaces/kube-system/services/monitoring-heapster request, in which case it should match.

We either need to involve more fields in Request into the detection logic, or use some other approach (e.g. modify the dynamic client)

}
actualURL := r.finalURLTemplate()
actual := actualURL.String()
expected := testCase.ExpectedFinalURL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: you don't need to redefine expected here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Addressed.

Request: NewRequest(nil, "DELETE", uri, "", ContentConfig{GroupVersion: &schema.GroupVersion{Group: "test"}}, Serializers{}, nil, nil, 0).
Prefix("/apis/namespaces/namespaces/namespaces/namespaces"),
ExpectedFullURL: "http://localhost/some/base/url/path/apis/namespaces/namespaces/namespaces/namespaces",
ExpectedFinalURL: "http://localhost/some/base/url/path/apis/namespaces/namespaces/namespaces/%7Bname%7D",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have expected apis/namespaces/{namespace}/namespaces/{name}?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait no I got confused :)

This is correct.

Request: NewRequest(nil, "DELETE", uri, "", ContentConfig{GroupVersion: &schema.GroupVersion{Group: "test"}}, Serializers{}, nil, nil, 0).
Prefix("/apis/namespaces/namespaces/namespaces/namespaces/status"),
ExpectedFullURL: "http://localhost/some/base/url/path/apis/namespaces/namespaces/namespaces/namespaces/status",
ExpectedFinalURL: "http://localhost/some/base/url/path/apis/namespaces/namespaces/namespaces/%7Bname%7D/status",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd expect{namespace} here too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scratch this, it's correct.

@lavalamp
Copy link
Member

/lgtm
/approve

Test cases seem thorough :)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 17, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, wenjiaswe

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 17, 2018
@k8s-ci-robot k8s-ci-robot merged commit 817d420 into kubernetes:master Sep 17, 2018
@wenjiaswe wenjiaswe deleted the 68115 branch September 17, 2018 19:46
@alvaroaleman
Copy link
Member

@wenjiaswe This is a pretty big issue actually, is there a chance this fix can get cherrrypicked/backported to 1.11?

@wenjiaswe
Copy link
Contributor Author

@alvaroaleman thanks for reminding! Let me see if I could get the final fix #68690 done soon enough, then I will back port the final fix. Otherwise I will back port this one first.

@lavalamp
Copy link
Member

lavalamp commented Sep 19, 2018 via email

@wenjiaswe
Copy link
Contributor Author

@lavalamp you are right, the worries we had about this temp fix doesn't exist in 1.11 anyway. I opened a PR for backporting. Thanks.

mvladev pushed a commit to mvladev/gardener that referenced this pull request Oct 8, 2018
After v1.11, the amount of metrics generated by kube-controller-manager
increased several times. See
kubernetes/kubernetes#68530

We drop all those metrics, but it takes extra time for Prometheus to
scrape this endpoint.
k8s-ci-robot added a commit that referenced this pull request Oct 10, 2018
…530-upstream-release-1.11

Automated cherry pick of #68530: Rewrite finalURLTemplate used only for metrics of dynamic client
mrIncompetent pushed a commit to kubermatic/kubermatic that referenced this pull request Oct 10, 2018
In kubernetes/kubernetes#68530 a bug will be
fixed with v1.12 of kubernetes, which floods the metrics created by
the controller-manager. This fix will drop all rest_* series from
the controller-manager when v1.11.x is used.
mrIncompetent pushed a commit to kubermatic/kubermatic that referenced this pull request Oct 11, 2018
* Mitigate prometheus RAM flooding

In kubernetes/kubernetes#68530 a bug will be
fixed with v1.12 of kubernetes, which floods the metrics created by
the controller-manager. This fix will drop all rest_* series from
the controller-manager when v1.11.x is used.

* fix tests

* update fixtures

* restrict prometheus drop rule to 1.11.0-1.11.3 as it got fixed in the versions above

* fix test
mrIncompetent pushed a commit to kubermatic/kubermatic that referenced this pull request Oct 11, 2018
* Mitigate prometheus RAM flooding

In kubernetes/kubernetes#68530 a bug will be
fixed with v1.12 of kubernetes, which floods the metrics created by
the controller-manager. This fix will drop all rest_* series from
the controller-manager when v1.11.x is used.

* fix tests

* update fixtures

* restrict prometheus drop rule to 1.11.0-1.11.3 as it got fixed in the versions above

* fix test

(cherry picked from commit 20aeed6)
mrIncompetent added a commit to kubermatic/kubermatic that referenced this pull request Oct 11, 2018
* Mitigate prometheus RAM flooding

In kubernetes/kubernetes#68530 a bug will be
fixed with v1.12 of kubernetes, which floods the metrics created by
the controller-manager. This fix will drop all rest_* series from
the controller-manager when v1.11.x is used.

* fix tests

* update fixtures

* restrict prometheus drop rule to 1.11.0-1.11.3 as it got fixed in the versions above

* fix test

(cherry picked from commit 20aeed6)
richardyuwen pushed a commit to richardyuwen/gardener that referenced this pull request Mar 26, 2019
After v1.11, the amount of metrics generated by kube-controller-manager
increased several times. See
kubernetes/kubernetes#68530

We drop all those metrics, but it takes extra time for Prometheus to
scrape this endpoint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Histogram metrics generating overwhelming number of prometheus metrics
8 participants