Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added: support for metrics configuration, periodic metrics cleanup and selective namespace whitelisting and blacklisting with respect to metrics registration #2288

Conversation

yashvardhan-kukreja
Copy link
Contributor

@yashvardhan-kukreja yashvardhan-kukreja commented Aug 21, 2021

Signed-off-by: Yashvardhan Kukreja yash.kukreja.98@gmail.com

Related issue

closes #2268

Milestone of this PR

What type of PR is this

/kind feature

Proposed Changes

Now, values.yaml will encapsulate configuration for metrics exposure as well

config:
  metricsConfig:
    namespaces: {
      "include": [], # list of namespaces to capture metrics for. Default: metrics being captured for all namespaces except excludeNamespaces.
      "exclude": [] # list of namespaces to NOT capture metrics for. Default: []
    }
    metricsRefreshInterval: 24h # rate at which metrics should reset so as to clean up the memory footprint of kyverno metrics. Default: null, no refresh of metrics
  # Or provide an existing metrics config-map by uncommenting the below line
  # existingMetricsConfig: sample-metrics-configmap. Refer to the ./templates/metricsconfigmap.yaml for the structure of metrics configmap.
  • namespaces.include - specifically, the namespaces for which the metrics will be collected. Default: metrics will be collected for all namespaces.
  • namespaces.exclude - the namespaces for which the metrics will NOT be collected. Default: none of the namespaces are excluded.
  • metricsRefreshInterval - The periodic rate at which the metrics of Kyverno will get cleaned. This will help in reducing the memory footprint associated with metrics collection and exposure in Kyverno. And this sort of metrics cleanup doesn't involve any loss of data because the metrics are always going to be persistently getting stored on end-user's Prometheus server.

Proof Manifests

Checklist

  • [] I have read the contributing guidelines.
  • [] I have added tests that prove my fix is effective or that my feature works.
  • [] My PR contains new or altered behavior to Kyverno and
    • [] I have added or changed the documentation myself in an existing PR and the link is:
    • [] I have raised an issue in kyverno/website to track the doc update and the link is:
    • [] I have read the PR documentation guide and followed the process including adding proof manifests to this PR.

Further Comments

@yashvardhan-kukreja yashvardhan-kukreja force-pushed the issue-2268/selective-metric-exposure branch 3 times, most recently from 8bb7a37 to 8cf8f78 Compare August 22, 2021 11:04
@realshuting
Copy link
Member

Hi @yashvardhan-kukreja - have you tested it locally? Seems like the Kyverno Pod never enters running & ready state. Can you please verify locally?

@yashvardhan-kukreja
Copy link
Contributor Author

yashvardhan-kukreja commented Aug 23, 2021

Yes I did for both with a definite metrics refresh and no metrics refresh. Was running perfectly fine on my end.

I'll test it again, meanwhile do you mind sending the logs and the output of kubectl describeof Kyverno pod?

Also, can you share the values.yaml file against which you are testing this PR?

@realshuting
Copy link
Member

Here's the log, I think it uses the direct manifest to install Kyverno, and Pod was in an Error state:

Run echo ">>> Install Kyverno"
  echo ">>> Install Kyverno"
  sed 's/imagePullPolicy:.*$/imagePullPolicy: IfNotPresent/g' ${GITHUB_WORKSPACE}/definitions/install.yaml | kubectl apply -f -
  kubectl apply -f ${GITHUB_WORKSPACE}/definitions/github/rbac.yaml
  chmod a+x ${GITHUB_WORKSPACE}/scripts/verify-deployment.sh
  sleep 50
  echo ">>> Check kyverno"
  kubectl get pods -n kyverno
  ${GITHUB_WORKSPACE}/scripts/verify-deployment.sh -n kyverno  kyverno
  sleep 20
  
  echo ">>> Expose the Kyverno's service's metric server to the host"
  kubectl port-forward svc/kyverno-svc-metrics -n kyverno 8000:8000 &
  echo ">>> Run Kyverno e2e test"
  make test-e2e
  kubectl delete -f ${GITHUB_WORKSPACE}/definitions/install.yaml
  shell: /usr/bin/bash -e {0}
  env:
    GOROOT: /opt/hostedtoolcache/go/1.16.7/x64
    CT_CONFIG_DIR: /opt/hostedtoolcache/ct/v3.3.0/x86_64/etc
    VIRTUAL_ENV: /opt/hostedtoolcache/ct/v3.3.0/x86_64/venv
>>> Install Kyverno
namespace/kyverno created
customresourcedefinition.apiextensions.k8s.io/clusterpolicies.kyverno.io created
customresourcedefinition.apiextensions.k8s.io/clusterpolicyreports.wgpolicyk8s.io created
customresourcedefinition.apiextensions.k8s.io/clusterreportchangerequests.kyverno.io created
customresourcedefinition.apiextensions.k8s.io/generaterequests.kyverno.io created
customresourcedefinition.apiextensions.k8s.io/policies.kyverno.io created
customresourcedefinition.apiextensions.k8s.io/policyreports.wgpolicyk8s.io created
customresourcedefinition.apiextensions.k8s.io/reportchangerequests.kyverno.io created
serviceaccount/kyverno-service-account created
clusterrole.rbac.authorization.k8s.io/kyverno:admin-policies created
clusterrole.rbac.authorization.k8s.io/kyverno:admin-policyreport created
clusterrole.rbac.authorization.k8s.io/kyverno:admin-reportchangerequest created
clusterrole.rbac.authorization.k8s.io/kyverno:customresources created
clusterrole.rbac.authorization.k8s.io/kyverno:generatecontroller created
clusterrole.rbac.authorization.k8s.io/kyverno:leaderelection created
clusterrole.rbac.authorization.k8s.io/kyverno:policycontroller created
clusterrole.rbac.authorization.k8s.io/kyverno:userinfo created
clusterrole.rbac.authorization.k8s.io/kyverno:webhook created
clusterrolebinding.rbac.authorization.k8s.io/kyverno:customresources created
clusterrolebinding.rbac.authorization.k8s.io/kyverno:generatecontroller created
clusterrolebinding.rbac.authorization.k8s.io/kyverno:leaderelection created
clusterrolebinding.rbac.authorization.k8s.io/kyverno:policycontroller created
clusterrolebinding.rbac.authorization.k8s.io/kyverno:userinfo created
clusterrolebinding.rbac.authorization.k8s.io/kyverno:webhook created
configmap/init-config created
service/kyverno-svc created
service/kyverno-svc-metrics created
deployment.apps/kyverno created
clusterrole.rbac.authorization.k8s.io/kyverno:userinfo configured
clusterrole.rbac.authorization.k8s.io/kyverno:customresources configured
clusterrole.rbac.authorization.k8s.io/kyverno:policycontroller configured
clusterrole.rbac.authorization.k8s.io/kyverno:generatecontroller configured
>>> Check kyverno
NAME                       READY   STATUS   RESTARTS   AGE
kyverno-64d6687bc9-jbvdl   0/1     Error    2          43s
Waiting for deployment of kyverno in namespace kyverno with a timeout 60 seconds
Expected generation for deployment kyverno: 1
Observed expected generation: 1
Specified replicas: 1
current/updated/available replicas: 1/1/, waiting
current/updated/available replicas: 1/1/, waiting
current/updated/available replicas: 1/1/, waiting

@yashvardhan-kukreja
Copy link
Contributor Author

yashvardhan-kukreja commented Aug 23, 2021

Hi @yashvardhan-kukreja - have you tested it locally? Seems like the Kyverno Pod never enters running & ready state. Can you please verify locally?

Shuting, I also checked around this. So, I installed Kyverno around 10-15 times and only once the issue occured and that was around Kyverno entering "Running" state but still it had 0/1 containers meaning, Kyverno container was failing. And upon describing it, I found that the Kyverno container failed the liveness probe's check in that situation due to which the Kyverno was registered as unhealthy.
But I checked the logs of that container and it was perfectly fine. Hence, the root cause was clear and that was, that the liveness probe checks executed before Kyverno's webhook server started (a pretty rare event).

So, as a resolution I increased the initialDelaySeconds of kyverno container from 10 seconds to 15 seconds so as to ensure that Kyverno gets enough time to start the webhook server before the liveness probe checks begin.

@yashvardhan-kukreja
Copy link
Contributor Author

Another thing Shuting. I found that although I had programmed the default case scenarios when namespaces field and metricsRefreshInterval field under config.metricsConfig in values.yaml but still, if the user explicitly removed/commented the entire config.metricsConfig block, then that was crashing the helm install of Kyverno. Hence, for that I programmed the default case scenario for that as well in the second last commit.

@yashvardhan-kukreja
Copy link
Contributor Author

yashvardhan-kukreja commented Aug 23, 2021

Here's the log, I think it uses the direct manifest to install Kyverno, and Pod was in an Error state:

I guess got the issue around here. I forgot to update metrics config related manifests and kustomize definitions under definitions/k8s-resources/. Working on it!

@yashvardhan-kukreja
Copy link
Contributor Author

All done @realshuting :)

@yashvardhan-kukreja
Copy link
Contributor Author

yashvardhan-kukreja commented Sep 2, 2021

@realshuting once you're done reviewing this PR, please do not merge it. Please review and merge #2351 first. Once it gets merged to main branch, I'll make a very tiny change to this PR, rebase it and I guess then, it would be in a mergeable state.

@realshuting realshuting self-assigned this Sep 2, 2021
Copy link
Member

@realshuting realshuting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@yashvardhan-kukreja - do we have issues tracking website updates? We need to document how to customize ConfigMap using Helm and direct manifests.

@yashvardhan-kukreja
Copy link
Contributor Author

/lgtm

@yashvardhan-kukreja - do we have issues tracking website updates? We need to document how to customize ConfigMap using Helm and direct manifests.

Yeah Shuting, I'll hop onto documenting that.
Meanwhile, I just got to make one last commit to this PR commenting the following line in values.yaml

metricsRefreshInterval: 24h

So that, by default, Kyverno's metric exporter won't reset and cleanup metrics in its buffer every 24 hrs.
If the user wants to do so, they can still uncomment it and feed it to helm to make Kyverno perform metrics refresh.

@yashvardhan-kukreja yashvardhan-kukreja force-pushed the issue-2268/selective-metric-exposure branch 3 times, most recently from 657c424 to bfb149a Compare September 4, 2021 03:01
@yashvardhan-kukreja
Copy link
Contributor Author

@realshuting do you mind running e2e tests corresponding to this branch on your end? I tried running them on my end but they Passed so I am not exactly able to find what's the concern here and why the e2e tests of the github actions are reporting failure.

@realshuting
Copy link
Member

@yashvardhan-kukreja - verified locally, all looked good!

Can you please rebase the main branch? And sorry for the late response.

…d selective namespace whitelisting and blacklisting for metrics

Signed-off-by: Yashvardhan Kukreja <yash.kukreja.98@gmail.com>
@realshuting
Copy link
Member

Thank you @yashvardhan-kukreja !

@realshuting realshuting merged commit 5fcd9b8 into kyverno:main Sep 10, 2021
@realshuting
Copy link
Member

Hi @yashvardhan-kukreja - following up on the doc update, do we have an issue logged to track it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support for configurable automatic refresh of metrics and selective exposure of metrics at namespace-level
2 participants