Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The grafana dashboard generated by the Grafana plugin is missing some key metrics. #3651

Closed
lingdie opened this issue Oct 9, 2023 · 17 comments · Fixed by #3690
Closed

The grafana dashboard generated by the Grafana plugin is missing some key metrics. #3651

lingdie opened this issue Oct 9, 2023 · 17 comments · Fixed by #3690
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on. triage/needs-information Indicates an issue needs more information in order to work on it.

Comments

@lingdie
Copy link

lingdie commented Oct 9, 2023

What do you want to happen?

As mentioned, the default generated dashboard is too simple.

ref to: #2183.

Extra Labels

No response

@lingdie lingdie added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 9, 2023
@camilamacedo86
Copy link
Member

Hi @lingdie,

Could you please describe and let us know how would you like to improve this one?
What are the dashboards that you think that should be generated by default
What are the changes that you would like to suggest?

c/c @Kavinjsir

@lingdie
Copy link
Author

lingdie commented Oct 10, 2023

Hi @camilamacedo86,
We can incorporate monitoring metrics such as the depth of the workqueue. Its peak is highly correlated with the decline in controller performance. This often happens when a large number of unrelated items queue up in a short period of time, such as during an informer resync or initialization, both of which can lead to an increase in the depth of the workqueue.

@camilamacedo86
Copy link
Member

HI @lingdie,

Please feel free to raise a pull request with all your suggestions.
Also, would be great if you could describe in detail what exactly to think should be added an why

@camilamacedo86 camilamacedo86 added the triage/needs-information Indicates an issue needs more information in order to work on it. label Oct 19, 2023
@Kavinjsir
Copy link
Contributor

Kavinjsir commented Nov 2, 2023

@camilamacedo86 @lingdie I think it a great idea to enrich the dashboard. Giving panels of Work Queue status sounds a common use case. Corresponding to the metrics given by controller-runtime, I guess we may consider:

  • Create a "WorkQueue Depth" panel based on workqueue_depth
  • Create a statistical panel "Seconds for Items Stay in Queue" (P50 P90 P99) based on workqueue_queue_duration_seconds_bucket
  • Create a "Longest Running Processor" panel based on workqueue_longest_running_processor_seconds
  • Create an "Unfinished Seconds" panel based on workqueue_finished_work_seconds
  • Create a statistical panel "Seconds Processing Items From Queue" (P50 P90 P99) based on workqueue_work_duration_seconds_bucket
  • Create a "Workqueue Retries" based on workqueue_retries_total

We may also utilize other metrics to enrich the dashboard, here is a reference from my previous proposal:

image


Other than that, I'd also love to see if we can explore using new features of Grafana. The current manifests are backward compatible to Grafana v5, which I think is an old version.

@varshaprasad96 varshaprasad96 added the triage/accepted Indicates an issue or PR is ready to be actively worked on. label Nov 2, 2023
@varshaprasad96
Copy link
Member

Marking this as triage accepted based on discussion with @Kavinjsir in the community meeting. Having default metrics which controller-runtime is providing should be a good start to the plugin. It is safe to assume that:

  • anyone using this plugin has exported the default metrics.
  • the metric name is close to stable as some of them are fetched from client-go and controller-runtime.

@camilamacedo86
Copy link
Member

IHMO it would be great if we could add to the GitHub action an action for we check and have a preview of the Grafana dashboard, So that we can easier validate its changes.

Could we address this one ?
WDYT @Kavinjsir

@Kavinjsir
Copy link
Contributor

@camilamacedo86 That is ideal! Well, to have a "vivid" dashboard presented, are you suggesting that Github Action could provide a link to view Grafana similar to how we preview docs if updated?
In that case, I'm curious if we need to:
i) setup a k8s cluster
ii) deploy an operator
iii) scale the corresponding CR
iv) push the metrics to a certain Grafana endpoint somewhere

@camilamacedo86
Copy link
Member

Hi @Kavinjsir

@camilamacedo86 That is ideal! Well, to have a "vivid" dashboard presented, are you suggesting that Github Action could provide a link to view Grafana similar to how we preview docs if updated?
In that case, I'm curious if we need to:
i) setup a k8s cluster
ii) deploy an operator
iii) scale the corresponding CR
iv) push the metrics to a certain Grafana endpoint somewhere

I think we need to ensure that we will be able to visualize the data. But todo this steps is not so hard see the PR to test out new e2e tests in the sample; https://github.com/kubernetes-sigs/kubebuilder/pull/3670/files#diff-995aca61b8f0f31632d3326ec143cc3bceb4ab730a97654d620cf7555d11eb42

@Kavinjsir
Copy link
Contributor

Hi @Kavinjsir

@camilamacedo86 That is ideal! Well, to have a "vivid" dashboard presented, are you suggesting that Github Action could provide a link to view Grafana similar to how we preview docs if updated?
In that case, I'm curious if we need to:
i) setup a k8s cluster
ii) deploy an operator
iii) scale the corresponding CR
iv) push the metrics to a certain Grafana endpoint somewhere

I think we need to ensure that we will be able to visualize the data. But todo this steps is not so hard see the PR to test out new e2e tests in the sample; https://github.com/kubernetes-sigs/kubebuilder/pull/3670/files#diff-995aca61b8f0f31632d3326ec143cc3bceb4ab730a97654d620cf7555d11eb42

@camilamacedo86 Looks good to me 👍🏼
One approach may be to use Grafana HTTP API to add dashboards during the e2e tests, I'm also wondering if there are better ideas.
Would it be expected as a follow-up after #3670 gets merged?
How would you like to have an individual issue talking about Grafana manifests validation?

@camilamacedo86
Copy link
Member

@Kavinjsir

IHMO if we could have a preview like we have for the docs would be amazing !!!
But if you have better ideas please fell free !!!

@lingdie
Copy link
Author

lingdie commented Nov 10, 2023

Implementing graph previews in CI could be quite challenging and may add complexity to our existing workflow. We need to balance the benefits of this feature against the potential increase in complexity it might introduce.

Additionally, there could be issues with the CI preview not generating the appropriate data or changes due to lack of load on the controllers, which might also affect the final presentation. This is another factor we need to consider when evaluating the feasibility of this feature.

@lingdie
Copy link
Author

lingdie commented Nov 10, 2023

@Kavinjsir For now, maybe you can provide us a website or an video to preview this pr?

@camilamacedo86
Copy link
Member

camilamacedo86 commented Nov 10, 2023

Hi @lingdie,

Implementing graph previews in CI could be quite challenging

I do not think so. How are you doing the tests locally?
Could we not to do the same in the CI

  • install kind
  • Install Prometheus and grafana
  • deploy a sample project using the scaffold done with Grafana plugin
  • and output the link for grafana in the CI
  • we can copy and paste and access it.

See: https://github.com/kubernetes-sigs/kubebuilder/blob/master/.github/workflows/test-sample-go.yml
Could we not like this one?
Have any step that is done via UI?
If not, I think we can achieve that.

@Kavinjsir
Copy link
Contributor

Hi @lingdie,

Implementing graph previews in CI could be quite challenging

I do not think so. How are you doing the tests locally? Could we not to do the same in the CI

  • install kind
  • Install Prometheus and grafana
  • deploy a sample project using the scaffold done with Grafana plugin
  • and output the link for grafana in the CI
  • we can copy and paste and access it.

See: https://github.com/kubernetes-sigs/kubebuilder/blob/master/.github/workflows/test-sample-go.yml Could we not like this one? Have any step that is done via UI? If not, I think we can achieve that.

@camilamacedo86 To enable a link to a Grafana UI for CI, I come up with two ways:

  1. Install Prometheus operator including Grafana, so that the Grafana web service is running inside a kind cluster. Then, I'm wondering How can we access the Grafana web service from the kind cluster?
  2. Or alternatively, we may have to setup an individual Grafana service. Then, I'm not sure How do we config the Grafana data source which should be the kind cluster prometheus metrics?

I'm not sure if we are on the same page, would you give some more context if I'm missing any part? Thx!

@camilamacedo86
Copy link
Member

camilamacedo86 commented Nov 14, 2023

Hi @Kavinjsir,

First at all I agree that is not trivial and is a challenge.
However, I thought that we could expose the Grafana UI link but that might bring some security concerns.
Also, we could capture screenshots following an idea. We might be able to do something like:

To Capture and Upload Screenshots to the GitHub Action:

We can add a new job in our GitHub Actions workflow specifically for capturing screenshots of the Grafana dashboard. This process involves:

  1. Setting Up Node.js Environment:
    Utilize actions/setup-node to prepare the Node.js environment in the GitHub Actions runner.

  2. Installing and Using Puppeteer:
    Puppeteer, a Node library, will be used to control a headless browser for capturing the screenshot.

  3. Node.js Script for Capturing the Screenshot:
    Write and execute a script to navigate to the Grafana dashboard and take a screenshot.

  4. Uploading the Screenshot as an Artifact:
    Finally, upload the captured screenshot to the workflow run as an artifact.

Here's an example workflow snippet along with the Node.js script:

jobs:
  capture-dashboard:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Set up Node.js
        uses: actions/setup-node@v2
        with:
          node-version: '14'

      - name: Install Puppeteer
        run: npm install puppeteer

      - name: Capture Grafana Dashboard Screenshot
        run: node capture-screenshot.js

      - name: Upload Screenshot
        uses: actions/upload-artifact@v2
        with:
          name: grafana-dashboard
          path: grafana-dashboard.png

And the Node.js script (capture-screenshot.js):

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('http://localhost:3000'); // Replace with your Grafana URL
  await page.screenshot({ path: 'grafana-dashboard.png' });

  await browser.close();
})();

To configure Grafana and Prometheus we might need to do something like:

  1. Setting Up Both:
- name: Deploy Prometheus and Grafana
  run: |
    kubectl apply -f path/to/prometheus-deployment.yaml
    kubectl apply -f path/to/grafana-deployment.yaml
  1. We might need to configure Prometheus as a datasource by:

Example grafana-datasource-config.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
data:
  prometheus-datasource.json: |-
    {
      "apiVersion": 1,
      "datasources": [
        {
          "access": "proxy",
          "editable": true,
          "name": "Prometheus",
          "orgId": 1,
          "type": "prometheus",
          "url": "http://prometheus-service:9090",
          "version": 1
        }
      ]
    }
  1. Apply the ConfigMap in our cluster::
kubectl apply -f grafana-datasource-config.yaml # to push the config
  1. We need to Update Grafana Deployment to Use the ConfigMap to modify your Grafana deployment YAML:

In our Grafana deployment YAML, we will need to mount the ConfigMap as a volume. Grafana will automatically detect and use any datasource configurations found in /etc/grafana/provisioning/datasources.

Example snippet to add to your Grafana deployment YAML:

volumes:
  - name: grafana-datasources
    configMap:
      name: grafana-datasources
volumeMounts:
  - name: grafana-datasources
    mountPath: /etc/grafana/provisioning/datasources

This mounts the ConfigMap in the Grafana pod at the directory where Grafana expects to find data source configurations.

  1. By lastly, we need to do the Port Forward:
- name: Port-forward Grafana
  run: kubectl port-forward svc/grafana 3000:80 &

So, at the end, it would be something like for example:

jobs:
  setup-and-capture:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up kind Cluster
        run: |
          # Add commands to set up kind cluster

      - name: Set up sample with Grafana plugin and deploy it
        run: |
          # Add commands to set up sample with Grafana plugin and deploy it

      - name: Setup Prometheus as Data source
        run: |
          # Add the code snippet for configuring Prometheus as a data source
          
      - name: Deploy Prometheus and Grafana
        run: |
          kubectl apply -f path/to/prometheus-deployment.yaml
          kubectl apply -f path/to/grafana-deployment.yaml

      - name: Port-forward Grafana
        run: |
          kubectl port-forward svc/grafana 3000:80 &
          sleep 10 # Waits for port-forward to establish

      - name: Set up Node.js
        uses: actions/setup-node@v2
        with:
          node-version: '14'

      - name: Install Puppeteer
        run: npm install puppeteer

      - name: Capture Grafana Dashboard Screenshot
        run: node capture-screenshot.js

      - name: Upload Screenshot
        uses: actions/upload-artifact@v3
        with:
          name: grafana-dashboard
          path: grafana-dashboard.png

@lingdie
Copy link
Author

lingdie commented Nov 15, 2023

Using screenshots can indeed bypass issues related to domains and servers, but there are some additional details that need to be addressed:

  1. How can we increase the load on ctrl-runtime to make changes in the Grafana dashboard?
  2. The Grafana dashboard requires login, and this logic needs to be implemented before taking the screenshot.

@Kavinjsir
Copy link
Contributor

@Kavinjsir For now, maybe you can provide us a website or an video to preview this pr?

@lingdie Would it be helpful for you to preview with the following resources given by the pr?

  1. The updates docs available for preview: https://deploy-preview-3690--kubebuilder.netlify.app/plugins/grafana-v1-alpha#unfinished-seconds
  2. The update grafana manifest in the testdata project: https://github.com/kubernetes-sigs/kubebuilder/blob/65b1201a83c5b48fd532140e43ed555a8deeb025/testdata/project-v4-with-grafana/grafana/controller-runtime-metrics.json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. triage/accepted Indicates an issue or PR is ready to be actively worked on. triage/needs-information Indicates an issue needs more information in order to work on it.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants