Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Dashboards constantly refreshing in operator #1045

Closed
redminion0 opened this issue May 10, 2023 · 11 comments · Fixed by #1051
Closed

[Bug] Dashboards constantly refreshing in operator #1045

redminion0 opened this issue May 10, 2023 · 11 comments · Fixed by #1051
Labels
bug/critical Bug with a critical severity, breaking functionality triage/accepted Indicates an issue or PR is ready to be actively worked on. v5 A v5 specifc issue/feature
Milestone

Comments

@redminion0
Copy link

redminion0 commented May 10, 2023

Describe the bug
A clear and concise description of what the bug is.

When using the v5.0.0-uid for dashboard UID support if multiple dashboards are imported the operator gets stuck in a loop constantly updating all dashboards.

Version
Full semver version of the operator being used e.g. v4.10.0, v5.0.0-rc0
quay.io/weisdd/grafana-operator:v5.0.0-uid (#1027)
To Reproduce
Steps to reproduce the behavior:

load the quay.io/weisdd/grafana-operator:v5.0.0-uid image.
then load in a large number of gzipJSON dashboards (in our case 36)

or

load dashboard one at a time (using the resource) then force the pods to move node

Expected behavior
A clear and concise description of what you expected to happen.
dashboards are not constantly updated

Suspect component/Location where the bug might be occurring
Please provide this if you know where this bug might occur otherwise leave as unknown

Screenshots
If applicable, add screenshots to help explain your problem.
image

  • Grafana Operator Version [e.g. v5.0.0]
  • Environment: K8s

Additional context
Add any other context about the problem here.

@redminion0 redminion0 added bug Something isn't working needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 10, 2023
@redminion0
Copy link
Author

when comparing json the only change grafana notices is the version

@NissesSenap
Copy link
Collaborator

@weisdd seems like we have some fun uid overwrite issue.

@NissesSenap NissesSenap added the v5 A v5 specifc issue/feature label May 11, 2023
@NissesSenap
Copy link
Collaborator

@redminion0 thanks for reporting this issue. Could you provide an example of a dashboard that we can use to verify this?

@NissesSenap NissesSenap added this to the Version 5.0 milestone May 11, 2023
@weisdd
Copy link
Collaborator

weisdd commented May 11, 2023

@redminion0 have you applied the CRDs from that PR as well? (There's a new status field to store UID)

@redminion0
Copy link
Author

Hi @weisdd yes I have applied the CRD's.
After some more investigation it goes like this:

adding dashboards one at a time

  1. from 0 dashboards in the cluster
  2. load dashboards 1 at a time (i left around 2mins per dashboard) in my case up to 36.
  3. dashboards get the guids from their json (in this case gzip and base64'd) correctly.
  4. dashboards are resynced only at resync period all seems fine.
  5. pods restart and the operator goes into a loop of constantly reconciling/updating the dashboards.

mass adding dashboards

  1. from 0 dashboards in the cluster
  2. apply all 36 dashboards and the loop starts immediately.

The error doesnt happen if smaller numbers are mass loaded at once.

if the operator is left in this state we eventually get a error of:
dial tcp <IP>:3000: connect: cannot assign requested address"}

I will attempt to find the number of dashboards that causes the problem

@redminion0
Copy link
Author

redminion0 commented May 11, 2023

it seems to happen at around 18 dashboards (although im also thinking it might be size rather than volume) I can easily recreate it by applying all the dashboards from https://github.com/prometheus-operator/kube-prometheus/blob/main/manifests/grafana-dashboardDefinitions.yaml (although i have updated some of the panels (graph to time series etc)

@weisdd
Copy link
Collaborator

weisdd commented May 11, 2023

@redminion0 Great to hear it's reproducible. Could you share it in a form of a step-by-step guide + archive with manifests? (a full set of manifests would be useful for us to have the same resync timers and other settings)

@weisdd weisdd added bug/critical Bug with a critical severity, breaking functionality and removed bug Something isn't working labels May 13, 2023
@weisdd
Copy link
Collaborator

weisdd commented May 13, 2023

Alright, I think the repro in the archive should be enough to investigate it further:
dashboards-loop.zip

@weisdd weisdd added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels May 13, 2023
@weisdd
Copy link
Collaborator

weisdd commented May 14, 2023

@redminion0 I've prepared a fix in #1051, the test image is here: quay.io/weisdd/grafana-operator:v5.0.0-correct-uid. Sorry for the bug and thanks for reporting it :)

@evgenii-denisov
Copy link

evgenii-denisov commented Jun 11, 2023

Seems this problem still exists even in v5.0.0 with Simple Dashboard from basic example

      image: ghcr.io/grafana-operator/grafana-operator:v5.0.0
Screenshot 2023-06-12 at 00 06 53 Operator constantly creates new version of dashboard every 30 second. And same problem, only version changed Screenshot 2023-06-12 at 00 12 00

Please describe what to check? And what I should provide to reproduce on your side?

@weisdd
Copy link
Collaborator

weisdd commented Jun 11, 2023

@evgenii-denisov The behaviour you're seeing is not a bug and has nothing to do with the original issue that got fixed in #1051. - The basic example contains resyncPeriod: 30s in its spec, that's why the operator re-uploads the dashboard every 30 seconds.
The mechanism itself is needed to cope with dashboard spec drift (e.g. due to users changing panels through UI), and calculating diffs in this case would have been over-engineering (keys can come in different order with different indentation => everything needs to be represented in the same way before "original" dashboard and the one that exists in Grafana can be compared), so the operator simply re-uploads the dashboard. The default interval is 5 minutes, though you're free to choose any other number or even disable resync :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug/critical Bug with a critical severity, breaking functionality triage/accepted Indicates an issue or PR is ready to be actively worked on. v5 A v5 specifc issue/feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants