Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Go panic when syncing many dashboards #1498

Closed
smuda opened this issue Apr 17, 2024 · 4 comments · Fixed by #1504
Closed

[Bug] Go panic when syncing many dashboards #1498

smuda opened this issue Apr 17, 2024 · 4 comments · Fixed by #1504
Labels
bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@smuda
Copy link
Contributor

smuda commented Apr 17, 2024

Describe the bug

Sometimes there is a go panic which crashes the pod. It only seems to occur when there is many dashboards to sync (>60). When the pod is automatically restarted, a number of the dashboards are already synced and the sync can be completed.

The panic log
2024-04-17T07:59:38Z	INFO	GrafanaDashboardReconciler	found matching Grafana instances for dashboard	{"controller": "grafanadashboard", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDashboard", "GrafanaDashboard": {"name":"xxxxxxxxxx","namespace":"f"}, "namespace": "xxxxxxx", "name": "xxxxxxxxxx", "reconcileID": "e616b68b-b337-468e-a3c1-57fe1b3052c2", "count": 1}
2024-04-17T08:01:53Z	INFO	Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler	{"controller": "grafanadashboard", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDashboard", "GrafanaDashboard": {"name":"yyyyyyyy","namespace":"xxxxxxxxxx"}, "namespace": "xxxxxxxxxx", "name": "yyyyyyy", "reconcileID": "778ab291-2559-47e8-aa34-bf2b96f0f94f"}
2024-04-17T08:01:53Z	ERROR	Reconciler error	{"controller": "grafanadashboard", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDashboard", "GrafanaDashboard": {"name":"yyyyyyyy","namespace":"xxxxxxxxxx"}, "namespace": "xxxxxxxxxx", "name": "yyyyyyyy", "reconcileID": "778ab291-2559-47e8-aa34-bf2b96f0f94f", "error": "Operation cannot be fulfilled on grafanas.grafana.integreatly.org \"grafana-internal\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227
2024-04-17T08:01:53Z	INFO	Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler	{"controller": "grafanadashboard", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDashboard", "GrafanaDashboard": {"name":"overview.json","namespace":"xxxxxxxxxx"}, "namespace": "xxxxxxxxxx", "name": "overview.json", "reconcileID": "64637408-0770-44a9-a2ce-8e9b436787d5"}
2024-04-17T08:01:53Z	ERROR	Reconciler error	{"controller": "grafanadashboard", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDashboard", "GrafanaDashboard": {"name":"overview.json","namespace":"xxxxxxxxxx"}, "namespace": "xxxxxxxxxx", "name": "overview.json", "reconcileID": "64637408-0770-44a9-a2ce-8e9b436787d5", "error": "Operation cannot be fulfilled on grafanas.grafana.integreatly.org \"grafana-internal\": the object has been modified; please apply your changes to the latest version and try again"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227
2024-04-17T08:01:53Z	INFO	Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference	{"controller": "grafanadashboard", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDashboard", "GrafanaDashboard": {"name":"daotimes.json","namespace":"xxxxxxxxxx"}, "namespace": "xxxxxxxxxx", "name": "daotimes.json", "reconcileID": "ad67afed-50b1-4992-a36d-28a2a94e18ff"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x194ada2]

goroutine 219 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:116 +0x1e5
panic({0x1bac840?, 0x33e1ad0?})
	runtime/panic.go:914 +0x21f
github.com/grafana/grafana-openapi-client-go/client/dashboards.(*GetDashboardByUIDOK).GetPayload(...)
	github.com/grafana/grafana-openapi-client-go@v0.0.0-20240215164046-eb0e60d27cb7/client/dashboards/get_dashboard_by_uid_responses.go:117
github.com/grafana/grafana-operator/v5/controllers.(*GrafanaDashboardReconciler).onDashboardDeleted(0xc0006b0800, {0x22762b8, 0xc003f25bf0}, {0xc000d04660, 0xb}, {0xc000d04650, 0xd})
	github.com/grafana/grafana-operator/v5/controllers/dashboard_controller.go:312 +0x2e2
github.com/grafana/grafana-operator/v5/controllers.(*GrafanaDashboardReconciler).Reconcile(0xc0006b0800, {0x22762b8, 0xc003f25bf0}, {{{0xc000d04660?, 0x5?}, {0xc000d04650?, 0xc000c55d08?}}})
	github.com/grafana/grafana-operator/v5/controllers/dashboard_controller.go:183 +0x2ef
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x22798e8?, {0x22762b8?, 0xc003f25bf0?}, {{{0xc000d04660?, 0xb?}, {0xc000d04650?, 0x0?}}})
	sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:119 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc00043cdc0, {0x22762f0, 0xc0002d8c30}, {0x1c86260?, 0xc009584cc0?})
	sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:316 +0x3cc
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc00043cdc0, {0x22762f0, 0xc0002d8c30})
	sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266 +0x1af
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 83
	sigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:223 +0x565

Version
5.8.0

To Reproduce

It seems hard to reproduce, but having many dashboard helps. To trigger the problem I restart the grafana pod (to have it empty) and then restart the operator pod after a few seconds. Then there is a full sync of all the dashboards and datasources.

Expected behavior
Not a crashing pod. :-)

Suspect component/Location where the bug might be occurring
The panic points to github.com/grafana/grafana-operator/v5/controllers/dashboard_controller.go:312, but
I'd assume the problem is somewhere else, perhaps even in github.com/grafana/grafana-openapi-client-go

Runtime (please complete the following information):

  • Grafana Operator Version 5.8.0
  • Environment: Openshift 4.12
  • Deployment type: deployed
  • Grafana is running it's own deployement and grafana-operator thinks it's an external grafana
@smuda smuda added bug Something isn't working needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 17, 2024
@smuda smuda changed the title [Bug] Go panic [Bug] Go panic when syncing many dashboards Apr 18, 2024
@weisdd
Copy link
Collaborator

weisdd commented Apr 23, 2024

That's certainly something interesting to investigate, thanks for reporting!

@pb82
Copy link
Collaborator

pb82 commented Apr 23, 2024

The issue could be the error handling here : if the error is GetDashboardByUIDNotFound, we don't return, but instead continue to run resp.GetPayload().

This happens in onDashboardDeleted, so if the dashboard isn't found, we can return without an error here.

@pb82 pb82 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 23, 2024
@smuda
Copy link
Contributor Author

smuda commented Apr 23, 2024

Probably a stupid question, but as this occurred during the first population of the dashboards to Grafana, why is onDashboardDeleted being called?

@weisdd
Copy link
Collaborator

weisdd commented Apr 23, 2024

@smuda Well, it's hard to say without seeing it in a lab, but it could potentially be in at least two cases:

  • dashboard uid got updated, then you should see a log message dashboard uid got updated, deleting dashboards with the old uid;
  • maybe some delays in synchronization of etcd (=stale reads) if you have multiple replicas. When the dashboard controller receives an event, it tries to fetch a CR from API-server, and if API-server replies that the CR does not exist, onDashboardDeleted gets called to clean up the dashboard from Grafana instances. etcd has a concept of quorum read, and I'm not sure if API-server relies on it when interacting with etcd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants