Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Setting a proxy can break the reconciler #1295

Closed
bheading opened this issue Nov 5, 2023 · 9 comments · Fixed by #1300
Closed

[Bug] Setting a proxy can break the reconciler #1295

bheading opened this issue Nov 5, 2023 · 9 comments · Fixed by #1300
Assignees
Labels
bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@bheading
Copy link

bheading commented Nov 5, 2023

Description of the bug
When a proxy is set in the operator (ie by setting HTTP_PROXY/HTTPS_PROXY/NO_PROXY in the Operator deployment) the Operator's interactions with the Grafana API may be routed through the proxy, causing reconciliation to fail. This will happen if the hostname of the Grafana API endpoint is not matched by the NO_PROXY setting.

Typically, users in disconnected environments (where using a proxy to reach the internet is mandatory) will have a standard set of proxy settings they will apply to each of their namespaces. Since the Grafana operator does not use the FQDN of the Grafana service when accessing the Grafana API, it is less likely that the NO_PROXY setting will avoid proxying the API calls.

This problem appears to have been introduced with the integration of #1286.

Version
v5.4.2

To Reproduce
I originally observed the issue on an Openshift cluster but separately investigated how to reproduce the issue using kind. I'm using Fedora 38.

  1. Clone this repo.
  2. Set up a suitable proxy service to test - I installed the standard squid package and tweaked the configuration to accept incoming requests from the kind cluster. I configured mine to listen on 192.168.10.36:3128.
  3. create the kind cluster :kind create cluster
  4. set up the operator : kubectl create -k deploy/kustomize/overlays/namespace_scoped. This creates the grafana namespace and installs the operator/CRDs there.
  5. change namespace for convenience : kubectl config set-context --current --namespace grafana
  6. Configure the operator's proxy settings as follows. The values may need adjusted for your local config.
kubectl set env deploy grafana-operator-controller-manager NO_PROXY=".svc.cluster.local,10.0.0.0/8"
kubectl set env deploy grafana-operator-controller-manager HTTP_PROXY="http://192.168.10.36:3128"
kubectl set env deploy grafana-operator-controller-manager HTTPS_PROXY="http://192.168.10.36:3128"
  1. Finally, create default grafana resources : kubectl -n grafana apply -k config/samples

Observed behaviour

The proxy (which is running outside the cluster) begins receiving Grafana API calls, as follows :

1699223641.075     33 172.18.0.2 TCP_MISS_ABORTED/503 4269 GET http://grafana-a-service.grafana:3000/api/datasources - HIER_NONE/- text/html
1699223641.075     65 172.18.0.2 TCP_MISS_ABORTED/503 4285 GET http://grafana-a-service.grafana:3000/api/folders? - HIER_NONE/- text/html

The Operator produces logs as follows (I've edited for brevity/readability, as the squid proxy returns a chunk of HTML)

2023-11-05T22:34:01Z	ERROR	GrafanaDatasourceReconciler	error reconciling datasource	{"controller": 
"grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "GrafanaDatasource": 
{"name":"grafanadatasource-sample","namespace":"grafana"}, "namespace": "grafana", "name": "grafanadatasource-sample", 
"reconcileID": "47b3f19a-e581-4eba-ac79-d63db58b5fd8", "datasource": "grafanadatasource-sample", "grafana": "grafana-a", 
"error": "status: 503, body: <!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\" \"http://www.w3.org/TR/html4/strict.dtd
\">\n<html><head>\n<meta type=\"copyright\" 

The following error was encountered while trying to retrieve the URL: <a href=\"http://grafana-a-service.grafana:3000
/api/datasources\">http://grafana-a-service.grafana:3000/api/datasources</a></p>\n\n<blockquote id=\"error\">\n<p>
<b>Unable to determine IP address from host name <q>grafana-a-service.grafana</q></b></p>\n</blockquote>\n\n<p>The DNS 
server returned:</p>\n<blockquote id=\"data\">\n<pre>Name Error: The domain name does not exist.</pre>\n</blockquote>
\n\n<p>This means that the cache was not able to resolve the hostname presented in the URL. Check if the address is correct.
</p>\n\

Expected behavior
The Operator should not route Grafana API calls via the proxy.

If the running version of the operator is reverted to v5.3.0, the problem goes away, most likely because the proxy support is not present in that version.

One workaround is to set NO_PROXY to include the namespace (eg .grafana in the above example). However this does not seem satisfactory as if the proxy settings are common and/or global every namespace will need to be included.

Alternatively, the code could be changed so that Grafana API calls are never proxied. That will work provided there are no use cases where the Grafana API lives outside the cluster.

A third option would be to use the FQDN of the Grafana service, so that the URL in the above example would become http://grafana-a-service.grafana.svc.cluster.local:3000. The .svc, .svc.cluster and .svc.cluster.local suffixes are more likely to be present in a globally-configured NO_PROXY setting. However, that won't work for clusters where the suffix has been changed from the default - some way to detect the suffix might help here.

Suspect component/Location where the bug might be occurring

Introducing proxy support is likely to have led to this problem, see #1286.

Runtime (please complete the following information):

  • OS: Linux/Fedora 38
  • Grafana Operator Version : v5.4.2
  • Environment: kind, also noted in Openshift when the v5.4.2 operator auto-updated
  • Deployment type: deployed
@bheading bheading added bug Something isn't working needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 5, 2023
@bheading
Copy link
Author

bheading commented Nov 5, 2023

Noted that the risk of this problem occurring is mentioned in the discussion of #1286 by @weisdd.

The workaround of setting NO_PROXY in each namespace where the Grafana operator is used is not a great one, especially on clusters where the proxy setting is propagated to each namespace via global configuration.

Wondering if it might be workable to add an env var, eg NO_PROXY_GRAFANA_API=true|false ?

@m-kay
Copy link

m-kay commented Nov 7, 2023

I think the best option is to call the API via the URL with the .svc.cluster.local suffix. For Clusters which have a different suffix configured, an env variable could be provided to overwrite the default suffix.

It would be great to get a fix for this soon because users, like me, which are installing the operator from operatorhub.io, have no chance to get back to the old version and are blocked at the moment.

@bheading
Copy link
Author

bheading commented Nov 7, 2023

@m-kay couple of ideas :

  1. specify the suffix as you say
  2. discover the suffix. don't know if this is possible but I would have thought so ?
  3. we could simply disable the proxy in the Grafana API client if isExternal() is false. I can't think of any valid use case where you'd have an internal Grafana instance that you want to control via a proxy.

2 and 3 seem cleaner to me as the user doesn't have to change anything.

BTW if you are blocked, you can override HTTP(S)_PROXY in your operator config to "". That should get you past the problem.

@m-kay
Copy link

m-kay commented Nov 7, 2023

Unfortunately I'm not able to override HTTP(S)_PROXY when installing the operator via operatorhub because the global proxy configuration is overriding my operator configuration.

As Option 3 would definitely fix the problem, however after reading the k8s docs I think the correct dns would be the one including the suffix. If it is possible to discover that suffix somehow it would for sure be cleaner to use the discovered suffix. Maybe the mounted /etc/resolve.conf file could be used for that somehow.

@pb82 pb82 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 7, 2023
@weisdd
Copy link
Collaborator

weisdd commented Nov 7, 2023

@bheading @m-kay Please, take a look at #1300 and let us know if it's something that meets your expectations. It basically follows what @bheading suggested in point 3 (use proxy for external instances).

Test image is published here: quay.io/weisdd/grafana-operator:v5.4.2-proxy-external, you can also build it from the branch. E.g.:

export KO_DOCKER_REPO=quay.io/weisdd/grafana-operator
ko build --sbom=none --bare --platform linux/arm64,linux/arm/v7,linux/amd64 -t v5.4.2-proxy-external

@weisdd weisdd self-assigned this Nov 7, 2023
@m-kay
Copy link

m-kay commented Nov 7, 2023

Thanks @weisdd, yes this looks good to me.

@bheading
Copy link
Author

bheading commented Nov 7, 2023

In principle it looks good @weisdd, I hope it is OK if I left one comment on a minor coding style point on the MR - feel free to reject it if not appropriate.

However, when I tried your container, it crashed (see stack trace below). I am guessing this is because the TLSClientConfig field is not initialized. That problem should be solved if you accept my coding style change ;)


2023-11-07T21:33:53Z	INFO	Starting workers	{"controller": "grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "worker count": 1}
2023-11-07T21:33:53Z	INFO	Starting workers	{"controller": "grafanafolder", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaFolder", "worker count": 1}
2023-11-07T21:33:53Z	INFO	Starting workers	{"controller": "grafanadashboard", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDashboard", "worker count": 1}
2023-11-07T21:33:54Z	INFO	GrafanaDashboardReconciler	found matching Grafana instances for dashboard	{"controller": "grafanadashboard", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDashboard", "GrafanaDashboard": {"name":"grafanadashboard-sample","namespace":"grafana"}, "namespace": "grafana", "name": "grafanadashboard-sample", "reconcileID": "c814fce3-0b8c-4437-9981-0db7af73c204", "count": 1}
2023-11-07T21:33:54Z	INFO	GrafanaDatasourceReconciler	found matching Grafana instances for datasource	{"controller": "grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "GrafanaDatasource": {"name":"grafanadatasource-sample","namespace":"grafana"}, "namespace": "grafana", "name": "grafanadatasource-sample", "reconcileID": "f61966b3-e28e-4a3d-aea9-6a6a3291b5ce", "count": 1}
2023-11-07T21:33:54Z	INFO	Observed a panic in reconciler: runtime error: invalid memory address or nil pointer dereference	{"controller": "grafanadatasource", "controllerGroup": "grafana.integreatly.org", "controllerKind": "GrafanaDatasource", "GrafanaDatasource": {"name":"grafanadatasource-sample","namespace":"grafana"}, "namespace": "grafana", "name": "grafanadatasource-sample", "reconcileID": "f61966b3-e28e-4a3d-aea9-6a6a3291b5ce"}
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xa0 pc=0x142309c]

goroutine 258 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:116 +0x1e5
panic({0x16e5c00?, 0x27c7130?})
	runtime/panic.go:914 +0x21f
github.com/grafana-operator/grafana-operator/v5/controllers/client.NewInstrumentedRoundTripper({0xc00075e3e0, 0x9}, 0xc000124038, 0x0)
	github.com/grafana-operator/grafana-operator/v5/controllers/client/round_tripper.go:26 +0x7c
github.com/grafana-operator/grafana-operator/v5/controllers/client.NewGrafanaClient({0x1bb6128?, 0xc000a8e810?}, {0x1bbe280?, 0xc0001287e0?}, 0xc0005b21e0)
	github.com/grafana-operator/grafana-operator/v5/controllers/client/grafana_client.go:146 +0xb9
github.com/grafana-operator/grafana-operator/v5/controllers.(*GrafanaDatasourceReconciler).onDatasourceCreated(0xc0005d2210, {0x1bb6128, 0xc000a8e810}, 0xc0005b21e0, 0xc0002808c0, 0xc00073bba0, {0xc000278f00, 0x40})
	github.com/grafana-operator/grafana-operator/v5/controllers/datasource_controller.go:319 +0xbc
github.com/grafana-operator/grafana-operator/v5/controllers.(*GrafanaDatasourceReconciler).Reconcile(0xc0005d2210, {0x1bb6128, 0xc000a8e810}, {{{0xc0007c4be6?, 0x5?}, {0xc00080c1f8?, 0xc000511d48?}}})
	github.com/grafana-operator/grafana-operator/v5/controllers/datasource_controller.go:242 +0xfae
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x1bb9048?, {0x1bb6128?, 0xc000a8e810?}, {{{0xc0007c4be6?, 0xb?}, {0xc00080c1f8?, 0x0?}}})
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:119 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0005c03c0, {0x1bb6160, 0xc000358d20}, {0x1763980?, 0xc0002680a0?})
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:316 +0x3c5
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0005c03c0, {0x1bb6160, 0xc000358d20})
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:266 +0x1c9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:227 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 186
	sigs.k8s.io/controller-runtime@v0.16.3/pkg/internal/controller/controller.go:223 +0x565

@HVBE HVBE closed this as completed in #1300 Nov 8, 2023
@m-kay
Copy link

m-kay commented Nov 8, 2023

Is there a chance to get this released as a hotfix?

@weisdd
Copy link
Collaborator

weisdd commented Nov 8, 2023

@bheading thanks, I fixed it in #1302

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants