Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cinder-csi-plugin] Apparently-completed connections to Keystone and Cinder fail with "i/o timeout" #1387

Closed
jralbert opened this issue Jan 26, 2021 · 23 comments · Fixed by #1399

Comments

@jralbert
Copy link
Contributor

/kind bug

What happened:
Some of the time, requests to the OpenStack APIs are successful, and with plugin debug logging turned way up I can see that the plugin is sometimes able to connect to Keystone, get a token, and then connect to Cinder to list and create volumes. However, most of the time connections to either Keystone or Cinder will appear to retry several times at 10-second intervals, and eventually fail with a message like:

I0120 17:39:17.616376 1 openstack.go:581] OpenStack connection error, retries exhausted. Aborting
W0120 17:39:17.616511 1 main.go:108] Failed to GetOpenStackProvider: Post "https://<CLOUD_DOMAIN>:5000/v3/auth/tokens": OpenStack connection error, retries exhausted. Aborting. Last error was: dial tcp <CLOUD_IP>:5000: i/o timeout

What you expected to happen:
For connections to the OpenStack API services to succeed as they do with other OpenStack clients, even with cURL from inside the cinder-csi-plugin containers.

How to reproduce it:
Just starting the plugin will cause the failure condition, almost immediately. This will result in containers exiting and going into CrashLoopBackoff, although enough connections succeed that if left to restart itself for long enough, volume actions will eventually get performed.

Anything else we need to know?:
This result occurs with both POSTs and GETs, and requests that often fail still do sometimes succeed.

Since I'm also in control of the OpenStack cloud infrastrucutre, I've been able to confirm in the API server logs that the requests which are being retried have in fact been received and responded to by the API servers, all with 2xx or 3xx response codes and in response times well under 10 seconds (most GETs are under 1 second), which leads me to think that there's not a genuine timeout condition here.

Basic connectivity testing with cURL inside the container where the plugin is running confirms that there's no difficulty or delay in querying the API server endpoints; tcpdump on the Kubernetes worker node hosting the container shows complete TCP conversations between the container and the API servers, with no indication of RSTs or broken connections.

Because I initially thought this must be an issue with the underlying gophercloud implementation that cinder-csi-plugin relies on, I first reported this bug on that project (gophercloud/gophercloud#2100) and had some good conversation with the project team about where this problem might arise, which ultimately ended up in the suggestion to open an issue on this repository.

Environment:

  • openstack-cloud-controller-manager(or other related binary) version: 1.20
  • OpenStack version: Stein
@kayrus
Copy link
Contributor

kayrus commented Jan 26, 2021

relates to #1207

@kayrus
Copy link
Contributor

kayrus commented Jan 27, 2021

Proposal:

  • introduce new config options to control transport timeouts
  • introduce more grained connection tracer to a HTTP client

@jichenjc
Copy link
Contributor

Some of the time, requests to the OpenStack APIs are successful, and with plugin debug logging turned way up I can see that the plugin is sometimes able to connect to Keystone, get a token, and then connect to Cinder to list and create volumes. However, most of the time connections to either Keystone or Cinder will appear to retry several times at 10-second intervals, and eventually fail with a message like:

just to confirm my understanding , the normal code works fine but with debug logging enabled then time to time the timeout issue occurs then lead to retry? thus the suggestion is to control the timeout of transport etc?

@jralbert
Copy link
Contributor Author

No, the code produces the frequent-failure condition at all times, and will return dial tcp <CLOUD_IP>:5000: i/o timeout in the container logs and enter CrashLoopBackoff state frequently; with debug logging turned up I'm able to see that there are some successful operations in between the failures.

I'm hopeful that @kayrus suggestion of improving visibility into the HTTP client and allowing tuning of the transport timeouts will allow me to both see in more detail what's causing this error and tune the plugin to avoid it.

@jichenjc
Copy link
Contributor

ok, let's wait for additional help and comments from @kayrus :), thanks for the info~

@ramineni
Copy link
Contributor

@jralbert Thanks for reporting the issue .
Few questions left unanswered on the gophercloud issue as well , so raising it here to better understand the scenario

  1. Does this behaviour appear as soon as the plugin starts or does it take some time for it to begin? In other words, if you restart the plugin, are 100% of the connections successful up until a certain point in time? And if so, approximately how long does it take to exhibit this behviour?
  2. If you're hitting HAProxy, is it possible that the failed requests are going to a specific node (or nodes) that are configured differently than the other nodes? In other words, is the issue being seen only when it hits particular node? because as you mentioned , it sometimes work and sometimes doesnt , make me think of that?
  3. Have you ruled our HAProxy interference? When you say you have tested with curl, were you hitting the haproxy service or going directly to an API server?
    Are you able to have Kubernetes bypass HAProxy? This would be the best way to rule out HAProxy interfering.

@jralbert
Copy link
Contributor Author

Hi @ramineni, thanks for following up! Answering in order:

  1. The behavior appears as soon as the plugin starts. Sometimes some requests succeed (ie, sometimes the keystone connection will succeed but then the cinder connections will fail), and very occasionally all the requests for a run of the plugin will succeed such that an outstanding mount or volume creation finally happens, but failures will continue to happen thereafter.
  2. I operate the OpenStack cloud and its HAProxy. I can confirm that requests are going to all of the backend API servers, and reviewing the logs at the API servers I can confirm that those servers are successfully receiving and processing the requests (success HTTP codes for each one).
  3. When testing with cURL, I tried to replicate what the plugin would experience, making connections to the HAProxy front end of the Keystone and Cinder services. I'm not able to bypass HAProxy, as the backend nodes are intentionally not reachable from the outside, but I have thousands of unique users operating various OpenStack clients (python-openstack, Terraform, etc) through that same front-end and have had no indication of similar issues elsewhere, so I don't think HAProxy is causing the issue.

@ramineni
Copy link
Contributor

ramineni commented Feb 2, 2021

@jralbert Thanks for the info.
Is livenessprobe enabled in the manifests? If yes, could you disable and check if everything works fine. And also alternatively , could you increase timeout of liveness probe and check.
https://github.com/kubernetes/cloud-provider-openstack/blob/master/manifests/cinder-csi-plugin/cinder-csi-controllerplugin.yaml#L121

@jralbert
Copy link
Contributor Author

jralbert commented Feb 2, 2021

@ramineni We're installing from the Helm chart, and the livenessprobe is not configurable from values, so we've been running using the settings which are defined in the chart (https://github.com/kubernetes/cloud-provider-openstack/blob/master/charts/cinder-csi-plugin/templates/controllerplugin-statefulset.yaml#L114 and https://github.com/kubernetes/cloud-provider-openstack/blob/master/charts/cinder-csi-plugin/templates/nodeplugin-daemonset.yaml#L81).

I pulled a local copy of the chart to install from, and manually updated the templates to just add a zero after each of the failure_threshold, timeout_seconds, and period_seconds, then re-installed from the updated chart - and the plugin has been running stably for 30 minutes, which is 30 minutes longer than I ever got it to run previously. I'll do some further testing to actually create/modify volumes, but this does appear to have solved the problem!

If you're willing to indulge me a bit further, I'd love to understand why that's happened: my understanding is that the livenessprobe checks if the CSI socket is ready, and if not triggers a restart of the container; but what I saw was the plugin program itself throwing errors and eventually exiting with a non-zero RC. Having changed the livenessprobe settings, I no longer see the errors, and the plugin runs without exiting. Am I misunderstanding the relationship between these components?

As a separate point, it would be wonderful to be able to configure the livenessprobe timeouts from Helm chart values; if that's something the project maintainers would consider, I'd be happy to create a pull request to implement it.

@jichenjc
Copy link
Contributor

jichenjc commented Feb 3, 2021

As a separate point, it would be wonderful to be able to configure the livenessprobe timeouts from Helm chart values; if that's something the project maintainers would consider, I'd be happy to create a pull request to implement it.

I think it's reasonable to make them configurable , thanks for taking this

If you're willing to indulge me a bit further, I'd love to understand why that's happened: my understanding is that the livenessprobe checks if the CSI socket is ready, and if not triggers a restart of the container; but what I saw was the plugin program itself throwing errors and eventually exiting with a non-zero RC. Having changed the livenessprobe settings, I no longer see the errors, and the plugin runs without exiting. Am I misunderstanding the relationship between these components?

https://github.com/kubernetes-csi/livenessprobe is the livenessprobe check, I am curious as well, as it will do the check of the CSI container itself and why it's related to openstack call seems weird to me
maybe need paste additional logs to see the pod logs and know what happened there..

@ramineni
Copy link
Contributor

ramineni commented Feb 3, 2021

@ramineni We're installing from the Helm chart, and the livenessprobe is not configurable from values, so we've been running using the settings which are defined in the chart (https://github.com/kubernetes/cloud-provider-openstack/blob/master/charts/cinder-csi-plugin/templates/controllerplugin-statefulset.yaml#L114 and https://github.com/kubernetes/cloud-provider-openstack/blob/master/charts/cinder-csi-plugin/templates/nodeplugin-daemonset.yaml#L81).

@jralbert That's a miss from our side then. values should be configurable.

I pulled a local copy of the chart to install from, and manually updated the templates to just add a zero after each of the failure_threshold, timeout_seconds, and period_seconds, then re-installed from the updated chart - and the plugin has been running stably for 30 minutes, which is 30 minutes longer than I ever got it to run previously. I'll do some further testing to actually create/modify volumes, but this does appear to have solved the problem!

If you're willing to indulge me a bit further, I'd love to understand why that's happened: my understanding is that the livenessprobe checks if the CSI socket is ready, and if not triggers a restart of the container; but what I saw was the plugin program itself throwing errors and eventually exiting with a non-zero RC. Having changed the livenessprobe settings, I no longer see the errors, and the plugin runs without exiting. Am I misunderstanding the relationship between these components?

@jralbert In addition it also checks if plugin able to connect to openstack cinder as well with basic api call . If not reachable, plugin will be considered in not-ready/healthy state.

As a separate point, it would be wonderful to be able to configure the livenessprobe timeouts from Helm chart values; if that's something the project maintainers would consider, I'd be happy to create a pull request to implement it.

ya, please do :) Appreciate it.

@jralbert
Copy link
Contributor Author

jralbert commented Feb 3, 2021

@ramineni Just planning for the PR, is it appropriate to add these configurable values to the csi.livenessprobe dict in values.yml, or would you prefer they be organized elsewhere in the file? This will be my first contribution to a Kubernetes project, and I understand there's a process to follow, so please excuse what will no doubt be many questions. :)

@jichenjc
Copy link
Contributor

jichenjc commented Feb 4, 2021

I think use value is a good way, as you can see
.Values.timeout , .Values.csi.resizer.image.pullPolicy etc are widely used so you might want to add csi.livenessprobe section

@ramineni
Copy link
Contributor

ramineni commented Feb 4, 2021

@ramineni Just planning for the PR, is it appropriate to add these configurable values to the csi.livenessprobe dict in values.yml, or would you prefer they be organized elsewhere in the file?

@jralbert Feel free to propose a PR which you think the best way. I'm ok with adding them to csi.livenessprobe, but you would be receiving comments from everyone once you propose a PR.

This will be my first contribution to a Kubernetes project, and I understand there's a process to follow, so please excuse what will no doubt be many questions. :)

@jralbert welcome to the community :) no issue, please reach out to us for any query. you could also reach out to us on #provider-openstack slack channel

@Gbotttt
Copy link

Gbotttt commented Jul 12, 2023

Hi Team,

I'm getting below error on cinder-csi-plugin container. Can you please help to troubleshoot this as to why i'm getting below error.

I0712 04:27:58.811070 1 driver.go:74] Driver: cinder.csi.openstack.org
I0712 04:27:58.811126 1 driver.go:75] Driver version: 1.3.3@v1.22.0
I0712 04:27:58.811134 1 driver.go:76] CSI Spec version: 1.3.0
I0712 04:27:58.811151 1 driver.go:106] Enabling controller service capability: LIST_VOLUMES
I0712 04:27:58.811159 1 driver.go:106] Enabling controller service capability: CREATE_DELETE_VOLUME
I0712 04:27:58.811163 1 driver.go:106] Enabling controller service capability: PUBLISH_UNPUBLISH_VOLUME
I0712 04:27:58.811166 1 driver.go:106] Enabling controller service capability: CREATE_DELETE_SNAPSHOT
I0712 04:27:58.811169 1 driver.go:106] Enabling controller service capability: LIST_SNAPSHOTS
I0712 04:27:58.811172 1 driver.go:106] Enabling controller service capability: EXPAND_VOLUME
I0712 04:27:58.811175 1 driver.go:106] Enabling controller service capability: CLONE_VOLUME
I0712 04:27:58.811178 1 driver.go:106] Enabling controller service capability: LIST_VOLUMES_PUBLISHED_NODES
I0712 04:27:58.811181 1 driver.go:106] Enabling controller service capability: GET_VOLUME
I0712 04:27:58.811184 1 driver.go:118] Enabling volume access mode: SINGLE_NODE_WRITER
I0712 04:27:58.811188 1 driver.go:128] Enabling node service capability: STAGE_UNSTAGE_VOLUME
I0712 04:27:58.811192 1 driver.go:128] Enabling node service capability: EXPAND_VOLUME
I0712 04:27:58.811195 1 driver.go:128] Enabling node service capability: GET_VOLUME_STATS
I0712 04:27:58.811395 1 openstack.go:90] Block storage opts: {0 false true}
W0712 04:28:02.912605 1 main.go:108] Failed to GetOpenStackProvider: Post "http://x.us.y.com:5000/v3/auth/tokens": dial tcp: lookup x.us.y.com on 169.254.25.10:53: server misbehaving

@jichenjc
Copy link
Contributor

try
update x.us.y.com => ip as workaround
the DNS server can't parse the DNS name x.us.y.com but seems it's a common problem , not sure why CSI CPO has so many occurance..

@molimolinari
Copy link

molimolinari commented Apr 12, 2024

Hi team!
I had the same problem but I found a solution:

I set the "dnspolicy" of the pod in cinder-csi-controllerplugin.yaml as "Default".

Default: Pod inherits worker node’s DNS configuration.

ps: The tricky of this parameter is that the "default" value is not the "default"...
Although it seems Default should be the default, but ClusterFirst is the default option. So to speak, if you have not set this option, ClusterFirst will be used as the default value.
More info:
https://itnext.io/kubernetes-hacks-and-tricks-5-pod-dns-policy-and-dns-config-options-c176f4a581c3

@rptaylor
Copy link
Contributor

@molimolinari that is an interesting finding. In general I think it would be fair to say that the Openstack cloud is always external to the k8s cluster, so using cluster DNS will not be helpful. Opening an issue or MR to change the CSI pod to use "Default" could be a good idea.
That being said if the cluster DNS is failing to resolve the openstack API, or causing delays in DNS resolution leading to a timeout, it sort of implies the cluster DNS is broken.

@jichenjc
Copy link
Contributor

this is good suggestion, I read the doc and seem ClusterFirstWithHostNet: might be better and more align with existing flow?
so clusterFirst (same to now) then with host (with proposed change) ?

@rptaylor
Copy link
Contributor

@jichenjc
Hmm yes I noticed the CSI pods are using hostNetwork. https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy

For Pods running with hostNetwork, you should explicitly set its DNS policy to "ClusterFirstWithHostNet". Otherwise, Pods running with hostNetwork and "ClusterFirst" will fallback to the behavior of the "Default" policy.

So it sounds like in practice, due to using hostNetwork, the current behaviour would have been the same as "Default". But wouldn't that mean the pods could not resolve k8s services?

I am not sure about the details of why hostNetwork is needed/used, but that being the case, it seems like ClusterFirstWithHostNet could be better. But this doesn't explain how the recent comment #1387 (comment) could fix a problem, so I don't know.

@jichenjc
Copy link
Contributor

But this doesn't explain how the recent comment #1387 (comment) could fix a problem, so I don't know.

um... I dont' know either :( , maybe there are some magic between Default and hostnetwork that we are not aware
but the host and openstack hypervisor should be same in our context so I think it's same

@dulek
Copy link
Contributor

dulek commented Apr 23, 2024

I'm guessing it all depends on the configuration of the nodes? If nodes can reach OpenStack APIs, then inheriting configuration from them by setting Default is a good idea.

@jichenjc
Copy link
Contributor

@molimolinari can you help check the #2574 comments ? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants