Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hostPort iptables rule is lost after node restarts #17464

Closed
vfreex opened this issue Nov 27, 2017 · 25 comments
Closed

hostPort iptables rule is lost after node restarts #17464

vfreex opened this issue Nov 27, 2017 · 25 comments
Assignees
Labels
component/networking kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P2

Comments

@vfreex
Copy link

vfreex commented Nov 27, 2017

[provide a description of the issue]
hostPort mapped by daemonset will disappear after node (or just docker) restarts.
But I am not sure if it is still present in latest version.

Version

[provide output of the openshift version or oc version command]
OpenShift origin v3.6.1+008f2d5

Steps To Reproduce
  1. Create a DaemonSet with hostMap, example https://gist.github.com/vfreex/fc768e2ecdd6c18047bb9be5e5e707aa
  2. A iptables rule will be added to the KUNE-HT-* chain of nat table.
  3. Restart docker on a particular node.
Current Result

After several minutes, the hostport on that node will become unreachable and the iptables rule in KUNE-HT-* chain will disappear.

Expected Result

the hostport will be mapped to the new Pod.

Additional Information
  1. If the iptables rule is added to the DOCKER chain, this bug will not happen. Although I don't know how OpenShift/Kubernetes makes this decision.
@mfojtik mfojtik added kind/bug Categorizes issue or PR as related to a bug. priority/P2 labels Nov 27, 2017
@vfreex
Copy link
Author

vfreex commented Jan 18, 2018

@knobunc Any updates? I can also help by providing an environment that can reproduce this issue.

@DanyC97
Copy link
Contributor

DanyC97 commented Jan 26, 2018

any chance this issue priority can be bumped please ?

It does happen on 1.5/3.6/3.7 when origin-node service is restarted ..

@smarterclayton any chance you might be aware of any K8 bug related to this issue? i couldn't find anything

@DanyC97
Copy link
Contributor

DanyC97 commented Feb 1, 2018

@knobunc @vfreex any chance more info can be provided or is anyone looking at ?

@vfreex
Copy link
Author

vfreex commented Feb 5, 2018

@DanyC97 I think the information provided here is sufficient to reproduce this issue. It seems to me that the priority of this issue is low.

@DanyC97
Copy link
Contributor

DanyC97 commented Feb 5, 2018

@sdodson any chance you can help with increasing the priority of this issue please ?

@DanyC97
Copy link
Contributor

DanyC97 commented Feb 8, 2018

i guess no luck to grab anyone attention no matter how much i tried .... oh well

@dcbw
Copy link
Member

dcbw commented Feb 9, 2018

Investigated and was able to reproduce locally (at least a variant of the issue) using the nginx daemonset and restarting docker. Analysis:

  1. when docker is restarted, kubelet notices the sandbox has died and terminates it. That termination clears the hostport chains too
  2. kubelet tries to start a new sandbox, but docker is down and this fails
  3. kubelet tries to start a 3rd sandbox, which works, and gets into the CNI network plugin for setup
  4. hostport rules get added
  5. PLEG requests pod sandbox status, which of course returns an empty IP becuase the sandbox is not ready yet. This state gets cached in the status manager and a SyncPod is queued.
  6. Sandbox creation finishes and the IP is assigned and available
  7. SyncPod() runs using the cached status from step (5)
  8. because none of the sandboxes from the cached status have an IP address, SyncPod() thinks the sandboxes are all dead and starts another one. this of course kills the one from step (6)
  9. repeat

It's currently unclear what should be done about this; it's a completely upstream problem. We've fixed a number of upstream issues with PLEG status racing with SyncPod in the past.

@dcbw
Copy link
Member

dcbw commented Feb 9, 2018

One question for the reporter: does the container ever become ready and start but without the hostport rules? Or does the container never become ready?

@vfreex
Copy link
Author

vfreex commented Feb 9, 2018

@dcbw Thanks for your investigation. Regarding to your question, I need time to confirm. Will follow up later.

@DanyC97
Copy link
Contributor

DanyC97 commented Feb 9, 2018

@dcbw in my case the is the former situation, pod up, no iptables rule

@dcbw
Copy link
Member

dcbw commented Feb 9, 2018

Do either of you see a line like this in your openshift-node process logs? (eg journalctl -b -u openshift-node)

Sandbox for pod "nginx-rx4vn_default(dc407878-0d25-11e8-ba8d-0242aa43501a)" has no IP address. Need to start a new one

@DanyC97
Copy link
Contributor

DanyC97 commented Feb 10, 2018

@dcbw yes i do

Feb 10 21:20:25 370-node3 origin-node[6548]: I0210 21:20:25.552489    6548 kuberuntime_manager.go:371] No sandbox for pod "nginx-8nf6l_andy(325d1baa-0ea8-11e8-bb11-005056b87bb3)" can be found. Need to start a new one

@DanyC97
Copy link
Contributor

DanyC97 commented Feb 13, 2018

@dcbw any luck or you need more info from my side ?

@DanyC97
Copy link
Contributor

DanyC97 commented Feb 19, 2018

@dcbw sorry to keep nudge you, any chance you can spare some time to get to the bottom of it please? (i know this might be in your spare time hence will be much appreciated the extra mile effort)

@DanyC97
Copy link
Contributor

DanyC97 commented Feb 24, 2018

@danwinship @smarterclayton @dcbw @liggitt i don't know who else to tag, i'm screaming for help please, can one of you keep looking into this issue as it does hurt me a lot running Origin internally.

I'm very surprised to see that this issue doesn't get much attention, i wonder if this doesn't happen in OCP ? is a very common scenario which can trigger a disaster in production since there is no current solution which does monitor the iptables rules to see if is still present or not.

@dcbw
Copy link
Member

dcbw commented Mar 15, 2018

@DanyC97 does the node eventually settle down and start the container, or does it never happen?

Also, if you can run openshift-node with --loglevel=5, reproduce the issue, and then mail "journalctl -b -u openshift-node" (might be called "atomic-openshift-node" too) to me at [dcbw at redhat dot com] I can analyze and see if your issue is the same as the one I've diagnosed.

@DanyC97
Copy link
Contributor

DanyC97 commented Mar 17, 2018

@dcbw i'll email all the info to you next week with a reproducible test case. Thanks a bunch !!

@kincl
Copy link

kincl commented Apr 27, 2018

This appears to be an issue in 3.9 as well and I am seeing it with the prometheus daemonset.

@DanyC97
Copy link
Contributor

DanyC97 commented May 7, 2018

@dcbw fyi just sent you an email, again sorry for very long delay.

@DanyC97
Copy link
Contributor

DanyC97 commented May 24, 2018

@dcbw let me know if i should resend the email, will be great if you have few min spare to look into it so we know if / when will get fixed. I don't want to move to 3.9 the entire prod envs and find out that we have same issue like @kincl mentioned, that will be a disaster for me anyway

@vfreex
Copy link
Author

vfreex commented Jul 30, 2018

Hi, what is the status of this?

@vfreex vfreex changed the title hostPort iptables rule missing after node restarts hostPort iptables rule is lost after node restarts Sep 5, 2018
@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 11, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 10, 2019
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/networking kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P2
Projects
None yet
Development

No branches or pull requests

9 participants