New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods are sometimes deployed without internal network connectivity #14092
Comments
|
Can you get the node logs please from the node where the failing pod is running? |
|
The issue only happens after a day or two of running, I had suspended all cron jobs when it started bringing down their services. I'm re-enabling some and will post back the logs when it happens again. |
|
I was able to replicate it again, the issue doesn't seem to particularity relate to CronJobs but it happens mostly on projects that are running them. Most of the time right after a build succeeds and a new deployment is created. Below are the errors: To clarify, running CronJobs only doesn't seem to break the project networking. It seems to happen mostly when a build finishes and a new deployment starts. But the issue happens only on projects that are running CronJobs, so perhaps some sort of race condition if they happen at the same time? I have the full log dump of one of the nodes where this happened. |
|
More relevant logs
|
|
I have been able to repeatedly replicate this issue on two different clusters. Steps:
Seems to happen when a new deployment starts at the same time the cronjob executes. Every 2-3 builds it happens. Delete the CronJob and all completed Jobs, keep triggering builds. Issue no longer occurs. |
|
After further tests, it doesn't appear to be just related to CronJobs at all. It was just occurring more frequently with projects that had CronJobs. |
|
I have experienced very similar issues after upgrading to 1.5 - I haven't had any CronJobs though, so I can confirm that it is probably unrelated to CronJobs. Occasionally, when a Pod starts up, it does not have network connectivity. It happens about once in 5 newly created pods. Restarting (deleting) the Pod, or restarting the origin-node or openvswitch service on the node that the Pod is running on fixes the problem. I have tried to investigate a bit further, and I noticed that in this setup: (project A / pod1) <-> Node1 <-> Node2 <-> (project A / pod2) If pod1 is experiencing this issue, and I was pinging pod2 from pod1, then the packets were actually arriving at pod2, and it was sending ICMP replies, which got back encapsulated in vxlan to Node1, but they haven't shown up on pod1's virtual interface, so it seemed the issue was with the vxlan termination of incoming traffic to the Pod. At this point I started looking at OpenFlow rules but I couldn't determine anything, I haven't found anything trivially missing so I restarted origin-node on Node1 which resolved the issue. But it's still an annoyance as it happens fairly frequently. |
That's the problem; something is going wrong in the networking code and causing it to drop OVS rules for a namespace when, apparently, the namespace is still in use. Could you restart the node with --loglevel=5, run until the error shows up, and then post the logs (from just the atomic-openshift-node service) somewhere? (It will probably be way too large to post here but you could use a pastebin and post just a link, or you could email it to me (danw at redhat)) |
where by "atomic-openshift-node" I probably mean "origin-node" I guess... |
|
I'll try to replicate this again on a test cluster so I don't need to update all the nodes to loglevel=5 @mbalazs curious if you are also running on AWS? |
|
@andrewklau No, it's on a private OpenNebula installation. Some additional info that might help in reproducing:
|
|
@danwinship I am emailing the logs to you now I was able to replicate this fairly quickly:
Continued to |
|
Please note that we're having the exact same issue, and it's not related to CronJobs. |
|
Can you tell me if the broken pod has an IP address? Can it ping the node it is on? Can the node ping it? (I'm wondering if this is an ARP cache problem) |
|
@knobunc In my case, the broken pod (php) has an IP address, and it can ping other nodes in the cluster. It can also ping the docker registry pod. And I can also ping the broken pod from the cluster nodes and from the registry pod. However, it can not ping the other pod (mysql) in the same project, and vice-versa. The other pod can also ping all the nodes and the registry too. So the only thing that seems to be broken is the communication between these two pods. |
|
@knobunc I have a 2 node test cluster running that I created to replicate the problem. It has hit the problem if you would like to have access. It's also running at --log-level=5 |
|
@knobunc @danwinship are there any additional logs or access I can provide to further identify what might be the cause? |
|
Can you ping pods from that pod that are on the same node? What plugin are you using for networking? |
|
It seems to only happen with multitenant plugin. |
|
Yes: multitenant |
|
This is starting to happen more frequently. Were the you able to get logs from @weliang1 's environment or should I reproduce a cluster logs again? |
|
@weliang1 has the logs we need... we're trying to track down the race. Thanks! |
|
Is the problem in openvswitch or the node? I'm wondering if it'd be possible to run a downgraded openvswitch v1.4.1 just so this stops causing app outages in the interim. |
Someone can do the test? :) |
|
Are you close to have a fix, or is it worth downgrading now? |
|
it's not in openvswitch |
|
@andrewklau We have downgraded our nodes to 1.4.1 and the issue seems to be gone. All pods are starting directly now. We're waiting a few days before concluding to victory, but so far so good. |
|
I just just switched the node container image. We don't have network
loss so far.
|
|
I'm not sure if this is related, but I noticed this pop up in one of the events for a 1.5 node: |
|
Is this going to be backported to 1.5? |
|
Backporting this to 1.5 would be great! |
|
Any news regarding a backport? This issue is still blocking our nodes from upgrading to 1.5 :( |
|
Same here.
Btw, is anybody building non-official packages/images?
Em 27 de jun de 2017 12:08, "Philippe Lafoucrière" <notifications@github.com>
escreveu:
… Any news regarding a backport? This issue is still blocking our nodes from
upgrading to 1.5 :(
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#14092 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAhBCNzLoxn-VgHud4HyvrcfDV_BMka8ks5sIRrigaJpZM4NS9b1>
.
|
|
The fix was merged in #14801 |
|
yeah, the fix is on the release-1.5 branch but AFAIK we don't do regular point releases of origin |
|
There was already a point release for origin, so why not do another? With this issue fixed, I guess it would be great to have another point release. not? |
|
Hi guys, it would be really nice to have a backport of this in the 1.5 branch. Thanks |
|
We've been waiting for a patch version for almost a month now, anyone? |
|
I'm pretty sure there won't be any point release for 1.5. But it's very easy to build a release by your own: https://github.com/openshift/origin/blob/release-1.5/HACKING.md -> just use the |
|
Yes, we can build our own images, but if we can't use official images anymore, that pretty much mean the end of openshift for us :( |
Why? Origin is not formally supported by Red Hat anyway. Literally the only difference between "official" images and images you built yourself is who typed "make release". |
|
Yes, but some do trust CentOS (which provides the origin packages), but not some random build packages. |
The projects that are running CronJobs seem to eventually lose internal network connectivity.
ie.
PHP + MySQL setup
Run cron job every minute, php script to connect to MySQL
By the end of the day, the PHP web server is unable to connect to the MySQL server with a connection timeout
Similar case happening to Redis and NodeJS, so it doesn't appear to be technology specific.
These are the same
ScheduledJobthat were being run on 1.4 (exported and re-imported asCronJob).Even after cleaning up the completed jobs (jobs older then 2 hours are deleted) the issue still occurs.
Projects without CronJobs don't appear to be effected.
Version
openshift v1.5.0+031cbe4
kubernetes v1.5.2+43a9be4
The text was updated successfully, but these errors were encountered: