New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Randomly detaching volumes #431
Comments
I will close this for now as I realize I didn't redeploy the longhorn driver deployer as you recommended previously when the kubernetes version changed. I also moved to K8S 1.13.4. If it occurs again, I'll re-raise this issue. |
This is still happening. Same as above.
Kubernetes 1.13.4 Although I changed K8S versions during upgrade of Rancher, I moved back to 1.13.4 so the only difference is Rancher moving from 2.1.6 to 2.1.7. I deleted the deployments and re-deployed longhorn driver deployment after last K8S upgrade. Any ideas? |
@paulmorabito It seems caused by the network issues in your setup:
Can you check if the connection between different nodes (pods across the nodes) are stable? |
How can I do that?
Nothing physically has changed with the network and they're wired
connections.
The only major system change was the upgrade. I've not noticed any other
issues elsewhere either.
…On Sat, Mar 16, 2019, 04:07 Sheng Yang ***@***.***> wrote:
@paulmorabito <https://github.com/paulmorabito> It seems caused by the
network issues in your setup:
time="2019-03-15T11:08:26Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:35814: read: connection reset by peer" id=8185369120697835514 type=events
time="2019-03-15T11:08:26Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
time="2019-03-15T11:08:26Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
2019/03/15 11:08:26 http: response.WriteHeader on hijacked connection
2019/03/15 11:08:26 http: response.Write on hijacked connection
time="2019-03-15T11:08:26Z" level=error msg="Failed to write err: write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
time="2019-03-15T11:09:11Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:33118: read: connection reset by peer" id=3019472168369352208 type=nodes
time="2019-03-15T11:11:01Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:33120: read: connection reset by peer" id=7721074465402278285 type=volumes
time="2019-03-15T11:11:01Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:33120: write: broken pipe"
time="2019-03-15T11:11:01Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:33120: write: broken pipe"
2019/03/15 11:11:01 http: response.WriteHeader on hijacked connection
2019/03/15 11:11:01 http: response.Write on hijacked connection
Can you check if the connection between different nodes (pods across the
nodes) are stable?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#431 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AORr75TER0mhe_OhD3Vtw_n6uVY_8Ms5ks5vW-9XgaJpZM4b126P>
.
|
I've written following in #428 , and we're working on a tool to show the network instability. #430 We can run three small pods per node, each on one node (by setting nodename in the pod), and make them ping Longhorn manager on the all the nodes. Then when network outage happens, we can take a look at those logs to see if it’s only Longhorn or the whole kubernetes network on the node goes down Also, on each node, use screen to run ping towards the other two nodes, put output to a file and check the result when the outage happens if only the test in Kubernetes pod failed, that’s kubernetes overlay network issue; if both failed, it’s a vendor VMnetwork issue |
OK, I will give this a try.
This is a home setup. I have two nodes, one of these is a QNAP NAS running
an Intel J3455 so it's under powered, relatively speaking. This seems to be
the one that is struggling to run longhorn as my other node is working
fine. I'm finding that over time I get limited node redundancy (with only
replicas on my non-NAS node), I put this down to the above but possibly
it's also causing the network issue because of delays/latency in responses
when the CPU or IO spikes?
Would it be better to disable scheduling on the NAS node for now and run
longhorn on only one node? I'm planning on getting better hardware shortly
anyhow.
…On Sat, 16 Mar 2019 at 08:23, Sheng Yang ***@***.***> wrote:
I've written following in #428
<#428> , and we're working on a
tool to show the network instability. #430
<#430>
We can run three small pods per node, each on one node (by setting
nodename in the pod), and make them ping Longhorn manager on the all the
nodes. Then when network outage happens, we can take a look at those logs
to see if it’s only Longhorn or the whole kubernetes network on the node
goes down
Also, on each node, use screen to run ping towards the other two nodes,
put output to a file and check the result when the outage happens
if only the test in Kubernetes pod failed, that’s kubernetes overlay
network issue; if both failed, it’s a vendor VMnetwork issue
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#431 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AORr7xhYa9JBPG1L0dSpsPmsqQmlNmeHks5vXCt8gaJpZM4b126P>
.
|
Network latency sounds like the reason. You can disable the scheduling to the NAS box in node setup, but it's only for the replica. If k8s decides to schedule workload to the NAS box, the engine would need to be run there, so it still can cause problems. You can limit the pod to schedule to the other box only to avoid that. |
Is there anything in particular that is taxing? Volume size or how
frequently it's read/written to? I'll try to reduce the number of volumes
as a short term measure and monitor as you recommend.
…On Sat, Mar 16, 2019, 09:31 Sheng Yang ***@***.***> wrote:
Network latency sounds like the reason.
You can disable the scheduling to the NAS box in node setup, but it's only
for the replica. If k8s decides to schedule workload to the NAS box, the
engine would need to be run there, so it still can cause problems. You can
limit the pod to schedule to the other box only to avoid that.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#431 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AORr71rcjb10clHGi0HK0pHXvAV6s7Qhks5vXDthgaJpZM4b126P>
.
|
More volumes will likely use more resources. Volume size or frequency of access won't matter much compare to number of volumes. |
After testing for a few days, there seems to be nothing specific in the
kubelet logs or pings (though ping response time varies).
I've reduced the number of volumes and I am still finding volumes
disconnected randomly. I am down to half the volumes I was using previously
and it is still occurring. Previously, I was running many more volumes and
although sometimes I had issues with node redundancy on my slower node, I
never had disconnections until I upgraded from Rancher 2.1.6 to 2.1.7. It
may be a strange coincidence but it is very odd. Is there anything
different between the two that could have increased the sensitivity?
If not, I'll wait until I have better hardware and test again.
…On Sun, 17 Mar 2019 at 01:58, Sheng Yang ***@***.***> wrote:
More volumes will likely use more resources. Volume size or frequency of
access won't matter much compare to number of volumes.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#431 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AORr75dVVYQqW0-n1vk1uyNMr4yg6L2Oks5vXSKegaJpZM4b126P>
.
|
@paulmorabito are there any lost pings? If you look at the time that volume became faulted and the time that ping failed, are they correlated? Also, what's the last stable version for you? Rancher version and kubernetes version. And there is maybe something in canal's log. Take a look at Also, you can send me the latest support bundle. |
I'll take a look at canal also. I can't see anything out of the ordinary so
far but the logging history only goes back a few minutes.
Last stable version was Rancher 2.1.6 with Kubernetes 1.13.4. Now I am
using Rancher 2.1.7 and Kubernetes 1.13.4.
If there is any correlation it is that the ping times are longer around the
volume disconnect. As best as I can see, it looks like CPU/disk IO is
spiking (and is hitting swap etc). As the node itself is an Intel J3455
processor it is not responding fast enough so longhorn disconnects. If I
try to hit the Rancher UI during this time then its slow to load pages but
nothing times out.
I'll let you know if I find anything more.
…On Thu, 21 Mar 2019 at 03:26, Sheng Yang ***@***.***> wrote:
@paulmorabito <https://github.com/paulmorabito> are there any lost pings?
If you look at the time that volume became faulted and the time that ping
failed, are they correlated?
Also, what's the last stable version for you? Rancher version and
kubernetes version.
And there is maybe something in canal's log. Take a look at kubectl -n
kube-system logs canal-xxx. Those are the overlay networking components
Kubernetes uses.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#431 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AORr73jv7TztB4ixUjq9icUiV7TgnsGcks5vYn1EgaJpZM4b126P>
.
|
We will add an experimental feature Let me know if it improves the situation you're experiencing. We are working on a better solution after v0.5.0. |
OK thank you for the update. So far, since I've moved to better hardware,
I've not had an issue with volumes disconnecting but would be happy to help
test this out.
…On Wed, 1 May 2019 at 03:56, Sheng Yang ***@***.***> wrote:
We will add an experimental feature to guarantee the engine's CPU in
v0.5.0. We hope this will improve Longhorn stability with high load
situation. Though due to the potential resource constraint, the attaching
of volume may fail with the feature enabled (that's why it's experimental).
Let me know if it improves the situation you're experiencing.
We are working on a better solution after v0.5.0.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#431 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADSGX34TDDNJ3MYHJYXCFJLPTCI4TANCNFSM4G6XN2HQ>
.
|
I'm getting a similar thing in v0.8.0 dashboard is showing count 260, Logs |
Its happening to all three volumes I have created |
I have a few |
running kube 1.17 in kind |
@cameronbraid Do you have open-iscsi installed on the node? You would see some error messages regarding Also, kind is not officially supported by Longhorn currently. |
yeah open-iscsi is installed on the node my node container is
I am using kind to run my developer cluster which I was setting up to be as similar as my staging/live cluster as possible. Kind seemed to be the way to do that. I just done have the resources to run 3 VMs on my dev machine |
@cameronbraid did you manage to resolve this? |
Hi,
I'm running Longhorn 0.4.0 on Rancher 2.1.7, Kubernetes version 1.12.6.
I upgraded to the latest Rancher (from 2.1.6) and Kubernetes at the same time.
Since then, I've noticed volumes randomly detaching. The error in longhorn-manager logs for the latest is:
This has never happened prior to the upgrade so I am a little unsure what is causing it. If I try to re-deploy, I get an error:
The only way around this seems to be to restore from a backup but it seems to repeatedly happen, so far to multiple containers.
And another example, just happened:
The text was updated successfully, but these errors were encountered: