Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Randomly detaching volumes #431

Open
paulmorabito opened this issue Mar 15, 2019 · 21 comments
Open

Randomly detaching volumes #431

paulmorabito opened this issue Mar 15, 2019 · 21 comments
Labels
area/kubernetes Kubernetes related like K8s version compatibility kind/bug

Comments

@paulmorabito
Copy link

paulmorabito commented Mar 15, 2019

Hi,

I'm running Longhorn 0.4.0 on Rancher 2.1.7, Kubernetes version 1.12.6.

I upgraded to the latest Rancher (from 2.1.6) and Kubernetes at the same time.

Since then, I've noticed volumes randomly detaching. The error in longhorn-manager logs for the latest is:

0315 06:26:41.314459       1 engine_controller.go:597] fail to update status for engine wallabag-config-e-9f0952ee: failed to list replicas from controller 'wallabag-config': Failed to execute: /var/lib/rancher/longhorn/engine-binaries/rancher-longhorn-engine-v0.4.0/longhorn [--url http://10.42.0.234:9501 ls], output time="2019-03-15T06:26:41Z" level=fatal msg="Error running ls command: Get http://10.42.0.234:9501/v1/replicas: dial tcp 10.42.0.234:9501: connect: connection refused" 
, error exit status 1 
time="2019-03-15T06:26:43Z" level=warning msg="instance wallabag-config-e-9f0952ee state is failed/unknown, pod state Failed" 
time="2019-03-15T06:26:43Z" level=warning msg="instance wallabag-config-e-9f0952ee crashed on node 192.168.1.113, log: \n2019-03-14T07:25:58.726298391Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"launcher: controller fe296e7f-87c5-48a7-95db-72830b59fe62 started\"\n2019-03-14T07:25:58.745036509Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Starting with replicas [\\\"tcp://10.42.1.227:9502\\\" \\\"tcp://10.42.0.233:9502\\\" \\\"tcp://10.42.1.225:9502\\\"]\" \n2019-03-14T07:25:58.745440608Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Connecting to remote: 10.42.1.227:9502\" \n2019-03-14T07:25:58.749815387Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Opening: 10.42.1.227:9502\" \n2019-03-14T07:25:58.807327488Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Adding backend: tcp://10.42.1.227:9502\" \n2019-03-14T07:25:58.807398943Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Set replica tcp://10.42.1.227:9502 to mode RW\" \n2019-03-14T07:25:58.80741908Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Connecting to remote: 10.42.0.233:9502\" \n2019-03-14T07:25:58.80786323Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Start monitoring tcp://10.42.1.227:9502\" \n2019-03-14T07:25:58.814177723Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Opening: 10.42.0.233:9502\" \n2019-03-14T07:25:59.113799554Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Adding backend: tcp://10.42.0.233:9502\" \n2019-03-14T07:25:59.113924242Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Set replica tcp://10.42.0.233:9502 to mode RW\" \n2019-03-14T07:25:59.113947574Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Connecting to remote: 10.42.1.225:9502\" \n2019-03-14T07:25:59.114004128Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Start monitoring tcp://10.42.0.233:9502\" \n2019-03-14T07:25:59.119344194Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Opening: 10.42.1.225:9502\" \n2019-03-14T07:25:59.172317064Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Adding backend: tcp://10.42.1.225:9502\" \n2 
019-03-14T07:25:59.220914683Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Set replica tcp://10.42.1.225:9502 to mode RW\" \n2019-03-14T07:25:59.221051515Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Start monitoring tcp://10.42.1.225:9502\" \n2019-03-14T07:25:59.221077003Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Get backend tcp://10.42.1.227:9502 revision counter 37\" \n2019-03-14T07:25:59.22109641Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Get backend tcp://10.42.0.233:9502 revision counter 37\" \n2019-03-14T07:25:59.221115434Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Get backend tcp://10.42.1.225:9502 revision counter 37\" \n2019-03-14T07:26:01.761834162Z time=\"2019-03-14T07:26:01Z\" level=info msg=\"New data socket connnection established\" \n2019-03-14T07:26:03.898401346Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"Creating device /dev/longhorn/wallabag-config 8:0\"\n2019-03-14T07:26:03.898492224Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"launcher: Device /dev/longhorn/wallabag-config is ready\"\n2019-03-14T07:26:03.898513263Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"launcher: SCSI device /dev/sda created\"\n2019-03-14T07:26:03.905052332Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"Listening on 0.0.0.0:9501\" \n" 
time="2019-03-15T06:26:43Z" level=debug msg="Stop monitoring wallabag-config-e-9f0952ee" 
time="2019-03-15T06:26:43Z" level=error msg="Engine of volume wallabag-config dead unexpectedly, detach the volume" 
time="2019-03-15T06:26:43Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"wallabag-config\", UID:\"79d11738-40d3-11e9-a8e8-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8412370\", FieldPath:\"\"}): type: 'Warning' reason: 'Faulted' Engine of volume wallabag-config dead unexpectedly, detach the volume" 
time="2019-03-15T06:26:43Z" level=debug msg="Requeue longhorn-system/wallabag-config-e-9f0952ee due to conflict" 
time="2019-03-15T06:26:43Z" level=warning msg="instance wallabag-config-e-9f0952ee state is failed/unknown, pod state Failed" 
time="2019-03-15T06:26:43Z" level=warning msg="instance wallabag-config-e-9f0952ee crashed on node 192.168.1.113, log: \n2019-03-14T07:25:58.726298391Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"launcher: controller fe296e7f-87c5-48a7-95db-72830b59fe62 started\"\n2019-03-14T07:25:58.745036509Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Starting with replicas [\\\"tcp://10.42.1.227:9502\\\" \\\"tcp://10.42.0.233:9502\\\" \\\"tcp://10.42.1.225:9502\\\"]\" \n2019-03-14T07:25:58.745440608Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Connecting to remote: 10.42.1.227:9502\" \n2019-03-14T07:25:58.749815387Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Opening: 10.42.1.227:9502\" \n2019-03-14T07:25:58.807327488Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Adding backend: tcp://10.42.1.227:9502\" \n2019-03-14T07:25:58.807398943Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Set replica tcp://10.42.1.227:9502 to mode RW\" \n2019-03-14T07:25:58.80741908Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Connecting to remote: 10.42.0.233:9502\" \n2019-03-14T07:25:58.80786323Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Start monitoring tcp://10.42.1.227:9502\" \n2019-03-14T07:25:58.814177723Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Opening: 10.42.0.233:9502\" \n2019-03-14T07:25:59.113799554Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Adding backend: tcp://10.42.0.233:9502\" \n2019-03-14T07:25:59.113924242Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Set replica tcp://10.42.0.233:9502 to mode RW\" \n2019-03-14T07:25:59.113947574Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Connecting to remote: 10.42.1.225:9502\" \n2019-03-14T07:25:59.114004128Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Start monitoring tcp://10.42.0.233:9502\" \n2019-03-14T07:25:59.119344194Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Opening: 10.42.1.225:9502\" \n2019-03-14T07:25:59.172317064Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Adding backend: tcp://10.42.1.225:9502\" \n2 
019-03-14T07:25:59.220914683Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Set replica tcp://10.42.1.225:9502 to mode RW\" \n2019-03-14T07:25:59.221051515Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Start monitoring tcp://10.42.1.225:9502\" \n2019-03-14T07:25:59.221077003Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Get backend tcp://10.42.1.227:9502 revision counter 37\" \n2019-03-14T07:25:59.22109641Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Get backend tcp://10.42.0.233:9502 revision counter 37\" \n2019-03-14T07:25:59.221115434Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Get backend tcp://10.42.1.225:9502 revision counter 37\" \n2019-03-14T07:26:01.761834162Z time=\"2019-03-14T07:26:01Z\" level=info msg=\"New data socket connnection established\" \n2019-03-14T07:26:03.898401346Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"Creating device /dev/longhorn/wallabag-config 8:0\"\n2019-03-14T07:26:03.898492224Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"launcher: Device /dev/longhorn/wallabag-config is ready\"\n2019-03-14T07:26:03.898513263Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"launcher: SCSI device /dev/sda created\"\n2019-03-14T07:26:03.905052332Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"Listening on 0.0.0.0:9501\" \n" 
time="2019-03-15T06:26:43Z" level=error msg="Engine of volume wallabag-config dead unexpectedly, detach the volume" 
time="2019-03-15T06:26:43Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"wallabag-config\", UID:\"79d11738-40d3-11e9-a8e8-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8412370\", FieldPath:\"\"}): type: 'Warning' reason: 'Faulted' Engine of volume wallabag-config dead unexpectedly, detach the volume" 
time="2019-03-15T06:26:43Z" level=warning msg="instance wallabag-config-e-9f0952ee state is failed/unknown, pod state Failed" 
time="2019-03-15T06:26:43Z" level=debug msg="Requeue longhorn-system/wallabag-config due to conflict" 
time="2019-03-15T06:26:43Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"wallabag-config-e-9f0952ee\", UID:\"7a1a066c-40d3-11e9-a8e8-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8723026\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops wallabag-config-e-9f0952ee" 
time="2019-03-15T06:26:43Z" level=warning msg="instance wallabag-config-e-9f0952ee crashed on node 192.168.1.113, but cannot get log, error the server could not find the requested resource (get pods wallabag-config-e-9f0952ee)" 
time="2019-03-15T06:26:44Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"wallabag-config-r-cddec082\", UID:\"45fed95f-4146-11e9-a8e8-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8723036\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops wallabag-config-r-cddec082" 
time="2019-03-15T06:26:44Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"wallabag-config-r-238b83c5\", UID:\"c91100ca-4568-11e9-a8e8-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8723037\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops wallabag-config-r-238b83c5" 
time="2019-03-15T06:26:46Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"wallabag-config-r-812b1bab\", UID:\"0e0f6fef-442b-11e9-a8e8-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8723041\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops wallabag-config-r-812b1bab" 

This has never happened prior to the upgrade so I am a little unsure what is causing it. If I try to re-deploy, I get an error:

FailedMount	Unable to mount volumes for pod "wallabag-574dd648-qvw62_apps(1369cca5-46f2-11e9-b537-d050998bc684)": timeout expired waiting for volumes to attach or mount for pod "apps"/"wallabag-574dd648-qvw62". list of unmounted volumes=[wallabag-config]. list of unattached volumes=[wallabag-config default-token-wpfq7]	2 minutes ago
Warning	FailedMount	MountVolume.SetUp failed for volume "wallabag-config" : rpc error: code = Internal desc = exit status 1

The only way around this seems to be to restore from a backup but it seems to repeatedly happen, so far to multiple containers.

And another example, just happened:

time="2019-03-15T07:30:53Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T07:31:14Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=4831395596437772646 type=events
time="2019-03-15T07:31:14Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=4741500764654383297 type=volumes
time="2019-03-15T07:31:14Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=7582563204632809049 type=settings
time="2019-03-15T07:31:14Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=6300763577289731937 type=nodes
time="2019-03-15T07:31:14Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=3639549196677473240 type=engineimages
time="2019-03-15T07:31:23Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T07:31:45Z" level=warning msg="instance pihole-config-e-2d769a35 state is failed/unknown, pod state Failed"
time="2019-03-15T07:31:45Z" level=warning msg="instance pihole-config-e-2d769a35 crashed on node 192.168.1.113, log: \n2019-03-15T00:23:10.96623998Z time=\"2019-03-15T00:23:10Z\" level=info msg=\"launcher: controller 0ca6bf38-471f-4ac6-8b40-a39160945156 started\"\n2019-03-15T00:23:11.035864659Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Starting with replicas [\\\"tcp://10.42.1.245:9502\\\" \\\"tcp://10.42.0.245:9502\\\" \\\"tcp://10.42.1.246:9502\\\"]\" \n2019-03-15T00:23:11.035944547Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.1.245:9502\" \n2019-03-15T00:23:11.035965404Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.1.245:9502\" \n2019-03-15T00:23:11.05689144Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.1.245:9502\" \n2019-03-15T00:23:11.084079701Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.1.245:9502 to mode RW\" \n2019-03-15T00:23:11.08590774Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.0.245:9502\" \n2019-03-15T00:23:11.088138901Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.1.245:9502\" \n2019-03-15T00:23:11.10909417Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.0.245:9502\" \n2019-03-15T00:23:11.146252431Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.0.245:9502\" \n2019-03-15T00:23:11.146316526Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.0.245:9502 to mode RW\" \n2019-03-15T00:23:11.146336827Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.1.246:9502\" \n2019-03-15T00:23:11.160947639Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.0.245:9502\" \n2019-03-15T00:23:11.194153494Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.1.246:9502\" \n2019-03-15T00:23:11.244320499Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.1.246:9502\" \n2019-
03-15T00:23:11.244399024Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.1.246:9502 to mode RW\" \n2019-03-15T00:23:11.244460383Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.1.246:9502\" \n2019-03-15T00:23:11.264967718Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.1.245:9502 revision counter 0\" \n2019-03-15T00:23:11.265038133Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.0.245:9502 revision counter 0\" \n2019-03-15T00:23:11.265058637Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.1.246:9502 revision counter 0\" \n2019-03-15T00:23:12.660298813Z time=\"2019-03-15T00:23:12Z\" level=info msg=\"New data socket connnection established\" \n2019-03-15T00:23:13.389425645Z time=\"2019-03-15T00:23:13Z\" level=warning msg=\"Device /dev/longhorn/pihole-config already exists, clean it up\"\n2019-03-15T00:23:13.389499557Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"Creating device /dev/longhorn/pihole-config 8:48\"\n2019-03-15T00:23:13.389520425Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"launcher: Device /dev/longhorn/pihole-config is ready\"\n2019-03-15T00:23:13.38954013Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"launcher: SCSI device /dev/sdd created\"\n2019-03-15T00:23:13.415594741Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"Listening on 0.0.0.0:9501\" \n2019-03-15T00:23:47.437218006Z time=\"2019-03-15T00:23:47Z\" level=error msg=\"Write timeout on replica 10.42.0.245:9503 seq= 90 size= 1280 (kB)\" \n2019-03-15T00:23:47.437316345Z time=\"2019-03-15T00:23:47Z\" level=error msg=\"Retry 1 on replica 10.42.0.245:9503 seq= 90 size= 1280 (kB)\" \n"
time="2019-03-15T07:31:45Z" level=debug msg="Stop monitoring pihole-config-e-2d769a35"
time="2019-03-15T07:31:45Z" level=debug msg="Requeue longhorn-system/pihole-config-e-2d769a35 due to conflict"
time="2019-03-15T07:31:45Z" level=warning msg="instance pihole-config-e-2d769a35 state is failed/unknown, pod state Failed"
time="2019-03-15T07:31:45Z" level=error msg="Engine of volume pihole-config dead unexpectedly, detach the volume"
time="2019-03-15T07:31:45Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pihole-config\", UID:\"31664fc2-46b8-11e9-b537-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8641054\", FieldPath:\"\"}): type: 'Warning' reason: 'Faulted' Engine of volume pihole-config dead unexpectedly, detach the volume"
time="2019-03-15T07:31:45Z" level=warning msg="instance pihole-config-e-2d769a35 crashed on node 192.168.1.113, log: \n2019-03-15T00:23:10.96623998Z time=\"2019-03-15T00:23:10Z\" level=info msg=\"launcher: controller 0ca6bf38-471f-4ac6-8b40-a39160945156 started\"\n2019-03-15T00:23:11.035864659Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Starting with replicas [\\\"tcp://10.42.1.245:9502\\\" \\\"tcp://10.42.0.245:9502\\\" \\\"tcp://10.42.1.246:9502\\\"]\" \n2019-03-15T00:23:11.035944547Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.1.245:9502\" \n2019-03-15T00:23:11.035965404Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.1.245:9502\" \n2019-03-15T00:23:11.05689144Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.1.245:9502\" \n2019-03-15T00:23:11.084079701Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.1.245:9502 to mode RW\" \n2019-03-15T00:23:11.08590774Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.0.245:9502\" \n2019-03-15T00:23:11.088138901Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.1.245:9502\" \n2019-03-15T00:23:11.10909417Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.0.245:9502\" \n2019-03-15T00:23:11.146252431Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.0.245:9502\" \n2019-03-15T00:23:11.146316526Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.0.245:9502 to mode RW\" \n2019-03-15T00:23:11.146336827Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.1.246:9502\" \n2019-03-15T00:23:11.160947639Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.0.245:9502\" \n2019-03-15T00:23:11.194153494Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.1.246:9502\" \n2019-03-15T00:23:11.244320499Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.1.246:9502\" \n2019-0
3-15T00:23:11.244399024Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.1.246:9502 to mode RW\" \n2019-03-15T00:23:11.244460383Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.1.246:9502\" \n2019-03-15T00:23:11.264967718Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.1.245:9502 revision counter 0\" \n2019-03-15T00:23:11.265038133Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.0.245:9502 revision counter 0\" \n2019-03-15T00:23:11.265058637Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.1.246:9502 revision counter 0\" \n2019-03-15T00:23:12.660298813Z time=\"2019-03-15T00:23:12Z\" level=info msg=\"New data socket connnection established\" \n2019-03-15T00:23:13.389425645Z time=\"2019-03-15T00:23:13Z\" level=warning msg=\"Device /dev/longhorn/pihole-config already exists, clean it up\"\n2019-03-15T00:23:13.389499557Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"Creating device /dev/longhorn/pihole-config 8:48\"\n2019-03-15T00:23:13.389520425Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"launcher: Device /dev/longhorn/pihole-config is ready\"\n2019-03-15T00:23:13.38954013Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"launcher: SCSI device /dev/sdd created\"\n2019-03-15T00:23:13.415594741Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"Listening on 0.0.0.0:9501\" \n2019-03-15T00:23:47.437218006Z time=\"2019-03-15T00:23:47Z\" level=error msg=\"Write timeout on replica 10.42.0.245:9503 seq= 90 size= 1280 (kB)\" \n2019-03-15T00:23:47.437316345Z time=\"2019-03-15T00:23:47Z\" level=error msg=\"Retry 1 on replica 10.42.0.245:9503 seq= 90 size= 1280 (kB)\" \n"
time="2019-03-15T07:31:45Z" level=warning msg="instance pihole-config-e-2d769a35 state is failed/unknown, pod state Failed"
time="2019-03-15T07:31:46Z" level=warning msg="instance pihole-config-e-2d769a35 crashed on node 192.168.1.113, log: \n2019-03-15T00:23:10.96623998Z time=\"2019-03-15T00:23:10Z\" level=info msg=\"launcher: controller 0ca6bf38-471f-4ac6-8b40-a39160945156 started\"\n2019-03-15T00:23:11.035864659Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Starting with replicas [\\\"tcp://10.42.1.245:9502\\\" \\\"tcp://10.42.0.245:9502\\\" \\\"tcp://10.42.1.246:9502\\\"]\" \n2019-03-15T00:23:11.035944547Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.1.245:9502\" \n2019-03-15T00:23:11.035965404Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.1.245:9502\" \n2019-03-15T00:23:11.05689144Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.1.245:9502\" \n2019-03-15T00:23:11.084079701Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.1.245:9502 to mode RW\" \n2019-03-15T00:23:11.08590774Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.0.245:9502\" \n2019-03-15T00:23:11.088138901Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.1.245:9502\" \n2019-03-15T00:23:11.10909417Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.0.245:9502\" \n2019-03-15T00:23:11.146252431Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.0.245:9502\" \n2019-03-15T00:23:11.146316526Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.0.245:9502 to mode RW\" \n2019-03-15T00:23:11.146336827Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.1.246:9502\" \n2019-03-15T00:23:11.160947639Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.0.245:9502\" \n2019-03-15T00:23:11.194153494Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.1.246:9502\" \n2019-03-15T00:23:11.244320499Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.1.246:9502\" \n2019-
03-15T00:23:11.244399024Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.1.246:9502 to mode RW\" \n2019-03-15T00:23:11.244460383Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.1.246:9502\" \n2019-03-15T00:23:11.264967718Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.1.245:9502 revision counter 0\" \n2019-03-15T00:23:11.265038133Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.0.245:9502 revision counter 0\" \n2019-03-15T00:23:11.265058637Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.1.246:9502 revision counter 0\" \n2019-03-15T00:23:12.660298813Z time=\"2019-03-15T00:23:12Z\" level=info msg=\"New data socket connnection established\" \n2019-03-15T00:23:13.389425645Z time=\"2019-03-15T00:23:13Z\" level=warning msg=\"Device /dev/longhorn/pihole-config already exists, clean it up\"\n2019-03-15T00:23:13.389499557Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"Creating device /dev/longhorn/pihole-config 8:48\"\n2019-03-15T00:23:13.389520425Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"launcher: Device /dev/longhorn/pihole-config is ready\"\n2019-03-15T00:23:13.38954013Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"launcher: SCSI device /dev/sdd created\"\n2019-03-15T00:23:13.415594741Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"Listening on 0.0.0.0:9501\" \n2019-03-15T00:23:47.437218006Z time=\"2019-03-15T00:23:47Z\" level=error msg=\"Write timeout on replica 10.42.0.245:9503 seq= 90 size= 1280 (kB)\" \n2019-03-15T00:23:47.437316345Z time=\"2019-03-15T00:23:47Z\" level=error msg=\"Retry 1 on replica 10.42.0.245:9503 seq= 90 size= 1280 (kB)\" \n"
time="2019-03-15T07:31:46Z" level=warning msg="instance pihole-config-e-2d769a35 state is failed/unknown, pod state Failed"
time="2019-03-15T07:31:46Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pihole-config-e-2d769a35\", UID:\"31829984-46b8-11e9-b537-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8738247\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pihole-config-e-2d769a35"
time="2019-03-15T07:31:46Z" level=warning msg="instance pihole-config-e-2d769a35 crashed on node 192.168.1.113, but cannot get log, error the server could not find the requested resource (get pods pihole-config-e-2d769a35)"
time="2019-03-15T07:31:46Z" level=debug msg="Requeue longhorn-system/pihole-config-e-2d769a35 due to conflict"
time="2019-03-15T07:31:46Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pihole-config-r-7d23f5df\", UID:\"31a452c0-46b8-11e9-b537-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8738258\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pihole-config-r-7d23f5df"
time="2019-03-15T07:31:46Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pihole-config-r-9d34fe99\", UID:\"319e02d8-46b8-11e9-b537-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8738260\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pihole-config-r-9d34fe99"
time="2019-03-15T07:31:46Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pihole-config-r-de19446f\", UID:\"31b012b5-46b8-11e9-b537-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8738262\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pihole-config-r-de19446f"
time="2019-03-15T07:31:47Z" level=debug msg="Requeue longhorn-system/pihole-config-r-7d23f5df due to conflict"
time="2019-03-15T07:31:48Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T07:31:54Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T07:31:55Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T07:31:55Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pihole-config\", UID:\"31664fc2-46b8-11e9-b537-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8738251\", FieldPath:\"\"}): type: 'Normal' reason: 'Detached' volume pihole-config has been detached"
time="2019-03-15T07:47:54Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T07:56:55Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:08:25Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:11:54Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:15:23Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:15:50Z" level=debug msg="websocket: open" id=4128074707144928693 type=nodes
time="2019-03-15T08:15:50Z" level=debug msg="websocket: open" id=6764860400748216452 type=volumes
time="2019-03-15T08:15:50Z" level=debug msg="websocket: open" id=6648466556511782731 type=settings
time="2019-03-15T08:15:50Z" level=debug msg="websocket: open" id=4493170291527913011 type=events
time="2019-03-15T08:15:59Z" level=debug msg="websocket: close 1001 (going away)" id=4128074707144928693 type=nodes
time="2019-03-15T08:15:59Z" level=debug msg="websocket: close 1001 (going away)" id=6764860400748216452 type=volumes
time="2019-03-15T08:15:59Z" level=debug msg="websocket: close 1001 (going away)" id=6648466556511782731 type=settings
time="2019-03-15T08:15:59Z" level=debug msg="websocket: close 1001 (going away)" id=4493170291527913011 type=events
time="2019-03-15T08:16:00Z" level=debug msg="websocket: open" id=4181549130489787303 type=nodes
time="2019-03-15T08:16:00Z" level=debug msg="websocket: open" id=5752050787147960179 type=volumes
time="2019-03-15T08:16:00Z" level=debug msg="websocket: open" id=6237091507046503631 type=settings
time="2019-03-15T08:16:00Z" level=debug msg="websocket: open" id=5573154643070934735 type=events
time="2019-03-15T08:16:06Z" level=debug msg="websocket: open" id=7145142990055357628 type=engineimages
time="2019-03-15T08:35:59Z" level=debug msg="websocket: close 1001 (going away)" id=7145142990055357628 type=engineimages
time="2019-03-15T08:35:59Z" level=debug msg="websocket: close 1001 (going away)" id=5573154643070934735 type=events
time="2019-03-15T08:35:59Z" level=debug msg="websocket: close 1001 (going away)" id=6237091507046503631 type=settings
time="2019-03-15T08:35:59Z" level=debug msg="websocket: close 1001 (going away)" id=5752050787147960179 type=volumes
time="2019-03-15T08:35:59Z" level=debug msg="websocket: close 1001 (going away)" id=4181549130489787303 type=nodes
time="2019-03-15T08:42:14Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:42:21Z" level=debug msg="websocket: open" id=1398672874488986839 type=nodes
time="2019-03-15T08:42:21Z" level=debug msg="websocket: open" id=2037380050782661346 type=engineimages
time="2019-03-15T08:42:21Z" level=debug msg="websocket: open" id=8140573680880219386 type=volumes
time="2019-03-15T08:42:21Z" level=debug msg="websocket: open" id=5183662064263719791 type=settings
time="2019-03-15T08:42:27Z" level=debug msg="websocket: open" id=6635645598637089984 type=events
time="2019-03-15T08:42:30Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:43:21Z" level=debug msg="websocket: close 1000 (normal)" id=2037380050782661346 type=engineimages
time="2019-03-15T08:43:21Z" level=debug msg="websocket: close 1000 (normal)" id=5183662064263719791 type=settings
time="2019-03-15T08:43:25Z" level=debug msg="websocket: open" id=74602158249045416 type=settings
time="2019-03-15T08:43:25Z" level=debug msg="websocket: open" id=4851548346893322878 type=engineimages
time="2019-03-15T08:44:51Z" level=debug msg="websocket: close 1000 (normal)" id=8140573680880219386 type=volumes
time="2019-03-15T08:44:53Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:44:54Z" level=debug msg="websocket: open" id=1105444861851866387 type=volumes
time="2019-03-15T08:45:24Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:46:21Z" level=debug msg="websocket: close 1000 (normal)" id=1105444861851866387 type=volumes
time="2019-03-15T08:46:24Z" level=debug msg="websocket: open" id=5323766511069271790 type=volumes
time="2019-03-15T08:46:58Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:47:02Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:47:05Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:47:21Z" level=debug msg="websocket: close 1000 (normal)" id=5323766511069271790 type=volumes
time="2019-03-15T08:47:25Z" level=debug msg="websocket: open" id=1289587634874238 type=volumes
time="2019-03-15T08:51:24Z" level=debug msg="websocket: close 1001 (going away)" id=1398672874488986839 type=nodes
time="2019-03-15T08:51:24Z" level=debug msg="websocket: close 1001 (going away)" id=6635645598637089984 type=events
time="2019-03-15T08:51:24Z" level=debug msg="websocket: close 1001 (going away)" id=4851548346893322878 type=engineimages
time="2019-03-15T08:51:24Z" level=debug msg="websocket: close 1001 (going away)" id=1289587634874238 type=volumes
time="2019-03-15T08:51:24Z" level=debug msg="websocket: close 1001 (going away)" id=74602158249045416 type=settings
time="2019-03-15T08:51:24Z" level=debug msg="websocket: open" id=9139432551256950817 type=nodes
time="2019-03-15T08:51:24Z" level=debug msg="websocket: open" id=7316449847013065679 type=volumes
time="2019-03-15T08:51:24Z" level=debug msg="websocket: open" id=3966634769506155410 type=settings
time="2019-03-15T08:51:24Z" level=debug msg="websocket: open" id=8007276293101714326 type=events
time="2019-03-15T08:51:54Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:52:24Z" level=debug msg="websocket: close 1000 (normal)" id=7316449847013065679 type=volumes
time="2019-03-15T08:52:28Z" level=debug msg="websocket: open" id=1383811817073807746 type=volumes
time="2019-03-15T08:53:44Z" level=debug msg="websocket: close 1000 (normal)" id=3966634769506155410 type=settings
time="2019-03-15T08:53:51Z" level=debug msg="websocket: open" id=8187151353233396656 type=settings
time="2019-03-15T08:59:36Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:40992: read: connection reset by peer" id=9139432551256950817 type=nodes
time="2019-03-15T09:03:23Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:03:43Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:41604: read: connection reset by peer" id=1383811817073807746 type=volumes
time="2019-03-15T09:03:43Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:41604: write: broken pipe"
time="2019-03-15T09:03:43Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:41604: write: broken pipe"
2019/03/15 09:03:43 http: response.WriteHeader on hijacked connection
2019/03/15 09:03:43 http: response.Write on hijacked connection
time="2019-03-15T09:03:43Z" level=error msg="Failed to write err: write tcp 10.42.0.223:9500->10.42.0.216:41604: write: broken pipe"
time="2019-03-15T09:10:03Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=8187151353233396656 type=settings
time="2019-03-15T09:10:10Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=8007276293101714326 type=events
time="2019-03-15T09:13:24Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:18:54Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:18:54Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:19:54Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:31:24Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:31:53Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:54:53Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:57:24Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:59:11Z" level=debug msg="websocket: open" id=3688657621062867182 type=nodes
time="2019-03-15T09:59:11Z" level=debug msg="websocket: open" id=8675113931133888859 type=volumes
time="2019-03-15T09:59:11Z" level=debug msg="websocket: open" id=4041442603163985588 type=settings
time="2019-03-15T09:59:11Z" level=debug msg="websocket: open" id=5323680298748747052 type=events
time="2019-03-15T09:59:23Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T10:00:41Z" level=debug msg="websocket: close 1000 (normal)" id=8675113931133888859 type=volumes
time="2019-03-15T10:00:45Z" level=debug msg="websocket: open" id=100342649986862393 type=volumes
time="2019-03-15T10:00:55Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T10:01:11Z" level=debug msg="websocket: close 1000 (normal)" id=4041442603163985588 type=settings
time="2019-03-15T10:01:15Z" level=debug msg="websocket: open" id=8098185332466814955 type=settings
 Wrap lines
@paulmorabito
Copy link
Author

I will close this for now as I realize I didn't redeploy the longhorn driver deployer as you recommended previously when the kubernetes version changed. I also moved to K8S 1.13.4. If it occurs again, I'll re-raise this issue.

@paulmorabito
Copy link
Author

paulmorabito commented Mar 15, 2019

This is still happening. Same as above.

time="2019-03-15T11:08:26Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:35814: read: connection reset by peer" id=8185369120697835514 type=events
time="2019-03-15T11:08:26Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
time="2019-03-15T11:08:26Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
2019/03/15 11:08:26 http: response.WriteHeader on hijacked connection
2019/03/15 11:08:26 http: response.Write on hijacked connection
time="2019-03-15T11:08:26Z" level=error msg="Failed to write err: write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
time="2019-03-15T11:09:11Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:33118: read: connection reset by peer" id=3019472168369352208 type=nodes
time="2019-03-15T11:11:01Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:33120: read: connection reset by peer" id=7721074465402278285 type=volumes
time="2019-03-15T11:11:01Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:33120: write: broken pipe"
time="2019-03-15T11:11:01Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:33120: write: broken pipe"
2019/03/15 11:11:01 http: response.WriteHeader on hijacked connection
2019/03/15 11:11:01 http: response.Write on hijacked connection
time="2019-03-15T11:11:01Z" level=error msg="Failed to write err: write tcp 10.42.0.223:9500->10.42.0.216:33120: write: broken pipe"
time="2019-03-15T11:11:11Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:14:45Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:15:11Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:16:14Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:17:14Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:18:59Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=4725361201971631803 type=engineimages
time="2019-03-15T11:19:05Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=238036254512907690 type=settings
time="2019-03-15T11:20:42Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:21:45Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:59428: write: connection reset by peer"
time="2019-03-15T11:21:45Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:59428: write: connection reset by peer"
2019/03/15 11:21:45 http: response.WriteHeader on hijacked connection
time="2019-03-15T11:21:45Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=6169431783094076595 type=nodes
2019/03/15 11:21:45 http: response.Write on hijacked connection
time="2019-03-15T11:21:45Z" level=error msg="Failed to write err: write tcp 10.42.0.223:9500->10.42.0.216:59428: write: connection reset by peer"
time="2019-03-15T11:22:45Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:22:45Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:59434: write: connection reset by peer"
time="2019-03-15T11:22:45Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:59434: write: connection reset by peer"
2019/03/15 11:22:45 http: response.WriteHeader on hijacked connection
2019/03/15 11:22:45 http: response.Write on hijacked connection
time="2019-03-15T11:22:45Z" level=error msg="Failed to write err: write tcp 10.42.0.223:9500->10.42.0.216:59434: write: connection reset by peer"
time="2019-03-15T11:22:45Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=5168485637911515866 type=events
time="2019-03-15T11:23:41Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:24:45Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:25:15Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:27:12Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:30:42Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:31:15Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:32:45Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:34:41Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:35:42Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:37:15Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:40:05Z" level=debug msg="websocket: open" id=1430955949357187458 type=nodes
time="2019-03-15T11:40:07Z" level=debug msg="websocket: open" id=1100066232393237368 type=events
time="2019-03-15T11:40:10Z" level=debug msg="websocket: close 1001 (going away)" id=1100066232393237368 type=events
time="2019-03-15T11:40:10Z" level=debug msg="websocket: close 1001 (going away)" id=1430955949357187458 type=nodes
time="2019-03-15T11:40:10Z" level=debug msg="websocket: close 1001 (going away)" id=1779197760252059838 type=settings
time="2019-03-15T11:40:10Z" level=debug msg="websocket: close 1001 (going away)" id=7287566523553878274 type=volumes
time="2019-03-15T11:40:10Z" level=debug msg="websocket: open" id=1514645244813189170 type=nodes
time="2019-03-15T11:40:10Z" level=debug msg="websocket: open" id=888831306678476749 type=volumes
time="2019-03-15T11:40:10Z" level=debug msg="websocket: open" id=999993769452696059 type=settings
time="2019-03-15T11:40:10Z" level=debug msg="websocket: open" id=9084276990810808601 type=events
time="2019-03-15T11:40:17Z" level=debug msg="websocket: open" id=7860987994654670230 type=engineimages
time="2019-03-15T11:43:45Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:45:11Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:46:56Z" level=debug msg="websocket: close 1000 (normal)" id=9084276990810808601 type=events
time="2019-03-15T11:47:09Z" level=debug msg="websocket: open" id=504605418748729395 type=events
time="2019-03-15T11:50:13Z" level=debug msg="websocket: close 1000 (normal)" id=504605418748729395 type=events
time="2019-03-15T11:50:17Z" level=debug msg="websocket: open" id=8077085466658919124 type=events
time="2019-03-15T11:50:21Z" level=debug msg="websocket: close 1000 (normal)" id=7860987994654670230 type=engineimages
time="2019-03-15T11:50:24Z" level=debug msg="websocket: open" id=8644144950573174000 type=engineimages
time="2019-03-15T11:50:24Z" level=debug msg="websocket: open" id=6415339020896519545 type=nodes
time="2019-03-15T11:50:24Z" level=debug msg="websocket: open" id=8388094500921628307 type=engineimages
time="2019-03-15T11:50:24Z" level=debug msg="websocket: open" id=2334813102760982254 type=volumes
time="2019-03-15T11:50:24Z" level=debug msg="websocket: open" id=833443658099127517 type=settings
 Wrap lines

Kubernetes 1.13.4
Rancher 2.1.7
Bare metal.

Although I changed K8S versions during upgrade of Rancher, I moved back to 1.13.4 so the only difference is Rancher moving from 2.1.6 to 2.1.7. I deleted the deployments and re-deployed longhorn driver deployment after last K8S upgrade.

Any ideas?

@paulmorabito paulmorabito reopened this Mar 15, 2019
@yasker
Copy link
Member

yasker commented Mar 15, 2019

@paulmorabito It seems caused by the network issues in your setup:

time="2019-03-15T11:08:26Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:35814: read: connection reset by peer" id=8185369120697835514 type=events
time="2019-03-15T11:08:26Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
time="2019-03-15T11:08:26Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
2019/03/15 11:08:26 http: response.WriteHeader on hijacked connection
2019/03/15 11:08:26 http: response.Write on hijacked connection
time="2019-03-15T11:08:26Z" level=error msg="Failed to write err: write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
time="2019-03-15T11:09:11Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:33118: read: connection reset by peer" id=3019472168369352208 type=nodes
time="2019-03-15T11:11:01Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:33120: read: connection reset by peer" id=7721074465402278285 type=volumes
time="2019-03-15T11:11:01Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:33120: write: broken pipe"
time="2019-03-15T11:11:01Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:33120: write: broken pipe"
2019/03/15 11:11:01 http: response.WriteHeader on hijacked connection
2019/03/15 11:11:01 http: response.Write on hijacked connection

Can you check if the connection between different nodes (pods across the nodes) are stable?

@paulmorabito
Copy link
Author

paulmorabito commented Mar 15, 2019 via email

@yasker
Copy link
Member

yasker commented Mar 15, 2019

I've written following in #428 , and we're working on a tool to show the network instability. #430

We can run three small pods per node, each on one node (by setting nodename in the pod), and make them ping Longhorn manager on the all the nodes. Then when network outage happens, we can take a look at those logs to see if it’s only Longhorn or the whole kubernetes network on the node goes down

Also, on each node, use screen to run ping towards the other two nodes, put output to a file and check the result when the outage happens

if only the test in Kubernetes pod failed, that’s kubernetes overlay network issue; if both failed, it’s a vendor VMnetwork issue

@paulmorabito
Copy link
Author

paulmorabito commented Mar 16, 2019 via email

@yasker
Copy link
Member

yasker commented Mar 16, 2019

Network latency sounds like the reason.

You can disable the scheduling to the NAS box in node setup, but it's only for the replica. If k8s decides to schedule workload to the NAS box, the engine would need to be run there, so it still can cause problems. You can limit the pod to schedule to the other box only to avoid that.

@paulmorabito
Copy link
Author

paulmorabito commented Mar 16, 2019 via email

@yasker
Copy link
Member

yasker commented Mar 16, 2019

More volumes will likely use more resources. Volume size or frequency of access won't matter much compare to number of volumes.

@yasker yasker added kind/bug area/driver area/kubernetes Kubernetes related like K8s version compatibility labels Mar 19, 2019
@yasker yasker added this to the v0.5.0 milestone Mar 19, 2019
@paulmorabito
Copy link
Author

paulmorabito commented Mar 20, 2019 via email

@yasker
Copy link
Member

yasker commented Mar 20, 2019

@paulmorabito are there any lost pings? If you look at the time that volume became faulted and the time that ping failed, are they correlated?

Also, what's the last stable version for you? Rancher version and kubernetes version.

And there is maybe something in canal's log. Take a look at kubectl -n kube-system logs canal-xxx. Those are the overlay networking components Kubernetes uses.

Also, you can send me the latest support bundle.

@paulmorabito
Copy link
Author

paulmorabito commented Mar 21, 2019 via email

@yasker
Copy link
Member

yasker commented Apr 30, 2019

We will add an experimental feature GuaranteedEngineCPU to guarantee the engine's CPU in v0.5.0. We hope this will improve Longhorn stability with high load situation. Though due to the potential resource constraint, the attaching of volume may fail with the feature enabled (that's why it's experimental).

Let me know if it improves the situation you're experiencing.

We are working on a better solution after v0.5.0.

@yasker yasker modified the milestones: v0.5.0, v0.6.0 Apr 30, 2019
@paulmorabito
Copy link
Author

paulmorabito commented May 3, 2019 via email

@yasker yasker modified the milestones: v0.6.0, v0.7.0 Jul 29, 2019
@yasker yasker removed this from the v0.7.0 milestone Sep 21, 2019
@cameronbraid
Copy link

I'm getting a similar thing in v0.8.0

dashboard is showing count 260, Engine of volume pvc-757fea95-d1c4-4cae-a87b-a240ae1489e4 dead unexpectedly, reattach the volume

Logs
logs-from-replica-manager-in-instance-manager-r-edd60c6a.txt

@cameronbraid
Copy link

Its happening to all three volumes I have created

@cameronbraid
Copy link

I have a few tgtd -f processes running each using 100% of a CPU

@cameronbraid
Copy link

running kube 1.17 in kind

@yasker
Copy link
Member

yasker commented Apr 6, 2020

@cameronbraid Do you have open-iscsi installed on the node? You would see some error messages regarding iscsiadm etc in the manager's log.

Also, kind is not officially supported by Longhorn currently. local-path-provisioner (which is included in the newer version of kind) should work better for your case if you're using a single node.

@cameronbraid
Copy link

yeah open-iscsi is installed on the node

my node container is

FROM kindest/node:v1.17.2
RUN apt-get update && apt-get -y install less iputils-ping dnsutils open-iscsi

I am using kind to run my developer cluster which I was setting up to be as similar as my staging/live cluster as possible. Kind seemed to be the way to do that. I just done have the resources to run 3 VMs on my dev machine

@wackazong
Copy link

@cameronbraid did you manage to resolve this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubernetes Kubernetes related like K8s version compatibility kind/bug
Projects
None yet
Development

No branches or pull requests

5 participants