Randomly detaching volumes #431

paulmorabito · 2019-03-15T07:23:07Z

Hi,

I'm running Longhorn 0.4.0 on Rancher 2.1.7, Kubernetes version 1.12.6.

I upgraded to the latest Rancher (from 2.1.6) and Kubernetes at the same time.

Since then, I've noticed volumes randomly detaching. The error in longhorn-manager logs for the latest is:

0315 06:26:41.314459       1 engine_controller.go:597] fail to update status for engine wallabag-config-e-9f0952ee: failed to list replicas from controller 'wallabag-config': Failed to execute: /var/lib/rancher/longhorn/engine-binaries/rancher-longhorn-engine-v0.4.0/longhorn [--url http://10.42.0.234:9501 ls], output time="2019-03-15T06:26:41Z" level=fatal msg="Error running ls command: Get http://10.42.0.234:9501/v1/replicas: dial tcp 10.42.0.234:9501: connect: connection refused" 
, error exit status 1 
time="2019-03-15T06:26:43Z" level=warning msg="instance wallabag-config-e-9f0952ee state is failed/unknown, pod state Failed" 
time="2019-03-15T06:26:43Z" level=warning msg="instance wallabag-config-e-9f0952ee crashed on node 192.168.1.113, log: \n2019-03-14T07:25:58.726298391Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"launcher: controller fe296e7f-87c5-48a7-95db-72830b59fe62 started\"\n2019-03-14T07:25:58.745036509Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Starting with replicas [\\\"tcp://10.42.1.227:9502\\\" \\\"tcp://10.42.0.233:9502\\\" \\\"tcp://10.42.1.225:9502\\\"]\" \n2019-03-14T07:25:58.745440608Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Connecting to remote: 10.42.1.227:9502\" \n2019-03-14T07:25:58.749815387Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Opening: 10.42.1.227:9502\" \n2019-03-14T07:25:58.807327488Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Adding backend: tcp://10.42.1.227:9502\" \n2019-03-14T07:25:58.807398943Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Set replica tcp://10.42.1.227:9502 to mode RW\" \n2019-03-14T07:25:58.80741908Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Connecting to remote: 10.42.0.233:9502\" \n2019-03-14T07:25:58.80786323Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Start monitoring tcp://10.42.1.227:9502\" \n2019-03-14T07:25:58.814177723Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Opening: 10.42.0.233:9502\" \n2019-03-14T07:25:59.113799554Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Adding backend: tcp://10.42.0.233:9502\" \n2019-03-14T07:25:59.113924242Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Set replica tcp://10.42.0.233:9502 to mode RW\" \n2019-03-14T07:25:59.113947574Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Connecting to remote: 10.42.1.225:9502\" \n2019-03-14T07:25:59.114004128Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Start monitoring tcp://10.42.0.233:9502\" \n2019-03-14T07:25:59.119344194Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Opening: 10.42.1.225:9502\" \n2019-03-14T07:25:59.172317064Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Adding backend: tcp://10.42.1.225:9502\" \n2 
019-03-14T07:25:59.220914683Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Set replica tcp://10.42.1.225:9502 to mode RW\" \n2019-03-14T07:25:59.221051515Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Start monitoring tcp://10.42.1.225:9502\" \n2019-03-14T07:25:59.221077003Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Get backend tcp://10.42.1.227:9502 revision counter 37\" \n2019-03-14T07:25:59.22109641Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Get backend tcp://10.42.0.233:9502 revision counter 37\" \n2019-03-14T07:25:59.221115434Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Get backend tcp://10.42.1.225:9502 revision counter 37\" \n2019-03-14T07:26:01.761834162Z time=\"2019-03-14T07:26:01Z\" level=info msg=\"New data socket connnection established\" \n2019-03-14T07:26:03.898401346Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"Creating device /dev/longhorn/wallabag-config 8:0\"\n2019-03-14T07:26:03.898492224Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"launcher: Device /dev/longhorn/wallabag-config is ready\"\n2019-03-14T07:26:03.898513263Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"launcher: SCSI device /dev/sda created\"\n2019-03-14T07:26:03.905052332Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"Listening on 0.0.0.0:9501\" \n" 
time="2019-03-15T06:26:43Z" level=debug msg="Stop monitoring wallabag-config-e-9f0952ee" 
time="2019-03-15T06:26:43Z" level=error msg="Engine of volume wallabag-config dead unexpectedly, detach the volume" 
time="2019-03-15T06:26:43Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"wallabag-config\", UID:\"79d11738-40d3-11e9-a8e8-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8412370\", FieldPath:\"\"}): type: 'Warning' reason: 'Faulted' Engine of volume wallabag-config dead unexpectedly, detach the volume" 
time="2019-03-15T06:26:43Z" level=debug msg="Requeue longhorn-system/wallabag-config-e-9f0952ee due to conflict" 
time="2019-03-15T06:26:43Z" level=warning msg="instance wallabag-config-e-9f0952ee state is failed/unknown, pod state Failed" 
time="2019-03-15T06:26:43Z" level=warning msg="instance wallabag-config-e-9f0952ee crashed on node 192.168.1.113, log: \n2019-03-14T07:25:58.726298391Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"launcher: controller fe296e7f-87c5-48a7-95db-72830b59fe62 started\"\n2019-03-14T07:25:58.745036509Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Starting with replicas [\\\"tcp://10.42.1.227:9502\\\" \\\"tcp://10.42.0.233:9502\\\" \\\"tcp://10.42.1.225:9502\\\"]\" \n2019-03-14T07:25:58.745440608Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Connecting to remote: 10.42.1.227:9502\" \n2019-03-14T07:25:58.749815387Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Opening: 10.42.1.227:9502\" \n2019-03-14T07:25:58.807327488Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Adding backend: tcp://10.42.1.227:9502\" \n2019-03-14T07:25:58.807398943Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Set replica tcp://10.42.1.227:9502 to mode RW\" \n2019-03-14T07:25:58.80741908Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Connecting to remote: 10.42.0.233:9502\" \n2019-03-14T07:25:58.80786323Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Start monitoring tcp://10.42.1.227:9502\" \n2019-03-14T07:25:58.814177723Z time=\"2019-03-14T07:25:58Z\" level=info msg=\"Opening: 10.42.0.233:9502\" \n2019-03-14T07:25:59.113799554Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Adding backend: tcp://10.42.0.233:9502\" \n2019-03-14T07:25:59.113924242Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Set replica tcp://10.42.0.233:9502 to mode RW\" \n2019-03-14T07:25:59.113947574Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Connecting to remote: 10.42.1.225:9502\" \n2019-03-14T07:25:59.114004128Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Start monitoring tcp://10.42.0.233:9502\" \n2019-03-14T07:25:59.119344194Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Opening: 10.42.1.225:9502\" \n2019-03-14T07:25:59.172317064Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Adding backend: tcp://10.42.1.225:9502\" \n2 
019-03-14T07:25:59.220914683Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Set replica tcp://10.42.1.225:9502 to mode RW\" \n2019-03-14T07:25:59.221051515Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Start monitoring tcp://10.42.1.225:9502\" \n2019-03-14T07:25:59.221077003Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Get backend tcp://10.42.1.227:9502 revision counter 37\" \n2019-03-14T07:25:59.22109641Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Get backend tcp://10.42.0.233:9502 revision counter 37\" \n2019-03-14T07:25:59.221115434Z time=\"2019-03-14T07:25:59Z\" level=info msg=\"Get backend tcp://10.42.1.225:9502 revision counter 37\" \n2019-03-14T07:26:01.761834162Z time=\"2019-03-14T07:26:01Z\" level=info msg=\"New data socket connnection established\" \n2019-03-14T07:26:03.898401346Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"Creating device /dev/longhorn/wallabag-config 8:0\"\n2019-03-14T07:26:03.898492224Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"launcher: Device /dev/longhorn/wallabag-config is ready\"\n2019-03-14T07:26:03.898513263Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"launcher: SCSI device /dev/sda created\"\n2019-03-14T07:26:03.905052332Z time=\"2019-03-14T07:26:03Z\" level=info msg=\"Listening on 0.0.0.0:9501\" \n" 
time="2019-03-15T06:26:43Z" level=error msg="Engine of volume wallabag-config dead unexpectedly, detach the volume" 
time="2019-03-15T06:26:43Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"wallabag-config\", UID:\"79d11738-40d3-11e9-a8e8-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8412370\", FieldPath:\"\"}): type: 'Warning' reason: 'Faulted' Engine of volume wallabag-config dead unexpectedly, detach the volume" 
time="2019-03-15T06:26:43Z" level=warning msg="instance wallabag-config-e-9f0952ee state is failed/unknown, pod state Failed" 
time="2019-03-15T06:26:43Z" level=debug msg="Requeue longhorn-system/wallabag-config due to conflict" 
time="2019-03-15T06:26:43Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"wallabag-config-e-9f0952ee\", UID:\"7a1a066c-40d3-11e9-a8e8-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8723026\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops wallabag-config-e-9f0952ee" 
time="2019-03-15T06:26:43Z" level=warning msg="instance wallabag-config-e-9f0952ee crashed on node 192.168.1.113, but cannot get log, error the server could not find the requested resource (get pods wallabag-config-e-9f0952ee)" 
time="2019-03-15T06:26:44Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"wallabag-config-r-cddec082\", UID:\"45fed95f-4146-11e9-a8e8-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8723036\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops wallabag-config-r-cddec082" 
time="2019-03-15T06:26:44Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"wallabag-config-r-238b83c5\", UID:\"c91100ca-4568-11e9-a8e8-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8723037\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops wallabag-config-r-238b83c5" 
time="2019-03-15T06:26:46Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"wallabag-config-r-812b1bab\", UID:\"0e0f6fef-442b-11e9-a8e8-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8723041\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops wallabag-config-r-812b1bab"

This has never happened prior to the upgrade so I am a little unsure what is causing it. If I try to re-deploy, I get an error:

FailedMount	Unable to mount volumes for pod "wallabag-574dd648-qvw62_apps(1369cca5-46f2-11e9-b537-d050998bc684)": timeout expired waiting for volumes to attach or mount for pod "apps"/"wallabag-574dd648-qvw62". list of unmounted volumes=[wallabag-config]. list of unattached volumes=[wallabag-config default-token-wpfq7]	2 minutes ago
Warning	FailedMount	MountVolume.SetUp failed for volume "wallabag-config" : rpc error: code = Internal desc = exit status 1

The only way around this seems to be to restore from a backup but it seems to repeatedly happen, so far to multiple containers.

And another example, just happened:

time="2019-03-15T07:30:53Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T07:31:14Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=4831395596437772646 type=events
time="2019-03-15T07:31:14Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=4741500764654383297 type=volumes
time="2019-03-15T07:31:14Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=7582563204632809049 type=settings
time="2019-03-15T07:31:14Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=6300763577289731937 type=nodes
time="2019-03-15T07:31:14Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=3639549196677473240 type=engineimages
time="2019-03-15T07:31:23Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T07:31:45Z" level=warning msg="instance pihole-config-e-2d769a35 state is failed/unknown, pod state Failed"
time="2019-03-15T07:31:45Z" level=warning msg="instance pihole-config-e-2d769a35 crashed on node 192.168.1.113, log: \n2019-03-15T00:23:10.96623998Z time=\"2019-03-15T00:23:10Z\" level=info msg=\"launcher: controller 0ca6bf38-471f-4ac6-8b40-a39160945156 started\"\n2019-03-15T00:23:11.035864659Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Starting with replicas [\\\"tcp://10.42.1.245:9502\\\" \\\"tcp://10.42.0.245:9502\\\" \\\"tcp://10.42.1.246:9502\\\"]\" \n2019-03-15T00:23:11.035944547Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.1.245:9502\" \n2019-03-15T00:23:11.035965404Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.1.245:9502\" \n2019-03-15T00:23:11.05689144Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.1.245:9502\" \n2019-03-15T00:23:11.084079701Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.1.245:9502 to mode RW\" \n2019-03-15T00:23:11.08590774Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.0.245:9502\" \n2019-03-15T00:23:11.088138901Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.1.245:9502\" \n2019-03-15T00:23:11.10909417Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.0.245:9502\" \n2019-03-15T00:23:11.146252431Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.0.245:9502\" \n2019-03-15T00:23:11.146316526Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.0.245:9502 to mode RW\" \n2019-03-15T00:23:11.146336827Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.1.246:9502\" \n2019-03-15T00:23:11.160947639Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.0.245:9502\" \n2019-03-15T00:23:11.194153494Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.1.246:9502\" \n2019-03-15T00:23:11.244320499Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.1.246:9502\" \n2019-
03-15T00:23:11.244399024Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.1.246:9502 to mode RW\" \n2019-03-15T00:23:11.244460383Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.1.246:9502\" \n2019-03-15T00:23:11.264967718Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.1.245:9502 revision counter 0\" \n2019-03-15T00:23:11.265038133Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.0.245:9502 revision counter 0\" \n2019-03-15T00:23:11.265058637Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.1.246:9502 revision counter 0\" \n2019-03-15T00:23:12.660298813Z time=\"2019-03-15T00:23:12Z\" level=info msg=\"New data socket connnection established\" \n2019-03-15T00:23:13.389425645Z time=\"2019-03-15T00:23:13Z\" level=warning msg=\"Device /dev/longhorn/pihole-config already exists, clean it up\"\n2019-03-15T00:23:13.389499557Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"Creating device /dev/longhorn/pihole-config 8:48\"\n2019-03-15T00:23:13.389520425Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"launcher: Device /dev/longhorn/pihole-config is ready\"\n2019-03-15T00:23:13.38954013Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"launcher: SCSI device /dev/sdd created\"\n2019-03-15T00:23:13.415594741Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"Listening on 0.0.0.0:9501\" \n2019-03-15T00:23:47.437218006Z time=\"2019-03-15T00:23:47Z\" level=error msg=\"Write timeout on replica 10.42.0.245:9503 seq= 90 size= 1280 (kB)\" \n2019-03-15T00:23:47.437316345Z time=\"2019-03-15T00:23:47Z\" level=error msg=\"Retry 1 on replica 10.42.0.245:9503 seq= 90 size= 1280 (kB)\" \n"
time="2019-03-15T07:31:45Z" level=debug msg="Stop monitoring pihole-config-e-2d769a35"
time="2019-03-15T07:31:45Z" level=debug msg="Requeue longhorn-system/pihole-config-e-2d769a35 due to conflict"
time="2019-03-15T07:31:45Z" level=warning msg="instance pihole-config-e-2d769a35 state is failed/unknown, pod state Failed"
time="2019-03-15T07:31:45Z" level=error msg="Engine of volume pihole-config dead unexpectedly, detach the volume"
time="2019-03-15T07:31:45Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pihole-config\", UID:\"31664fc2-46b8-11e9-b537-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8641054\", FieldPath:\"\"}): type: 'Warning' reason: 'Faulted' Engine of volume pihole-config dead unexpectedly, detach the volume"
time="2019-03-15T07:31:45Z" level=warning msg="instance pihole-config-e-2d769a35 crashed on node 192.168.1.113, log: \n2019-03-15T00:23:10.96623998Z time=\"2019-03-15T00:23:10Z\" level=info msg=\"launcher: controller 0ca6bf38-471f-4ac6-8b40-a39160945156 started\"\n2019-03-15T00:23:11.035864659Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Starting with replicas [\\\"tcp://10.42.1.245:9502\\\" \\\"tcp://10.42.0.245:9502\\\" \\\"tcp://10.42.1.246:9502\\\"]\" \n2019-03-15T00:23:11.035944547Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.1.245:9502\" \n2019-03-15T00:23:11.035965404Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.1.245:9502\" \n2019-03-15T00:23:11.05689144Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.1.245:9502\" \n2019-03-15T00:23:11.084079701Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.1.245:9502 to mode RW\" \n2019-03-15T00:23:11.08590774Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.0.245:9502\" \n2019-03-15T00:23:11.088138901Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.1.245:9502\" \n2019-03-15T00:23:11.10909417Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.0.245:9502\" \n2019-03-15T00:23:11.146252431Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.0.245:9502\" \n2019-03-15T00:23:11.146316526Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.0.245:9502 to mode RW\" \n2019-03-15T00:23:11.146336827Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.1.246:9502\" \n2019-03-15T00:23:11.160947639Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.0.245:9502\" \n2019-03-15T00:23:11.194153494Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.1.246:9502\" \n2019-03-15T00:23:11.244320499Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.1.246:9502\" \n2019-0
3-15T00:23:11.244399024Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.1.246:9502 to mode RW\" \n2019-03-15T00:23:11.244460383Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.1.246:9502\" \n2019-03-15T00:23:11.264967718Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.1.245:9502 revision counter 0\" \n2019-03-15T00:23:11.265038133Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.0.245:9502 revision counter 0\" \n2019-03-15T00:23:11.265058637Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.1.246:9502 revision counter 0\" \n2019-03-15T00:23:12.660298813Z time=\"2019-03-15T00:23:12Z\" level=info msg=\"New data socket connnection established\" \n2019-03-15T00:23:13.389425645Z time=\"2019-03-15T00:23:13Z\" level=warning msg=\"Device /dev/longhorn/pihole-config already exists, clean it up\"\n2019-03-15T00:23:13.389499557Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"Creating device /dev/longhorn/pihole-config 8:48\"\n2019-03-15T00:23:13.389520425Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"launcher: Device /dev/longhorn/pihole-config is ready\"\n2019-03-15T00:23:13.38954013Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"launcher: SCSI device /dev/sdd created\"\n2019-03-15T00:23:13.415594741Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"Listening on 0.0.0.0:9501\" \n2019-03-15T00:23:47.437218006Z time=\"2019-03-15T00:23:47Z\" level=error msg=\"Write timeout on replica 10.42.0.245:9503 seq= 90 size= 1280 (kB)\" \n2019-03-15T00:23:47.437316345Z time=\"2019-03-15T00:23:47Z\" level=error msg=\"Retry 1 on replica 10.42.0.245:9503 seq= 90 size= 1280 (kB)\" \n"
time="2019-03-15T07:31:45Z" level=warning msg="instance pihole-config-e-2d769a35 state is failed/unknown, pod state Failed"
time="2019-03-15T07:31:46Z" level=warning msg="instance pihole-config-e-2d769a35 crashed on node 192.168.1.113, log: \n2019-03-15T00:23:10.96623998Z time=\"2019-03-15T00:23:10Z\" level=info msg=\"launcher: controller 0ca6bf38-471f-4ac6-8b40-a39160945156 started\"\n2019-03-15T00:23:11.035864659Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Starting with replicas [\\\"tcp://10.42.1.245:9502\\\" \\\"tcp://10.42.0.245:9502\\\" \\\"tcp://10.42.1.246:9502\\\"]\" \n2019-03-15T00:23:11.035944547Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.1.245:9502\" \n2019-03-15T00:23:11.035965404Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.1.245:9502\" \n2019-03-15T00:23:11.05689144Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.1.245:9502\" \n2019-03-15T00:23:11.084079701Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.1.245:9502 to mode RW\" \n2019-03-15T00:23:11.08590774Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.0.245:9502\" \n2019-03-15T00:23:11.088138901Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.1.245:9502\" \n2019-03-15T00:23:11.10909417Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.0.245:9502\" \n2019-03-15T00:23:11.146252431Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.0.245:9502\" \n2019-03-15T00:23:11.146316526Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.0.245:9502 to mode RW\" \n2019-03-15T00:23:11.146336827Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Connecting to remote: 10.42.1.246:9502\" \n2019-03-15T00:23:11.160947639Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.0.245:9502\" \n2019-03-15T00:23:11.194153494Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Opening: 10.42.1.246:9502\" \n2019-03-15T00:23:11.244320499Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Adding backend: tcp://10.42.1.246:9502\" \n2019-
03-15T00:23:11.244399024Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Set replica tcp://10.42.1.246:9502 to mode RW\" \n2019-03-15T00:23:11.244460383Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Start monitoring tcp://10.42.1.246:9502\" \n2019-03-15T00:23:11.264967718Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.1.245:9502 revision counter 0\" \n2019-03-15T00:23:11.265038133Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.0.245:9502 revision counter 0\" \n2019-03-15T00:23:11.265058637Z time=\"2019-03-15T00:23:11Z\" level=info msg=\"Get backend tcp://10.42.1.246:9502 revision counter 0\" \n2019-03-15T00:23:12.660298813Z time=\"2019-03-15T00:23:12Z\" level=info msg=\"New data socket connnection established\" \n2019-03-15T00:23:13.389425645Z time=\"2019-03-15T00:23:13Z\" level=warning msg=\"Device /dev/longhorn/pihole-config already exists, clean it up\"\n2019-03-15T00:23:13.389499557Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"Creating device /dev/longhorn/pihole-config 8:48\"\n2019-03-15T00:23:13.389520425Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"launcher: Device /dev/longhorn/pihole-config is ready\"\n2019-03-15T00:23:13.38954013Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"launcher: SCSI device /dev/sdd created\"\n2019-03-15T00:23:13.415594741Z time=\"2019-03-15T00:23:13Z\" level=info msg=\"Listening on 0.0.0.0:9501\" \n2019-03-15T00:23:47.437218006Z time=\"2019-03-15T00:23:47Z\" level=error msg=\"Write timeout on replica 10.42.0.245:9503 seq= 90 size= 1280 (kB)\" \n2019-03-15T00:23:47.437316345Z time=\"2019-03-15T00:23:47Z\" level=error msg=\"Retry 1 on replica 10.42.0.245:9503 seq= 90 size= 1280 (kB)\" \n"
time="2019-03-15T07:31:46Z" level=warning msg="instance pihole-config-e-2d769a35 state is failed/unknown, pod state Failed"
time="2019-03-15T07:31:46Z" level=info msg="Event(v1.ObjectReference{Kind:\"Engine\", Namespace:\"longhorn-system\", Name:\"pihole-config-e-2d769a35\", UID:\"31829984-46b8-11e9-b537-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8738247\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pihole-config-e-2d769a35"
time="2019-03-15T07:31:46Z" level=warning msg="instance pihole-config-e-2d769a35 crashed on node 192.168.1.113, but cannot get log, error the server could not find the requested resource (get pods pihole-config-e-2d769a35)"
time="2019-03-15T07:31:46Z" level=debug msg="Requeue longhorn-system/pihole-config-e-2d769a35 due to conflict"
time="2019-03-15T07:31:46Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pihole-config-r-7d23f5df\", UID:\"31a452c0-46b8-11e9-b537-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8738258\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pihole-config-r-7d23f5df"
time="2019-03-15T07:31:46Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pihole-config-r-9d34fe99\", UID:\"319e02d8-46b8-11e9-b537-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8738260\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pihole-config-r-9d34fe99"
time="2019-03-15T07:31:46Z" level=info msg="Event(v1.ObjectReference{Kind:\"Replica\", Namespace:\"longhorn-system\", Name:\"pihole-config-r-de19446f\", UID:\"31b012b5-46b8-11e9-b537-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8738262\", FieldPath:\"\"}): type: 'Normal' reason: 'Stop' Stops pihole-config-r-de19446f"
time="2019-03-15T07:31:47Z" level=debug msg="Requeue longhorn-system/pihole-config-r-7d23f5df due to conflict"
time="2019-03-15T07:31:48Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T07:31:54Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T07:31:55Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T07:31:55Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pihole-config\", UID:\"31664fc2-46b8-11e9-b537-d050998bc684\", APIVersion:\"longhorn.rancher.io/v1alpha1\", ResourceVersion:\"8738251\", FieldPath:\"\"}): type: 'Normal' reason: 'Detached' volume pihole-config has been detached"
time="2019-03-15T07:47:54Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T07:56:55Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:08:25Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:11:54Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:15:23Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:15:50Z" level=debug msg="websocket: open" id=4128074707144928693 type=nodes
time="2019-03-15T08:15:50Z" level=debug msg="websocket: open" id=6764860400748216452 type=volumes
time="2019-03-15T08:15:50Z" level=debug msg="websocket: open" id=6648466556511782731 type=settings
time="2019-03-15T08:15:50Z" level=debug msg="websocket: open" id=4493170291527913011 type=events
time="2019-03-15T08:15:59Z" level=debug msg="websocket: close 1001 (going away)" id=4128074707144928693 type=nodes
time="2019-03-15T08:15:59Z" level=debug msg="websocket: close 1001 (going away)" id=6764860400748216452 type=volumes
time="2019-03-15T08:15:59Z" level=debug msg="websocket: close 1001 (going away)" id=6648466556511782731 type=settings
time="2019-03-15T08:15:59Z" level=debug msg="websocket: close 1001 (going away)" id=4493170291527913011 type=events
time="2019-03-15T08:16:00Z" level=debug msg="websocket: open" id=4181549130489787303 type=nodes
time="2019-03-15T08:16:00Z" level=debug msg="websocket: open" id=5752050787147960179 type=volumes
time="2019-03-15T08:16:00Z" level=debug msg="websocket: open" id=6237091507046503631 type=settings
time="2019-03-15T08:16:00Z" level=debug msg="websocket: open" id=5573154643070934735 type=events
time="2019-03-15T08:16:06Z" level=debug msg="websocket: open" id=7145142990055357628 type=engineimages
time="2019-03-15T08:35:59Z" level=debug msg="websocket: close 1001 (going away)" id=7145142990055357628 type=engineimages
time="2019-03-15T08:35:59Z" level=debug msg="websocket: close 1001 (going away)" id=5573154643070934735 type=events
time="2019-03-15T08:35:59Z" level=debug msg="websocket: close 1001 (going away)" id=6237091507046503631 type=settings
time="2019-03-15T08:35:59Z" level=debug msg="websocket: close 1001 (going away)" id=5752050787147960179 type=volumes
time="2019-03-15T08:35:59Z" level=debug msg="websocket: close 1001 (going away)" id=4181549130489787303 type=nodes
time="2019-03-15T08:42:14Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:42:21Z" level=debug msg="websocket: open" id=1398672874488986839 type=nodes
time="2019-03-15T08:42:21Z" level=debug msg="websocket: open" id=2037380050782661346 type=engineimages
time="2019-03-15T08:42:21Z" level=debug msg="websocket: open" id=8140573680880219386 type=volumes
time="2019-03-15T08:42:21Z" level=debug msg="websocket: open" id=5183662064263719791 type=settings
time="2019-03-15T08:42:27Z" level=debug msg="websocket: open" id=6635645598637089984 type=events
time="2019-03-15T08:42:30Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:43:21Z" level=debug msg="websocket: close 1000 (normal)" id=2037380050782661346 type=engineimages
time="2019-03-15T08:43:21Z" level=debug msg="websocket: close 1000 (normal)" id=5183662064263719791 type=settings
time="2019-03-15T08:43:25Z" level=debug msg="websocket: open" id=74602158249045416 type=settings
time="2019-03-15T08:43:25Z" level=debug msg="websocket: open" id=4851548346893322878 type=engineimages
time="2019-03-15T08:44:51Z" level=debug msg="websocket: close 1000 (normal)" id=8140573680880219386 type=volumes
time="2019-03-15T08:44:53Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:44:54Z" level=debug msg="websocket: open" id=1105444861851866387 type=volumes
time="2019-03-15T08:45:24Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:46:21Z" level=debug msg="websocket: close 1000 (normal)" id=1105444861851866387 type=volumes
time="2019-03-15T08:46:24Z" level=debug msg="websocket: open" id=5323766511069271790 type=volumes
time="2019-03-15T08:46:58Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:47:02Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:47:05Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:47:21Z" level=debug msg="websocket: close 1000 (normal)" id=5323766511069271790 type=volumes
time="2019-03-15T08:47:25Z" level=debug msg="websocket: open" id=1289587634874238 type=volumes
time="2019-03-15T08:51:24Z" level=debug msg="websocket: close 1001 (going away)" id=1398672874488986839 type=nodes
time="2019-03-15T08:51:24Z" level=debug msg="websocket: close 1001 (going away)" id=6635645598637089984 type=events
time="2019-03-15T08:51:24Z" level=debug msg="websocket: close 1001 (going away)" id=4851548346893322878 type=engineimages
time="2019-03-15T08:51:24Z" level=debug msg="websocket: close 1001 (going away)" id=1289587634874238 type=volumes
time="2019-03-15T08:51:24Z" level=debug msg="websocket: close 1001 (going away)" id=74602158249045416 type=settings
time="2019-03-15T08:51:24Z" level=debug msg="websocket: open" id=9139432551256950817 type=nodes
time="2019-03-15T08:51:24Z" level=debug msg="websocket: open" id=7316449847013065679 type=volumes
time="2019-03-15T08:51:24Z" level=debug msg="websocket: open" id=3966634769506155410 type=settings
time="2019-03-15T08:51:24Z" level=debug msg="websocket: open" id=8007276293101714326 type=events
time="2019-03-15T08:51:54Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T08:52:24Z" level=debug msg="websocket: close 1000 (normal)" id=7316449847013065679 type=volumes
time="2019-03-15T08:52:28Z" level=debug msg="websocket: open" id=1383811817073807746 type=volumes
time="2019-03-15T08:53:44Z" level=debug msg="websocket: close 1000 (normal)" id=3966634769506155410 type=settings
time="2019-03-15T08:53:51Z" level=debug msg="websocket: open" id=8187151353233396656 type=settings
time="2019-03-15T08:59:36Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:40992: read: connection reset by peer" id=9139432551256950817 type=nodes
time="2019-03-15T09:03:23Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:03:43Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:41604: read: connection reset by peer" id=1383811817073807746 type=volumes
time="2019-03-15T09:03:43Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:41604: write: broken pipe"
time="2019-03-15T09:03:43Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:41604: write: broken pipe"
2019/03/15 09:03:43 http: response.WriteHeader on hijacked connection
2019/03/15 09:03:43 http: response.Write on hijacked connection
time="2019-03-15T09:03:43Z" level=error msg="Failed to write err: write tcp 10.42.0.223:9500->10.42.0.216:41604: write: broken pipe"
time="2019-03-15T09:10:03Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=8187151353233396656 type=settings
time="2019-03-15T09:10:10Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=8007276293101714326 type=events
time="2019-03-15T09:13:24Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:18:54Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:18:54Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:19:54Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:31:24Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:31:53Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:54:53Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:57:24Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T09:59:11Z" level=debug msg="websocket: open" id=3688657621062867182 type=nodes
time="2019-03-15T09:59:11Z" level=debug msg="websocket: open" id=8675113931133888859 type=volumes
time="2019-03-15T09:59:11Z" level=debug msg="websocket: open" id=4041442603163985588 type=settings
time="2019-03-15T09:59:11Z" level=debug msg="websocket: open" id=5323680298748747052 type=events
time="2019-03-15T09:59:23Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T10:00:41Z" level=debug msg="websocket: close 1000 (normal)" id=8675113931133888859 type=volumes
time="2019-03-15T10:00:45Z" level=debug msg="websocket: open" id=100342649986862393 type=volumes
time="2019-03-15T10:00:55Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T10:01:11Z" level=debug msg="websocket: close 1000 (normal)" id=4041442603163985588 type=settings
time="2019-03-15T10:01:15Z" level=debug msg="websocket: open" id=8098185332466814955 type=settings
 Wrap lines

The text was updated successfully, but these errors were encountered:

paulmorabito · 2019-03-15T10:46:21Z

I will close this for now as I realize I didn't redeploy the longhorn driver deployer as you recommended previously when the kubernetes version changed. I also moved to K8S 1.13.4. If it occurs again, I'll re-raise this issue.

paulmorabito · 2019-03-15T11:54:44Z

This is still happening. Same as above.

time="2019-03-15T11:08:26Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:35814: read: connection reset by peer" id=8185369120697835514 type=events
time="2019-03-15T11:08:26Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
time="2019-03-15T11:08:26Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
2019/03/15 11:08:26 http: response.WriteHeader on hijacked connection
2019/03/15 11:08:26 http: response.Write on hijacked connection
time="2019-03-15T11:08:26Z" level=error msg="Failed to write err: write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
time="2019-03-15T11:09:11Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:33118: read: connection reset by peer" id=3019472168369352208 type=nodes
time="2019-03-15T11:11:01Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:33120: read: connection reset by peer" id=7721074465402278285 type=volumes
time="2019-03-15T11:11:01Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:33120: write: broken pipe"
time="2019-03-15T11:11:01Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:33120: write: broken pipe"
2019/03/15 11:11:01 http: response.WriteHeader on hijacked connection
2019/03/15 11:11:01 http: response.Write on hijacked connection
time="2019-03-15T11:11:01Z" level=error msg="Failed to write err: write tcp 10.42.0.223:9500->10.42.0.216:33120: write: broken pipe"
time="2019-03-15T11:11:11Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:14:45Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:15:11Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:16:14Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:17:14Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:18:59Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=4725361201971631803 type=engineimages
time="2019-03-15T11:19:05Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=238036254512907690 type=settings
time="2019-03-15T11:20:42Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:21:45Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:59428: write: connection reset by peer"
time="2019-03-15T11:21:45Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:59428: write: connection reset by peer"
2019/03/15 11:21:45 http: response.WriteHeader on hijacked connection
time="2019-03-15T11:21:45Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=6169431783094076595 type=nodes
2019/03/15 11:21:45 http: response.Write on hijacked connection
time="2019-03-15T11:21:45Z" level=error msg="Failed to write err: write tcp 10.42.0.223:9500->10.42.0.216:59428: write: connection reset by peer"
time="2019-03-15T11:22:45Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:22:45Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:59434: write: connection reset by peer"
time="2019-03-15T11:22:45Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:59434: write: connection reset by peer"
2019/03/15 11:22:45 http: response.WriteHeader on hijacked connection
2019/03/15 11:22:45 http: response.Write on hijacked connection
time="2019-03-15T11:22:45Z" level=error msg="Failed to write err: write tcp 10.42.0.223:9500->10.42.0.216:59434: write: connection reset by peer"
time="2019-03-15T11:22:45Z" level=debug msg="websocket: close 1006 (abnormal closure): unexpected EOF" id=5168485637911515866 type=events
time="2019-03-15T11:23:41Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:24:45Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:25:15Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:27:12Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:30:42Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:31:15Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:32:45Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:34:41Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:35:42Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:37:15Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:40:05Z" level=debug msg="websocket: open" id=1430955949357187458 type=nodes
time="2019-03-15T11:40:07Z" level=debug msg="websocket: open" id=1100066232393237368 type=events
time="2019-03-15T11:40:10Z" level=debug msg="websocket: close 1001 (going away)" id=1100066232393237368 type=events
time="2019-03-15T11:40:10Z" level=debug msg="websocket: close 1001 (going away)" id=1430955949357187458 type=nodes
time="2019-03-15T11:40:10Z" level=debug msg="websocket: close 1001 (going away)" id=1779197760252059838 type=settings
time="2019-03-15T11:40:10Z" level=debug msg="websocket: close 1001 (going away)" id=7287566523553878274 type=volumes
time="2019-03-15T11:40:10Z" level=debug msg="websocket: open" id=1514645244813189170 type=nodes
time="2019-03-15T11:40:10Z" level=debug msg="websocket: open" id=888831306678476749 type=volumes
time="2019-03-15T11:40:10Z" level=debug msg="websocket: open" id=999993769452696059 type=settings
time="2019-03-15T11:40:10Z" level=debug msg="websocket: open" id=9084276990810808601 type=events
time="2019-03-15T11:40:17Z" level=debug msg="websocket: open" id=7860987994654670230 type=engineimages
time="2019-03-15T11:43:45Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:45:11Z" level=debug msg="Requeue longhorn-system/192.168.1.113 due to conflict"
time="2019-03-15T11:46:56Z" level=debug msg="websocket: close 1000 (normal)" id=9084276990810808601 type=events
time="2019-03-15T11:47:09Z" level=debug msg="websocket: open" id=504605418748729395 type=events
time="2019-03-15T11:50:13Z" level=debug msg="websocket: close 1000 (normal)" id=504605418748729395 type=events
time="2019-03-15T11:50:17Z" level=debug msg="websocket: open" id=8077085466658919124 type=events
time="2019-03-15T11:50:21Z" level=debug msg="websocket: close 1000 (normal)" id=7860987994654670230 type=engineimages
time="2019-03-15T11:50:24Z" level=debug msg="websocket: open" id=8644144950573174000 type=engineimages
time="2019-03-15T11:50:24Z" level=debug msg="websocket: open" id=6415339020896519545 type=nodes
time="2019-03-15T11:50:24Z" level=debug msg="websocket: open" id=8388094500921628307 type=engineimages
time="2019-03-15T11:50:24Z" level=debug msg="websocket: open" id=2334813102760982254 type=volumes
time="2019-03-15T11:50:24Z" level=debug msg="websocket: open" id=833443658099127517 type=settings
 Wrap lines

Kubernetes 1.13.4
Rancher 2.1.7
Bare metal.

Although I changed K8S versions during upgrade of Rancher, I moved back to 1.13.4 so the only difference is Rancher moving from 2.1.6 to 2.1.7. I deleted the deployments and re-deployed longhorn driver deployment after last K8S upgrade.

Any ideas?

yasker · 2019-03-15T19:07:01Z

@paulmorabito It seems caused by the network issues in your setup:

time="2019-03-15T11:08:26Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:35814: read: connection reset by peer" id=8185369120697835514 type=events
time="2019-03-15T11:08:26Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
time="2019-03-15T11:08:26Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
2019/03/15 11:08:26 http: response.WriteHeader on hijacked connection
2019/03/15 11:08:26 http: response.Write on hijacked connection
time="2019-03-15T11:08:26Z" level=error msg="Failed to write err: write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe"
time="2019-03-15T11:09:11Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:33118: read: connection reset by peer" id=3019472168369352208 type=nodes
time="2019-03-15T11:11:01Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:33120: read: connection reset by peer" id=7721074465402278285 type=volumes
time="2019-03-15T11:11:01Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:33120: write: broken pipe"
time="2019-03-15T11:11:01Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:33120: write: broken pipe"
2019/03/15 11:11:01 http: response.WriteHeader on hijacked connection
2019/03/15 11:11:01 http: response.Write on hijacked connection

Can you check if the connection between different nodes (pods across the nodes) are stable?

paulmorabito · 2019-03-15T20:52:45Z

How can I do that? Nothing physically has changed with the network and they're wired connections. The only major system change was the upgrade. I've not noticed any other issues elsewhere either.

…

On Sat, Mar 16, 2019, 04:07 Sheng Yang ***@***.***> wrote: @paulmorabito <https://github.com/paulmorabito> It seems caused by the network issues in your setup: time="2019-03-15T11:08:26Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:35814: read: connection reset by peer" id=8185369120697835514 type=events time="2019-03-15T11:08:26Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe" time="2019-03-15T11:08:26Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe" 2019/03/15 11:08:26 http: response.WriteHeader on hijacked connection 2019/03/15 11:08:26 http: response.Write on hijacked connection time="2019-03-15T11:08:26Z" level=error msg="Failed to write err: write tcp 10.42.0.223:9500->10.42.0.216:35814: write: broken pipe" time="2019-03-15T11:09:11Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:33118: read: connection reset by peer" id=3019472168369352208 type=nodes time="2019-03-15T11:11:01Z" level=debug msg="read tcp 10.42.0.223:9500->10.42.0.216:33120: read: connection reset by peer" id=7721074465402278285 type=volumes time="2019-03-15T11:11:01Z" level=warning msg="HTTP handling error write tcp 10.42.0.223:9500->10.42.0.216:33120: write: broken pipe" time="2019-03-15T11:11:01Z" level=error msg="Error in request: write tcp 10.42.0.223:9500->10.42.0.216:33120: write: broken pipe" 2019/03/15 11:11:01 http: response.WriteHeader on hijacked connection 2019/03/15 11:11:01 http: response.Write on hijacked connection Can you check if the connection between different nodes (pods across the nodes) are stable? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#431 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AORr75TER0mhe_OhD3Vtw_n6uVY_8Ms5ks5vW-9XgaJpZM4b126P> .

yasker · 2019-03-15T23:23:39Z

I've written following in #428 , and we're working on a tool to show the network instability. #430

We can run three small pods per node, each on one node (by setting nodename in the pod), and make them ping Longhorn manager on the all the nodes. Then when network outage happens, we can take a look at those logs to see if it’s only Longhorn or the whole kubernetes network on the node goes down

Also, on each node, use screen to run ping towards the other two nodes, put output to a file and check the result when the outage happens

if only the test in Kubernetes pod failed, that’s kubernetes overlay network issue; if both failed, it’s a vendor VMnetwork issue

paulmorabito · 2019-03-16T00:00:46Z

OK, I will give this a try. This is a home setup. I have two nodes, one of these is a QNAP NAS running an Intel J3455 so it's under powered, relatively speaking. This seems to be the one that is struggling to run longhorn as my other node is working fine. I'm finding that over time I get limited node redundancy (with only replicas on my non-NAS node), I put this down to the above but possibly it's also causing the network issue because of delays/latency in responses when the CPU or IO spikes? Would it be better to disable scheduling on the NAS node for now and run longhorn on only one node? I'm planning on getting better hardware shortly anyhow.

…

On Sat, 16 Mar 2019 at 08:23, Sheng Yang ***@***.***> wrote: I've written following in #428 <#428> , and we're working on a tool to show the network instability. #430 <#430> We can run three small pods per node, each on one node (by setting nodename in the pod), and make them ping Longhorn manager on the all the nodes. Then when network outage happens, we can take a look at those logs to see if it’s only Longhorn or the whole kubernetes network on the node goes down Also, on each node, use screen to run ping towards the other two nodes, put output to a file and check the result when the outage happens if only the test in Kubernetes pod failed, that’s kubernetes overlay network issue; if both failed, it’s a vendor VMnetwork issue — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#431 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AORr7xhYa9JBPG1L0dSpsPmsqQmlNmeHks5vXCt8gaJpZM4b126P> .

yasker · 2019-03-16T00:31:28Z

Network latency sounds like the reason.

You can disable the scheduling to the NAS box in node setup, but it's only for the replica. If k8s decides to schedule workload to the NAS box, the engine would need to be run there, so it still can cause problems. You can limit the pod to schedule to the other box only to avoid that.

paulmorabito · 2019-03-16T02:08:46Z

Is there anything in particular that is taxing? Volume size or how frequently it's read/written to? I'll try to reduce the number of volumes as a short term measure and monitor as you recommend.

…

On Sat, Mar 16, 2019, 09:31 Sheng Yang ***@***.***> wrote: Network latency sounds like the reason. You can disable the scheduling to the NAS box in node setup, but it's only for the replica. If k8s decides to schedule workload to the NAS box, the engine would need to be run there, so it still can cause problems. You can limit the pod to schedule to the other box only to avoid that. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#431 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AORr71rcjb10clHGi0HK0pHXvAV6s7Qhks5vXDthgaJpZM4b126P> .

yasker · 2019-03-16T16:58:05Z

More volumes will likely use more resources. Volume size or frequency of access won't matter much compare to number of volumes.

paulmorabito · 2019-03-20T06:41:00Z

After testing for a few days, there seems to be nothing specific in the kubelet logs or pings (though ping response time varies). I've reduced the number of volumes and I am still finding volumes disconnected randomly. I am down to half the volumes I was using previously and it is still occurring. Previously, I was running many more volumes and although sometimes I had issues with node redundancy on my slower node, I never had disconnections until I upgraded from Rancher 2.1.6 to 2.1.7. It may be a strange coincidence but it is very odd. Is there anything different between the two that could have increased the sensitivity? If not, I'll wait until I have better hardware and test again.

…

On Sun, 17 Mar 2019 at 01:58, Sheng Yang ***@***.***> wrote: More volumes will likely use more resources. Volume size or frequency of access won't matter much compare to number of volumes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#431 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AORr75dVVYQqW0-n1vk1uyNMr4yg6L2Oks5vXSKegaJpZM4b126P> .

yasker · 2019-03-20T18:26:11Z

@paulmorabito are there any lost pings? If you look at the time that volume became faulted and the time that ping failed, are they correlated?

Also, what's the last stable version for you? Rancher version and kubernetes version.

And there is maybe something in canal's log. Take a look at kubectl -n kube-system logs canal-xxx. Those are the overlay networking components Kubernetes uses.

Also, you can send me the latest support bundle.

paulmorabito · 2019-03-21T01:52:30Z

I'll take a look at canal also. I can't see anything out of the ordinary so far but the logging history only goes back a few minutes. Last stable version was Rancher 2.1.6 with Kubernetes 1.13.4. Now I am using Rancher 2.1.7 and Kubernetes 1.13.4. If there is any correlation it is that the ping times are longer around the volume disconnect. As best as I can see, it looks like CPU/disk IO is spiking (and is hitting swap etc). As the node itself is an Intel J3455 processor it is not responding fast enough so longhorn disconnects. If I try to hit the Rancher UI during this time then its slow to load pages but nothing times out. I'll let you know if I find anything more.

…

On Thu, 21 Mar 2019 at 03:26, Sheng Yang ***@***.***> wrote: @paulmorabito <https://github.com/paulmorabito> are there any lost pings? If you look at the time that volume became faulted and the time that ping failed, are they correlated? Also, what's the last stable version for you? Rancher version and kubernetes version. And there is maybe something in canal's log. Take a look at kubectl -n kube-system logs canal-xxx. Those are the overlay networking components Kubernetes uses. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#431 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AORr73jv7TztB4ixUjq9icUiV7TgnsGcks5vYn1EgaJpZM4b126P> .

yasker · 2019-04-30T18:56:06Z

We will add an experimental feature GuaranteedEngineCPU to guarantee the engine's CPU in v0.5.0. We hope this will improve Longhorn stability with high load situation. Though due to the potential resource constraint, the attaching of volume may fail with the feature enabled (that's why it's experimental).

Let me know if it improves the situation you're experiencing.

We are working on a better solution after v0.5.0.

paulmorabito · 2019-05-03T07:15:44Z

OK thank you for the update. So far, since I've moved to better hardware, I've not had an issue with volumes disconnecting but would be happy to help test this out.

…

On Wed, 1 May 2019 at 03:56, Sheng Yang ***@***.***> wrote: We will add an experimental feature to guarantee the engine's CPU in v0.5.0. We hope this will improve Longhorn stability with high load situation. Though due to the potential resource constraint, the attaching of volume may fail with the feature enabled (that's why it's experimental). Let me know if it improves the situation you're experiencing. We are working on a better solution after v0.5.0. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#431 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADSGX34TDDNJ3MYHJYXCFJLPTCI4TANCNFSM4G6XN2HQ> .

cameronbraid · 2020-04-06T15:49:32Z

I'm getting a similar thing in v0.8.0

dashboard is showing count 260, Engine of volume pvc-757fea95-d1c4-4cae-a87b-a240ae1489e4 dead unexpectedly, reattach the volume

Logs
logs-from-replica-manager-in-instance-manager-r-edd60c6a.txt

cameronbraid · 2020-04-06T15:49:44Z

Its happening to all three volumes I have created

cameronbraid · 2020-04-06T16:07:27Z

I have a few tgtd -f processes running each using 100% of a CPU

cameronbraid · 2020-04-06T16:08:30Z

running kube 1.17 in kind

yasker · 2020-04-06T17:30:30Z

@cameronbraid Do you have open-iscsi installed on the node? You would see some error messages regarding iscsiadm etc in the manager's log.

Also, kind is not officially supported by Longhorn currently. local-path-provisioner (which is included in the newer version of kind) should work better for your case if you're using a single node.

cameronbraid · 2020-04-07T02:04:08Z

yeah open-iscsi is installed on the node

my node container is

FROM kindest/node:v1.17.2
RUN apt-get update && apt-get -y install less iputils-ping dnsutils open-iscsi

I am using kind to run my developer cluster which I was setting up to be as similar as my staging/live cluster as possible. Kind seemed to be the way to do that. I just done have the resources to run 3 VMs on my dev machine

wackazong · 2023-12-19T11:39:06Z

@cameronbraid did you manage to resolve this?

paulmorabito closed this as completed Mar 15, 2019

paulmorabito reopened this Mar 15, 2019

yasker added kind/bug area/driver area/kubernetes Kubernetes related like K8s version compatibility labels Mar 19, 2019

yasker added this to the v0.5.0 milestone Mar 19, 2019

yasker modified the milestones: v0.5.0, v0.6.0 Apr 30, 2019

yasker modified the milestones: v0.6.0, v0.7.0 Jul 29, 2019

yasker removed this from the v0.7.0 milestone Sep 21, 2019

stale bot added the wontfix label Mar 19, 2021

innobead removed the wontfix label Mar 20, 2021

longhorn deleted a comment from stale bot Mar 20, 2021

khushboo-rancher mentioned this issue Jul 28, 2021

[BUG] Investigate engine integration test failure - restore_with_frontend: can only restore backup from replica in mode RW, got ERR #1628

Closed

innobead removed the area/driver label Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Randomly detaching volumes #431

Randomly detaching volumes #431

paulmorabito commented Mar 15, 2019 •

edited

paulmorabito commented Mar 15, 2019

paulmorabito commented Mar 15, 2019 •

edited

yasker commented Mar 15, 2019

paulmorabito commented Mar 15, 2019 via email

yasker commented Mar 15, 2019

paulmorabito commented Mar 16, 2019 via email

yasker commented Mar 16, 2019

paulmorabito commented Mar 16, 2019 via email

yasker commented Mar 16, 2019

paulmorabito commented Mar 20, 2019 via email •

edited

yasker commented Mar 20, 2019 •

edited

paulmorabito commented Mar 21, 2019 via email

yasker commented Apr 30, 2019 •

edited

paulmorabito commented May 3, 2019 via email

cameronbraid commented Apr 6, 2020

cameronbraid commented Apr 6, 2020

cameronbraid commented Apr 6, 2020

cameronbraid commented Apr 6, 2020

yasker commented Apr 6, 2020 •

edited

cameronbraid commented Apr 7, 2020

wackazong commented Dec 19, 2023

Randomly detaching volumes #431

Randomly detaching volumes #431

Comments

paulmorabito commented Mar 15, 2019 • edited

paulmorabito commented Mar 15, 2019

paulmorabito commented Mar 15, 2019 • edited

yasker commented Mar 15, 2019

paulmorabito commented Mar 15, 2019 via email

yasker commented Mar 15, 2019

paulmorabito commented Mar 16, 2019 via email

yasker commented Mar 16, 2019

paulmorabito commented Mar 16, 2019 via email

yasker commented Mar 16, 2019

paulmorabito commented Mar 20, 2019 via email • edited

yasker commented Mar 20, 2019 • edited

paulmorabito commented Mar 21, 2019 via email

yasker commented Apr 30, 2019 • edited

paulmorabito commented May 3, 2019 via email

cameronbraid commented Apr 6, 2020

cameronbraid commented Apr 6, 2020

cameronbraid commented Apr 6, 2020

cameronbraid commented Apr 6, 2020

yasker commented Apr 6, 2020 • edited

cameronbraid commented Apr 7, 2020

wackazong commented Dec 19, 2023

paulmorabito commented Mar 15, 2019 •

edited

paulmorabito commented Mar 15, 2019 •

edited

paulmorabito commented Mar 20, 2019 via email •

edited

yasker commented Mar 20, 2019 •

edited

yasker commented Apr 30, 2019 •

edited

yasker commented Apr 6, 2020 •

edited