Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows pods are not reachable on a hybrid Linux/ windows cluster #103

Open
llyons opened this issue Oct 8, 2020 · 22 comments
Open

windows pods are not reachable on a hybrid Linux/ windows cluster #103

llyons opened this issue Oct 8, 2020 · 22 comments
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@llyons
Copy link

llyons commented Oct 8, 2020

Hi,

we have a custom "baremetal" K8 cluster with 2 linux nodes and 1 windows node. Just trying to get familiar with how it works, etc. We have the nginx ingress controller running and the metalLB loadbalancer running. Both of the ingress and loadbalancer controllers only run on the linux nodes and master. They dont run on the windows node (i was told this wasnt needed). We have deployed a number of linux containers running on the linux nodes and they work. the 2 windows containers we have running on the windows node start and are running but are not reachable on clusterIP or the provisioned serviceip. The containers are running on the windows node since I can see those and even docker exec -it into them. Doing a kubectl get svc i have these values for the clientportal app.

clientportal LoadBalancer 10.110.61.103 10.243.0.39 80:30875/TCP 21h app=clientportal

i am able to get to the app with http://windows-node-ip:30875
but I cant get to the app like this http://10.243.0.39
i cant curl the app from one of the linux nodes using clusterIP or the serviceip. I can with the actual nodeIP

i do notice on the window node in the /var/log/kubelet log file some errors like this.

cni_windows.go:59] error while adding to cni network: error while GETHNSNewtorkByName(flannel.4096): Network flannel.4096 not found

file.go:104] Unable to read config path "C:\\var\\lib\\kubelet\\etc\\kubernetes\\manifests": path does not exist, ignoring

Kubectl get nodes shows the windows node is ready
kubetcl get pods shows running pods on the windows node
docker container ls on windows nodes shows the containers are running and scheduled.

we did upgrade to 1.19.2 of kubelet and kubeadm but it looks like we have had this issue for some time.

C:\k>kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", BuildDate:"2020-09-16T13:38:53Z", GoVersion:"go1.15", Compiler:"gc", Platform:"windows/amd64"}

attached are some of the kubelet log files.

Also the logs from kubectl -n kube-system logs kube-flannel-ds-windows-amd64-vrsg9

`Mode LastWriteTime Length Name


d----- 8/20/2020 4:35 PM serviceaccount
WARNING: The names of some imported commands from the module 'hns' include unapproved verbs that might make them less
discoverable. To find the commands with unapproved verbs, run the Import-Module command again with the Verbose
parameter. For a list of approved verbs, type Get-Verb.
Invoke-HnsRequest : @{Error=An adapter was not found. ; ErrorCode=2151350278; Success=False}
At C:\k\flannel\hns.psm1:233 char:16

  • ... return Invoke-HnsRequest -Method POST -Type networks -Data $Json ...
  •             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    • CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException
    • FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Invoke-HNSRequest

I1008 10:46:56.667299 8008 main.go:518] Determining IP address of default interface
I1008 10:46:57.507782 8008 main.go:531] Using interface with name Ethernet0 2 and address 10.243.1.202
I1008 10:46:57.507782 8008 main.go:548] Defaulting external address to interface address (10.243.1.202)
I1008 10:46:57.545795 8008 kube.go:119] Waiting 10m0s for node controller to sync
I1008 10:46:57.545795 8008 kube.go:306] Starting kube subnet manager
I1008 10:46:58.551449 8008 kube.go:126] Node controller sync successful
I1008 10:46:58.551449 8008 main.go:246] Created subnet manager: Kubernetes Subnet Manager - aabrw-kuber03
I1008 10:46:58.551449 8008 main.go:249] Installing signal handlers
I1008 10:46:58.551449 8008 main.go:390] Found network config - Backend type: vxlan
I1008 10:46:58.551449 8008 vxlan_windows.go:127] VXLAN config: Name=flannel.4096 MacPrefix=0E-2A VNI=4096 Port=4789 GBP=false
DirectRouting=false
I1008 10:46:58.619205 8008 device_windows.go:116] Attempting to create HostComputeNetwork &{ flannel.4096 Overlay [] {[]} { [
] [] []} [{Static [{192.168.2.0/24 [[123 34 84 121 112 101 34 58 34 86 83 73 68 34 44 34 83 101 116 116 105 110 103 115 34 58 12
3 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 125 125]] [{192.168.2.1 0.0.0.0/0 0}]}]}] 8 {2 0}}
E1008 10:46:59.972661 8008 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.243.1.202:50491
->10.243.1.212:6443: wsarecv: An established connection was aborted by the software in your host machine.
E1008 10:46:59.973662 8008 reflector.go:304] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to watch *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=10547296&timeoutSeconds=582&watch=true: http2: no cached connection was
available
E1008 10:47:01.036947 8008 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get`

kubelet.exe.logs.zip

any feedback or guidance would be appreciated.

@jsturtevant
Copy link
Contributor

The issue is that flannel is starting before the external network is created. The work around is to Restart the Flannel pod.

The network is created and then flannel is started here:

wins cli process run --path /k/flannel/setup.exe --args "--mode=overlay --interface=Ethernet"
wins cli route add --addresses 169.254.169.254
wins cli process run --path /k/flannel/flanneld.exe --args "--kube-subnet-mgr --kubeconfig-file /k/flannel/kubeconfig.yml" --envs "POD_NAME=$env:POD_NAME POD_NAMESPACE=$env:POD_NAMESPACE"

The external network doesn't finish creating before flannel is started causing the bad loop

@llyons
Copy link
Author

llyons commented Oct 9, 2020

Do we restart those pods by running those commands above or deleting pods and having them restart

@jsturtevant
Copy link
Contributor

They run in a DaemonSet so you can kubectl delete pod or run kubectl rollout restart daemonset

@llyons
Copy link
Author

llyons commented Oct 9, 2020

so after trying this, it seems like we still have the same core issue. The windows simple webapps running on the windows node are not reachable from the linux machines (curl clusterip, curl svc ip). the linux apps on linux nodes are reachable. The windows container on the windows node can be reached with the windows node IP: port.

The new info in the logs is this.

`Mode LastWriteTime Length Name


d----- 10/9/2020 11:38 AM flannel

Directory: C:\host\k\flannel\var\run\secrets\kubernetes.io

Mode LastWriteTime Length Name


d----- 10/9/2020 10:46 AM serviceaccount
WARNING: The names of some imported commands from the module 'hns' include unapproved verbs that might make them less
discoverable. To find the commands with unapproved verbs, run the Import-Module command again with the Verbose
parameter. For a list of approved verbs, type Get-Verb.
Invoke-HnsRequest : @{Error=An adapter was not found. ; ErrorCode=2151350278; Success=False}
At C:\k\flannel\hns.psm1:233 char:16

  • ... return Invoke-HnsRequest -Method POST -Type networks -Data $Json ...
  •             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    • CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException
    • FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Invoke-HNSRequest

I1009 12:34:16.131480 7100 main.go:518] Determining IP address of default interface
I1009 12:34:17.551409 7100 main.go:531] Using interface with name Ethernet0 2 and address 10.243.1.202
I1009 12:34:17.551409 7100 main.go:548] Defaulting external address to interface address (10.243.1.202)
I1009 12:34:17.597426 7100 kube.go:119] Waiting 10m0s for node controller to sync
I1009 12:34:17.597426 7100 kube.go:306] Starting kube subnet manager
I1009 12:34:18.612968 7100 kube.go:126] Node controller sync successful
I1009 12:34:18.612968 7100 main.go:246] Created subnet manager: Kubernetes Subnet Manager - ssssssss
I1009 12:34:18.612968 7100 main.go:249] Installing signal handlers
I1009 12:34:18.612968 7100 main.go:390] Found network config - Backend type: vxlan
I1009 12:34:18.612968 7100 vxlan_windows.go:127] VXLAN config: Name=flannel.4096 MacPrefix=0E-2A VNI=4096 Port=4789 GBP=false
DirectRouting=false
I1009 12:34:19.445335 7100 device_windows.go:116] Attempting to create HostComputeNetwork &{ flannel.4096 Overlay [] {[]} { [
] [] []} [{Static [{192.168.2.0/24 [[123 34 84 121 112 101 34 58 34 86 83 73 68 34 44 34 83 101 116 116 105 110 103 115 34 58 12
3 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 125 125]] [{192.168.2.1 0.0.0.0/0 0}]}]}] 8 {2 0}}
E1009 12:34:21.947399 7100 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.243.1.202:50521
->10.243.1.212:6443: wsarecv: An established connection was aborted by the software in your host machine.
E1009 12:34:21.947399 7100 reflector.go:304] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to watch *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=10782832&timeoutSeconds=582&watch=true: http2: no cached connection was
available
E1009 12:34:23.075801 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
E1009 12:34:24.125142 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
E1009 12:34:25.203474 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
I1009 12:34:25.966698 7100 device_windows.go:124] Waiting to get ManagementIP from HostComputeNetwork flannel.4096
E1009 12:34:26.214764 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
I1009 12:34:26.522847 7100 device_windows.go:136] Waiting to get net interface for HostComputeNetwork flannel.4096 (10.243.1.
202)
E1009 12:34:27.219039 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
E1009 12:34:28.246212 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
I1009 12:34:29.233905 7100 device_windows.go:145] Created HostComputeNetwork flannel.4096
E1009 12:34:29.277916 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
I1009 12:34:29.341929 7100 main.go:313] Changing default FORWARD chain policy to ACCEPT
I1009 12:34:29.343930 7100 main.go:321] Wrote subnet file to /run/flannel/subnet.env
I1009 12:34:29.343930 7100 main.go:325] Running backend.
I1009 12:34:29.344933 7100 main.go:343] Waiting for all goroutines to exit
I1009 12:34:29.344933 7100 vxlan_network_windows.go:63] Watching for new subnet leases
E1009 12:34:30.281059 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
E1009 12:34:31.287135 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
E1009 12:34:32.303437 7100 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available

Should I attempt to add the node again? Not sure what to do next.

@llyons
Copy link
Author

llyons commented Oct 9, 2020

One thing I noticed is that our Operating system is 1809 and we have Kubernetes 1.19.2.

Do we need kubernetes 1.18 to make this work?

@llyons
Copy link
Author

llyons commented Oct 14, 2020

So we tried to make sure the version of our OS (1809) did not include a September patch that has shown to cause these issues. That being identified in this issue microsoft/Windows-Containers#61

after setting this up we restarted the windows node and made sure the pods where back to running and we still have the same issues as before.

Our OS version is now Microsoft Windows [Version 10.0.17763.1397]

the logs still show the same issues. We cant access the simple web apps on the windows node through serviceIP or clusterIP but we can get to the running containers on the windows node using the NodeIP:port

The windows node is running, all the pods are running. Linux web apps on linux nodes are accessible.

Here is contents of the log files.

c:\var\logs\kubelet on windows node

kubelet.exe.AABRW-KUBER03.OLH_AABRW-KUBER03$.log.ERROR.20201014-120556.zip

results of kubectl -n kube-system logs kube-flannel-ds-windows-amd64-fw8k9

Mode LastWriteTime Length Name


d----- 10/13/2020 11:53 AM serviceaccount
WARNING: The names of some imported commands from the module 'hns' include unapproved verbs that might make them less
discoverable. To find the commands with unapproved verbs, run the Import-Module command again with the Verbose
parameter. For a list of approved verbs, type Get-Verb.
Invoke-HnsRequest : @{Error=An adapter was not found. ; ErrorCode=2151350278; Success=False}
At C:\k\flannel\hns.psm1:233 char:16

  • ... return Invoke-HnsRequest -Method POST -Type networks -Data $Json ...
  •             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    
    • CategoryInfo : NotSpecified: (:) [Write-Error], WriteErrorException
    • FullyQualifiedErrorId : Microsoft.PowerShell.Commands.WriteErrorException,Invoke-HNSRequest

I1014 12:07:44.490869 10480 main.go:518] Determining IP address of default interface
I1014 12:07:46.139972 10480 main.go:531] Using interface with name Ethernet0 2 and address 10.243.1.202
I1014 12:07:46.139972 10480 main.go:548] Defaulting external address to interface address (10.243.1.202)
I1014 12:07:46.158979 10480 kube.go:119] Waiting 10m0s for node controller to sync
I1014 12:07:46.158979 10480 kube.go:306] Starting kube subnet manager
I1014 12:07:47.192086 10480 kube.go:126] Node controller sync successful
I1014 12:07:47.192086 10480 main.go:246] Created subnet manager: Kubernetes Subnet Manager - aabrw-kuber03
I1014 12:07:47.192086 10480 main.go:249] Installing signal handlers
I1014 12:07:47.192086 10480 main.go:390] Found network config - Backend type: vxlan
I1014 12:07:47.192086 10480 vxlan_windows.go:127] VXLAN config: Name=flannel.4096 MacPrefix=0E-2A VNI=4096 Port=4789 GBP=false
DirectRouting=false
I1014 12:07:47.416559 10480 device_windows.go:116] Attempting to create HostComputeNetwork &{ flannel.4096 Overlay [] {[]} { [
] [] []} [{Static [{192.168.2.0/24 [[123 34 84 121 112 101 34 58 34 86 83 73 68 34 44 34 83 101 116 116 105 110 103 115 34 58 12
3 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 125 125]] [{192.168.2.1 0.0.0.0/0 0}]}]}] 8 {2 0}}
E1014 12:07:48.689549 10480 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.243.1.202:50567
->10.243.1.212:6443: wsarecv: An established connection was aborted by the software in your host machine.
E1014 12:07:48.690549 10480 reflector.go:304] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to watch *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=11876190&timeoutSeconds=582&watch=true: http2: no cached connection was
available
E1014 12:07:49.767699 10480 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
E1014 12:07:50.781831 10480 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
E1014 12:07:51.809827 10480 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
I1014 12:07:52.464903 10480 device_windows.go:124] Waiting to get ManagementIP from HostComputeNetwork flannel.4096
E1014 12:07:52.827943 10480 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
I1014 12:07:53.036966 10480 device_windows.go:136] Waiting to get net interface for HostComputeNetwork flannel.4096 (10.243.1.
202)
E1014 12:07:53.846512 10480 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
I1014 12:07:54.091535 10480 device_windows.go:145] Created HostComputeNetwork flannel.4096
I1014 12:07:54.175544 10480 main.go:313] Changing default FORWARD chain policy to ACCEPT
I1014 12:07:54.183543 10480 main.go:321] Wrote subnet file to /run/flannel/subnet.env
I1014 12:07:54.183543 10480 main.go:325] Running backend.
I1014 12:07:54.183543 10480 main.go:343] Waiting for all goroutines to exit
I1014 12:07:54.183543 10480 vxlan_network_windows.go:63] Watching for new subnet leases
E1014 12:07:54.850768 10480 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available
E1014 12:07:55.853629 10480 reflector.go:201] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to list *v1.Node: Get
https://10.243.1.212:6443/api/v1/nodes?resourceVersion=0: http2: no cached connection was available

@vitaliy-leschenko
Copy link
Contributor

I rolled back my servers (on my test cluster) to 10.0.17763.1294 and can say that all windows pods are reachable after reboot.

PS C:\Users\v.leschenko> kubectl get nodes -owide
NAME           STATUS   ROLES    AGE     VERSION   INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                  KERNEL-VERSION       CONTAINER-RUNTIME
k8s            Ready    master   197d    v1.19.0   192.168.2.20   <none>        Ubuntu 18.04.5 LTS        4.15.0-118-generic   docker://19.3.12
k8s-us1804-a   Ready    <none>   197d    v1.19.0   192.168.2.25   <none>        Ubuntu 18.04.5 LTS        4.15.0-118-generic   docker://19.3.12
k8s-ws1809-a   Ready    <none>   2d21h   v1.19.0   192.168.2.21   <none>        Windows Server Standard   10.0.17763.1294      docker://19.3.12
k8s-ws1809-b   Ready    <none>   2d11h   v1.19.0   192.168.2.22   <none>        Windows Server Standard   10.0.17763.1294      docker://19.3.12
k8s-ws1809-c   Ready    <none>   3d      v1.19.0   192.168.2.23   <none>        Windows Server Standard   10.0.17763.1457      docker://19.3.12

k8s-ws1809-c will wait until fix without reboot to check that fix is working.

@llyons
Copy link
Author

llyons commented Oct 14, 2020

another related piece of info. with this svc (windows app on windows node)

clientportal LoadBalancer 10.110.61.103 10.243.0.39 80:30875/TCP 8d

I am not able to curl 10.243.0.39 from the linux master but I CAN curl it from the linux worker. clientportal is one of the apps running on the windows node.

with this service (webapp on linux master)

frontend LoadBalancer 10.110.169.97 10.243.0.36 80:30889/TCP 32dith

another similar scenario, a webapp (frontend) running on the linux master is reachable from the master linux node with curl 10.243.0.36
However on the actual linux worker node, I am not able to curl 10.243.0.36

@jsturtevant
Copy link
Contributor

I did some digging on the error messages above and I think there are a few things happening:

HNS issue

Invoke-HnsRequest : @{Error=An adapter was not found. ; ErrorCode=2151350278; Success=False}
At C:\k\flannel\hns.psm1:233 char:16

This is happening because you are passing Ethernet0 2 to to the HNS module. While this should work it is being passed through wins.exe:

wins cli process run --path /k/flannel/setup.exe --args "--mode=overlay --interface=Ethernet"
(from our slack convo I know you've replaced Ethernet with Ethernet0 2 as described in the documentation)

The issue is wins.exe splits arguments on spaces:

https://github.com/rancher/wins/blob/7c2d5528151cb63355615e1ee02bd59380c1c1e2/cmd/client/process/run.go#L75
https://github.com/rancher/wins/blob/7c2d5528151cb63355615e1ee02bd59380c1c1e2/cmd/cmds/flags/list_value.go#L11-L13
https://github.com/rancher/wins/blob/7c2d5528151cb63355615e1ee02bd59380c1c1e2/cmd/cmds/flags/list_value.go#L30

This is causing the Only Ethernet0 to be passed and there for the error.

Another aspect of this is this error An adapter was not found. Can also be caused when a network and vswitch is already attached to an Adapter. One work around to next issue (Failed to watch) is to restart the Flannel pod. On the first creation of the Attempting to create HostComputeNetwork the flannel.4096 is created and a switch is attached to the adapter which can also cause the An adapter was not found. error.

Flannel Network creation and Failed to list

The creation of the external Network by HNS isn't strictly needed since this PR in flannel went in. This is why after a restarting the flannel pod things start to work, the flannel creates the network attached to the correct adapter (I1014 12:07:46.139972 10480 main.go:531] Using interface with name Ethernet0 2 and address 10.243.1.202)

It seems there is a timing issue with Flannel creating the network though:

I1014 12:07:47.416559 10480 device_windows.go:116] Attempting to create HostComputeNetwork &{ flannel.4096 Overlay [] {[]} { [
] [] []} [{Static [{192.168.2.0/24 [[123 34 84 121 112 101 34 58 34 86 83 73 68 34 44 34 83 101 116 116 105 110 103 115 34 58 12
3 34 73 115 111 108 97 116 105 111 110 73 100 34 58 52 48 57 54 125 125]] [{192.168.2.1 0.0.0.0/0 0}]}]}] 8 {2 0}}
E1014 12:07:48.689549 10480 streamwatcher.go:109] Unable to decode an event from the watch stream: read tcp 10.243.1.202:50567
->10.243.1.212:6443: wsarecv: An established connection was aborted by the software in your host machine.
E1014 12:07:48.690549 10480 reflector.go:304] github.com/coreos/flannel/subnet/kube/kube.go:307: Failed to watch *v1.Node: Get

Here we see it creating the network (flannel.4096) and starting to list the Kubernetes nodes via the golang client. The flannel network takes some time and causes a network hicup when first creating the VM switch (see this comment). The network blip causes the connection to the apiserver to get into a bad cached state as defined in flannel-io/flannel#1272.

Work arounds

By creating the external network first via HNS you avoid this issue completely because there is not network disconnect during the time of the flannel network creation. One option is during node set up you create the External network before deploying flannel this should resolve the issue.

To fix this in the Docker image requires some extra work since it looks like wins.exe is no longer taking issues (issues look to be disabled on the repository). The work around for the arguements being split are not to elegant, either encode the space and decode in the setup binary or pass the value via file to the setup binary. I have a working version which I will clean up some more before submitting: https://github.com/kubernetes-sigs/sig-windows-tools/compare/master...jsturtevant:wait-for-network?expand=1 a PR.

Ultimatly the fix should go in to flannel to reset connections properly or what till network is fully stable. There is a long standing open issue in the golang kubernetes client blocker on this issue that could potently fix the issue as well: kubernetes/client-go#374

@jsturtevant
Copy link
Contributor

/assign

@jsturtevant
Copy link
Contributor

looks like the creation of the external network prior was trying to solve this issue but the issue with wins args not parsing properly keeps it around: #37

@jsturtevant
Copy link
Contributor

To fix this in the Docker image requires some extra work since it looks like wins.exe is no longer taking issues (issues look to be disabled on the repository).

I got connected from folks that work on wins.exe and they are working on a fix. Will open an PR to update this once we have a new package.

Fyi - for future wins issues from slack conversation:

submit all issues to the rancher/rancher repo if you find things like wins who have issues turned off. rancher PMs/engineers watch that repo and will find the right people to do the work

@llyons
Copy link
Author

llyons commented Oct 20, 2020

so we have the latest October CU now installed and we still have the same issue as described above.
Do we know anything else to try?

@jsturtevant
Copy link
Contributor

The Windows update was not the issue here. It is flannel-io/flannel#1272.

Until Flannel is fixed the wins.exe needs to be updated to be able to create the external networks that have spaces in them like Ethernet0 2.

The workaround until Wins.exe or flannel is fixed is to manually create the External network before starting flannel.

@llyons
Copy link
Author

llyons commented Oct 21, 2020

Are there any instructions on how to create the external network before starting flannel? I did try to rename the ethernet adapter to just ethernet2 and that didnt seem to help fix the issue.

I noticed that the wins commands that are run refer to a setup.exe or flanneld.exe which dont exist in the /k/flannel folder on the windows node.

wins cli process run --path /k/flannel/setup.exe --args "--mode=overlay --interface=Ethernet2"
wins cli route add --addresses 169.254.169.254
wins cli process run --path /k/flannel/flanneld.exe --args "--kube-subnet-mgr --kubeconfig-file /k/flannel/kubeconfig.yml" --envs "POD_NAME=$env:POD_NAME POD_NAMESPACE=$env:POD_NAMESPACE"

@jsturtevant
Copy link
Contributor

@llyons the setup.exe source is here:

`$network = Get-HNSNetwork | ? Name -eq "External"; `+
`if ($network -eq $null) { `+
`New-HNSNetwork -Type Overlay -AddressPrefix "192.168.255.0/30" -Gateway "192.168.255.1" -Name "External" -AdapterName "%s" -SubnetPolicies @(@{Type = "VSID"; VSID = 9999; }); `+
`} elseif ($network.Type -ne "Overlay") { `+
`Write-Warning "'External' network already exists but has wrong type: $($network.Type)." `+

and includes the powershell to create the external network:

New-HNSNetwork -Type Overlay -AddressPrefix "192.168.255.0/30" -Gateway "192.168.255.1" -Name "External" -AdapterName "Ethernet0 2" -SubnetPolicies @(@{Type = "VSID"; VSID = 9999; });

Note that will fail if flannel already created a network on the nic, in which case you will need to remove the flannel network.

@llyons
Copy link
Author

llyons commented Oct 22, 2020

if we setup our CIDR as 192.168.0.0/16 should we try to (first remove existing) then run those powershell commands to setup network in the range of 192.168.0.0/16 and gateway of 192.168.0.1?

also if its already setup, we want to remove the flannel.4096 network or the external?

It looks like we have External (tied to Ethernet2), nat, flannel.4096 and
89b601bd3b8b4850bc7711537882a6c9aa3788b6f7c11854518dc4733d686c0e

is this network that we are creating, part of the actual ethernet network allowing connectivity or is this a new network that is being created. If I try to run the above New-HNSNetwork command on the existing ethernet2 adapter, it says it already exists.

@jsturtevant
Copy link
Contributor

Which cidr are you referring to? I think you will want you node/pod cider to not overlap with the external metric cidr. My understanding is this external network is for creating the vswitch which enables the network connectivity via the adapter. It isn't really needed except for the bug in flannel flannel-io/flannel#1272. @ksubrmnn might be able to explain better.

If I try to run the above New-HNSNetwork command on the existing ethernet2 adapter, it says it already exists.

Yes, This should be on a fresh node or you will need to clean up all the different networks that might have been created.

@llyons
Copy link
Author

llyons commented Oct 23, 2020

I might not have said this properly above. Any help or guidance on this would be appreciated @ksubrmnn

If I have this return from Get-HNSNetwork


ActivityId             : 0481DD58-698B-4829-8FF7-02407876752E
AdditionalParams       :
CurrentEndpointCount   : 0
Extensions             : {@{Id=E7C3B2F0-F3C5-48DF-AF2B-10FED6D72E7A; IsEnabled=False; Name=Microsoft Windows Filtering
                         Platform}, @{Id=E9B59CFA-2BE1-4B21-828F-B6FBDBDDC017; IsEnabled=False; Name=Microsoft Azure
                         VFP Switch Extension}, @{Id=EA24CD6C-D17A-4348-9190-09F0D5BE83DD; IsEnabled=True;
                         Name=Microsoft NDIS Capture}}
Flags                  : 0
Health                 : @{AddressNotificationMissedCount=0; AddressNotificationSequenceNumber=0;
                         InterfaceNotificationMissedCount=0; InterfaceNotificationSequenceNumber=0; LastErrorCode=0;
                         LastUpdateTime=132478756989051774; RouteNotificationMissedCount=0;
                         RouteNotificationSequenceNumber=0}
ID                     : 777F0851-EF37-4D73-BAE3-8F3464294CCB
IPv6                   : False
LayeredOn              : 85D8CB85-C25B-4B8E-82A7-A81110A9EB91
MacPools               : {@{EndMacAddress=00-15-5D-C4-FF-FF; StartMacAddress=00-15-5D-C4-F0-00}}
MaxConcurrentEndpoints : 0
Name                   : nat
NatName                : ICSB758DD0D-1851-4C13-A6A8-3630CEBD4726
Policies               : {}
Resources              : @{AdditionalParams=; AllocationOrder=2; Allocators=System.Object[]; Health=;
                         ID=0481DD58-698B-4829-8FF7-02407876752E; PortOperationTime=0; State=1; SwitchOperationTime=0;
                         VfpOperationTime=0; parentId=BDD4F023-90C0-43FF-BA5F-F9920C901B5C}
State                  : 1
Subnets                : {@{AdditionalParams=; AddressPrefix=172.27.176.0/20; GatewayAddress=172.27.176.1; Health=;
                         ID=FE5C7CE2-3D7D-43EC-9E21-6D217F7C1106; Policies=System.Object[]; State=0}}
TotalEndpoints         : 0
Type                   : nat
Version                : 38654705667

ActivityId             : 7104552C-95E1-49BF-939F-D12E40B386B8
AdditionalParams       :
CurrentEndpointCount   : 1
Extensions             : {@{Id=E7C3B2F0-F3C5-48DF-AF2B-10FED6D72E7A; IsEnabled=False; Name=Microsoft Windows Filtering
                         Platform}, @{Id=E9B59CFA-2BE1-4B21-828F-B6FBDBDDC017; IsEnabled=False; Name=Microsoft Azure
                         VFP Switch Extension}, @{Id=EA24CD6C-D17A-4348-9190-09F0D5BE83DD; IsEnabled=True;
                         Name=Microsoft NDIS Capture}}
Flags                  : 0
Health                 : @{AddressNotificationMissedCount=0; AddressNotificationSequenceNumber=0;
                         InterfaceNotificationMissedCount=0; InterfaceNotificationSequenceNumber=0; LastErrorCode=0;
                         LastUpdateTime=132478756998524596; RouteNotificationMissedCount=0;
                         RouteNotificationSequenceNumber=0}
ID                     : DE8B02C3-F5E4-436A-B2AD-86D3D34A4B12
IPv6                   : False
LayeredOn              : 85D8CB85-C25B-4B8E-82A7-A81110A9EB91
MacPools               : {@{EndMacAddress=00-15-5D-2D-1F-FF; StartMacAddress=00-15-5D-2D-10-00}}
MaxConcurrentEndpoints : 2
Name                   : 69746ee3532666b83adb8edea7f2b9d49d4ea191a7ef620c9bf95b17f5d170d7
NatName                : ICS35158D80-1C7D-4937-AF77-002858685E7D
Policies               : {}
Resources              : @{AdditionalParams=; AllocationOrder=2; Allocators=System.Object[]; Health=;
                         ID=7104552C-95E1-49BF-939F-D12E40B386B8; PortOperationTime=0; State=1; SwitchOperationTime=0;
                         VfpOperationTime=0; parentId=BDD4F023-90C0-43FF-BA5F-F9920C901B5C}
State                  : 1
Subnets                : {@{AdditionalParams=; AddressPrefix=172.22.192.0/20; GatewayAddress=172.22.192.1; Health=;
                         ID=EF24497B-1AAE-4DC6-94AB-937035687E18; Policies=System.Object[]; State=0}}
TotalEndpoints         : 2
Type                   : nat
Version                : 38654705667

ActivityId             : 8A7055F0-550E-4A89-8951-F83DA1A54EC4
AdditionalParams       :
CurrentEndpointCount   : 2
DNSServerCompartment   : 7
DrMacAddress           : 00-15-5D-36-09-79
Extensions             : {@{Id=E7C3B2F0-F3C5-48DF-AF2B-10FED6D72E7A; IsEnabled=False; Name=Microsoft Windows Filtering
                         Platform}, @{Id=E9B59CFA-2BE1-4B21-828F-B6FBDBDDC017; IsEnabled=True; Name=Microsoft Azure
                         VFP Switch Extension}, @{Id=EA24CD6C-D17A-4348-9190-09F0D5BE83DD; IsEnabled=True;
                         Name=Microsoft NDIS Capture}}
Flags                  : 8
Health                 : @{LastErrorCode=0; LastUpdateTime=132478760658733588}
ID                     : A9A09EB9-F565-4E92-B4E0-72CA273F7EF6
IPv6                   : False
InterfaceConstraint    : @{InterfaceGuid=00000000-0000-0000-0000-000000000000}
LayeredOn              : 68C26E4B-B00A-4097-A7A0-5236D358B510
MacPools               : {@{EndMacAddress=00-15-5D-4E-4F-FF; StartMacAddress=00-15-5D-4E-40-00}}
ManagementIP           : 10.243.1.202
MaxConcurrentEndpoints : 2
Name                   : flannel.4096
Policies               : {@{Type=HostRoute}, @{DestinationPrefix=192.168.1.0/24;
                         DistributedRouterMacAddress=6a:60:9e:b2:c9:50; IsolationId=4096;
                         ProviderAddress=10.243.1.213; Type=RemoteSubnetRoute}, @{DestinationPrefix=192.168.0.0/24;
                         DistributedRouterMacAddress=42:72:18:81:ac:6f; IsolationId=4096;
                         ProviderAddress=10.243.1.212; Type=RemoteSubnetRoute}}
Resources              : @{AdditionalParams=; AllocationOrder=1; Allocators=System.Object[]; Health=;
                         ID=8A7055F0-550E-4A89-8951-F83DA1A54EC4; PortOperationTime=0; State=1; SwitchOperationTime=0;
                         VfpOperationTime=0; parentId=D81E13D4-58B9-4D5F-8972-4606BBE27C41}
State                  : 1
Subnets                : {@{AdditionalParams=; AddressPrefix=192.168.2.0/24; GatewayAddress=192.168.2.1; Health=;
                         ID=0B389370-6A1B-4DE8-B4CA-D50A153284CF; ObjectType=5; Policies=System.Object[]; State=0}}
TotalEndpoints         : 5
Type                   : Overlay
Version                : 38654705667

and then this image for my current adapters

How would I want to proceed. Assuming I have the cluster CIDR of 192.168.0.0/16 and metallb also serving up IP addresses from a pool, would I do this?

New-HNSNetwork -Type Overlay -AddressPrefix "192.168.0.0/16" -Gateway "192.168.0.1" -Name "External" -AdapterName "Ethernet" -SubnetPolicies @(@{Type = "VSID"; VSID = 9999; });

I am pretty desperate to get the windows portion working.

remember we can curl the windows serviceip and cluster ip of pod on windows node from Linux node.. but we cant get that serviceIP or cluster IP exposed outside of the 2 linux cluster nodes. apps running on the linux nodes are exposed and do render outside of the cluster using the serviceip

@llyons
Copy link
Author

llyons commented Nov 9, 2020

We where able to determine that in our configuration it looks like metalLB, which provides an IP from a pool, did not have a speaker in the windows node that prevented the IP provisioned from being accessible from outside.

Instead we put the windows containers on the windows node behind a ingress resource and setup a ingress service of type load balancer to handle this. So in essence we are getting access through the ingress controllers running on the linux portion of the cluster.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 7, 2021
@jsturtevant
Copy link
Contributor

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

5 participants