New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flannel: hook failed: "cni-relation-changed" for flannel:cni #160

Closed
spikebike opened this Issue Dec 16, 2016 · 12 comments

Comments

Projects
None yet
4 participants
@spikebike
Copy link

spikebike commented Dec 16, 2016

I tried using juju to deply kubernetes and it seemed mostly to work, except:

Unit                      Workload  Agent  Machine  Public address  Ports     Message
easyrsa/0*                active    idle   0        10.1.1.237                Certificate Authority connected.
etcd/0*                   active    idle   1        10.1.1.240      2379/tcp  Healthy with 3 known peers.
etcd/1                    active    idle   2        10.1.1.202      2379/tcp  Healthy with 3 known peers.
etcd/2                    active    idle   3        10.1.1.236      2379/tcp  Healthy with 3 known peers.
kubeapi-load-balancer/0*  active    idle   4        10.1.1.218      443/tcp   Loadbalancer ready.
kubernetes-master/0*      active    idle   5        10.1.1.212      6443/tcp  Kubernetes master services ready.
  flannel/0*              error     idle            10.1.1.212                hook failed: "cni-relation-changed" for flannel:cni
kubernetes-worker/0*      waiting   idle   6        10.1.1.210                Waiting for kubelet to start.
  flannel/3               error     idle            10.1.1.210                hook failed: "cni-relation-changed" for flannel:cni
kubernetes-worker/1       waiting   idle   7        10.1.1.208                Waiting for kubelet to start.
  flannel/1               error     idle            10.1.1.208                hook failed: "cni-relation-changed" for flannel:cni
kubernetes-worker/2       waiting   idle   8        10.1.1.222                Waiting for kubelet to start.
  flannel/2               error     idle            10.1.1.222                hook failed: "cni-relation-changed" for flannel:cni
@lazypower

This comment has been minimized.

Copy link
Contributor

lazypower commented Dec 16, 2016

Hi spikebike, was this deployed to lxd perchance?

@spikebike

This comment has been minimized.

Copy link

spikebike commented Dec 16, 2016

Indeed, yes it was deployed to lxd.

@lazypower

This comment has been minimized.

Copy link
Contributor

lazypower commented Dec 16, 2016

AH, did you simply juju deploy or did you use conjure-up to deploy the bundle?

There's a call out under 'Alternative Deployment Methods' that seems to be a bit of a rough spot for that information to live, but it outlines that there are specific profile edits and modifications that will need to happen to the system that you're deploying the Canonical Distribution of Kubernetes to.

Its not a straight forward edit, so my suggestion would be to destroy that model, install conjure-up and attempt re-deployment via:

add-apt-repository ppa:conjure-up/next
apt-get update
apt-get install conjure-up
conjure-up canonical-kubernetes

It will perform the required steps when deploying to a LXD host to ensure you get the required functional tweaks. We're still discussing how we can make this experience better and more obvious to users.

As you're the third user to find this flaw in our README hiding the steps for a local development experience, i'll take a work item to ensure thats more prominent in the README.

@lazypower

This comment has been minimized.

Copy link
Contributor

lazypower commented Dec 16, 2016

Duplicate of #151

@spikebike

This comment has been minimized.

Copy link

spikebike commented Dec 16, 2016

@spikebike

This comment has been minimized.

Copy link

spikebike commented Dec 16, 2016

@lazypower

This comment has been minimized.

Copy link
Contributor

lazypower commented Jan 3, 2017

@spikebike this issue is starting to age, and I wanted to follow up to see if conjure-up resolved your issue or if we need to do deeper introspection.

@lazypower

This comment has been minimized.

Copy link
Contributor

lazypower commented Jan 9, 2017

closing due to inactivity. Please re-open if this continues to be an issue.

@lazypower lazypower closed this Jan 9, 2017

@rene00

This comment has been minimized.

Copy link

rene00 commented Jan 13, 2017

This continues to be an issue for me.

My goal is to have canonical-kubernetes installed using some sort of headless mode. I'm using lxd.

On a pristine xenial ec2 instance, I run something like:

sudo apt-add-repository ppa:juju/stable
sudo apt -y update
sudo apt-get -y upgrade
sudo apt install -y zfsutils-linux lxd
sudo lxd init \
    --auto \
    --storage-backend zfs \
    --storage-create-device /dev/xvdb \
    --storage-pool lxd
sudo usermod -aG lxd ubuntu

I then rename lxdbr0 and disabled ipv6 as per first comment in juju/juju#5871.

I then bootstrap a localhost controller.

juju bootstrap lxd localhost

juju deploy ubuntu works as expected.

I then run juju deploy canonical-kubernetes

$ juju deploy canonical-kubernetes
Located bundle "cs:bundle/canonical-kubernetes-19"
Deploying charm "cs:~containers/easyrsa-5"
added resource easyrsa
Deploying charm "cs:~containers/etcd-21"
added resource snapshot
Deploying charm "cs:~containers/flannel-7"
added resource flannel
Deploying charm "cs:~containers/kubeapi-load-balancer-5"
application kubeapi-load-balancer exposed
Deploying charm "cs:~containers/kubernetes-master-10"
added resource kubernetes
Deploying charm "cs:~containers/kubernetes-worker-12"
added resource kubernetes
application kubernetes-worker exposed
Related "kubernetes-master:kube-api-endpoint" and "kubeapi-load-balancer:apiserver"
Related "kubernetes-master:loadbalancer" and "kubeapi-load-balancer:loadbalancer"
Related "kubernetes-master:cluster-dns" and "kubernetes-worker:kube-dns"
Related "kubernetes-master:certificates" and "easyrsa:client"
Related "etcd:certificates" and "easyrsa:client"
Related "kubernetes-master:etcd" and "etcd:db"
Related "kubernetes-worker:certificates" and "easyrsa:client"
Related "kubernetes-worker:kube-api-endpoint" and "kubeapi-load-balancer:website"
Related "kubeapi-load-balancer:certificates" and "easyrsa:client"
Related "flannel:etcd" and "etcd:db"
Related "flannel:cni" and "kubernetes-master:cni"
Related "flannel:cni" and "kubernetes-worker:cni"
Deploy of bundle completed.

and end up with this:

$ juju status
Model    Controller  Cloud/Region   Version
default  localhost   lxd/localhost  2.0.2

App                    Version  Status   Scale  Charm                  Store       Rev  OS      Notes
easyrsa                3.0.1    active       1  easyrsa                jujucharms    5  ubuntu  
etcd                   2.2.5    active       3  etcd                   jujucharms   21  ubuntu  
flannel                0.6.1    error        4  flannel                jujucharms    7  ubuntu  
kubeapi-load-balancer  1.10.0   active       1  kubeapi-load-balancer  jujucharms    5  ubuntu  exposed
kubernetes-master      1.5.1    active       1  kubernetes-master      jujucharms   10  ubuntu  
kubernetes-worker      1.5.1    waiting      3  kubernetes-worker      jujucharms   12  ubuntu  exposed
ubuntu                 16.04    active       1  ubuntu                 jujucharms   10  ubuntu  

Unit                      Workload  Agent  Machine  Public address  Ports     Message
easyrsa/0*                active    idle   1        10.138.86.163             Certificate Authority connected.
etcd/0*                   active    idle   2        10.138.86.242   2379/tcp  Healthy with 3 known peers.
etcd/1                    active    idle   3        10.138.86.12    2379/tcp  Healthy with 3 known peers.
etcd/2                    active    idle   4        10.138.86.220   2379/tcp  Healthy with 3 known peers.
kubeapi-load-balancer/0*  active    idle   5        10.138.86.20    443/tcp   Loadbalancer ready.
kubernetes-master/0*      active    idle   6        10.138.86.253   6443/tcp  Kubernetes master services ready.
  flannel/0*              error     idle            10.138.86.253             hook failed: "cni-relation-joined" for flannel:cni
kubernetes-worker/0       waiting   idle   7        10.138.86.37              Waiting for kubelet to start.
  flannel/2               error     idle            10.138.86.37              hook failed: "cni-relation-changed" for flannel:cni
kubernetes-worker/1       waiting   idle   8        10.138.86.158             Waiting for kubelet to start.
  flannel/3               error     idle            10.138.86.158             hook failed: "cni-relation-joined" for flannel:cni
kubernetes-worker/2*      waiting   idle   9        10.138.86.216             Waiting for kubelet to start.
  flannel/1               error     idle            10.138.86.216             hook failed: "cni-relation-joined" for flannel:cni
ubuntu/0*                 active    idle   0        10.138.86.4               ready

Machine  State    DNS            Inst id        Series  AZ
0        started  10.138.86.4    juju-5a6a66-0  xenial  
1        started  10.138.86.163  juju-5a6a66-1  xenial  
2        started  10.138.86.242  juju-5a6a66-2  xenial  
3        started  10.138.86.12   juju-5a6a66-3  xenial  
4        started  10.138.86.220  juju-5a6a66-4  xenial  
5        started  10.138.86.20   juju-5a6a66-5  xenial  
6        started  10.138.86.253  juju-5a6a66-6  xenial  
7        started  10.138.86.37   juju-5a6a66-7  xenial  
8        started  10.138.86.158  juju-5a6a66-8  xenial  
9        started  10.138.86.216  juju-5a6a66-9  xenial  

Relation           Provides               Consumes               Type
certificates       easyrsa                etcd                   regular
certificates       easyrsa                kubeapi-load-balancer  regular
certificates       easyrsa                kubernetes-master      regular
certificates       easyrsa                kubernetes-worker      regular
cluster            etcd                   etcd                   peer
etcd               etcd                   flannel                regular
etcd               etcd                   kubernetes-master      regular
cni                flannel                kubernetes-master      regular
cni                flannel                kubernetes-worker      regular
loadbalancer       kubeapi-load-balancer  kubernetes-master      regular
kube-api-endpoint  kubeapi-load-balancer  kubernetes-worker      regular
cni                kubernetes-master      flannel                subordinate
kube-dns           kubernetes-master      kubernetes-worker      regular
cni                kubernetes-worker      flannel                subordinate

Digging a bit deeper, all machines with flannel have this within /var/log/juju/unit-flannel-0.log:

2017-01-13 03:35:45 INFO juju.worker.uniter resolver.go:100 awaiting error resolution for "relation-joined" hook
2017-01-13 03:35:45 INFO juju-log cni:11: Reactive main running for hook cni-relation-joined
2017-01-13 03:35:45 INFO juju-log cni:11: Invoking reactive handler: hooks/relations/kubernetes-cni/requires.py:11:changed
2017-01-13 03:35:46 INFO juju-log cni:11: Invoking reactive handler: reactive/flannel.py:167:ready
2017-01-13 03:35:46 INFO cni-relation-joined Traceback (most recent call last):
2017-01-13 03:35:46 INFO cni-relation-joined   File "/var/lib/juju/agents/unit-flannel-0/charm/hooks/cni-relation-joined", line 19, in <module>
2017-01-13 03:35:46 INFO cni-relation-joined     main()
2017-01-13 03:35:46 INFO cni-relation-joined   File "/usr/local/lib/python3.5/dist-packages/charms/reactive/__init__.py", line 78, in main
2017-01-13 03:35:46 INFO cni-relation-joined     bus.dispatch()
2017-01-13 03:35:46 INFO cni-relation-joined   File "/usr/local/lib/python3.5/dist-packages/charms/reactive/bus.py", line 434, in dispatch
2017-01-13 03:35:46 INFO cni-relation-joined     _invoke(other_handlers)
2017-01-13 03:35:46 INFO cni-relation-joined   File "/usr/local/lib/python3.5/dist-packages/charms/reactive/bus.py", line 417, in _invoke
2017-01-13 03:35:46 INFO cni-relation-joined     handler.invoke()
2017-01-13 03:35:46 INFO cni-relation-joined   File "/usr/local/lib/python3.5/dist-packages/charms/reactive/bus.py", line 291, in invoke
2017-01-13 03:35:46 INFO cni-relation-joined     self._action(*args)
2017-01-13 03:35:46 INFO cni-relation-joined   File "/var/lib/juju/agents/unit-flannel-0/charm/reactive/flannel.py", line 171, in ready
2017-01-13 03:35:46 INFO cni-relation-joined     status_set('active', 'Flannel subnet ' + get_flannel_subnet())
2017-01-13 03:35:46 INFO cni-relation-joined   File "/var/lib/juju/agents/unit-flannel-0/charm/reactive/flannel.py", line 223, in get_flannel_subnet
2017-01-13 03:35:46 INFO cni-relation-joined     with open('/run/flannel/subnet.env') as f:
2017-01-13 03:35:46 INFO cni-relation-joined FileNotFoundError: [Errno 2] No such file or directory: '/run/flannel/subnet.env'
2017-01-13 03:35:46 ERROR juju.worker.uniter.operation runhook.go:107 hook "cni-relation-joined" failed: exit status 1
2017-01-13 03:35:46 INFO juju.worker.uniter resolver.go:100 awaiting error resolution for "relation-joined" hook

I'm going to continue poking around.

@rene00

This comment has been minimized.

Copy link

rene00 commented Jan 13, 2017

Just found juju-solutions/charm-flannel#26 which I'm guessing should resolve this issue once deployed.

@Cynerva

This comment has been minimized.

Copy link
Contributor

Cynerva commented Jan 13, 2017

@rene00 The fix in juju-solutions/charm-flannel#26 should give you a different symptom, but likely won't fix the problem entirely.

The hook failure you're seeing occurs because /run/flannel/subnet.env doesn't exist when the charm expects it to - which means flannel isn't working right for one reason or another.

If I remember right, flannel doesn't work under the default LXD profile because of missing files in /proc/sys/net/ipv4/neigh/flannel.1. You might see an error along those lines if you do journalctl -u flannel on one of the failed units.

If you deploy with conjure-up as mentioned above in #160 (comment), I believe it should take care of the LXD profile changes and give you better results.

@lazypower

This comment has been minimized.

Copy link
Contributor

lazypower commented Jan 13, 2017

What cynerva has called out is correct. Using conjure-up would resolve the issue where flannel is unable to start, thus unable to create /run/flannel/subnet.env (which is an artifact left on disk when flanneld is running, which contains the CIDR and MTU of the device)

Using conjure-up edits the lxd profile for permissions with the device nodes it needs for flannel to run. You can see what its doing here:

https://github.com/conjure-up/spells/blob/master/canonical-kubernetes/steps/lxd-profile.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment