Skip to content
This repository has been archived by the owner on Sep 30, 2020. It is now read-only.

Add networking-daemonsets feature #1195

Merged

Conversation

davidmccormick
Copy link
Contributor

Deploy Calico and Flannel networking as a daemonset with kubernetes API as backing store. Removes the need for nodes connecting to etcd and frees up node podCIDR leases faster -addressing cluster role issue: flannel-io/flannel#954.

Experimental feature, disabled by default.
Kubernetes Controllers become responsible for allocating node cidrs.
Switch between calico+flannel (canal) or flannel.
Fast roll out into existing clusters with minimal disruption.
Optional calico typha service for easing load on apiservers in large clusters.

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 21, 2018
@davidmccormick
Copy link
Contributor Author

I need to update the tests with the expectation of the NetworkingDaemonSets type which gets set up with default values.

@codecov-io
Copy link

codecov-io commented Mar 22, 2018

Codecov Report

Merging #1195 into master will increase coverage by 0.22%.
The diff coverage is 78.94%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1195      +/-   ##
==========================================
+ Coverage    36.2%   36.42%   +0.22%     
==========================================
  Files          63       63              
  Lines        3823     3846      +23     
==========================================
+ Hits         1384     1401      +17     
- Misses       2224     2229       +5     
- Partials      215      216       +1
Impacted Files Coverage Δ
model/userdata.go 60% <0%> (-1.02%) ⬇️
core/controlplane/config/config.go 62.38% <83.33%> (+0.39%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 548bb38...2c4232a. Read the comment docs.

Copy link
Contributor

@mumoshu mumoshu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on naming, but code LGTM overall!
Thank you very much for your efforts 🎉

type NetworkingDaemonSets struct {
Enabled bool `yaml:"enabled"`
Typha bool `yaml:"typha"`
CalicoNodeImage model.Image `yaml:"calico-node-image"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit but would you mind changing yaml keys to lowerCamelCase for consistency with other existing keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - no problem

- name: kubelet.service
command: start
runtime: true
content: |
[Unit]
{{ if not .Experimental.NetworkingDaemonSets.Enabled -}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • How about renaming NetworkingDaemonSets to e.g. SelfHostedCanal? After reading the canal doc, I now believe we can safely call it so
  • Can we move the new configuration(NetworkingDaemonSets before the my suggested rename) out of .Experimental? Sorry for the confusion but we're basically avoiding to use the .Experimental key anymore. We're in pre-1.0 and there is no strict rule about what's considered experimental or when it graduates from experimental and so on.
    • However, a note about the experimental feature in cluster.yaml comments would be welcomed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to name it Canal as you can have selfhosted flannel if you don't supply UseCalico: true. Perhaps another name better captures the hosted-in-kubernetes, removed-from-coreos-supplied-flannel-ness of the change? Shall we call it KubeHostedNetworking, or SelfHostedNetworkFabric maybe?

I'll remove it from Experimental and make it top level?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out! Made sense a lot.

How about Kubernetes.Networking.SelfHosting then?

Probably we can also consider migrating other network-related keys like useCalico under that for more cohesion, in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll make those changes and re-push

Copy link
Contributor Author

@davidmccormick davidmccormick Mar 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Err, did you want me to create a Kubernetes stanza? If so should I move kubeDns, kubeProxy, kubernetesDashboard, + lots of others into it? (Feels quite a big change).

Or create it just for my networking settings for now?

#kubernetes:
#  networking:
#    selfHosting:
#      enabled: false
#      typha: false
#      calicoNodeImage:
#        repo: quay.io/calico/node
#        tag: v3.0.3
#      calicoCniImage:
#        repo: quay.io/calico/cni
#        tag: v2.0.1
#      flannelImage:
#        repo: quay.io/coreos/flannel
#        tag: v0.9.1
#      flannelCniImage:
#        repo: quay.io/coreos/flannel-cni
#        tag: v0.3.0
#     typhaImage:
#        repo: quay.io/calico/typha
#        tag: v0.6.2

Or could I get away with KubernetesNetworkingSelfHosting? e.g.

#kubernetesNetworkingSelfHosting:
#  enabled: false
#  typha: false
#  calicoNodeImage:
#    repo: quay.io/calico/node
#    tag: v3.0.3
#  calicoCniImage:
#    repo: quay.io/calico/cni
#    tag: v2.0.1
#  flannelImage:
#    repo: quay.io/coreos/flannel
#    tag: v0.9.1
#  flannelCniImage:
#    repo: quay.io/coreos/flannel-cni
#    tag: v0.3.0
#  typhaImage:
#    repo: quay.io/calico/typha
#    tag: v0.6.2

@mumoshu
Copy link
Contributor

mumoshu commented Mar 27, 2018

Or create it just for this settings?

This is what I meant!
I'll create separate issues for migrating others.

…osting and squash a few bugs hampering migrations

rebuild templates after rebase and merge from master
@davidmccormick
Copy link
Contributor Author

Hi, I've updated the settings to kubernetes.networking.selfhosting, squashed a couple of bugs, and merged the latest changes. I have new versions of the template files and these seem to be conflicting and not sure how to clear them.

Is their a way to select my versions? Can you do that?

@@ -270,6 +270,7 @@ coreos:
{{end -}}

[Service]
Type=simple
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etcd rollout stalls at the first etcd server in a new cluster when the etcdadm-check (etcd disaster recovery enabled) service is enabled. This service, etcd-member, is type 'notify' by default - but etcd won't send the notify until the cluster is healthy (and so systemd reports etcd as starting-not started even though etcd is running ok). Etcadm-check won't start until etcd-member has started and so no cfn-signal can be sent to start the subsequent etcd servers. This means that your roll out gets stuck at the first etcd server. By changing the type of etcd-member to 'simple' we allow etcdadm-check start and the cloud-formation signal can go out. There is no side effect from changing the type so that as soon as etcd has been started everything can continue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidmccormick Good point!

But I take this change to actually result in a side-effect of we no longer postpone the rolling update of etcd nodes in case etcd-member failed to join the existing cluster. I think this is better explained in #1206 (comment).

Not saying this change is no good. However, at least, this chance should be discussed and made separately!
That being said, would you mind reverting this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it

"ToPort": 5473
},
"Type": "AWS::EC2::SecurityGroupIngress"
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: This rule should be enabled only when typha is enabled.

# ISSUES1 - calico iptables rules remain on the nodes, recycle nodes to remove.
# ISSUES2 - Existing pod ip addresses are not tracked and can cause clashing ips - please cordon all nodes whilst
# performing the upgrade until all workers/nodes have been recycyled - then it will be ok.
# ISSUES3 - Rolling back to old networking will require all nodes are cycled/rolled and the network is disrupted throughout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really appreciate the comprehensive and detailed documentation here!

enable: false
content: |
[Unit]
Description=Perform actions which help when migrating from legacy to selfhosted networking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be run when .Kubernetes.Networking.SelfHosting.Enabled is true?
The proposed implementation seems to be doing the opposite of that, or perhaps I'm missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, no - the migration helper is intended to be a safeguard for rolling the new networking in an existing cluster and should only be deployed to legacy networking hosts and not new nodes where SelfHostedNetworking has been enabled (i.e. no local flannel). What it does is stop the local flannel service and remove its devices so that a more rapid migration can be achieved. This is in order to minimise cluster disruption in the time between the first new controller coming up and adding the new daemonsets and all of the nodes having been recycled. I did find an issue in my testing where the kubelet was still taking the ip addresses from the old flannel rather than the new and so I thought it would make sense to stop it so this can't happen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidmccormick I think I started to understand the good of taking care of migration from old to new networking for minimal downtime.

But on the other hand I'm still unable to understand why this implementation helps that. To me, this systemd unit seems to run "whenever a node w/ legacy networking starts", but not "before a node w/ new networking comes up".

Also, how does a node w/ legacy networking starts successfully if you disable non-self-hosted flanneld this way?

Probably I'm still missing something...

Copy link
Contributor Author

@davidmccormick davidmccormick Apr 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, sorry - I haven't explained it all that well and it is a little complicated. The net-migration-helper is deployed to all nodes when the self-hosting option has been set to off/disabled - ie. all legacy style nodes. On its own it wont do anything - you see it will only trigger if someone writes a file /etc/kubernetes/cni/net.d/net-migration which I'm using as a kind of semaphore/trigger file. This file gets written when either of the self-hosting canal or flannel daemonsets are deployed (effectively signalling all legacy nodes that a migration is happening so they can shutdown their local flannel). Migrated nodes don't need to worry about this file though - they don't have a local flannel and so this service isn't deployed to them.

The whole goal of this is to minimise the disruption when rolling a legacy cluster to the new networking (if every node has to be rolled before the network is good this would cause lengthy disruption on large clusters). This way the cluster can reconfigure its networking with only a minute or less of disruption as it rolls out (in fact in some rolls there is no disruption at all).

As soon as the user selects self-hosting in their cluster.yaml and rolls their cluster, and once the first new controller comes up and runs install-kube-system, it will deploy the canal or flannel daemonsets to ALL nodes - so for a while we have legacy nodes which have their previously running flannel on them AND the new flannel deployed by daemonset. By writing this semaphore file /etc/kubernetes/cni/net.d/net-migration as part of the daemonset - we have a way of getting the legacy nodes to stop their local flannel and prevent any clashes that can occur from having flannel running twice. Once all the nodes have rolled and their are no legacy nodes in the cluster then the file has no purpose any longer and just gets ignored.

I saw some strange behaviour in testing that led me to believe that the safest way to deploy into an existing cluster is to roll the cluster so the net-migration helper goes out first on legacy nodes. I'm not convinced that it is needed in all cases, but it does give me a bit more security for performing a roll out into a large cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need any more docs/testing?

Copy link
Contributor

@mumoshu mumoshu Apr 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally understood! Thank you so much for the clarification.

For anyone interested, a few points you should know to understand about how this is implemented:

  • net-migration-helper.path is implicitly paired to net-migration-helper.service, so that once the file is created on the path, the .service counterpart is started by systemd.
  • A canal daemonset deployed via the first controller node rolled via kube-aws update to enable canal writes the file to all the controller and worker nodes, so that net-migration-helper.path is triggered on every legacy node for faster migration. Implementation: https://github.com/kubernetes-incubator/kube-aws/pull/1195/files#diff-ef25536c536667a40b993d4d24ab7567R1241

@cknowles
Copy link
Contributor

cknowles commented Apr 5, 2018

Awesome! I would heart this PR more if I could. I believe this will resolve #704 as well which I would need to work on soon. Might be worth commented it as experimental in cluster.yaml?

@@ -284,8 +317,7 @@ coreos:
--container-runtime={{.ContainerRuntime}} \
--rkt-path=/usr/bin/rkt \
--rkt-stage1-image=coreos.com/rkt/stage1-coreos \
--node-labels node-role.kubernetes.io/master{{if .NodeLabels.Enabled}},{{.NodeLabels.String}} \
{{end}} \
--node-labels=node-role.kubernetes.io/master="",kubernetes.io/role=master,{{if .NodeLabels.Enabled}},{{.NodeLabels.String}}{{end}} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it actually affects anything but you might unexpectedly have double commands when NodeLabels is enabled: kubernetes.io/role=master,,{{.NodeLabels.String}}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes let me send a small correction

@mumoshu
Copy link
Contributor

mumoshu commented Apr 6, 2018

@davidmccormick Thank you so much for all the your awesome works. I've reviewed this again, and am happy to get this merged 👍

@mumoshu mumoshu added this to the v0.9.10.rc-1 milestone Apr 6, 2018
@mumoshu mumoshu merged commit 07eb6b0 into kubernetes-retired:master Apr 6, 2018
@jorge07
Copy link
Contributor

jorge07 commented Apr 10, 2018

Did someone test this?

@davidmccormick
Copy link
Contributor Author

I did lots of testing of it - are you having a particular issue?

@davidmccormick davidmccormick deleted the feature/networking-daemonsets branch April 10, 2018 14:37
@jorge07
Copy link
Contributor

jorge07 commented Apr 10, 2018

We're having issues creating a cluster from master and looks related with canal. I'll try to collect as much logs as possible and post it here asap

@davidmccormick
Copy link
Contributor Author

davidmccormick commented Apr 10, 2018

Ok - let me know what you have and what settings you are using in cluster.yaml?
Also please let me know if it is a new cluster or existing - and if existing, what was the configured networking - flannel or flannel + calico?

@rastut
Copy link

rastut commented Apr 11, 2018

Hey i am a colleague from @jorge07 . Yesterday we were doing some testing with the master branch and we have found that the cluster was not able to run containers correctly. We are running the cluster without self-hosting the network .We have been looking logs around, and the problem was that kubelet was not able to locate cni binaries.

Apr 11 10:42:10 ip-10-0-2-59.us-east-2.compute.internal kubelet-wrapper[1410]: E0411 10:42:10.546050    1410 cni.go:259] Error adding network: failed to find plugin "loopback" in path [/opt/loopback/bin /opt/cni/bin]
Apr 11 10:42:10 ip-10-0-2-59.us-east-2.compute.internal kubelet-wrapper[1410]: E0411 10:42:10.546706    1410 cni.go:220] Error while adding to cni lo network: failed to find plugin "loopback" in path [/opt/loopback/bin /opt/cni/bin]
Apr 11 10:42:10 ip-10-0-2-59.us-east-2.compute.internal kubelet-wrapper[1410]: E0411 10:42:10.547302    1410 cni.go:259] Error adding network: failed to find plugin "loopback" in path [/opt/loopback/bin /opt/cni/bin]
Apr 11 10:42:10 ip-10-0-2-59.us-east-2.compute.internal kubelet-wrapper[1410]: E0411 10:42:10.551897    1410 cni.go:220] Error while adding to cni lo network: failed to find plugin "loopback" in path [/opt/loopback/bin /opt/cni/bin]

After taking a look basically i have found that the kubelet system unit basically is mounting the host path /opt/cni/bin inside the kubelet container, this is causing that the path of the container is override with the host path and the kubelet is not able to found the cni binaries.

https://github.com/kubernetes-incubator/kube-aws/blob/3f116956373f6fa817618b257b4479a8cb464b45/core/controlplane/config/templates/cloud-config-worker#L329-L330

This is the output of one of the nodes that currently are not working correctly:

core@ip-10-0-2-59 /opt/cni/bin $ rkt list                                                                                                  
UUID		APP		IMAGE NAME				STATE		CREATED		STARTED		NETWORKS
9d5945e6	awscli		quay.io/coreos/awscli:master		exited garbage	1 hour ago	1 hour ago	
9f0dfd23	flannel		quay.io/coreos/flannel:v0.10.0		exited		1 hour ago	1 hour ago	
eadd1353	flannel		quay.io/coreos/flannel:v0.10.0		running		1 hour ago	1 hour ago	
f1dcc1fc	hyperkube-amd64	k8s.gcr.io/hyperkube-amd64:v1.9.1	running		59 minutes ago	59 minutes ago	
core@ip-10-0-2-59 /opt/cni/bin $ sudo rkt enter f1dcc1fc                                                                                   
enter: no command specified, assuming "/bin/bash"
root@ip-10-0-2-59:/# ls /opt/cni/bin
root@ip-10-0-2-59:/# 

If we delete the argument from the unit then the containers began to run correctly and on the kubelet container we can see the binaries:

core@ip-10-0-1-182 /usr/lib64/systemd $ sudo rkt enter 4cd0af93                                                                  
enter: no command specified, assuming "/bin/bash"
root@ip-10-0-1-182:/# ls /opt/cni/bin
bridge	dhcp  flannel  host-local  ipvlan  loopback  macvlan  portmap  ptp  sample  tuning  vlan
root@ip-10-0-1-182:/#

@davidmccormick
Copy link
Contributor Author

davidmccormick commented Apr 11, 2018

Hi thanks for the info - I will have a look at making the mount dependent on the networking selected. I'm guessing that you are not running with calico? as this would have installed the cni binaries as part of its bootstrap.

The quickest fix whilst I work on testing the fix would be to enable the kubernetes-networking-self hosting feature with type 'flannel' (and would allow more testing of this feature). Apologies for the bug, and I'll be opening a pull request with a fix as soon as I am happy with the fix.

@ktateish
Copy link
Contributor

Hi all,
I've opened an issue #1232 without noticing this discussion. (Thanks @kevtaylor for notifying me)

I'll keep #1232 open for users who search issues with ContainerCreating, so please include Close #1232 in your commit message when you create a fix commit for this problem in the future.

I'll also keep investigating it.

@rastut
Copy link

rastut commented Apr 11, 2018

Yep not using calico. I have just try some changes, modifying the template, but then i am getting some strange failure. If i go through the templates and i modify the mount, then i am getting an error in relation to the signature verification of the image.

Apr 11 13:05:52 ip-10-0-2-229.us-east-2.compute.internal kubelet-wrapper[2258]: run: signature verification for docker images is not supported (try --insecure-options=image)

@davidmccormick
Copy link
Contributor Author

Thanks, it looks like I introduced this bug with cni after my flannel testing. I'm testing an update to the cloud-config that copies the binaries into the host /opt/cni/bin directory. I need to keep this mounted so that users can switch from legacy flannel/calico mode to the self-hosting daemonsets with minimal cluster disruption.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants