Implement kube-master HA for multiple masters #761

gitschaub · 2016-04-12T06:59:02Z

We need to cover HA for kube-master components. Kubernetes HA cluster guide [1]. #725

User specifies multiple nodes under the master role [2]. Ansible should seamlessly create a cluster with a collection of masters, with api services appropriately load balanced, and other kube-master components leader-elected.

Enable static etcd clustering (working with Fixes ipv4 address resolution of etcd peers in etcd config #711)
Deploy kube-apiserver on multiple nodes
Load balance kube-apiserver instances (e.g. add haproxy/keepalived roles)
Create controller-manager & scheduler services for each apiserver (leader-elected)

(optional, good to have)

add in support to run master components in containers (related, [ansible] add source_type: docker (run Kubernetes parts in containers) #673)

[1] http://kubernetes.io/docs/admin/high-availability/
[2] https://github.com/kubernetes/contrib/blob/master/ansible/inventory.example.ha

This change is

gitschaub · 2016-04-12T06:59:40Z

@danehans @stephenrlouie @rutsky

rutsky · 2016-04-12T11:48:40Z

@adamschaub if this is WIP, can you describe what is your plan and what is left to do to achieve K8s HA deployment?

eparis · 2016-04-12T13:55:39Z

ansible/roles/master/templates/scheduler.j2

@@ -4,4 +4,4 @@
 # default config should be adequate

 # Add your own!
-KUBE_SCHEDULER_ARGS="--kubeconfig={{ kube_config_dir }}/scheduler.kubeconfig"
+KUBE_SCHEDULER_ARGS="--kubeconfig={{ kube_config_dir }}/scheduler.kubeconfig {% if groups['masters']|length > 1 %}--leader-elect=true{% endif %}"


Does it break things to do 'leader-elect=true' even if there is only one?

Very good point, I'll look it up and test it out. I naively assumed that it being 'off' by default had some purpose.

rutsky · 2016-04-13T04:55:55Z

Related issue: kubernetes/kubernetes#18174

danehans · 2016-04-13T06:31:31Z

@rutsky keep in mind that @adamschaub work will include a load-balancer role within contrib/ansible, so kubernetes/kubernetes#18174 should not be an issue as kubelet/proxy will connect to a vip. Let em know if I missed anything with the related issue.

gitschaub · 2016-04-20T23:07:15Z

Sorry for the messy commits, but I wanted to get my changes out there for visibility. I've separated master tasks into single-master and ha-master groups, based on the number of nodes in 'masters' group.

No load balancing yet, and the HA tasks are CoreOS dependent at the moment.

Each master node runs an instance of kube-apiserver (via hyperkube), and kube-scheduler/kube-controller-manager are leader elected using podmaster. I'm finishing up load-balancing configuration with haproxy+keepalived pods.

danehans · 2016-04-21T16:24:56Z

ansible/roles/master/handlers/main.yml

@@ -14,12 +14,15 @@

 - name: restart apiserver
  service: name=kube-apiserver state=restarted
+  when: groups['masters']|length == 1


I would prefer this syntax if it works b/c it's used elsewhere in the project:

groups['masters'][0]

I don't follow. groups['masters'][0] is used elsewhere to delegate a task to just one master (generating tokens and certs is one example). In this case, we only want to restart the apiserver service when there is only one master.

got it. thanks.

danehans · 2016-04-21T17:34:20Z

@adamschaub looks like you're making good progress.

@eparis @rutsky I would appreciate you taking time to review this WIP PR to add Master HA based on the k8s ha guide: http://kubernetes.io/docs/admin/high-availability/

gitschaub · 2016-04-21T23:59:53Z

@danehans @eparis @rutsky I've removed the haproxy/keepalived bits. After discussion with @danehans, we've decided to add haproxy/keepalived under a separate PR.

I've tested this with CentOS 7 and CoreOS (stable) on a virtualbox environment. If you add more than one master node, kubelet is installed on all masters and master components are run on pods and leader-elected as required.

Each master node runs an apiserver pod. They are not currently load balanced/tied to a floating IP, so clobbering the first master in the list will cause communication issues with the other pieces.

danehans · 2016-04-22T18:37:52Z

Note: That the separate haproxy/keepalived will address the limitation that @adamschaub describes:

Each master node runs an apiserver pod. They are not currently load balanced/tied to a floating IP, so clobbering the first master in the list will cause communication issues with the other pieces.

ingvagabund · 2016-05-26T13:24:45Z

Sorry guys for being my victim again :)

ingvagabund · 2016-05-26T13:24:47Z

@redhat-k8s-bot test this

redhat-k8s-bot · 2016-05-26T13:25:13Z

Can one of the admins verify this patch?

ingvagabund · 2016-05-26T13:25:22Z

@redhat-k8s-bot test please

redhat-k8s-bot · 2016-05-26T13:56:50Z

RH ansible test passed for commit a4b9c92

logs

danehans · 2016-05-27T18:52:12Z

@ingvagabund all looks good with our testing of this patch. @atosatto since you're interested in this work, have you tested it?

atosatto · 2016-05-27T18:59:22Z

I'll give it a spin this evening and let you know if it works to me.
I've already tested it few days ago and everything was working fine to me.

Can someone share the way he is testing the PR? How many masters are you
using? And minions? Which OS? Which virtualization provider? Have you
performed some additional operations in order to check if kubernetes is
running properly?

I'm asking all these questions because I want to find out which is the
current test case in order to provide much more consistent feedbacks.

Andrea Tosatto
On May 27, 2016 20:52, "Daneyon Hansen" notifications@github.com wrote:

@ingvagabund https://github.com/ingvagabund all looks good with our
testing of this patch. @atosatto https://github.com/atosatto since
you're interested in this work, have you tested it?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#761 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AA1LhjSE4DyL3y0EUaHSOBHjL8j9XXKDks5qFz1hgaJpZM4IFGJ0
.

gitschaub · 2016-05-27T19:29:56Z

@atosatto I've been testing using vagrant+virtualbox provider. I either modify the Vagrantfile directly or pull in your multi-master addition to include at least 2 masters, 3 etcds, and 1 node. Once the install is complete, I hop onto a master node, verify that the master service pods are running using docker ps (should have a kube-apiserver on each master, scheduler/controller-manager should each be running on exactly one master). Use etcdctl get [/scheduler or /controller] to see what master each is assigned to. Then I take any generic rc manifest and create it using kubectl. Once it is successfully assigned to a node and running, all should be ship shape.

gitschaub · 2016-06-02T18:36:24Z

@atosatto, any updates?

atosatto · 2016-06-08T08:42:06Z

@adamschaub sorry Adam for my very late response but I had no time to check this until yesterday. I've been able to make things working with your setup but I had issues with a greater number of master nodes. I will try to dig into this. Maybe we can ping each other on Hipchat in order to let you drill down the issue.

smugcloud · 2016-07-19T23:05:48Z

Hey @adamschaub, do you have any update on this PR? After pulling down this PR, and incorporating the changes from #711, I am still getting the error below. This is a CoreOS cluster in AWS with 5 hosts (2 Masters).

fatal: [10.178.154.89]: FAILED! => {"changed": false, "failed": true, "invocation": {"module_args": {"dest": "/etc/kubernetes/manifests/kube-api-hyperkube.yml", "src": "kube-api-hyperkube.yml.j2"}, "module_name": "template"}, "msg": "AnsibleUndefinedVariable: ERROR! 'dict object' has no attribute 'etcd_interface'"}

I'll dig into this more and submit a PR if I locate the issue.

gitschaub · 2016-07-21T15:52:54Z

@smugcloud Thanks. Haven't touched this in a while... etcd_interface is being set in etcd role. I think there might be a requirement that all nodes in the master group also be in the etcd group. Try adding them to etcd. If that's the case, I definitely need to document that.

k8s-github-robot · 2016-08-02T20:26:33Z

Labelling this PR as size/XL

k8s-github-robot · 2016-08-17T23:27:29Z

@adamschaub PR needs rebase

test-foxish · 2016-09-21T22:53:51Z

recomputing cla status...

k8s-github-robot · 2016-09-22T22:18:29Z

[CLA-PING] @adamschaub

Thanks for your pull request. It looks like this may be your first contribution to a CNCF open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://identity.linuxfoundation.org/projects/cncf to sign.

Once you've signed, please reply here (e.g. "I signed it!") and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.

k8s-github-robot · 2016-10-02T22:27:32Z

[CLA-PING] @adamschaub

Thanks for your pull request. It looks like this may be your first contribution to a CNCF open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://identity.linuxfoundation.org/projects/cncf to sign.

Once you've signed, please reply here (e.g. "I signed it!") and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.

k8s-github-robot · 2016-10-12T22:35:29Z

[CLA-PING] @adamschaub

Thanks for your pull request. It looks like this may be your first contribution to a CNCF open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://identity.linuxfoundation.org/projects/cncf to sign.

Once you've signed, please reply here (e.g. "I signed it!") and we'll verify. Thanks.

If you've already signed a CLA, it's possible we don't have your GitHub username or you're using a different email address. Check your existing CLA data and verify that your email is set on your git commits.
If you signed the CLA as a corporation, please sign in with your organization's credentials at https://identity.linuxfoundation.org/projects/cncf to be authorized.

k8s-github-robot · 2016-12-19T18:05:39Z

[APPROVALNOTIFIER] The Following OWNERS Files Need Approval:

ansible

We suggest the following people:
cc @eparis
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancelin a comment

fejta-bot · 2017-12-19T11:27:53Z

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

fejta-bot · 2018-01-18T12:15:27Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot · 2018-02-17T12:22:02Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

googlebot added the cla: yes label Apr 12, 2016

eparis added the area/ansible label Apr 12, 2016

eparis reviewed Apr 12, 2016
View reviewed changes

danehans reviewed Apr 21, 2016
View reviewed changes

gitschaub force-pushed the kube-master-ha branch from 02f2e90 to a0dab87 Compare April 21, 2016 23:55

gitschaub force-pushed the kube-master-ha branch 2 times, most recently from 0248a4a to 535ff74 Compare April 22, 2016 16:40

k8s-github-robot assigned ingvagabund Aug 2, 2016

k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2016

k8s-github-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Sep 22, 2016

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2017

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 18, 2018

k8s-ci-robot closed this Feb 17, 2018

cblecker unassigned ingvagabund Apr 17, 2019

Implement kube-master HA for multiple masters #761

Implement kube-master HA for multiple masters #761

Conversation

gitschaub commented Apr 12, 2016 • edited by thockin

gitschaub commented Apr 12, 2016

rutsky commented Apr 12, 2016

eparis Apr 12, 2016

Choose a reason for hiding this comment

gitschaub Apr 12, 2016

Choose a reason for hiding this comment

rutsky commented Apr 13, 2016

danehans commented Apr 13, 2016

gitschaub commented Apr 20, 2016

danehans Apr 21, 2016

Choose a reason for hiding this comment

gitschaub Apr 21, 2016

Choose a reason for hiding this comment

danehans Apr 21, 2016

Choose a reason for hiding this comment

danehans commented Apr 21, 2016

gitschaub commented Apr 21, 2016

danehans commented Apr 22, 2016

ingvagabund commented May 26, 2016

ingvagabund commented May 26, 2016

redhat-k8s-bot commented May 26, 2016

ingvagabund commented May 26, 2016

redhat-k8s-bot commented May 26, 2016

danehans commented May 27, 2016

atosatto commented May 27, 2016

gitschaub commented May 27, 2016 • edited

gitschaub commented Jun 2, 2016

atosatto commented Jun 8, 2016

smugcloud commented Jul 19, 2016

gitschaub commented Jul 21, 2016

k8s-github-robot commented Aug 2, 2016

k8s-github-robot commented Aug 17, 2016

test-foxish commented Sep 21, 2016

k8s-github-robot commented Sep 22, 2016

k8s-github-robot commented Oct 2, 2016

k8s-github-robot commented Oct 12, 2016

k8s-github-robot commented Dec 19, 2016

fejta-bot commented Dec 19, 2017

fejta-bot commented Jan 18, 2018

fejta-bot commented Feb 17, 2018

gitschaub commented Apr 12, 2016 •

edited by thockin

gitschaub commented May 27, 2016 •

edited