Skip to content
This repository has been archived by the owner on Apr 17, 2019. It is now read-only.

Implement kube-master HA for multiple masters #761

Closed

Conversation

gitschaub
Copy link
Contributor

@gitschaub gitschaub commented Apr 12, 2016

We need to cover HA for kube-master components. Kubernetes HA cluster guide [1]. #725

User specifies multiple nodes under the master role [2]. Ansible should seamlessly create a cluster with a collection of masters, with api services appropriately load balanced, and other kube-master components leader-elected.

(optional, good to have)

[1] http://kubernetes.io/docs/admin/high-availability/
[2] https://github.com/kubernetes/contrib/blob/master/ansible/inventory.example.ha


This change is Reviewable

@gitschaub
Copy link
Contributor Author

@rutsky
Copy link
Contributor

rutsky commented Apr 12, 2016

@adamschaub if this is WIP, can you describe what is your plan and what is left to do to achieve K8s HA deployment?

@@ -4,4 +4,4 @@
# default config should be adequate

# Add your own!
KUBE_SCHEDULER_ARGS="--kubeconfig={{ kube_config_dir }}/scheduler.kubeconfig"
KUBE_SCHEDULER_ARGS="--kubeconfig={{ kube_config_dir }}/scheduler.kubeconfig {% if groups['masters']|length > 1 %}--leader-elect=true{% endif %}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it break things to do 'leader-elect=true' even if there is only one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, I'll look it up and test it out. I naively assumed that it being 'off' by default had some purpose.

@rutsky
Copy link
Contributor

rutsky commented Apr 13, 2016

Related issue: kubernetes/kubernetes#18174

@danehans
Copy link

@rutsky keep in mind that @adamschaub work will include a load-balancer role within contrib/ansible, so kubernetes/kubernetes#18174 should not be an issue as kubelet/proxy will connect to a vip. Let em know if I missed anything with the related issue.

@gitschaub
Copy link
Contributor Author

Sorry for the messy commits, but I wanted to get my changes out there for visibility. I've separated master tasks into single-master and ha-master groups, based on the number of nodes in 'masters' group.

No load balancing yet, and the HA tasks are CoreOS dependent at the moment.

Each master node runs an instance of kube-apiserver (via hyperkube), and kube-scheduler/kube-controller-manager are leader elected using podmaster. I'm finishing up load-balancing configuration with haproxy+keepalived pods.

@@ -14,12 +14,15 @@

- name: restart apiserver
service: name=kube-apiserver state=restarted
when: groups['masters']|length == 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer this syntax if it works b/c it's used elsewhere in the project:

groups['masters'][0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow. groups['masters'][0] is used elsewhere to delegate a task to just one master (generating tokens and certs is one example). In this case, we only want to restart the apiserver service when there is only one master.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. thanks.

@danehans
Copy link

@adamschaub looks like you're making good progress.

@eparis @rutsky I would appreciate you taking time to review this WIP PR to add Master HA based on the k8s ha guide: http://kubernetes.io/docs/admin/high-availability/

@gitschaub
Copy link
Contributor Author

@danehans @eparis @rutsky I've removed the haproxy/keepalived bits. After discussion with @danehans, we've decided to add haproxy/keepalived under a separate PR.

I've tested this with CentOS 7 and CoreOS (stable) on a virtualbox environment. If you add more than one master node, kubelet is installed on all masters and master components are run on pods and leader-elected as required.

Each master node runs an apiserver pod. They are not currently load balanced/tied to a floating IP, so clobbering the first master in the list will cause communication issues with the other pieces.

@gitschaub gitschaub force-pushed the kube-master-ha branch 2 times, most recently from 0248a4a to 535ff74 Compare April 22, 2016 16:40
@danehans
Copy link

Note: That the separate haproxy/keepalived will address the limitation that @adamschaub describes:

Each master node runs an apiserver pod. They are not currently load balanced/tied to a floating IP, so clobbering the first master in the list will cause communication issues with the other pieces.

@ingvagabund
Copy link
Contributor

Sorry guys for being my victim again :)

@ingvagabund
Copy link
Contributor

@redhat-k8s-bot test this

@redhat-k8s-bot
Copy link

Can one of the admins verify this patch?

@ingvagabund
Copy link
Contributor

@redhat-k8s-bot test please

@redhat-k8s-bot
Copy link

RH ansible test passed for commit a4b9c92

@danehans
Copy link

@ingvagabund all looks good with our testing of this patch. @atosatto since you're interested in this work, have you tested it?

@atosatto
Copy link
Contributor

I'll give it a spin this evening and let you know if it works to me.
I've already tested it few days ago and everything was working fine to me.

Can someone share the way he is testing the PR? How many masters are you
using? And minions? Which OS? Which virtualization provider? Have you
performed some additional operations in order to check if kubernetes is
running properly?

I'm asking all these questions because I want to find out which is the
current test case in order to provide much more consistent feedbacks.

Andrea Tosatto
On May 27, 2016 20:52, "Daneyon Hansen" notifications@github.com wrote:

@ingvagabund https://github.com/ingvagabund all looks good with our
testing of this patch. @atosatto https://github.com/atosatto since
you're interested in this work, have you tested it?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#761 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AA1LhjSE4DyL3y0EUaHSOBHjL8j9XXKDks5qFz1hgaJpZM4IFGJ0
.

@gitschaub
Copy link
Contributor Author

gitschaub commented May 27, 2016

@atosatto I've been testing using vagrant+virtualbox provider. I either modify the Vagrantfile directly or pull in your multi-master addition to include at least 2 masters, 3 etcds, and 1 node. Once the install is complete, I hop onto a master node, verify that the master service pods are running using docker ps (should have a kube-apiserver on each master, scheduler/controller-manager should each be running on exactly one master). Use etcdctl get [/scheduler or /controller] to see what master each is assigned to. Then I take any generic rc manifest and create it using kubectl. Once it is successfully assigned to a node and running, all should be ship shape.

@gitschaub
Copy link
Contributor Author

@atosatto, any updates?

@atosatto
Copy link
Contributor

atosatto commented Jun 8, 2016

@adamschaub sorry Adam for my very late response but I had no time to check this until yesterday. I've been able to make things working with your setup but I had issues with a greater number of master nodes. I will try to dig into this. Maybe we can ping each other on Hipchat in order to let you drill down the issue.

@smugcloud
Copy link

Hey @adamschaub, do you have any update on this PR? After pulling down this PR, and incorporating the changes from #711, I am still getting the error below. This is a CoreOS cluster in AWS with 5 hosts (2 Masters).

fatal: [10.178.154.89]: FAILED! => {"changed": false, "failed": true, "invocation": {"module_args": {"dest": "/etc/kubernetes/manifests/kube-api-hyperkube.yml", "src": "kube-api-hyperkube.yml.j2"}, "module_name": "template"}, "msg": "AnsibleUndefinedVariable: ERROR! 'dict object' has no attribute 'etcd_interface'"}

I'll dig into this more and submit a PR if I locate the issue.

@gitschaub
Copy link
Contributor Author

@smugcloud Thanks. Haven't touched this in a while... etcd_interface is being set in etcd role. I think there might be a requirement that all nodes in the master group also be in the etcd group. Try adding them to etcd. If that's the case, I definitely need to document that.

@k8s-github-robot
Copy link

Labelling this PR as size/XL

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 17, 2016
@k8s-github-robot
Copy link

@adamschaub PR needs rebase

@test-foxish
Copy link

recomputing cla status...

@k8s-github-robot
Copy link

[CLA-PING] @adamschaub

Thanks for your pull request. It looks like this may be your first contribution to a CNCF open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://identity.linuxfoundation.org/projects/cncf to sign.

Once you've signed, please reply here (e.g. "I signed it!") and we'll verify. Thanks.


@k8s-github-robot k8s-github-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Sep 22, 2016
@k8s-github-robot
Copy link

[CLA-PING] @adamschaub

Thanks for your pull request. It looks like this may be your first contribution to a CNCF open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://identity.linuxfoundation.org/projects/cncf to sign.

Once you've signed, please reply here (e.g. "I signed it!") and we'll verify. Thanks.


1 similar comment
@k8s-github-robot
Copy link

[CLA-PING] @adamschaub

Thanks for your pull request. It looks like this may be your first contribution to a CNCF open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://identity.linuxfoundation.org/projects/cncf to sign.

Once you've signed, please reply here (e.g. "I signed it!") and we'll verify. Thanks.


@k8s-github-robot
Copy link

[APPROVALNOTIFIER] The Following OWNERS Files Need Approval:

  • ansible

We suggest the following people:
cc @eparis
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancelin a comment

@fejta-bot
Copy link

Issues go stale after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 19, 2017
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle rotten
/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 18, 2018
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/ansible cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet