Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add etcd scaleup playbook #3043

Merged
merged 1 commit into from Aug 2, 2017
Merged

Add etcd scaleup playbook #3043

merged 1 commit into from Aug 2, 2017

Conversation

jkhelil
Copy link
Contributor

@jkhelil jkhelil commented Jan 5, 2017

An etcd scaleup playbook is needed for these reasons:

  • master scaleup playbook does not scaleup etcd
  • if we broke a master vm and we loose it, the master recovery does not recover the etcd cluster size, each master rebuild results on decreasing etcd cluster size, this is particularly troubelsome for upgrades whenever someone wants to upgrades master nodes by destroying/recreating each one progressively

@openshift-bot
Copy link

Can one of the admins verify this patch?

@openshift-bot
Copy link

Can one of the admins verify this patch?
I understand the following commands:

  • bot, add author to whitelist
  • bot, test pull request
  • bot, test pull request once

@sdodson
Copy link
Member

sdodson commented Jan 10, 2017

bot, test pull request

@sdodson
Copy link
Member

sdodson commented Jan 10, 2017

aos-ci-test

@openshift-bot
Copy link

@jkhelil jkhelil force-pushed the scaleup_etcd branch 2 times, most recently from 28cdc53 to 0604c6c Compare January 12, 2017 08:50
@jkhelil jkhelil changed the title add etcd scaleup playbook [WIP] add etcd scaleup playbook Jan 12, 2017
@jkhelil
Copy link
Contributor Author

jkhelil commented Jan 12, 2017

This is not complete yet, I found some troubles with it, need to validate it before

@jkhelil jkhelil force-pushed the scaleup_etcd branch 2 times, most recently from b1af6fe to d08e21e Compare January 19, 2017 14:52
@jkhelil
Copy link
Contributor Author

jkhelil commented Jan 19, 2017

I updated the PR with playbooks working now

@jkhelil jkhelil changed the title [WIP] add etcd scaleup playbook Add etcd scaleup playbook Jan 27, 2017
@jkhelil
Copy link
Contributor Author

jkhelil commented Jan 27, 2017

I am working on fixing yamllint staff(travis errors)

@jkhelil
Copy link
Contributor Author

jkhelil commented Jan 27, 2017

travis Erros fixed

@jkhelil
Copy link
Contributor Author

jkhelil commented Jan 30, 2017

@sdodson PTAL

@sdodson sdodson requested a review from abutcher January 30, 2017 16:02
@sdodson
Copy link
Member

sdodson commented Jan 30, 2017

Thanks, assigned someone to review who is more familiar with the tricky bits of cert work around etcd.

@abutcher
Copy link
Member

abutcher commented Feb 3, 2017

@jkhelil Started testing this afternoon and will be reviewing soon! Sorry for the delay on this one

@jkhelil
Copy link
Contributor Author

jkhelil commented Feb 28, 2017

@abutcher
Hi Andrew, Did you have time to take a lookm Do you have any feedback about this ?

@abutcher
Copy link
Member

abutcher commented Mar 9, 2017

@jkhelil I've tested the playbook starting with multiple etcd instances and it works great. I also think the changes here look good.

I didn't have success starting with a single etcd node and moving to multiple due to the way we configure advertised client urls. I haven't dug into this too deeply but it appears that with a single external etcd instance, the advertised client urls only contain localhost which causes the member add operation to fail. I expect there is some configuration we can change to make the advertised client urls contain the protocol:ip:port for a single node etcd cluster. Have you investigated this or would you expect starting with a single etcd instance to work?

[root@master1 ~]# /usr/bin/etcdctl --cert-file /etc/etcd/peer.crt --key-file /etc/etcd/peer.key --ca-file /etc/etcd/ca.crt -C https://master1.abutcher.com:2379 member add master2.abutcher.com https://192.168.122.53:2380
2017-03-08 16:48:09.107878 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
client: etcd cluster is unavailable or misconfigured; error #0: dial tcp [::1]:2379: getsockopt: connection refused
[root@master1 ~]# curl --cacert /etc/etcd/ca.crt --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key -X GET https://192.168.122.186:2379/v2/members
{"members":[{"id":"16d01a6198b0837b","name":"default","peerURLs":["http://192.168.122.186:2380"],"clientURLs":["https://localhost:2379"]}]}

As opposed to:

[root@master1 ~]# curl --cacert /etc/etcd/ca.crt --cert /etc/etcd/peer.crt --key /etc/etcd/peer.key -X GET https://192.168.122.186:2379/v2/members
{"members":[{"id":"f2bf02afcb3a87c","name":"master1.abutcher.com","peerURLs":["https://192.168.122.186:2380"],"clientURLs":["https://192.168.122.186:2379"]},{"id":"81c1d1344f4a5c06","name":"master2.abutcher.com","peerURLs":["https://192.168.122.53:2380"],"clientURLs":["https://192.168.122.53:2379"]}]}

@sdodson
Copy link
Member

sdodson commented Mar 9, 2017

We probably need to do something to convert the single etcd instance from 'localhost' to hostnames and then re-initialize the cluster with that config before scaling up.

@abutcher
Copy link
Member

abutcher commented Mar 9, 2017

I was able to start with a single instance by making these changes to the etcd config in order to set a single node up for future clustering.

diff --git a/roles/etcd/templates/etcd.conf.j2 b/roles/etcd/templates/etcd.conf.j2
index 64c14a0..1095e76 100644
--- a/roles/etcd/templates/etcd.conf.j2
+++ b/roles/etcd/templates/etcd.conf.j2
@@ -8,12 +8,8 @@
 {% endfor -%}
 {% endmacro -%}
 
-{% if etcd_peers | default([]) | length > 1 %}
 ETCD_NAME={{ etcd_hostname }}
 ETCD_LISTEN_PEER_URLS={{ etcd_listen_peer_urls }}
-{% else %}
-ETCD_NAME=default
-{% endif %}
 ETCD_DATA_DIR={{ etcd_data_dir }}
 #ETCD_SNAPSHOT_COUNTER=10000
 ETCD_HEARTBEAT_INTERVAL=500
@@ -23,7 +19,6 @@ ETCD_LISTEN_CLIENT_URLS={{ etcd_listen_client_urls }}
 #ETCD_MAX_WALS=5
 #ETCD_CORS=
 
-{% if etcd_peers | default([]) | length > 1 %}
 #[cluster]
 ETCD_INITIAL_ADVERTISE_PEER_URLS={{ etcd_initial_advertise_peer_urls }}
 {% if initial_etcd_cluster is defined and initial_etcd_cluster %}
@@ -37,7 +32,6 @@ ETCD_INITIAL_CLUSTER_TOKEN={{ etcd_initial_cluster_token }}
 #ETCD_DISCOVERY_SRV=
 #ETCD_DISCOVERY_FALLBACK=proxy
 #ETCD_DISCOVERY_PROXY=
-{% endif %}
 ETCD_ADVERTISE_CLIENT_URLS={{ etcd_advertise_client_urls }}
 
 #[proxy]

@openshift-bot openshift-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 10, 2017
@ingvagabund
Copy link
Member

RFE for adding/removing etcd members here: #1772

@ingvagabund
Copy link
Member

ingvagabund commented May 4, 2017

use cases tested:

  • 1-member rpm based cluster: it failed as expected, the etcd cluster was not able to elect a leader
  • 3-member rpm based cluster: the cluster got scaled up and was heathy
  • 1-member container based cluster: the same as the rpm based one, failed
  • 3-member container based cluster: the new etcd_container service failed and the new member hangs out in [unstarted] state indefinitely. However, when I first run the rpm based etcd on the host, it connects to the cluster and the forth member is correctly added. Then I stop the etcd binary and start etcd_container service. Now, the service is running as expected. But the initial error of the etcd docker container is:
# /usr/bin/docker run --name etcd_container --rm -v /var/lib/etcd/:/var/lib/etcd/:z -v /etc/etcd:/etc/etcd:ro --env-file=/etc/etcd/etcd.conf --net=host --entrypoint=/usr/bin/etcd registry.access.redhat.com/rhel7/etcd
2017-05-04 13:55:00.376606 I | pkg/flags: recognized and used environment variable ETCD_ADVERTISE_CLIENT_URLS=https://172.16.186.68:2379
2017-05-04 13:55:00.376871 I | pkg/flags: recognized and used environment variable ETCD_CA_FILE=/etc/etcd/ca.crt
2017-05-04 13:55:00.376876 I | pkg/flags: recognized and used environment variable ETCD_CERT_FILE=/etc/etcd/server.crt
2017-05-04 13:55:00.376892 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd/
2017-05-04 13:55:00.376908 I | pkg/flags: recognized and used environment variable ETCD_ELECTION_TIMEOUT=2500
2017-05-04 13:55:00.376915 I | pkg/flags: recognized and used environment variable ETCD_HEARTBEAT_INTERVAL=500
2017-05-04 13:55:00.376923 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_ADVERTISE_PEER_URLS=https://172.16.186.68:2380
2017-05-04 13:55:00.376933 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER="172.16.186.198=https://172.16.186.198:2380,172.16.186.19=https://172.16.186.19:2380,10.8.173.198=https://172.16.186.68:2380,172.16.186.196=https://172.16.186.196:2380"
2017-05-04 13:55:00.376941 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_STATE=existing
2017-05-04 13:55:00.376945 I | pkg/flags: recognized and used environment variable ETCD_INITIAL_CLUSTER_TOKEN=etcd-cluster-1
2017-05-04 13:55:00.376950 I | pkg/flags: recognized and used environment variable ETCD_KEY_FILE=/etc/etcd/server.key
2017-05-04 13:55:00.376958 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_CLIENT_URLS=https://172.16.186.68:2379
2017-05-04 13:55:00.376964 I | pkg/flags: recognized and used environment variable ETCD_LISTEN_PEER_URLS=https://172.16.186.68:2380
2017-05-04 13:55:00.376976 I | pkg/flags: recognized and used environment variable ETCD_NAME=10.8.173.198
2017-05-04 13:55:00.376985 I | pkg/flags: recognized and used environment variable ETCD_PEER_CA_FILE=/etc/etcd/ca.crt
2017-05-04 13:55:00.376991 I | pkg/flags: recognized and used environment variable ETCD_PEER_CERT_FILE=/etc/etcd/peer.crt
2017-05-04 13:55:00.376998 I | pkg/flags: recognized and used environment variable ETCD_PEER_KEY_FILE=/etc/etcd/peer.key
2017-05-04 13:55:00.377038 I | etcdmain: etcd Version: 3.1.3
2017-05-04 13:55:00.377043 I | etcdmain: Git SHA: 21fdcc6
2017-05-04 13:55:00.377046 I | etcdmain: Go Version: go1.7.4
2017-05-04 13:55:00.377048 I | etcdmain: Go OS/Arch: linux/amd64
2017-05-04 13:55:00.377052 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2017-05-04 13:55:00.377165 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2017-05-04 13:55:00.377180 I | embed: peerTLS: cert = /etc/etcd/peer.crt, key = /etc/etcd/peer.key, ca = /etc/etcd/ca.crt, trusted-ca = , client-cert-auth = false
2017-05-04 13:55:00.378145 I | embed: listening for peers on https://172.16.186.68:2380
2017-05-04 13:55:00.378191 I | embed: listening for client requests on 172.16.186.68:2379
2017-05-04 13:55:00.382675 I | warning: ignoring ServerName for user-provided CA for backwards compatibility is deprecated
2017-05-04 13:55:00.402022 W | etcdserver: could not get cluster response from https://172.16.186.196:2380": Get https://172.16.186.196:2380"/members: dial tcp: unknown port tcp/2380"
2017-05-04 13:55:00.431901 C | etcdmain: error validating peerURLs {ClusterID:d712d882974aa376 Members:[&{ID:7915f55aacdb8b03 RaftAttributes:{PeerURLs:[https://172.16.186.198:2380]} Attributes:{Name:172.16.186.198 ClientURLs:[https://172.16.186.198:2379]}} &{ID:8912ec64d58a0782 RaftAttributes:{PeerURLs:[https://172.16.186.19:2380]} Attributes:{Name:172.16.186.19 ClientURLs:[https://172.16.186.19:2379]}} &{ID:8dba527c7e7ddbcb RaftAttributes:{PeerURLs:[https://172.16.186.68:2380]} Attributes:{Name: ClientURLs:[]}} &{ID:f7ddfdabf342e6f3 RaftAttributes:{PeerURLs:[https://172.16.186.196:2380]} Attributes:{Name:172.16.186.196 ClientURLs:[https://172.16.186.196:2379]}}] RemovedMemberIDs:[]}: unmatched member while checking PeerURLs

Tried that with both selinux enabled and disabled. No difference.

@ingvagabund
Copy link
Member

1-node member related issue: etcd-io/etcd#7820

@jkhelil
Copy link
Contributor Author

jkhelil commented Jun 15, 2017

@abutcher Hi Andrew, were you able to do this
""With the changes in the PR I'm able to go from one etcd instance to multiple using the scaleup playbook 👍 but not if the cluster was created prior to introducing these config changes. There may be a way to reconfig with these changes and then scaleup. I will test this today.""

@abutcher
Copy link
Member

@abutcher Hi Andrew, were you able to do this
""With the changes in the PR I'm able to go from one etcd instance to multiple using the scaleup playbook 👍 but not if the cluster was created prior to introducing these config changes. There may be a way to reconfig with these changes and then scaleup. I will test this today.""

Yes, tested this but not successfully.

@jkhelil
Copy link
Contributor Author

jkhelil commented Jun 15, 2017

I am working on it, I keep you informed

@jkhelil jkhelil changed the title Add etcd scaleup playbook [WIP]Add etcd scaleup playbook Jun 15, 2017
@jkhelil
Copy link
Contributor Author

jkhelil commented Jun 19, 2017

I manage to have it working with a cluster created with old conf, and I was able to scaleup etcd on new node using rpm based config, not dockerized one.
but I am encoutring this error
TASK [etcd_common : Fail if invalid r_etcd_common_action provided] *************
task path: /home/ansible/openshift-ansible/roles/etcd_common/tasks/main.yml:2
fatal: [ose3-int-a-node1.node.new-gen-1a-eu-central-1.acs]: FAILED! => {
"failed": true,
"msg": "The conditional check 'o' failed. The error was: error while evaluating conditional (o): 'o' is undefined\n\nThe error appears to have been in '/home/ansible/openshift-ansible/roles/etcd/tasks/main.yml': line 123, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- include_role:\n ^ here\n"
}

which seems not related to etcd scaleup, it is reported here
#4121

@jkhelil
Copy link
Contributor Author

jkhelil commented Jun 20, 2017

@abutcher can you give me explicit detail about what is not working exactly, I was able to add an etcd node using thes playbooks on a stack created with 3.2 playbooks. except the error i mentioned, which seems to be a bug on last openshift ansible

@openshift-bot
Copy link

Can one of the admins verify this patch?
I understand the following commands:

  • bot, add author to whitelist
  • bot, test pull request
  • bot, test pull request once

@abutcher
Copy link
Member

abutcher commented Jul 28, 2017

@jkhelil The only case that isn't working for me is to create a cluster with a single instance in the [etcd] group with the master branch and then attempt to scale it up with this branch. However, if the cluster was initially created with this branch then scaling up from a single instance works fine.

I also encountered some issues with overridden hostnames which I was able to work around with this commit abutcher@54d3b45.

@sdodson
Copy link
Member

sdodson commented Aug 1, 2017

aos-ci-test

@sdodson
Copy link
Member

sdodson commented Aug 1, 2017

[merge]

@abutcher
Copy link
Member

abutcher commented Aug 1, 2017

bot, retest this please

@sdodson sdodson changed the title [WIP]Add etcd scaleup playbook Add etcd scaleup playbook Aug 1, 2017
@openshift-bot
Copy link

error: aos-ci-jenkins/OS_3.6_containerized for cd269f0 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for cd269f0 (logs)

@abutcher
Copy link
Member

abutcher commented Aug 2, 2017

aos-ci-test

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_NOT_containerized, aos-ci-jenkins/OS_3.6_NOT_containerized_e2e_tests" for cd269f0 (logs)

@openshift-bot
Copy link

success: "aos-ci-jenkins/OS_3.6_containerized, aos-ci-jenkins/OS_3.6_containerized_e2e_tests" for cd269f0 (logs)

@sdodson
Copy link
Member

sdodson commented Aug 2, 2017

[merge]

@sdodson sdodson merged commit 1765ce2 into openshift:master Aug 2, 2017
@ganhuang
Copy link
Contributor

@sdodson Do you think we would need a card to track the feature? I'm afraid it would be missed by QE if it's intended to be in OCP, as well as the document.

@sdodson
Copy link
Member

sdodson commented Aug 11, 2017

@ganhuang there's one already, moved it to complete https://trello.com/c/EESwIsuW/171-5-support-for-scaling-up-etcd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants