-
Notifications
You must be signed in to change notification settings - Fork 38.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kubeadm: wait for the etcd cluster to be available when growing it #72984
Merged
k8s-ci-robot
merged 1 commit into
kubernetes:master
from
ereslibre:wait-for-etcd-when-growing
Jan 20, 2019
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,6 +21,7 @@ import ( | |
"crypto/tls" | ||
"fmt" | ||
"net" | ||
"net/url" | ||
"path/filepath" | ||
"strconv" | ||
"strings" | ||
|
@@ -73,7 +74,7 @@ func New(endpoints []string, ca, cert, key string) (*Client, error) { | |
return &client, nil | ||
} | ||
|
||
// NewFromCluster creates an etcd client for the the etcd endpoints defined in the ClusterStatus value stored in | ||
// NewFromCluster creates an etcd client for the etcd endpoints defined in the ClusterStatus value stored in | ||
// the kubeadm-config ConfigMap in kube-system namespace. | ||
// Once created, the client synchronizes client's endpoints with the known endpoints from the etcd membership API (reality check). | ||
func NewFromCluster(client clientset.Interface, certificatesDir string) (*Client, error) { | ||
|
@@ -146,7 +147,15 @@ type Member struct { | |
} | ||
|
||
// AddMember notifies an existing etcd cluster that a new member is joining | ||
func (c Client) AddMember(name string, peerAddrs string) ([]Member, error) { | ||
func (c *Client) AddMember(name string, peerAddrs string) ([]Member, error) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as a note/TODO: in a separate PR we need to make all the client methods to use pointers. |
||
// Parse the peer address, required to add the client URL later to the list | ||
// of endpoints for this client. Parsing as a first operation to make sure that | ||
// if this fails no member addition is performed on the etcd cluster. | ||
parsedPeerAddrs, err := url.Parse(peerAddrs) | ||
if err != nil { | ||
return nil, errors.Wrapf(err, "error parsing peer address %s", peerAddrs) | ||
} | ||
|
||
cli, err := clientv3.New(clientv3.Config{ | ||
Endpoints: c.Endpoints, | ||
DialTimeout: 20 * time.Second, | ||
|
@@ -176,6 +185,9 @@ func (c Client) AddMember(name string, peerAddrs string) ([]Member, error) { | |
} | ||
} | ||
|
||
// Add the new member client address to the list of endpoints | ||
c.Endpoints = append(c.Endpoints, GetClientURLByIP(parsedPeerAddrs.Hostname())) | ||
ereslibre marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
return ret, nil | ||
} | ||
|
||
|
@@ -255,7 +267,7 @@ func (c Client) WaitForClusterAvailable(retries int, retryInterval time.Duration | |
fmt.Printf("[util/etcd] Waiting %v until next retry\n", retryInterval) | ||
time.Sleep(retryInterval) | ||
} | ||
fmt.Printf("[util/etcd] Attempting to see if all cluster endpoints are available %d/%d\n", i+1, retries) | ||
klog.V(2).Infof("attempting to see if all cluster endpoints (%s) are available %d/%d", c.Endpoints, i+1, retries) | ||
resp, err := c.ClusterAvailable() | ||
if err != nil { | ||
switch err { | ||
|
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this function printing
IMO those output should be removed (or converted into log messages) in order to be consistent with all the other waiters in kubeadm.
However, considering that this requires to add "This can take up to ..." in every place where
WaitForClusterAvailable
is used, this goes out of the scope of this PR, so please open an issue to track this as a todo/good first issueThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I proposed to use
klog
here too but @rosti didn't want to address this change on this PR, only changing the one potentially long with the endpoints. I agree with your point of view though @fabriziopandini.@rosti, wdyt? Should I change this now that @fabriziopandini also raised this issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Marked as resolved as per discussion with @fabriziopandini, leaving it as it was as @rosti proposed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think, that we need some sort of indication about the reason we wait another 5 seconds. This is tightly coupled with the UX of end users, that run kubeadm directly on command line. For that matter I am not a fan of klogging this. In my opinion it should go out via print.
On the other hand, we can certainly reduce the output here to say a single, more descriptive message per retry.
However, as @fabriziopandini mentioned, this will require changes in a few more places, thus it may better be done in another PR. We can file a backlog issue for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wanted to get more feedback on the 5 seconds and the interval of 20 tries.
if we can get a check faster than 5 on the average, possibly we can reduce the value?
also 20 tries is a lot. in reality, we might get the failed state much sooner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the last run it took 4 retries (of 5 seconds interval), this one was way off-charts and with a clean environment :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i've mentioned this on slack:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fabriziopandini @rosti please give you stamp of approval for the above comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the 40 seconds idea, but let's keep the steps at 5 sec. Bear in mind, that we have just written out the static pod spec, so the kubelet needs to detect it, spin it up and for etcd to become responsive. On some systems it's easy for this to come above 2 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, @rosti is voting for 5 sec / 8 retries.
@fabriziopandini ?