-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable etcd cluster scaling for userclusters #5571
Conversation
if etcdClusterSize > replicas { | ||
return replicas + 1 | ||
} | ||
return replicas - 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested with a pre-existing, running cluster with 3 replicas. The statefulset ended up being reduced to 2 replicas, I assume here (.Spec.EtcdClusterSize not set => etcdClusterSize:=0; replicas:=3; isEtcdHealthy:=true => return 2). The etcd launcher then refuses to launch a 2-node cluster because that's smaller than the defaultClusterSize (3), leading to a permanently non-running cluster. Is the idea that the admin must always set .Spec.EtcdClusterSize?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's defaulted here, which runs as part of the API server. I am not 100% sure if it will default existing clusters as well. But it is a good point, we should be more defensive here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed that an existing cluster is migrated to the new launcher properly if .spec.etcdClusterSize is set to 3 first. Maybe we should just treat an unspecified (==0) value for the field as if the field was set to 3.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it's defaulted here, which runs as part of the API server. I am not 100% sure if it will default existing clusters as well. But it is a good point, we should be more defensive here.
Yeah the defaulting I saw too, but I'm pretty sure that's only called when new clusters are created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@multi-io I just pushed a fix for it. Can you please confirm it's working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed. Existing etcd cluster was migrated without having to manually set .spec.etcdClusterSize or anything else in the cluster resource.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmd/etcd-launcher/main.go
Outdated
// not required, will leave it for now. | ||
os.Setenv(initialStateEnvName, "new") | ||
os.Setenv(initialClusterEnvName, initialMemberList(config.clusterSize, config.namespace)) | ||
os.Setenv(initialStateEnvName, e.config.initialState) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line and the line underneath it has no error check, I know the original code had no error check either but would make since to add some checks here.
cmd/etcd-launcher/main.go
Outdated
os.Setenv(initialStateEnvName, "new") | ||
os.Setenv(initialClusterEnvName, initialMemberList(config.clusterSize, config.namespace)) | ||
os.Setenv(initialStateEnvName, e.config.initialState) | ||
os.Setenv(initialClusterEnvName, strings.Join(initialMembers, ",")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thing regarding this line and the one above, are you using it for debugging? because normally env vars are populated via the job instead of setting explicitly in the code.
cmd/etcd-launcher/main.go
Outdated
return fmt.Sprintf("%s.etcd.%s.svc.cluster.local:2380", e.config.podName, e.config.namespace) | ||
} | ||
|
||
func (e *etcdCluster) getConfigFromEnv() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way we actually pass configs to the bin/cmd, is via having the args/options pattern. In other words, we pass those configs as flags to the cmd instead of env vars. So it follows as:
- From out side(e.g: job runner) env vars are being created and passed to the container for example.
- Inside the container/job the configs are passed as flags.
For more information take a look at the user-cluster-controller-manager and how those args are read via the hack script
cmd/etcd-launcher/main.go
Outdated
etcdCmd(config), | ||
os.Environ()) | ||
// setup and start etcd command | ||
cmd := exec.Command("/usr/local/bin/etcd", etcdCmd(e.config)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we make suer that etcd
cmd exists before doing any calls or preparation? if os.Status("") => not exist return an error as early as possible?
cmd/etcd-launcher/main.go
Outdated
// just get a key from etcd, this is how `etcdctl endpoint health` works! | ||
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) | ||
_, err = client.Get(ctx, "healthy") | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:maybe check the error then cancel? defer cancel()
cmd/etcd-launcher/main.go
Outdated
// just get a key from etcd, this is how `etcdctl endpoint health` works! | ||
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) | ||
_, err := e.client.Get(ctx, "healthy") | ||
cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:maybe check the error then cancel? defer cancel()
return cmd | ||
} | ||
|
||
func (e *etcdCluster) getClient() error { | ||
if e.client != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this one, so when the client is not nil you are returning nil, shouldn't you return the client here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
client
is set in the etcdCluster
struct. If it's nil, I create it and assign it to e.client instead of passing it around.
cmd/etcd-launcher/main.go
Outdated
|
||
// if existing, we need to reconcile | ||
var healthy bool | ||
for i := 0; i < 5; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here you are trying to make sure that the storage is healthy, so you are trying to get valid results in 5 times. In general that's valid of course but what about adding some semantics something similar to this
var (
tries int
maxTries int
healthy bool
)
for tries < maxTries {
if healthy, err = e.isHealthy(); !healthy || err != nil {
log.Error("...")
time.Sleep(500 * time.Milisecond)
tries++
continue
}
log.Info("...")
break
}
if !healthy || err != nil {
log.Fatal("failed for ever")
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or you can try out this lovely package:
https://godoc.org/k8s.io/apimachinery/pkg/util/wait#Poll
I think the cluster size field should be .spec.etcd.clusterSize instead of .spec.etcdClusterSize, i.e. have a spec.etcd map in there that can accommodate future etcd-related settings. |
for { // reconcile dead members | ||
members, err := e.listMembers() | ||
if err != nil { | ||
time.Sleep(10 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we log the error in the for loop? this line and the lines underneath it?
cmd/etcd-launcher/main.go
Outdated
continue | ||
} | ||
} | ||
} else { // new etcd member, need to join hte cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: s/hte/the
cmd/etcd-launcher/main.go
Outdated
} | ||
|
||
config.clusterSize = defaultClusterSize | ||
if s := os.Getenv("ECTD_CLUSTER_SIZE"); s != "" { | ||
if config.clusterSize, err = strconv.Atoi(s); err != nil { | ||
return nil, fmt.Errorf("failed to read ECTD_CLUSTER_SIZE: %v", err) | ||
return fmt.Errorf("failed to read ECTD_CLUSTER_SIZE: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: s/ECTD/ETCD
cmd/etcd-launcher/main.go
Outdated
if config.clusterSize > defaultClusterSize { | ||
return nil, fmt.Errorf("ECTD_CLUSTER_SIZE is smaller then %d", defaultClusterSize) | ||
if config.clusterSize < defaultClusterSize { | ||
return fmt.Errorf("ECTD_CLUSTER_SIZE is smaller then %d", defaultClusterSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: s/ECTD/ETCD
typo: s/then/than
cmd/etcd-launcher/main.go
Outdated
if err = e.getLocalClient(); err != nil { | ||
return false, err | ||
} | ||
var resp *clientv3.StatusResponse |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be derived instead and declared inside the for loop since you are not using outside of it:
resp, err = e.localClient.Status(context.Background(), e.endpoint())
cmd/etcd-launcher/main.go
Outdated
continue | ||
} | ||
if !healthy { | ||
if _, err := e.client.MemberRemove(context.Background(), member.ID); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need the check here and you are not wrap the error, thus remove the if statement please and return the method.
return e.client.MemberRemove(context.Background(), member.ID)
e95e6f8
to
5780ffd
Compare
/retest |
1 similar comment
/retest |
cmd/etcd-launcher/main.go
Outdated
"strings" | ||
"time" | ||
|
||
kubermaticlog "github.com/kubermatic/kubermatic/pkg/log" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please regroup the imports respectively :
- go sdk imports(
"context"
) - third party imports(
"go.uber.org/zap"
) - kubermatic/machine-controller imports
- k8s imports
cmd/etcd-launcher/main.go
Outdated
@@ -88,30 +89,40 @@ func main() { | |||
for { // reconcile dead members | |||
members, err := e.listMembers() | |||
if err != nil { | |||
log.Warnf("failed to list memebers: %v ", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use logging with additional context with error wrapping:
log.Warnw("failed to list memebers ", zap.Error(err))
cmd/etcd-launcher/main.go
Outdated
|
||
if _, err := os.Stat(etcdCommandPath); os.IsNotExist(err) { | ||
log.Fatalf("can't find etcd command [%s]: %v", etcdCommandPath, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the pattern that we use regaring contextual logs:
log.Fatalw("can't find command","command-path", etcdCommandPath, zap.Error(err))
cmd/etcd-launcher/main.go
Outdated
break | ||
} | ||
// to avoide race conditions, we will run only on the cluster leader | ||
leader, err := e.isLeader() | ||
if err != nil || !leader { | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to check the error two times, instead make the log message more general:
log.Warnw("failed to remove member, error occurred or didn't get the current leader", zap.Error(err))
In this case, if there was an error it will be printed out as an error
cmd/etcd-launcher/main.go
Outdated
time.Sleep(10 * time.Second) | ||
continue | ||
} | ||
if err := e.removeDeadMembers(); err != nil { | ||
if err := e.removeDeadMembers(log); err != nil { | ||
log.Warnf("failed to remove member: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please adjust.
log.Warnw("failed to remove member", zap.Error(err))
cmd/etcd-launcher/main.go
Outdated
if err = wait.Poll(1*time.Second, 30*time.Second, func() (bool, error) { | ||
return e.isEndpointHealthy(member.PeerURLs[0]) | ||
}); err != nil { | ||
log.Infof("member [%s] is not responding, removing from cluster", member.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use:
log.Infow("member is not responding, removing from cluster", "member-name", member.Name)
if err != nil { | ||
log.Fatalf("failed to get launcher configuration: %v", err) | ||
} | ||
|
||
logOpts := kubermaticlog.NewDefaultOptions() | ||
rawLog := kubermaticlog.New(logOpts.Debug, logOpts.Format) | ||
log := rawLog.Sugar() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable collides with the "log"
imported package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The imported package is aliased, so it should be ok?
@@ -82,6 +82,21 @@ func (r *Reconciler) syncHealth(ctx context.Context, cluster *kubermaticv1.Clust | |||
if err != nil { | |||
return err | |||
} | |||
// set ClusterConditionEtcdClusterInitialized, this should be don't only once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand the comment, can you please make it clearer :-)
pkg/resources/etcd/statefulset.go
Outdated
}, | ||
// { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is not needed can you please remove it?
pkg/resources/etcd/statefulset.go
Outdated
return etcdClusterSize | ||
} | ||
replicas := int(*set.Spec.Replicas) | ||
isEtcdHealthy := data.Cluster().Status.ExtendedHealth.Etcd == kubermaticv1.HealthStatusUp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please move this line to 341, as this line is evaluated and not used if the etcdClusterSize
== replicas
/test pre-kubermatic-e2e-aws-flatcar-1.18 |
/retest |
1 similar comment
/retest |
/approve |
LGTM label has been added. Git tree hash: 95798e4abc12a40a47b037e6c9551941f4531c5c
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: moadqassem, moelsayed The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Review the full test history Silence the bot with an Also, here is a cat. |
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes Support configurable user cluster etcd cluster size #5545
Special notes for your reviewer:
Documentation:
Does this PR introduce a user-facing change?: