-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor cluster controller #3380
refactor cluster controller #3380
Conversation
/cc @zryfish |
da42065
to
81aa760
Compare
Codecov Report
@@ Coverage Diff @@
## master #3380 +/- ##
==========================================
+ Coverage 11.87% 11.89% +0.02%
==========================================
Files 226 226
Lines 42658 42605 -53
==========================================
+ Hits 5065 5068 +3
+ Misses 36809 36757 -52
+ Partials 784 780 -4
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@@ -348,6 +345,80 @@ func (c *clusterController) reconcileHostCluster() error { | |||
return err | |||
} | |||
|
|||
func (c *clusterController) judgeIfClusterIsReady() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about changing the name to probeClusters
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed.
klog.Error(err) | ||
continue | ||
} | ||
config.Timeout = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if there are lots of clusters, saying 50, and each cluster takes 9s to finish probing, that would be 450s, > resyncPeriod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10 seconds seem to be too long. But the case you are talking about is very rare. It's unlikely that every cluster connection takes 9s. In most cases, one connection takes several ms. How about changing the timeout to 3s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are network issues on the node where the ks-controller-manager
pod residing, it's possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I did before put cluster back to working queue every resyncPeriod, and check its readiness on main sync loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see put cluster back to working queue every resyncPeriod
, but I saw check its readiness on main sync loop.
. I think we don't need to put cluster back to working queue every resyncPeriod
manually, the cluster informer does that automatically. The reason I check the cluster readiness separately is to check all of the cluster readiness, not only proxy connection. What you did before only checks the proxy cluster if has agent availably
status, then updates the cluster status to ready or not. By using kubeconfig, I think it's more reliable(e.g. direct connection with kube-apiserver unreachable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If put in the main sync loop, once the cluster is created/updated/deleted, the check will be performed, which may be too frequent. On the other hand, the check may be too long and will affect the sync loop. What is your suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, but we need to update cluster.status.configz
every resyncPeriod too. So I suggest make config.timeout
shorter, and probe in the main loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we update the cluster.status.configz
every resyncPeriod at the end of the main loop. The update didn't change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, better to make config.timeout
shorter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config.Timeout
has been set as 3s by default. We can merge this pr now.
@@ -79,5 +79,5 @@ func (o *Options) AddFlags(fs *pflag.FlagSet, s *Options) { | |||
"This field is used when generating deployment yaml for agent.") | |||
|
|||
fs.DurationVar(&o.ClusterControllerResyncSecond, "cluster-controller-resync-second", s.ClusterControllerResyncSecond, | |||
"Cluster controller resync second to sync cluster resource.") | |||
"Cluster controller resync second to sync cluster resource. e.g. 30s 60s 120s...") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to start 2m, 5m, 10m
, small resync period increases load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with that. I will update the comment.
Signed-off-by: yuswift <yuswiftli@yunify.com>
81aa760
to
194d054
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: yuswift, zryfish The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: yuswift yuswiftli@yunify.com
What type of PR is this?
/kind design
What this PR does / why we need it:
Reduce the complexity between tower server and clsuter-controller. Remove the
port allocation
proxy creation
token generation
steps, addcluster ready detection
step.Which issue(s) this PR fixes:
Fixes #3234