Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable concurrency for pp and cpp #3511

Merged
merged 1 commit into from
May 17, 2023

Conversation

zach593
Copy link
Contributor

@zach593 zach593 commented May 10, 2023

What type of PR is this?

/kind feature

What this PR does / why we need it:

Recently, we encountered a slow restart issue where it took 15 minutes to clear the items in queues. We suspect that the number of workers in pp/cpp could be one of the reasons for this problem. Therefore, it would be reasonable to add a concurrency configuration for this.

Although the pp/cpp reconcile function appears to be capable of handling concurrency, the number of workers is hardcoded as 1. Please let me know if there is any particular reason for this.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

`karmada-controller-manager`: Introduced `--concurrent-propagation-policy-syncs`/`--concurrent-cluster-propagation-policy-syncs` flags to specify concurrent syncs for PropagationPolicy and ClusterPropagationPolicy.

@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label May 10, 2023
@karmada-bot karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 10, 2023
@codecov-commenter
Copy link

codecov-commenter commented May 10, 2023

Codecov Report

Merging #3511 (cac02e0) into master (17eae7d) will increase coverage by 0.00%.
The diff coverage is 0.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@           Coverage Diff           @@
##           master    #3511   +/-   ##
=======================================
  Coverage   52.64%   52.64%           
=======================================
  Files         213      213           
  Lines       19581    19583    +2     
=======================================
+ Hits        10308    10310    +2     
  Misses       8721     8721           
  Partials      552      552           
Flag Coverage Δ
unittests 52.64% <0.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
cmd/controller-manager/app/options/options.go 0.00% <0.00%> (ø)
pkg/detector/detector.go 0.00% <0.00%> (ø)

... and 1 file with indirect coverage changes

Copy link
Member

@XiShanYongYe-Chang XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zach593, thanks for your feedback, I think it's a reasonable update.

cmd/controller-manager/app/controllermanager.go Outdated Show resolved Hide resolved
Signed-off-by: zach593 <zach_li@outlook.com>
Copy link
Member

@XiShanYongYe-Chang XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label May 10, 2023
@XiShanYongYe-Chang
Copy link
Member

Hi @zach593 can you help add release note to describe the update?

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/assign
I'll take a look tomorrow.

@@ -202,6 +206,8 @@ func (o *Options) AddFlags(flags *pflag.FlagSet, allControllers, disabledByDefau
flags.IntVar(&o.ConcurrentResourceBindingSyncs, "concurrent-resourcebinding-syncs", 5, "The number of ResourceBindings that are allowed to sync concurrently.")
flags.IntVar(&o.ConcurrentWorkSyncs, "concurrent-work-syncs", 5, "The number of Works that are allowed to sync concurrently.")
flags.IntVar(&o.ConcurrentNamespaceSyncs, "concurrent-namespace-syncs", 1, "The number of Namespaces that are allowed to sync concurrently.")
flags.IntVar(&o.ConcurrentPropagationPolicySyncs, "concurrent-propagation-policy-syncs", 1, "The number of PropagationPolicy that are allowed to sync concurrently.")
flags.IntVar(&o.ConcurrentClusterPropagationPolicySyncs, "concurrent-cluster-propagation-policy-syncs", 1, "The number of ClusterPropagationPolicy that are allowed to sync concurrently.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do CPP and PP need to be configured separately? @XiShanYongYe-Chang

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For users, the usage of cpp and pp may be different, so separate control may be a little more fine-grained.

@@ -202,6 +206,8 @@ func (o *Options) AddFlags(flags *pflag.FlagSet, allControllers, disabledByDefau
flags.IntVar(&o.ConcurrentResourceBindingSyncs, "concurrent-resourcebinding-syncs", 5, "The number of ResourceBindings that are allowed to sync concurrently.")
flags.IntVar(&o.ConcurrentWorkSyncs, "concurrent-work-syncs", 5, "The number of Works that are allowed to sync concurrently.")
flags.IntVar(&o.ConcurrentNamespaceSyncs, "concurrent-namespace-syncs", 1, "The number of Namespaces that are allowed to sync concurrently.")
flags.IntVar(&o.ConcurrentPropagationPolicySyncs, "concurrent-propagation-policy-syncs", 1, "The number of PropagationPolicy that are allowed to sync concurrently.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's slow by default, I prefer to set the default value bigger, like 3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this pull request, concurrency was hardcoded as "1". I believe it would be better to set the default value to be the same as the value before the changes, so that users who did not modify this flag do not need to consider this change. If you think it's necessary, we can increase this value.

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it makes sense to add a flag to specify the concurrency, before looking into the code, I might want to ask some questions to better understand this issue.

Recently, we encountered a slow restart issue where it took 15 minutes to clear the items in queues.

I wonder which queue we are talking about here? And How do you know it is the queue that is blocked starting? Are there some logs or metrics?
How many resource templates are in your system?

Copy link
Member

@zishen zishen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to add concurrency configuration for pp and cpp
.

@Poor12
Copy link
Member

Poor12 commented May 12, 2023

Hi @zach593, you can get metrics from karmada-controller-manager to prove that work queue is heavy to deal with items:
eg:

workqueue_adds_total{name="propagationPolicy reconciler"} 
workqueue_depth{name="propagationPolicy reconciler"}

@RainbowMango
Copy link
Member

Good to add concurrency configuration for pp and cpp

Yes, I totally agree that allows people the specify the concurrency.

But I'm curious about the root cause of the issue:

Recently, we encountered a slow restart issue where it took 15 minutes to clear the items in queues.

We need to identify the root cause of the problem, develop solutions to fix it, and ultimately be able to verify it.

@zach593
Copy link
Contributor Author

zach593 commented May 15, 2023

image

this is monitoring data on grafana, for last time we restart the karmada-controller-manager. you could see pp reconciler used about 3-4 minutes to clear the queue items.

pp/cpp reconciler is the upstream reconciler of the resource detector, and the output (workload template) of the upstream reconciler (pp/cpp) will be added to the downstream (resource detector) queue. Therefore, if the upstream reconciler works slowly, the downstream will need to reconcile some items twice. The same situation can also occur between detector -> rb/crb, rb/crb -> work, and the slow processing speed of the upstream will amplify the number of reconciles that the downstream needs to do.

therefore, adding concurrency for pp/cpp is reasonable. This will reduce the total number of items that karmada-controller-manager needs to reconcile when restarted.

@RainbowMango @Poor12

@zach593
Copy link
Contributor Author

zach593 commented May 15, 2023

Hi @zach593 can you help add release note to describe the update?

User can control the concurrency of PropagationPolicyWorker and ClusterPolicyReconcileWorker by using the flags `concurrent-propagation-policy-syncs` and `concurrent-cluster-propagation-policy-syncs`.

done @XiShanYongYe-Chang

@RainbowMango
Copy link
Member

The grafana graph is awesome!

Have you tested this patch? Has the consumption speed of the item queue been improved? It would be great if we have two graphs from that we can see the difference in their effects.

@XiShanYongYe-Chang
Copy link
Member

Hi @zach593 can you help add release note to describe the update?

User can control the concurrency of PropagationPolicyWorker and ClusterPolicyReconcileWorker by using the flags `concurrent-propagation-policy-syncs` and `concurrent-cluster-propagation-policy-syncs`.

done @XiShanYongYe-Chang

How about update like this:

karmada-controller-manager: add `concurrent-propagation-policy-syncs`, `concurrent-cluster-propagation-policy-syncs` flags to adjust pp/cpp syncs speed.

@RainbowMango RainbowMango added this to the v1.6 milestone May 16, 2023
@zach593
Copy link
Contributor Author

zach593 commented May 16, 2023

Have you tested this patch? Has the consumption speed of the item queue been improved? It would be great if we have two graphs from that we can see the difference in their effects.

Yes, we have tested it, but the monitoring data is not convincing enough yet.

before patch:
image

patched (10 workers):
image

seems reconciles per minute have increase from 3-4k to 5.6k

because cpu was already bursting, effect of this patch may not have been fully released.
and because the monitoring system's service discovery is not timely, some monitoring data is missing

@zach593
Copy link
Contributor Author

zach593 commented May 16, 2023

How about update like this:

karmada-controller-manager: add `concurrent-propagation-policy-syncs`, `concurrent-cluster-propagation-policy-syncs` flags to adjust pp/cpp syncs speed.

cool, done

Copy link
Member

@RainbowMango RainbowMango left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

Many thanks to @zach593, we can go with the default concurrency (1), I appreciate all your effort on it. If you have any evidence that we need to increase the concurrency, please feel free to send a PR. Thanks in advance. :)

@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RainbowMango

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 17, 2023
@karmada-bot karmada-bot merged commit 0789c83 into karmada-io:master May 17, 2023
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants