Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement keepalived load balancer #4344

Merged
merged 12 commits into from
May 7, 2024

Conversation

juanluisvaladas
Copy link
Contributor

@juanluisvaladas juanluisvaladas commented Apr 29, 2024

Description

For easier review I split the PR in 5 commits:
1- Add API types and autogenerated code
2- Controller changes to make keepalived work
3- Updated inttest
4- Encapsulate keepalived config in a new struct
5- Documentations

I also found a typo so I sneaked in a tiny 1 line commit

This is part of #4181

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update

How Has This Been Tested?

  • Manual test
  • Auto test added

Checklist:

  • My code follows the style guidelines of this project
  • My commit messages are signed-off
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules
  • I have checked my code and corrected any misspellings

@@ -232,11 +232,16 @@ func (c *command) start(ctx context.Context) error {
if c.SingleNode {
return errors.New("control plane load balancing cannot be used in a single-node cluster")
}
if cplb.Type != v1beta1.CPLBTypeKeepalived {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO we should do this in config validation already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this is done in a separate commit because rewriting the git history was quite complex...

docs/cplb.md Outdated
@@ -3,17 +3,23 @@
For clusters that don't have an [externally managed load balancer](high-availability.md#load-balancer) for the k0s
control plane, there is another option to get a highly available control plane called control plane load balancing (CPLB).

CPLB allows automatic assigned of predefined IP addresses using VRRP across masters.
CPLB has two features that often will be combined, but normally will be used together: VRRP Instances, which allows
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPLB has two features that often will be combined, but normally will be used together

This doesn't sound right 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't 😅 . Fixed

docs/cplb.md Outdated
Comment on lines 55 to 56
* If `VirtualServers` are used, the cluster configuration doesn't specify a non-empty
[`spec.api.externalAddress`][specapi]. `VRRPInstances` are compatible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I don't really get what this means. Could you rephrase this a bit

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rephrased it, please review it again, I think it's better now but I'm not quite convinced to be honest


func (r *CPLBReconciler) watchAPIServers(ctx context.Context) {
// Before starting check that the API is actually responding
for {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could utilize our internal watch helper here? IMO would be much simpler and that already handles retries etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this as a separate commit. Can be squashed.

pkg/component/controller/cplb_reconciler.go Show resolved Hide resolved
if err := k.configureDummy(); err != nil {
return fmt.Errorf("failed to configure dummy interface: %w", err)
}
if err := k.Config.ValidateVRRPInstances(nil); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, IMO we need to hook validation to general config validation "phase"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it's added as a separate commit because rewriting the git history wasn't easy...

pkg/component/controller/cplb_unix.go Outdated Show resolved Hide resolved
pkg/component/controller/cplb_unix.go Outdated Show resolved Hide resolved
pkg/apis/k0s/v1beta1/cplb.go Outdated Show resolved Hide resolved
@juanluisvaladas juanluisvaladas marked this pull request as draft May 2, 2024 12:03
@juanluisvaladas
Copy link
Contributor Author

Converting to draft while I address the concerns.

pkg/component/controller/cplb_unix.go Show resolved Hide resolved
pkg/apis/k0s/v1beta1/cplb.go Outdated Show resolved Hide resolved
pkg/apis/k0s/v1beta1/cplb.go Show resolved Hide resolved
pkg/apis/k0s/v1beta1/cplb.go Show resolved Hide resolved
pkg/apis/k0s/v1beta1/cplb.go Outdated Show resolved Hide resolved
pkg/component/controller/cplb_reconciler.go Outdated Show resolved Hide resolved
pkg/component/controller/cplb_reconciler.go Show resolved Hide resolved
docs/cplb.md Show resolved Hide resolved
pkg/component/controller/cplb_unix.go Show resolved Hide resolved
pkg/supervisor/supervisor.go Show resolved Hide resolved
Comment on lines 316 to 332
// Wait for the supervisor to start keepalived before
// watching for endpoint changes
process := k.supervisor.GetProcess()
for process == nil {
k.log.Info("Waiting for keepalived to start")
time.Sleep(5 * time.Second)
process = k.supervisor.GetProcess()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might work without this loop, if the nil check would be moved into the update loop.

Copy link
Contributor Author

@juanluisvaladas juanluisvaladas May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feel strongly about it, I will change it, but I prefer it this way.

Even if the process is dead, s.GetProcess won't return nil once keepalived is started the first time, it will just return a process with an invalid PID. Adding it into the loop means we're doing a check that will always return false. Also we don't care if the PID is of an older dead process because the PID is obtained AFTER writing the template which means the new process will be using the new file.

But if you feel very strongly about it I'll move the nil check inside, there aren't big consecuences anyway, maybe faster reloads when the cluster is bootstrapping.

Copy link
Member

@twz123 twz123 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feel strongly about it, I will change it, but I prefer it this way.

Currently, the loop cannot be cancelled externally. That's why I figured that we could just inline the nil check below. Would be less code, too. I'm fine with keeping it, as long as it can be cancelled.

it will just return a process with an invalid PID

Right. I'd argue that this is a current shortcoming of Supervisor, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@twz123 I accidentally editted your comment with my reply 🤦

The reply is:

I added a limit of 6 times, that's 30 seconds and should be way more than enough time to start keepalived. It gets cancelled eventually, it's not the prettiest but should fix the problem.

Unfortunatelly I deleted your comment saying that it had to be possible to cancel it from the outside.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunatelly I deleted your comment saying that it had to be possible to cancel it from the outside.

Restored it from the history 🙃

Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>
@juanluisvaladas juanluisvaladas force-pushed the k0s-lb-2 branch 2 times, most recently from dd92d37 to 7eff376 Compare May 6, 2024 14:56
@juanluisvaladas juanluisvaladas marked this pull request as ready for review May 6, 2024 15:19
pkg/apis/k0s/v1beta1/cplb.go Outdated Show resolved Hide resolved
pkg/apis/k0s/v1beta1/cplb.go Outdated Show resolved Hide resolved
pkg/apis/k0s/v1beta1/cplb.go Outdated Show resolved Hide resolved
pkg/apis/k0s/v1beta1/cplb.go Outdated Show resolved Hide resolved
pkg/apis/k0s/v1beta1/cplb.go Outdated Show resolved Hide resolved
TUNLBKind KeepalivedLBKind = "TUN"
)

type RealServer struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they? Still not able to spot them 🙈

@@ -108,6 +140,10 @@ func (k *Keepalived) Start(_ context.Context) error {
DataDir: k.K0sVars.DataDir,
UID: k.uid,
}

if len(k.Config.VirtualServers) > 0 {
go k.watchReconcilerUpdates()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot spot it. Can you verify?

Comment on lines 316 to 332
// Wait for the supervisor to start keepalived before
// watching for endpoint changes
process := k.supervisor.GetProcess()
for process == nil {
k.log.Info("Waiting for keepalived to start")
time.Sleep(5 * time.Second)
process = k.supervisor.GetProcess()
}
Copy link
Member

@twz123 twz123 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you feel strongly about it, I will change it, but I prefer it this way.

Currently, the loop cannot be cancelled externally. That's why I figured that we could just inline the nil check below. Would be less code, too. I'm fine with keeping it, as long as it can be cancelled.

it will just return a process with an invalid PID

Right. I'd argue that this is a current shortcoming of Supervisor, though.

pkg/apis/k0s/v1beta1/cplb.go Outdated Show resolved Hide resolved
@juanluisvaladas
Copy link
Contributor Author

Everything should be addressed now

Copy link
Member

@twz123 twz123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last thing™. I think we can just omit the delay_loop if it's zero.

{{ if gt (len $RealServers) 0 }}
{{ range .VirtualServers }}
virtual_server {{ .IPAddress }} {{ $APIServerPort }} {
delay_loop {{ .DelayLoop.Seconds }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
delay_loop {{ .DelayLoop.Seconds }}
{{- if gt .DelayLoop.Seconds 0.0 }}
delay_loop {{ .DelayLoop.Seconds }}
{{- end }}

Copy link
Contributor Author

@juanluisvaladas juanluisvaladas May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't. In keepalived it defaults to 60. I set it to 0 because I think it doesn't make sense to delay it at all in CPLB because it's added after kubernetes.default.svc is reconciled and hence has passed all the relevant health local health checks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't in keepalived it defaults to 60.

I see. Apparently I found yet another delay_loop in the keepalived codebase 😅

I set it to 0 because I think it doesn't make sense to delay it at all in CPLB because it's added after kubernetes.default.svc is reconciled and hence has passed all the relevant health local health checks.

I don't fully understand the implications, but a delay_loop of 0 is not a thing in keepalived (it will use the default of 60, then):

# /var/lib/k0s/bin/keepalived --dont-fork --use-file /run/k0s/keepalived.conf --no-syslog --log-console --log-detail --dump-conf -t
(/run/k0s/keepalived.conf: Line 33) number '0' outside range [0.000001, 18446744073709.551614]
(/run/k0s/keepalived.conf: Line 33) virtual server delay loop '0' invalid - ignoring

But letz address this in a subsequent PR 😄

pkg/apis/k0s/v1beta1/cplb.go Outdated Show resolved Hide resolved
docs/configuration.md Outdated Show resolved Hide resolved
pkg/apis/k0s/v1beta1/cplb.go Outdated Show resolved Hide resolved
pkg/apis/k0s/v1beta1/cplb.go Outdated Show resolved Hide resolved
juanluisvaladas and others added 10 commits May 7, 2024 16:47
And move validation to clusterconfig.

Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>
The field wasn't required and didn't serve any actual purpose, so remove
it and auto generate it always.

Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>
Co-authored-by: Tom Wieczorek <twz123@users.noreply.github.com>
Signed-off-by: Juan-Luis de Sousa-Valadas Castaño <jvaladas@mirantis.com>
@juanluisvaladas juanluisvaladas merged commit bb42d03 into k0sproject:main May 7, 2024
78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants