Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swap protokube's gossip implementation from weaveworks/mesh to memberlist #7436

Closed
jacksontj opened this issue Aug 19, 2019 · 16 comments
Closed
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jacksontj
Copy link
Contributor

1. Describe IN DETAIL the feature/behavior/change you would like to see.
I just finished spending a few days looking into some weird scale issues with weaveworks/mesh (#7427) -- and after doing so I see a LOT problems that make me question its viability as a kops component. Specifically the issue I've hit is that it has issues once you hit ~200 nodes in the cluster -- which isn't all that large. Other projects have actually moved off of mesh (e.g. alertmanager.

I imagine that we'd need to actually add support for both and have flags to swap between gossip implementation, but that is all doable. So if this is something people are open to I would be up for spending some time to make it happen.

@jacksontj
Copy link
Contributor Author

I have a WIP implementation on my branch (which is based on the fork we use) -- https://github.com/wish/kops/compare/release-1.12_fork...jacksontj:gossip_dns?expand=1

As of now you can create a cluster with the new gossip setup and I have tested up to 800 peers in the cluster with <3% cpu utilization (compared to 60-100%) and ~60mb of RAM (compared to ~4gb).

At this point the main piece missing is the config plumbing. Since this is an entirely different gossip protocol it needs to run on different ports etc. So I'm thinking the easiest mechanism would be to (1) add a flag for which gossip to use (probably have to keep the default on mesh for now -- not sure how we'd change a default like that).

As for migrating a cluster we have 2 options (1) we make protokube spawn N gossips with config -- so you could add the second then remove the first or (2) we just document a somewhat manual procedure where you manually start protokube on the box a second time with different flags to do the migration. I imagine the first is preferable -- but is significantly more work.

Any feedback would be greatly appreciated :)

@zetaab
Copy link
Member

zetaab commented Aug 24, 2019

@justinsb what is your opinion to this? Do you see some possible problems?

@jacksontj
Copy link
Contributor Author

After thinking some more I think I'll have to make protokube support 2 at a time. I'm thinking basically to add the following flags (names aren't set, just conveying the idea):

  • primary-gossip-type: define mesh or memberlist as the primary mechanism, meaning it will get wired up to etc hosts
  • secondary-gossip-type: optional flag to enable a second gossip (to enable switching)
  • (primary|secondary)-gossip-port: set ports separately

This way the switch would be (1) add second gossip to masters (2) switch primary gossip on masters (3) swap nodes primary (4) remove secondary from master

The alternatives seem to all end up requiring a lot of manual hand holding which will make the upgrade process more painful, unfortunately this approach just adds a bunch more options but is probably more likely to get people to upgrade.

@shrinandj
Copy link
Contributor

@jacksontj , I see that the implementation of this is checked in. The protokube config options are gossip-protocol and gossip-protocol-secondary.

Can you list the exact set of steps that could be used for migrating an existing cluster using mesh to one using memberlist?

I can test this for you and report the results here.

@jacksontj
Copy link
Contributor Author

Definitely, I was planning on writing up some docs, but as you have noticed I haven't had the time to write it up nicely yet :)

So here are the raw notes I used when upgrading:

// Step 1: Enable double gossip
// apply to masters, then apply to dns-controller
{
  "gossipConfig": {
    "protocol": "mesh",
    "listen": "0.0.0.0:3999",
    "secondary": {
      "protocol": "memberlist",
      "listen": "0.0.0.0:4000",
    },
  },
  "dnsControllerGossipConfig": {
    "protocol": "mesh",
    "listen": "0.0.0.0:3998",
    "seed": "127.0.0.1:3999",
    "secondary": {
      "protocol": "memberlist",
      "listen": "0.0.0.0:3993",
    },
  },
}

// Step 2: swap primary gossip
// apply to masters, dns-controller, nodes
{
  "gossipConfig": {
    "protocol": "memberlist",
    "listen": "0.0.0.0:4000",
    "secondary": {
      "protocol": "mesh",
      "listen": "0.0.0.0:3999",
    },
  },
  "dnsControllerGossipConfig": {
    "protocol": "memberlist",
    "listen": "0.0.0.0:3993",
    "seed": "127.0.0.1:4000",
    "secondary": {
      "protocol": "mesh",
      "listen": "0.0.0.0:3998",
    },
  },
}

// step 3: remove mesh
// apply to nodes, masters, dns-controller
{ 
  "gossipConfig": {
    "protocol": "memberlist",
    "listen": "0.0.0.0:4000",
  },
  "dnsControllerGossipConfig": {
    "protocol": "memberlist",
    "listen": "0.0.0.0:3993",
    "seed": "127.0.0.1:4000",
  },
}

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 27, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@olemarkus
Copy link
Member

/reopen
/remove-lifecycle rotten
/kind office-hours

@jacksontj are you still around and want to get this into shape? :)

@k8s-ci-robot
Copy link
Contributor

@olemarkus: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten
/kind office-hours

@jacksontj are you still around and want to get this into shape? :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Sep 19, 2022
@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 19, 2022
@justinsb
Copy link
Member

justinsb commented Dec 2, 2022

We discussed this in office hours. The suggestion is to investigate whether we can simplify the gossip stack, by using an approach inspired by the no-DNS work: #14711

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 13, 2023
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 12, 2023
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

9 participants