Swap protokube's gossip implementation from weaveworks/mesh to memberlist #7436

jacksontj · 2019-08-19T22:55:03Z

1. Describe IN DETAIL the feature/behavior/change you would like to see.
I just finished spending a few days looking into some weird scale issues with weaveworks/mesh (#7427) -- and after doing so I see a LOT problems that make me question its viability as a kops component. Specifically the issue I've hit is that it has issues once you hit ~200 nodes in the cluster -- which isn't all that large. Other projects have actually moved off of mesh (e.g. alertmanager.

I imagine that we'd need to actually add support for both and have flags to swap between gossip implementation, but that is all doable. So if this is something people are open to I would be up for spending some time to make it happen.

jacksontj · 2019-08-23T22:12:04Z

I have a WIP implementation on my branch (which is based on the fork we use) -- https://github.com/wish/kops/compare/release-1.12_fork...jacksontj:gossip_dns?expand=1

As of now you can create a cluster with the new gossip setup and I have tested up to 800 peers in the cluster with <3% cpu utilization (compared to 60-100%) and ~60mb of RAM (compared to ~4gb).

At this point the main piece missing is the config plumbing. Since this is an entirely different gossip protocol it needs to run on different ports etc. So I'm thinking the easiest mechanism would be to (1) add a flag for which gossip to use (probably have to keep the default on mesh for now -- not sure how we'd change a default like that).

As for migrating a cluster we have 2 options (1) we make protokube spawn N gossips with config -- so you could add the second then remove the first or (2) we just document a somewhat manual procedure where you manually start protokube on the box a second time with different flags to do the migration. I imagine the first is preferable -- but is significantly more work.

Any feedback would be greatly appreciated :)

zetaab · 2019-08-24T08:49:06Z

@justinsb what is your opinion to this? Do you see some possible problems?

jacksontj · 2019-08-24T13:57:48Z

After thinking some more I think I'll have to make protokube support 2 at a time. I'm thinking basically to add the following flags (names aren't set, just conveying the idea):

primary-gossip-type: define mesh or memberlist as the primary mechanism, meaning it will get wired up to etc hosts
secondary-gossip-type: optional flag to enable a second gossip (to enable switching)
(primary|secondary)-gossip-port: set ports separately

This way the switch would be (1) add second gossip to masters (2) switch primary gossip on masters (3) swap nodes primary (4) remove secondary from master

The alternatives seem to all end up requiring a lot of manual hand holding which will make the upgrade process more painful, unfortunately this approach just adds a bunch more options but is probably more likely to get people to upgrade.

shrinandj · 2019-10-26T15:35:58Z

@jacksontj , I see that the implementation of this is checked in. The protokube config options are gossip-protocol and gossip-protocol-secondary.

Can you list the exact set of steps that could be used for migrating an existing cluster using mesh to one using memberlist?

I can test this for you and report the results here.

jacksontj · 2019-10-30T20:44:26Z

Definitely, I was planning on writing up some docs, but as you have noticed I haven't had the time to write it up nicely yet :)

So here are the raw notes I used when upgrading:

// Step 1: Enable double gossip
// apply to masters, then apply to dns-controller
{
  "gossipConfig": {
    "protocol": "mesh",
    "listen": "0.0.0.0:3999",
    "secondary": {
      "protocol": "memberlist",
      "listen": "0.0.0.0:4000",
    },
  },
  "dnsControllerGossipConfig": {
    "protocol": "mesh",
    "listen": "0.0.0.0:3998",
    "seed": "127.0.0.1:3999",
    "secondary": {
      "protocol": "memberlist",
      "listen": "0.0.0.0:3993",
    },
  },
}

// Step 2: swap primary gossip
// apply to masters, dns-controller, nodes
{
  "gossipConfig": {
    "protocol": "memberlist",
    "listen": "0.0.0.0:4000",
    "secondary": {
      "protocol": "mesh",
      "listen": "0.0.0.0:3999",
    },
  },
  "dnsControllerGossipConfig": {
    "protocol": "memberlist",
    "listen": "0.0.0.0:3993",
    "seed": "127.0.0.1:4000",
    "secondary": {
      "protocol": "mesh",
      "listen": "0.0.0.0:3998",
    },
  },
}

// step 3: remove mesh
// apply to nodes, masters, dns-controller
{ 
  "gossipConfig": {
    "protocol": "memberlist",
    "listen": "0.0.0.0:4000",
  },
  "dnsControllerGossipConfig": {
    "protocol": "memberlist",
    "listen": "0.0.0.0:3993",
    "seed": "127.0.0.1:4000",
  },
}

fejta-bot · 2020-01-28T21:08:34Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-02-27T21:50:17Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-03-28T22:34:05Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-03-28T22:34:19Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

olemarkus · 2022-09-19T11:25:10Z

/reopen
/remove-lifecycle rotten
/kind office-hours

@jacksontj are you still around and want to get this into shape? :)

k8s-ci-robot · 2022-09-19T11:25:14Z

@olemarkus: Reopened this issue.

In response to this:

/reopen
/remove-lifecycle rotten
/kind office-hours

@jacksontj are you still around and want to get this into shape? :)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

justinsb · 2022-12-02T16:59:54Z

We discussed this in office hours. The suggestion is to investigate whether we can simplify the gossip stack, by using an approach inspired by the no-DNS work: #14711

k8s-triage-robot · 2023-04-13T18:36:23Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-05-13T18:58:37Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2023-06-12T19:13:58Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2023-06-12T19:14:04Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jacksontj mentioned this issue Aug 21, 2019

Huge CPU usage spike in protokube after cluster scale up #7427

Closed

This was referenced Sep 4, 2019

Memberlist gossip implementation #7521

Merged

AddPeer method to add additional peers after initial startup prometheus/alertmanager#2019

Closed

Protokube metrics #7546

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 28, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 27, 2020

k8s-ci-robot closed this as completed Mar 28, 2020

olemarkus mentioned this issue Sep 19, 2022

protokube - gossip memory leak #13974

Closed

k8s-ci-robot added the kind/office-hours label Sep 19, 2022

k8s-ci-robot reopened this Sep 19, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 19, 2022

johngmyers removed the kind/office-hours label Jan 13, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 13, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 12, 2023

dkulchinsky mentioned this issue May 6, 2024

Add GossipConfig & DNSControllerGossipConfig cluster spec blocks terraform-kops/terraform-provider-kops#46

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swap protokube's gossip implementation from weaveworks/mesh to memberlist #7436

Swap protokube's gossip implementation from weaveworks/mesh to memberlist #7436

jacksontj commented Aug 19, 2019

jacksontj commented Aug 23, 2019

zetaab commented Aug 24, 2019

jacksontj commented Aug 24, 2019

shrinandj commented Oct 26, 2019

jacksontj commented Oct 30, 2019

fejta-bot commented Jan 28, 2020

fejta-bot commented Feb 27, 2020

fejta-bot commented Mar 28, 2020

k8s-ci-robot commented Mar 28, 2020

olemarkus commented Sep 19, 2022

k8s-ci-robot commented Sep 19, 2022

justinsb commented Dec 2, 2022

k8s-triage-robot commented Apr 13, 2023

k8s-triage-robot commented May 13, 2023

k8s-triage-robot commented Jun 12, 2023

k8s-ci-robot commented Jun 12, 2023

Swap protokube's gossip implementation from weaveworks/mesh to memberlist #7436

Swap protokube's gossip implementation from weaveworks/mesh to memberlist #7436

Comments

jacksontj commented Aug 19, 2019

jacksontj commented Aug 23, 2019

zetaab commented Aug 24, 2019

jacksontj commented Aug 24, 2019

shrinandj commented Oct 26, 2019

jacksontj commented Oct 30, 2019

fejta-bot commented Jan 28, 2020

fejta-bot commented Feb 27, 2020

fejta-bot commented Mar 28, 2020

k8s-ci-robot commented Mar 28, 2020

olemarkus commented Sep 19, 2022

k8s-ci-robot commented Sep 19, 2022

justinsb commented Dec 2, 2022

k8s-triage-robot commented Apr 13, 2023

k8s-triage-robot commented May 13, 2023

k8s-triage-robot commented Jun 12, 2023

k8s-ci-robot commented Jun 12, 2023