-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BGP with enterprise VPNs use case #60
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minimal changes
Thinking about this, assets should be in a subdirectory. Two possibilities:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I beat you to it
I think we'll undo the markdown linting, there. We can add it separately. Also, I cocked it up and it doesn't run. |
e00575d
to
ce32158
Compare
(removed, despite the record in the review comments) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Yeah, I think we're now down to the abstract 'is this how we want a user story to look' - for which we could use the template for the next step. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting starting point. In addition to BGPs allergy to NAT, there is also the convergence issue, and the time it takes for BGP to get to a "happy" status upon standup, or recovery.
This LGTM, but as a side question, is there a possible future where we say BGP isn't a good candidate for this style of packaging and hosting? I'm picking here solely because its the first use-case, but I'm curious... do we have a threshold for how much we try and get a use case to work before we say "maybe this is a bad idea?", or do we keep engineering until we beat BGP into submission?
It's an interesting point. We discuss recovery from failure, but the consequences of failure have to be considered in this. What we're really saying with more conventional apps (e.g. web services) is that no meaningful consequences if a component fails as long as we're ready to accept another request. Here, we have different consequences for failure and we have to see if it matters.
I think we can keep judgement out of this and say with an even hand 'this is the best that this allows us to do, take it or leave it'. It may not suit the use cases it used to; it might be better for other ones (e.g. much bigger RIBs). Bear in mind, with BGP, that it does fail when whatever it runs on dies, and it always has - by 'fail', I mean 'withdraw all routes'. There's nothing new about that. The point about clouds that we perhaps forget, is that failures are more likely - because we (theoretically) buy cheaper servers and use more fragile equipment, there are more points of failure, and we also expect ops activities like upgrades to be more frequent and more disruptive (killing containers during an upgrade is fine, e.g.). We are supposed to use the new tools to minimise the consequences of them. But failures are still monthly, not hourly, so they might be acceptable without doing more. The overall resilience would depend several factors. We can get a replacement BGP server running very quickly, even if the original dies - that's new. We can make the RIB live in a distributed database, so a new process can use GR as it subs in for an old one - that's new too. And no-one builds their network to rely on one BGP server always running, so BGP never had to be 100% even in the before times. We can use this sort of judgement method with whatever we're doing. This is how it was before, this is how it is now - better in some ways, worse in others - and now you-the-user decide if that's worth having. Using cloud native does have benefits; they're just not as straightforward, so we have to consider this a bit more closely. [I'm wondering whether this discussion wants putting somewhere where people will find it.] |
This use case covers different alternatives to address how to implement a BGP server in a kube-native way, but it's not referring to CNI Multiplexers like Multus, DANM or NSM, I just wondering if this was on purpose. |
I'm trying to separate requirements from design - and, I admit, not being 100% consistent at it while we work out what the rules should be, so feedback on that would be useful. What I've tried to say here is: if we assume only what we'd get from a standard k8s install - so that's default CNI behaviour without extensions (as in, literally, 'what all CNIs are documented to do', without having to choose one that implements CNI-and-then-some) and we want to solve this use case, then we have shortcomings and they're worth writing up. And then I've stopped. Applying Multus or DANM to this would then be the next step - a design question, separate from the use case and its implied requirements, and a means to test if that system design actually solves the requirements of this use case - and that belongs somewhere else It needs to be somewhere else because there could be more than just those two solutions to consider. I could apply other technologies to it (NSM, appropriately trained cockroaches moving packets in little envelopes, other solutions we haven't considered or written yet) and, just the same, measure if they do well or badly for this use case. Ultimately our best practice, if we choose one, should be the one that ticks the most boxes this and the other use cases; and we get to document the shortcomings too because they should become clear. Thus: use case -> unsatisfied requirements -> bunch of design proposals -> best practice. |
Agree, not only as a way to highlight the unsatisfied requirements, it also promotes the portability of CNFs. In the other hand, it's tricky to define a standard K8s deployment, given that Multus implements CNI methods (ADD, DEL, CHECK and VERSION) but relays on the pre-creation of overlay networks to operate properly. So maybe the criteria for a standard K8s deployment is what we can get from SPs, isn't it? |
Just an idea, there are a lot of acronyms in there (and networking in general). For someone coming with a k8s but not networking background, it may look like alphabet soup. Should we require that all acronyms are defined once at the first instance of it in each doc or have a separate glossary? I would actually be in favor of the former as it makes for a more continuous reading experience and a glossary is extra work where most people would just google it. |
Low-rent .gitignore file. Ignores typical in-edit and backup files for text editors.
A multi-VRF BGP speaker use case describing a reasonably common use case where two isolated networks are in use and the BGP protocol is the aim of the network function.
Co-authored-by: Victor Morales <chipahuac@hotmail.com>
Co-authored-by: Victor Morales <chipahuac@hotmail.com>
Restructure into a subdirectory to keep document with its assets
Provide a reference section including any acronyms that we use that might throw people, and a discussion of what we can reasonably expect from any Kubernetes platform that has not been specifically tailored for NFV.
82426bf
to
ff252aa
Compare
Rebased, added a 'context' section (none of the rest is changed but this should address both the acronyms and the 'what are we comparing with' questions; they might be common to all use cases, but we can deal with the consequences of that later). |
Markdown usage issue. Without blank lines, it runs into one big paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good problem statement
So the thing about CNIs is that what you say is true; this is there interface to the platform that they have to provide to be \a CNI, and they must all implement this. But there's documented expected behaviour they offer to the platform consumer, which is the more important thing to us. Things like Calico, Cilium, Flannel and Weave offer - mostly - just that base behaviour. If you want to write a portable app, you would stick to the core functionality that they all have in common. Multus also offers that behaviour, but most of what it does that is interesting for our use case is an extension of that that is specific to Multus. And Multus (or equivalent functionality) is not going to be found in a k8s deployment you pick at random, so we're not really judging the suitability of what we can reasonably guarantee to find. It's not common best practice to deploy Multus. So Multus is not a great place to start from with a comparison perspective, but it's a great thing to bring in at the design step. "If we said it was a CNF best practice to expect a Multus-enabled platform, then..." And then we can test this theory against other designs with other conditions. |
No use case template yet, so this will want updating when it's committed.