Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGP with enterprise VPNs use case #60

Merged
merged 8 commits into from
Feb 22, 2021

Conversation

iawells
Copy link
Collaborator

@iawells iawells commented Feb 8, 2021

No use case template yet, so this will want updating when it's committed.

Copy link
Collaborator

@electrocucaracha electrocucaracha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minimal changes

use-case/bgp-customer.md Outdated Show resolved Hide resolved
use-case/bgp-customer.md Outdated Show resolved Hide resolved
@taylor taylor self-requested a review February 8, 2021 16:20
@xmulligan xmulligan requested review from vukg and removed request for taylor February 8, 2021 16:20
@iawells
Copy link
Collaborator Author

iawells commented Feb 8, 2021

Thinking about this, assets should be in a subdirectory.

Two possibilities:

  • An .md in the main folder and a subdirectory without the .md extension.
  • An index.md in the subfolder along with its assets

use-case/bgp-enterprise/README.md Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@iawells iawells left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I beat you to it

@iawells
Copy link
Collaborator Author

iawells commented Feb 8, 2021

I think we'll undo the markdown linting, there. We can add it separately. Also, I cocked it up and it doesn't run.

@iawells
Copy link
Collaborator Author

iawells commented Feb 8, 2021

(removed, despite the record in the review comments)

Copy link
Member

@taylor taylor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@iawells
Copy link
Collaborator Author

iawells commented Feb 8, 2021

Yeah, I think we're now down to the abstract 'is this how we want a user story to look' - for which we could use the template for the next step.

Copy link
Collaborator

@jeffsaelens jeffsaelens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting starting point. In addition to BGPs allergy to NAT, there is also the convergence issue, and the time it takes for BGP to get to a "happy" status upon standup, or recovery.

This LGTM, but as a side question, is there a possible future where we say BGP isn't a good candidate for this style of packaging and hosting? I'm picking here solely because its the first use-case, but I'm curious... do we have a threshold for how much we try and get a use case to work before we say "maybe this is a bad idea?", or do we keep engineering until we beat BGP into submission?

@iawells
Copy link
Collaborator Author

iawells commented Feb 9, 2021

Interesting starting point. In addition to BGPs allergy to NAT, there is also the convergence issue, and the time it takes for BGP to get to a "happy" status upon standup, or recovery.

It's an interesting point. We discuss recovery from failure, but the consequences of failure have to be considered in this. What we're really saying with more conventional apps (e.g. web services) is that no meaningful consequences if a component fails as long as we're ready to accept another request. Here, we have different consequences for failure and we have to see if it matters.

is there a possible future where we say BGP isn't a good candidate for this style of packaging and hosting ?[...] do we have a threshold for how much we try and get a use case to work before we say "maybe this is a bad idea?"

I think we can keep judgement out of this and say with an even hand 'this is the best that this allows us to do, take it or leave it'. It may not suit the use cases it used to; it might be better for other ones (e.g. much bigger RIBs).

Bear in mind, with BGP, that it does fail when whatever it runs on dies, and it always has - by 'fail', I mean 'withdraw all routes'. There's nothing new about that.

The point about clouds that we perhaps forget, is that failures are more likely - because we (theoretically) buy cheaper servers and use more fragile equipment, there are more points of failure, and we also expect ops activities like upgrades to be more frequent and more disruptive (killing containers during an upgrade is fine, e.g.). We are supposed to use the new tools to minimise the consequences of them. But failures are still monthly, not hourly, so they might be acceptable without doing more.

The overall resilience would depend several factors. We can get a replacement BGP server running very quickly, even if the original dies - that's new. We can make the RIB live in a distributed database, so a new process can use GR as it subs in for an old one - that's new too. And no-one builds their network to rely on one BGP server always running, so BGP never had to be 100% even in the before times.

We can use this sort of judgement method with whatever we're doing. This is how it was before, this is how it is now - better in some ways, worse in others - and now you-the-user decide if that's worth having. Using cloud native does have benefits; they're just not as straightforward, so we have to consider this a bit more closely.

[I'm wondering whether this discussion wants putting somewhere where people will find it.]

@electrocucaracha
Copy link
Collaborator

This use case covers different alternatives to address how to implement a BGP server in a kube-native way, but it's not referring to CNI Multiplexers like Multus, DANM or NSM, I just wondering if this was on purpose.

@iawells
Copy link
Collaborator Author

iawells commented Feb 9, 2021

I'm trying to separate requirements from design - and, I admit, not being 100% consistent at it while we work out what the rules should be, so feedback on that would be useful. What I've tried to say here is: if we assume only what we'd get from a standard k8s install - so that's default CNI behaviour without extensions (as in, literally, 'what all CNIs are documented to do', without having to choose one that implements CNI-and-then-some) and we want to solve this use case, then we have shortcomings and they're worth writing up. And then I've stopped.

Applying Multus or DANM to this would then be the next step - a design question, separate from the use case and its implied requirements, and a means to test if that system design actually solves the requirements of this use case - and that belongs somewhere else

It needs to be somewhere else because there could be more than just those two solutions to consider. I could apply other technologies to it (NSM, appropriately trained cockroaches moving packets in little envelopes, other solutions we haven't considered or written yet) and, just the same, measure if they do well or badly for this use case.

Ultimately our best practice, if we choose one, should be the one that ticks the most boxes this and the other use cases; and we get to document the shortcomings too because they should become clear.

Thus: use case -> unsatisfied requirements -> bunch of design proposals -> best practice.

@electrocucaracha
Copy link
Collaborator

Agree, not only as a way to highlight the unsatisfied requirements, it also promotes the portability of CNFs. In the other hand, it's tricky to define a standard K8s deployment, given that Multus implements CNI methods (ADD, DEL, CHECK and VERSION) but relays on the pre-creation of overlay networks to operate properly. So maybe the criteria for a standard K8s deployment is what we can get from SPs, isn't it?

@xmulligan
Copy link
Contributor

Just an idea, there are a lot of acronyms in there (and networking in general). For someone coming with a k8s but not networking background, it may look like alphabet soup. Should we require that all acronyms are defined once at the first instance of it in each doc or have a separate glossary? I would actually be in favor of the former as it makes for a more continuous reading experience and a glossary is extra work where most people would just google it.

iawells and others added 7 commits February 11, 2021 08:31
Low-rent .gitignore file.

Ignores typical in-edit and backup files for text editors.
A multi-VRF BGP speaker use case describing a reasonably common use case
where two isolated networks are in use and the BGP protocol is the
aim of the network function.
Co-authored-by: Victor Morales <chipahuac@hotmail.com>
Co-authored-by: Victor Morales <chipahuac@hotmail.com>
Restructure into a subdirectory to keep document with its assets
Provide a reference section including any acronyms that we use that might
throw people, and a discussion of what we can reasonably expect from
any Kubernetes platform that has not been specifically tailored for NFV.
@iawells
Copy link
Collaborator Author

iawells commented Feb 11, 2021

Rebased, added a 'context' section (none of the rest is changed but this should address both the acronyms and the 'what are we comparing with' questions; they might be common to all use cases, but we can deal with the consequences of that later).

Markdown usage issue.
Without blank lines, it runs into one big paragraph.
Copy link
Collaborator

@rannyh rannyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good problem statement

@iawells
Copy link
Collaborator Author

iawells commented Feb 12, 2021

Multus implements CNI methods (ADD, DEL, CHECK and VERSION)

So the thing about CNIs is that what you say is true; this is there interface to the platform that they have to provide to be \a CNI, and they must all implement this. But there's documented expected behaviour they offer to the platform consumer, which is the more important thing to us.

Things like Calico, Cilium, Flannel and Weave offer - mostly - just that base behaviour. If you want to write a portable app, you would stick to the core functionality that they all have in common.

Multus also offers that behaviour, but most of what it does that is interesting for our use case is an extension of that that is specific to Multus. And Multus (or equivalent functionality) is not going to be found in a k8s deployment you pick at random, so we're not really judging the suitability of what we can reasonably guarantee to find. It's not common best practice to deploy Multus.

So Multus is not a great place to start from with a comparison perspective, but it's a great thing to bring in at the design step. "If we said it was a CNF best practice to expect a Multus-enabled platform, then..." And then we can test this theory against other designs with other conditions.

@xmulligan xmulligan self-requested a review February 22, 2021 16:19
@xmulligan xmulligan merged commit 221405c into lfn-cnti:master Feb 22, 2021
@iawells iawells deleted the usecase-bgp-enterprise branch February 22, 2021 16:23
@xmulligan xmulligan mentioned this pull request Mar 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants