Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cni ipvlan vpc k8s support #4762

Merged
merged 12 commits into from
Nov 21, 2018

Conversation

chris-h-phillips
Copy link
Contributor

We want to try out the cni plugin Lyft has put out https://github.com/lyft/cni-ipvlan-vpc-k8s
It has some desirable characteristics like no extra pods to run, less overhead by using ipvlan etc.

I've been on a journey that started with just baking the extra config and binaries on top of the kops ami which was less than savory since I had to do acrobatics to try and get --node-ip into the kubelet args and get the templating of the cni conflist done at just the right time during instance startup. It worked but was quite unsatisfying.
I considered porting that work over to using the additional user-data / hooks capabilities but that also left me feeling unsatisfied with basically doing the same acrobatics in a harder to work with format and prone to being broken by upstream changes in the future
I decided then to undertake adding support for the plugin in kops using the AmazonVPC work as a starting point.

I've arrived at this and it works great for our clusters.

This does break slightly with the other CNI plugins in that it's not really an addon. Which I think is a positive thing overall.

There are some things that I've done in the code that might not be in quite the right place. I'll call them out inline

I wanted to get this work out there to solicit feedback and hopefully get this to a state that can be adopted so we don't have to maintain this in a fork forever. I did some googling and don't see any mention of it anywhere but welcome pointers to any previous discussion of the topic.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 22, 2018
Kuberouter *KuberouterNetworkingSpec `json:"kuberouter,omitempty"`
Romana *RomanaNetworkingSpec `json:"romana,omitempty"`
AmazonVPC *AmazonVPCNetworkingSpec `json:"amazonvpc,omitempty"`
AmazonVPCIPVlan *AmazonVPCIPVlanNetworkingSpec `json:"amazonvpcipvlan,omitempty"`
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name cni-ipvlan-vpc-k8s doesn't really roll off the tongue so AmazonVPCIPVlan was the best I could come up with

@@ -61,6 +60,17 @@ func (b *NetworkBuilder) Build(c *fi.ModelBuilderContext) error {
}
}

if networking.AmazonVPCIPVlan != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hooked downloading of the binaries into this ModelBuilder. It seems a little awkward and I'm wondering if treating the download as another asset the same as the other cni binaries is the right way to go. I had to modify the Archive task to support gzip archives which seems like is already handled in the asset store.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that other CNI providers install binaries by mounting the hostPath and copying them in, possibly from the image. It's a more generic solution, in that it works on any cluster. I'd probably be more in favor of that approach, but we can also try this (designating it as alpha in the docs) and see how it goes. But also I don't know if they copy binaries or just config files.

I do agree that an asset might probably be better than introducing another code path.

That said, I think that getting a manifest would likely be best of all - going to comment on that at top level though

@@ -127,6 +127,10 @@ func buildNodeupTags(role api.InstanceGroupRole, cluster *api.Cluster, clusterTa
tags.Insert("_networking_cni")
}

if networking.AmazonVPCIPVlan != nil {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hooked the cni conflist template into the Loader with this tag. I'm not sure if that is the right place for it but it seemed to follow established patterns in the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: We're trying to get rid of tags in general, because a function tends to be clearer. I don't think this actually does anything (it looks like you're just using it on a resource, where I don't think tags act as a filter, but I could very well be wrong)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The treewalker ignores tag directories that are unknown

I spent a fair bit of time looking at how templates get handled in the code and this seemed to be the way it's done in nodeup, though there do appear to be other ways in other places. Is there another example of doing templates in nodeup without the tree walker / tags that I've overlooked and could emulate?

@@ -241,6 +246,14 @@ func (c *NodeUpCommand) Run(out io.Writer) error {
loader.Builders = append(loader.Builders, &model.CalicoBuilder{NodeupModelContext: modelContext})
}

loader.TemplateFunctions["MapJson"] = func(m map[string]string) (string, error) {
Copy link
Contributor Author

@chris-h-phillips chris-h-phillips Mar 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my text/template foo is not very strong and this is what I came up with to get my template to render correctly. I'm sure there's a better approach. This function seems out of place here

@tvi
Copy link
Contributor

tvi commented Mar 23, 2018

Hi, I saw this PR and I am also looking into deploying lyft cni with kops.
I was wondering, if it would not be easier to deploy it as daemonset the same as amazons cni and have the container copy those cni binaries.

@spiffxp
Copy link
Member

spiffxp commented Mar 23, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 23, 2018
@chris-h-phillips
Copy link
Contributor Author

I considered it and I definitely see an argument for it. To me though, the do something once semantics didn't feel like a good fit for daemonset. Having the containers that had done the downloading just hanging around felt not right and somewhat counter to one of the designed benefits of the plugin. But the thing that really tipped the scales for me personally was, having the thing configured before the kubelet even starts means there's no waiting around for cni containers to be up and running.

@tvi
Copy link
Contributor

tvi commented Mar 23, 2018

yeah, I am thinking about adding prometheus+jaeger agent to the deamon set so we get more visibility into it in production + it would kind of simplify this work you are doing

  • doing upgrades with kubectl apply is much easier than copying binaries

@chris-h-phillips chris-h-phillips force-pushed the cni-ipvlan-vpc-k8s branch 4 times, most recently from a658dbb to 75c6027 Compare March 29, 2018 05:18
@@ -872,6 +872,8 @@ func RunCreateCluster(f *util.Factory, out io.Writer, c *CreateClusterOptions) e
cluster.Spec.Networking.AmazonVPC = &api.AmazonVPCNetworkingSpec{}
case "cilium":
cluster.Spec.Networking.Cilium = &api.CiliumNetworkingSpec{}
case "amazonvpcipvlan":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are going to need to figure out names that don't cause people confusion :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lyft should be in the name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how bout LyftVPC ? it's short and sweet

@@ -61,6 +60,17 @@ func (b *NetworkBuilder) Build(c *fi.ModelBuilderContext) error {
}
}

if networking.AmazonVPCIPVlan != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that other CNI providers install binaries by mounting the hostPath and copying them in, possibly from the image. It's a more generic solution, in that it works on any cluster. I'd probably be more in favor of that approach, but we can also try this (designating it as alpha in the docs) and see how it goes. But also I don't know if they copy binaries or just config files.

I do agree that an asset might probably be better than introducing another code path.

That said, I think that getting a manifest would likely be best of all - going to comment on that at top level though

Romana *RomanaNetworkingSpec `json:"romana,omitempty"`
AmazonVPC *AmazonVPCNetworkingSpec `json:"amazonvpc,omitempty"`
Cilium *CiliumNetworkingSpec `json:"cilium,omitempty"`
AmazonVPCIPVlan *AmazonVPCIPVlanNetworkingSpec `json:"amazonvpcipvlan,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the JSON here should probably be amazonVpcIpvlan, but on the other hand none of the others have done that (I'm thinking kuberouter should have been kubeRouter and amazonvpc should have been amazonVpc)

@@ -127,6 +127,10 @@ func buildNodeupTags(role api.InstanceGroupRole, cluster *api.Cluster, clusterTa
tags.Insert("_networking_cni")
}

if networking.AmazonVPCIPVlan != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: We're trying to get rid of tags in general, because a function tends to be clearer. I don't think this actually does anything (it looks like you're just using it on a resource, where I don't think tags act as a filter, but I could very well be wrong)

var extractArgs = "xf"

if e.Gzip {
extractArgs = "xzf"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. Some alternatives you might want to consider: What we did in assetstore.go is we look at the extension, and try to guess whether we should expand it: if strings.HasSuffix(file, ".tar.gz") || strings.HasSuffix(file, ".tgz"). I also think tar can usually figure it out, but we might not want to rely on that. Also because we are in nodeup and so we have the file at this point, we could detect the file type (e.g. run file).

But this is good enough for now, unless you're incline to change it!

// LyftIpVlanNetworkingSpec declares that we want to use the cni-ipvlan-vpc-k8s CNI networking
type AmazonVPCIPVlanNetworkingSpec struct {
// Tags that subnets must have to be used for pod IPs
SubnetTags map[string]string `json:"subnetTags,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So do we need all of these in order to start trying it out? I'm not sure how far along you are, but the challenge is that once we put them in the API, they're part of our contract. Can we default these or live without them, at least for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SecurityGroupIds is set to the sgs on the primary ENI if nothing overrides it so we could just move that into a template function and not allow overrides to start
InterfaceIndex can just be 1. It's the suggested way of doing it
RouteToVPCPeers could just be true. I imagine there are situations where people want to not have this turned on
The SubnetTags could just do Type = pod. Then in the future maybe if kops creates subnets for the pod network it could add this tag to them.
BinariesDownloadURL will get removed altogether since it's handled as an asset now.

@justinsb
Copy link
Member

justinsb commented Apr 2, 2018

This is nice work - I had a few comments. In general I'd like to get to a model where you're able to specify a manifest for the CNI provider which nodeup/protokube then just runs kubectl apply on. But the network providers are particularly complicated, especially this one which starts referencing security groups etc.

I'm a bit worried about establishing an API that we can't continue to offer in future, so any fields we can default (at least to start) would be great.

A few arguments for a manifest:

  • Apparently there will be modes for a one-shot Daemonset coming soon, although the resources used by a bash script which sleeps and loops are probably negligible
  • You could source the CNI binary from the docker image
  • Being able to update the manifest URL separately from a kops release is good both for kops and for the plugin users.

Some arguments against:

  • I'd expect that eventually users would expect kops to create the subnets & security groups for the CNI provider to use, so we can't totally be oblivious to what's going on in the manifest

@chrislovecnm
Copy link
Contributor

I see we are downloading stuff. Does lyft not have it in a daemonset?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 9, 2018
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 9, 2018
@chris-h-phillips
Copy link
Contributor Author

@chrislovecnm Lyft doesn't have a daemonset for this. It's kind of one of the touted features that you don't have to run one.

We also wanted the system to be host-local with minimal moving components and state; our network stack contains no network services or daemons. As AWS instances boot, CNI plugins communicate with AWS networking APIs to provision network resources for Pods.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 10, 2018
@dermyrddin
Copy link

@polarbizzle thanx for the update. Just looked into the commit - shouldn't cniVersion in 10-cni-ipvlan-vpc-k8s.conflist.template also be set to "0.4.2"?

@chris-h-phillips
Copy link
Contributor Author

@merlineus 0.3.1 is actually the version of the cni spec

@dermyrddin
Copy link

@polarbizzle oh, I see, sorry for that :)

@chris-h-phillips chris-h-phillips force-pushed the cni-ipvlan-vpc-k8s branch 3 times, most recently from edcc309 to 1ceaa45 Compare November 7, 2018 16:51
@chris-h-phillips
Copy link
Contributor Author

@justinsb , @chrislovecnm could you have another look and perhaps we could land this one?

@@ -964,7 +966,7 @@ func RunCreateCluster(f *util.Factory, out io.Writer, c *CreateClusterOptions) e

case api.TopologyPrivate:
if !supportsPrivateTopology(cluster.Spec.Networking) {
return fmt.Errorf("Invalid networking option %s. Currently only '--networking kopeio-vxlan (or kopeio)', '--networking weave', '--networking flannel', '--networking calico', '--networking canal', '--networking kube-router', '--networking romana', '--networking amazon-vpc-routed-eni' are supported for private topologies", c.Networking)
return fmt.Errorf("Invalid networking option %s. Currently only '--networking kopeio-vxlan (or kopeio)', '--networking weave', '--networking flannel', '--networking calico', '--networking canal', '--networking kube-router', '--networking romana', '--networking amazon-vpc-routed-eni' '--networking lyftvpc' are supported for private topologies", c.Networking)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really need to generate this list :-)

networking := cluster.Spec.Networking

if networking == nil {
return nil, fmt.Errorf("Networking is not set, and should not be nil here")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you needed to do this - it probably would have sufficed to just do

if cluster.Spec.Networking != nil && cluster.Spec.Networking.LyftVPC != nil {

}

if networking.LyftVPC != nil {
tags.Insert("_lyft_vpc_cni")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably aren't using this tag - we're trying to get rid of tags altogether... but it's harmless

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tag is how the cni config file template gets rendered

/upup/models/nodeup/resources/_lyft_vpc_cni/files/etc/cni/net.d/10-cni-ipvlan-vpc-k8s.conflist.template

if you can point me to a better place to handle rendering that template I could refactor accordingly

return nil, fmt.Errorf("error querying ec2 metadata service (for az/region): %v", err)
}

sgNames, err := metadata.GetMetadata("security-groups")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but we can compute the security groups for any instance, because we set them up. I'll try to do a PoC so we can discuss, unless you know that I'm missing something :-)

@justinsb
Copy link
Member

Sorry for the delay @polarbizzle ... it's additive, and we've delayed on this for a long time, so I'm going to approve it! Hopefully should make it easier for others to make progress on it also (e.g. I want to try to compute the security groups)!

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 21, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: justinsb, polarbizzle

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 21, 2018
@k8s-ci-robot k8s-ci-robot merged commit f401240 into kubernetes:master Nov 21, 2018
@dermyrddin
Copy link

Hi @polarbizzle, thanks for your work!
I see that there is no documentation for option "--networking lyftvpc", but it works fine.
Is it ready for production use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants