Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support antrea as network policy provider in kube-up #100736

Closed
wants to merge 2 commits into from

Conversation

anfernee
Copy link
Member

@anfernee anfernee commented Apr 1, 2021

  • Linux Support

Use the following environment to enable antrea on linux for kube-up:

export NETWORK_POLICY_PROVIDER=antrea

Currently, antrea is not supported on cos yet, so use ubuntu instead.
For example, the following environments:

export KUBE_NODE_OS_DISTRIBUTION=ubuntu
export KUBE_GCE_NODE_IMAGE=ubuntu-2004-focal-v20200423
export KUBE_GCE_NODE_PROJECT=ubuntu-os-cloud
  • Windows support

Use the following environment to enable antrea on windows for kube-up:

export WINDOWS_NETWORK_POLICY_PROVIDER=antrea

This commit also allows to specify which windows image to use in the
cluster by

export WINDOWS_NODE_IMAGE_PROJECT=<project_name>
export WINDOWS_NODE_IMAGE=<image_name>

What type of PR is this?

/kind feature

What this PR does / why we need it:

The PR allows to create antrea-enabled windows nodes with kube-up.sh. It only allows to create windows nodes now, because there is some problem running antrea on COS based images.

Support antrea as network policy provider in kube-up

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

n/a

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 1, 2021
@k8s-ci-robot
Copy link
Contributor

@anfernee: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 1, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: anfernee
To complete the pull request process, please assign dims after the PR has been reviewed.
You can assign the PR to them by writing /assign @dims in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/provider/gcp Issues or PRs related to gcp provider sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. labels Apr 1, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Apr 1, 2021
@anfernee
Copy link
Member Author

anfernee commented Apr 1, 2021

cc @antoninbas @lzhecheng

@anfernee
Copy link
Member Author

anfernee commented Apr 1, 2021

/retest

@dims
Copy link
Member

dims commented Apr 1, 2021

@anfernee would it be possible to first get a linux / antrea change in? (before we try to do windows?). I'd expect a PR that introduces antrea on linux with network policy switched on, followed by a periodic CI job that exercises these changes and runs the network policy tests (and the results show up in test-grid).

Then we figure out how to do the same with windows. WDYT?

@anfernee
Copy link
Member Author

anfernee commented Apr 1, 2021

I tried linux version as well, but it failed to start on COS images. I don't know why yet (created antrea-io/antrea#2021). The change will help us testing antrea-windows compatibility, which could be used to build a verification pipeline for antrea. It would be very helpful for that task if this can be merged separately.

@dims
Copy link
Member

dims commented Apr 1, 2021

@anfernee you don't need to use COS images, you can use kube-up with ubuntu images too!

example : https://github.com/kubernetes/test-infra/blob/7affdf2921a038c6d78e4a1f2f018f53c959893c/config/jobs/kubernetes/sig-cloud-provider/gcp/gcp-gce.yaml#L307-L310

@dims
Copy link
Member

dims commented Apr 1, 2021

@anfernee /cluster is a directory we are trying to deprecate/remove for a while, while we allow small/targeted things to merge, we do not encourage a large change like what is being proposed here ( see #78995 for context ).

So let's see if there is a limited set of changes that will help run network policy tests on linux first.

cc @spiffxp @BenTheElder

@@ -0,0 +1,208 @@
apiVersion: apps/v1
kind: DaemonSet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @jkh52
There are concerns with the total number of DaemonSets we are running.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the commit references windows but I notice this says linux in the path.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I repurposed this to support linux as well. @jkh52 what is the concern about the number of DaemonSets?

Copy link
Contributor

@jkh52 jkh52 Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my experience with DaemonSets, when there are enough nodes (example: hundreds or more) and HA clusters (3 masters) the proxy-server could have runaway memory ballooning (at times like rolling master restarts, when we expect many agent grpc connections). I found the root cause to be: the agent authentication path in proxy-server was getting throttled by client-go and accumulating resources.

I found stability improvements by tuning:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there node startup scripts that this could be placed in? Or is this solving a chicken/egg problem where Antrea network fabric is not known to be selected before the node boots?

@k8s-ci-robot
Copy link
Contributor

@cheftako: GitHub didn't allow me to request PR reviews from the following users: jkh52.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @jkh52
There are concerns with the total number of DaemonSets we are running.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -566,3 +573,18 @@ export WINDOWS_ENABLE_DSR="${WINDOWS_ENABLE_DSR:-false}"
# TLS_CIPHER_SUITES defines cipher suites allowed to be used by kube-apiserver.
# If this variable is unset or empty, kube-apiserver will allow its default set of cipher suites.
export TLS_CIPHER_SUITES=""

# Optional: URL to download antrea-cni.exe for Windows node
export WINDOWS_ANTREA_CNI_BINARY_URL="${WINDOWS_ANTREA_CNI_BINARY_URL:-https://github.com/vmware-tanzu/antrea/releases/download/v0.13.1/antrea-cni-windows-x86_64.exe}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want CNI and Agent at the same version? If so it may make sense to use a common version or URL base.

Copy link
Member Author

@anfernee anfernee Apr 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most likely, but I am not assuming that. It could be different .gke-x versions in the future, so I would vote keep the flexibility there. Also, the URL from github.com and gcr.io might be using different patterns.

cluster/gce/util.sh Outdated Show resolved Hide resolved
# Download ovs installer
$tmp_dir = 'C:\k8s_tmp'
New-Item -Force -ItemType 'directory' $tmp_dir | Out-Null
$filename = 'Install-OVS.ps1'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work for GKE with Google-signed OVS.
This script only works for self-signed OVS.
We have to change the script.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the latest version of the script, we added a new command-line option, ImportCertificate (https://github.com/vmware-tanzu/antrea/blob/002379983cbf685ffb1623cfb953d4443b18ef14/hack/windows/Install-OVS.ps1#L21). You can set it to False when you bring your own signed binaries.

Copy link

@mmottaghi mmottaghi Apr 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Sure, that could be a good solution, but even then, my other comment (line 483) still applies: input arguments must change.
  • The other issue we have is that this file is not the right place to become so specific about the network policy component.
    Here we have to only decide if we do or don't want to install a nwk-policy component, and if yes, we have to delegate the control to a more component-specific file (which would be Antrea installer in this case) to take care of the installation for us. If we directly install Antrea here, over time as more and more network components emerge this file will get polluted and messy with all the new installation calls.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep it as is. @mmottaghi can you add a patch to support officially signed ovs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

cluster/gce/windows/k8s-node-setup.psm1 Show resolved Hide resolved
@jsturtevant
Copy link
Contributor

fyi @jayunit100

@jsturtevant jsturtevant added this to InProgress (v1.22) in SIG-Windows Apr 22, 2021
@jsturtevant
Copy link
Contributor

/cc @jayunit100

@BenTheElder
Copy link
Member

So let's see if there is a limited set of changes that will help run network policy tests on linux first.

Don't we already run network policy tests with kube-up on linux with calico? cc @aojea

@aojea
Copy link
Member

aojea commented Apr 28, 2021

So let's see if there is a limited set of changes that will help run network policy tests on linux first.

Don't we already run network policy tests with kube-up on linux with calico? cc @aojea

yes we do, https://testgrid.k8s.io/sig-network-gce#network-policies,%20google-gce , it is Calico because the code was there since the beginning and we needed a way to test the new testing implemented in the e2e framework for network policies

I thought this was only for windows, I didn't know the goal was to set this for linux too ...

I think we shouldn't build a CNI testing matrix in k/k ... there are more than 20 CNI plugins in the web https://kubernetes.io/docs/concepts/cluster-administration/networking/ , what happen if a new plugin wants to add its job too?

@aojea
Copy link
Member

aojea commented Apr 28, 2021

also, this has 26 fields in v1alpha1, is this stable enough to gate code in k/k?

@anfernee
Copy link
Member Author

antrea recently reached 1.0.0 milestone. @antoninbas Is there a plan to promote the api to v1? Regardless the tag, I believe the API should be pretty stable.

@anfernee
Copy link
Member Author

So let's see if there is a limited set of changes that will help run network policy tests on linux first.

Don't we already run network policy tests with kube-up on linux with calico? cc @aojea

yes we do, https://testgrid.k8s.io/sig-network-gce#network-policies,%20google-gce , it is Calico because the code was there since the beginning and we needed a way to test the new testing implemented in the e2e framework for network policies

I thought this was only for windows, I didn't know the goal was to set this for linux too ...

I think we shouldn't build a CNI testing matrix in k/k ... there are more than 20 CNI plugins in the web https://kubernetes.io/docs/concepts/cluster-administration/networking/ , what happen if a new plugin wants to add its job too?

Original purpose is to support windows. There was a comment to enable that on Linux nodes as well, because Linux nodes are required, and it's strange to leave it behind. I think it makes sense, so added both.

@antoninbas
Copy link
Contributor

antrea recently reached 1.0.0 milestone. @antoninbas Is there a plan to promote the api to v1? Regardless the tag, I believe the API should be pretty stable.

We do intend to promote the APIs to Beta and eventually GA, but we make a decision for each CRD separately for each new Antrea release. The Antrea policy API is still going through changes, even though we consider the implementation itself stable. Note that these Antrea-specific APIs are of course not exercised by any upstream test…

@dims
Copy link
Member

dims commented Apr 29, 2021

/hold

#100736 (comment)

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 29, 2021
@aojea
Copy link
Member

aojea commented Apr 29, 2021

Let me suggest a different direction, why don't you use the new repo that the network policy folks are going to create?
kubernetes/community#5747 (comment)

You can use prow from there and we don't keep growing the tech debt in k/k ,

I think we shouldn't build a CNI testing matrix in k/k ... there are more than 20 CNI plugins in the web https://kubernetes.io/docs/concepts/cluster-administration/networking/ , what happen if a new plugin wants to add its job too?

Also, we avoid this ^^^, Kubernetes network policy is the same as Ingress, it only defines the API not the implementation

@anfernee
Copy link
Member Author

also, this has 26 fields in v1alpha1, is this stable enough to gate code in k/k?

I've removed the v1alpha1 APIs and changed the config to disable other features that relies on the alpha apis.

* Linux Support

Use the following environment to enable antrea on linux for kube-up:
```
export NETWORK_POLICY_PROVIDER=antrea
```

Currently, antrea is not supported on cos yet, so use ubuntu instead.
For example, the following environments:
```
export KUBE_NODE_OS_DISTRIBUTION=ubuntu
export KUBE_GCE_NODE_IMAGE=ubuntu-2004-focal-v20200423
export KUBE_GCE_NODE_PROJECT=ubuntu-os-cloud
```

* Windows support

Use the following environment to enable antrea on windows for kube-up:
```
export WINDOWS_NETWORK_POLICY_PROVIDER=antrea
```

This commit also allows to specify which windows image to use in the
cluster by
```
export WINDOWS_NODE_IMAGE_PROJECT=<project_name>
export WINDOWS_NODE_IMAGE=<image_name>
```
These CRDs are not required for basic functionality, so remove them.
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Apr 30, 2021

@anfernee: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-e2e-gce-100-performance fcce3e7 link /test pull-kubernetes-e2e-gce-100-performance
pull-kubernetes-e2e-gce-ubuntu-containerd fcce3e7 link /test pull-kubernetes-e2e-gce-ubuntu-containerd

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@anfernee
Copy link
Member Author

Can someone take another look at this :)

@dims
Copy link
Member

dims commented May 1, 2021

repeating again, please feel to experiment in this PR. but we will not merge this as mentioned several times earlier in this PR.

@anfernee
Copy link
Member Author

anfernee commented May 1, 2021

Hey @dims this is already the smallest change to support antrea. What do you suggest to make this mergeable?

hostNetwork: true
containers:
- name: node-init
image: gcr.io/google-containers/startup-script:v1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use v2? v1 was built in 2016.

@@ -0,0 +1,208 @@
apiVersion: apps/v1
kind: DaemonSet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there node startup scripts that this could be placed in? Or is this solving a chicken/egg problem where Antrea network fabric is not known to be selected before the node boots?

valueFrom:
fieldRef:
fieldPath: spec.nodeName
image: projects.registry.vmware.com/antrea/antrea-ubuntu:v1.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these images be hosted from a canonical Kubernetes repository?

@@ -167,7 +167,7 @@ export ENABLE_DOCKER_REGISTRY_CACHE=true

# Optional: Deploy a L7 loadbalancer controller to fulfill Ingress requests:
# glbc - CE L7 Load Balancer Controller
export ENABLE_L7_LOADBALANCING="${KUBE_ENABLE_L7_LOADBALANCING:-glbc}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad merge? Why is this getting deleted? Or is the comment above meant to be preserved?

@@ -83,17 +83,20 @@ function set-linux-node-image() {
# Requires:
# WINDOWS_NODE_OS_DISTRIBUTION
# Sets:
# WINDOWS_NODE_IMAGE_PROJECT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the set-windows-node-image function being changed? I wouldn't expect that to be updated.

@@ -117,6 +117,25 @@ write_files:
[Install]
WantedBy=multi-user.target

- path: /lib/systemd/system/google-network-daemon.service
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will google-network-daemon be installed as a systemd service on all COS/GCI images going forward even when Antrea isn't installed or is there a flag somewhere that ensures this is only active when Antrea is enabled.

AntreaPolicy: false
Traceflow: false
clientConnection:
kubeconfig: \etc\kubernetes\antrea.kubeconfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefix with C: or use / delimiters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyje \etc\kubernetes\antrea.kubeconfig works.
This is the way the Antrea team uses it.

@@ -966,6 +1140,9 @@ function Prepare-CniNetworking {
Install_Cni_Binaries
Configure_Dockerd_CniNetworking
}
if (${env:WINDOWS_NETWORK_POLICY_PROVIDER} -eq "antrea") {
Configure_Antrea_CniNetworking
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this tested with containerd as well?

Copy link
Member

@liggitt liggitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried linux version as well, but it failed to start on COS images. I don't know why yet (created antrea-io/antrea#2021). The change will help us testing antrea-windows compatibility, which could be used to build a verification pipeline for antrea. It would be very helpful for that task if this can be merged separately.

In my opinion, new addons and callers should not be added to this directory. Every new thing added is one more obstacle to cleaning up it and removing it.

Comment on lines +87 to +110
- apiGroups:
- core.antrea.tanzu.vmware.com
resources:
- clustergroups
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
- apiGroups:
- crd.antrea.io
resources:
- clustergroups
verbs:
- get
- list
- watch
- create
- update
- patch
- delete
Copy link
Member

@liggitt liggitt May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this addon doing with these permissions (and all the CRUD permissions below). this seems like a very wide set of permissions

@k8s-ci-robot
Copy link
Contributor

@anfernee: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 6, 2021
@anfernee anfernee marked this pull request as draft May 12, 2021 22:01
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 12, 2021
@anfernee
Copy link
Member Author

There is ongoing discussion where this change should land, so mark it as draft.

@anfernee
Copy link
Member Author

anfernee commented Jun 9, 2021

@ALL It looks like this cluster folder as well as all the cloud provider specific change is going to phase out from this repo. All the new features are forbidden. As a result, this change won't be suitable here. In the long run, https://github.com/kubernetes/cloud-provider-gcp is a better place for this change. That repo is still a little bit behind, but it will catch up. We will keep working with antrea and kubernetes community to make them work better together. Thanks a lot for the reviewers who have put a lot of time on this change.

@anfernee anfernee closed this Jun 9, 2021
SIG-Windows automation moved this from InProgress (v1.22) to Done (v1.21) Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/gcp Issues or PRs related to gcp provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet