New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update cluster/gce/ scripts to support Windows nodes. #73442

Merged
merged 1 commit into from Feb 1, 2019

Conversation

@pjh
Copy link
Contributor

pjh commented Jan 29, 2019

/kind feature

What this PR does / why we need it: This PR introduces support for Windows nodes into the cluster bringup scripts for GCE. Windows node support is targeting Kubernetes 1.14 for GA.

There are two main parts to the PR: changes to the existing scripts under cluster/gce and the addition of several PowerShell scripts that run on the Windows nodes to connect them to the cluster. These scripts currently work against Windows Server version 1803.

A version of these changes is already being tested at https://testgrid.k8s.io/google-windows#windows-prototype. Another test job (pending kubernetes/test-infra#11010) targeting these changes will be added once the PR is merged.

The full history of these changes can be seen at https://github.com/pjh/kubernetes/tree/windows-up. Thanks @yujuhong and @mtaufen for their contributions.

Introduced support for Windows nodes into the cluster bringup scripts for GCE.
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 29, 2019

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 29, 2019

Hi @pjh. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Jan 29, 2019

/ok-to-test

@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Jan 29, 2019

The PR is huge, but the majority of it is adding a new, isolated cluster/gce/win1803 package.

/cc @roberthbailey @zmerlynn @mikedanese
Does any of you have time to time to help review the changes in cluster/ and cluster/gce/?

@yujuhong yujuhong added this to the v1.14 milestone Jan 29, 2019

@pjh pjh force-pushed the pjh:gce-windows-cluster branch from 9176042 to 4a1b3fc Jan 29, 2019

@pjh

This comment has been minimized.

Copy link
Contributor Author

pjh commented Jan 29, 2019

CLA signed and commit amended with proper e-mail address.

@pjh

This comment has been minimized.

Copy link
Contributor Author

pjh commented Jan 29, 2019

CC some folks from sig-windows: @PatrickLang @michmike @benmoss @astrieanna

@michmike
Copy link

michmike left a comment

why is 1803 the OS of choice? SIG-Windows plans to GA with windows node in kubernetes v1.14, but we will only support windows server 2019 (win1809)


# Taint Windows nodes by default to prevent Linux workloads from being
# scheduled onto them.
WINDOWS_NODE_TAINTS="${WINDOWS_NODE_TAINTS:-node.kubernetes.io/os=windows:NoSchedule}"

This comment has been minimized.

@michmike

michmike Jan 29, 2019

We would prefer if the taint was specific to the operating system of windows since there has to be a direct correllation between host OS version and container OS version.
so the taint should be Windows1803 in this case.
Note that 1803 is not supported by sig-windows for GA

This comment has been minimized.

@yujuhong

yujuhong Jan 29, 2019

Member

Yes, I'm aware of this. That's why I brought this up in the Windows GA KEP :-)
I'd like some clear recommendation of how OS version matching is going to work. For example, should it be win1809 or win2019? Is 1809 going to be compatible with 1903? I think this may eventually lead to more confusion to the users.

BTW, this is a temporary taint we choose for GCE. We can (and will) change this later. The main purpose of this PR is to get the code in tree, so we can start testing against k8s and make incremental changes.

This comment has been minimized.

@pjh

pjh Jan 29, 2019

Author Contributor

I agree - for the taint name we can wait for the approach to be finalized in the KEP / elsewhere, then we can easily change this.

This comment has been minimized.

@michmike

michmike Jan 29, 2019

Microsoft publishes a clear compatibility guide here. https://docs.microsoft.com/en-us/virtualization/windowscontainers/deploy-containers/version-compatibility. we document this in the KEP and don't have any ownership or influence here. it is what it is. 1809 and 1903 are incompatible for Windows server containers.

the name of the taint is not as important as long as it is consistent between containers and container host and it describes the OS version it is compatible with.

@pjh

This comment has been minimized.

Copy link
Contributor Author

pjh commented Jan 30, 2019

Sadly, it looks like changing some of the "internal" environment variable names is more dangerous than I anticipated. For example, cluster/gce/config-test.sh sets this variable:

NODE_OS_DISTRIBUTION=${KUBE_NODE_OS_DISTRIBUTION:-${KUBE_OS_DISTRIBUTION:-gci}}

I thought it would be ok to change the "internal" name (within the cluster/ scripts) of this variable to LINUX_NODE_OS_DISTRIBUTION, leaving the "external" name KUBE_NODE_OS_DISTRIBUTION unchanged to remain compatible with existing users. Unfortunately it turns out there are other, less-obvious users of the internal name too - hack/ginkgo-e2e.sh uses NODE_OS_DISTRIBUTION directly, not KUBE_NODE_OS_DISTRIBUTION. This is causing pull-kubernetes-e2e-gce to run tests against this PR that should be skipped.

Similarly, a quick grep for NUM_NODES in the entire tree turns up a million uses, so introducing LINUX_NUM_NODES instead and using it throughout cluster/ is still potentially dangerous. It might be possible to hunt down all the uses in the tree but even then who knows what else is consuming these variables.

I'm working on updating the PR so that no existing variable names are changed at all. It's unfortunate that we'll have e.g. NUM_NODES and NUM_WINDOWS_NODES, instead of NUM_LINUX_NODES and NUM_WINDOWS_NODES.

@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Jan 30, 2019

I'm working on updating the PR so that no existing variable names are changed at all. It's unfortunate that we'll have e.g. NUM_NODES and NUM_WINDOWS_NODES, instead of NUM_LINUX_NODES and NUM_WINDOWS_NODES.

Alternatively, we could still keep NUM_NODES as the number of total nodes, and using NUM_WINDOWS_NODES and NUM_LINUX_NODES for configuring nodes in cluster/ .
E.g.,

  • if NUM_NODES is set to 5, cluster creates 5 linux nodes
  • if NUM_NODES is set to 5, and NUM_WINDOWS_NODES is set to 2, the script automatically set NUM_LINUX_NODES to 3.

Internally, everything uses NUM_WINDOWS_NODES and NUM_LINUX_NODES. Externally, NUM_NODES can be configured normally when NUM_WINDOWS_NODES is not set (defaut:0).
What do you think?

@pjh

This comment has been minimized.

Copy link
Contributor Author

pjh commented Jan 30, 2019

Right, that approach might make sense and I think we might want to use it later... but for this PR I think we can stick with having "disjoint" NUM_NODES and NUM_WINDOWS_NODES. At some point as we start exercising more code paths with NUM_WINDOWS_NODES set > 0 we may find that switching to setting NUM_LINUX_NODES internally will make sense.

@pjh

This comment has been minimized.

Copy link
Contributor Author

pjh commented Jan 30, 2019

I'm working on updating the PR so that no existing variable names are changed at all. It's unfortunate that we'll have e.g. NUM_NODES and NUM_WINDOWS_NODES, instead of NUM_LINUX_NODES and NUM_WINDOWS_NODES.

This is done and I've tested that cluster bringup still works.

@michmike did you want to take another look at this?

@pjh

This comment has been minimized.

Copy link
Contributor Author

pjh commented Jan 31, 2019

/test pull-kubernetes-e2e-gce-100-performance

@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Jan 31, 2019

/lgtm
/approve

@spiffxp did a pass of review.

@roberthbailey @zmerlynn @mikedanese, does anyone of you have time to review and/or approve?

@k8s-ci-robot k8s-ci-robot added the lgtm label Jan 31, 2019

@mikedanese
Copy link
Member

mikedanese left a comment

Ok, I looked at everything outside of cluster/gce/win1803 and it's reasonable. I had a couple style nits. Can you cleanup git history?

# WINDOWS_NODE_IMAGE_FAMILY
# WINDOWS_NODE_IMAGE_PROJECT
function set-windows-node-image() {
WINDOWS_NODE_IMAGE_PROJECT="windows-cloud"

This comment has been minimized.

@mikedanese

mikedanese Jan 31, 2019

Member

Theoretically, indentation should be 2 spaces:

https://google.github.io/styleguide/shell.xml#Indentation

This comment has been minimized.

@pjh

pjh Jan 31, 2019

Author Contributor

Done.


# Sets KUBELET_ARGS with the kubelet flags for Windows nodes.
function construct-windows-kubelet-flags {
local flags=$(construct-common-kubelet-flags)

This comment has been minimized.

@mikedanese

This comment has been minimized.

@pjh

pjh Jan 31, 2019

Author Contributor

Done, thanks for the style guide link.


# Turn off kernel memory cgroup notification.
flags+=" --experimental-kernel-memcg-notification=false"

KUBELET_ARGS="${flags}"
}

# $1: if 'true', we're rendering config for a master, else a node
function build-kubelet-config {
local master=$1

This comment has been minimized.

@mikedanese

mikedanese Jan 31, 2019

Member

Quote argument variables.

This comment has been minimized.

@pjh

pjh Jan 31, 2019

Author Contributor

Done.

@@ -950,6 +1136,10 @@ $(echo "${CUSTOM_CALICO_NODE_DAEMONSET_YAML:-}" | sed -e "s/'/''/g")
CUSTOM_TYPHA_DEPLOYMENT_YAML: |
$(echo "${CUSTOM_TYPHA_DEPLOYMENT_YAML:-}" | sed -e "s/'/''/g")
EOF
if [[ "${master}" == "false" ]]; then

This comment has been minimized.

@mikedanese

mikedanese Jan 31, 2019

Member

Why is this needed?

This comment has been minimized.

@pjh

pjh Jan 31, 2019

Author Contributor

I suspect there used to be something in this block but then it was removed. I've deleted it, cc @mtaufen in case there's some reason for it.

--zone "${ZONE}" \
--base-instance-name "${group_name}" \
--size "${this_mig_size}" \
--template "$template_name" || true;

This comment has been minimized.

@pjh

pjh Jan 31, 2019

Author Contributor

Done.

--project "${PROJECT}" \
--timeout "${MIG_WAIT_UNTIL_STABLE_TIMEOUT}" || true;
done
wait

This comment has been minimized.

@mikedanese

mikedanese Jan 31, 2019

Member

What are you waiting for? I don't see any subprocesses spawned.

This comment has been minimized.

@pjh

pjh Jan 31, 2019

Author Contributor

You're right. This was copied from create-nodes() just above... which also doesn't seem to create any subprocesses. (shrug)

Removed.

@pjh pjh force-pushed the pjh:gce-windows-cluster branch from 5edcc3f to 283a77a Jan 31, 2019

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 31, 2019

@pjh pjh force-pushed the pjh:gce-windows-cluster branch from 283a77a to f0f7829 Jan 31, 2019

@pjh
Copy link
Contributor Author

pjh left a comment

Thanks for the comments @mikedanese!

# WINDOWS_NODE_IMAGE_FAMILY
# WINDOWS_NODE_IMAGE_PROJECT
function set-windows-node-image() {
WINDOWS_NODE_IMAGE_PROJECT="windows-cloud"

This comment has been minimized.

@pjh

pjh Jan 31, 2019

Author Contributor

Done.


# Sets KUBELET_ARGS with the kubelet flags for Windows nodes.
function construct-windows-kubelet-flags {
local flags=$(construct-common-kubelet-flags)

This comment has been minimized.

@pjh

pjh Jan 31, 2019

Author Contributor

Done, thanks for the style guide link.


# Turn off kernel memory cgroup notification.
flags+=" --experimental-kernel-memcg-notification=false"

KUBELET_ARGS="${flags}"
}

# $1: if 'true', we're rendering config for a master, else a node
function build-kubelet-config {
local master=$1

This comment has been minimized.

@pjh

pjh Jan 31, 2019

Author Contributor

Done.

@@ -950,6 +1136,10 @@ $(echo "${CUSTOM_CALICO_NODE_DAEMONSET_YAML:-}" | sed -e "s/'/''/g")
CUSTOM_TYPHA_DEPLOYMENT_YAML: |
$(echo "${CUSTOM_TYPHA_DEPLOYMENT_YAML:-}" | sed -e "s/'/''/g")
EOF
if [[ "${master}" == "false" ]]; then

This comment has been minimized.

@pjh

pjh Jan 31, 2019

Author Contributor

I suspect there used to be something in this block but then it was removed. I've deleted it, cc @mtaufen in case there's some reason for it.

--zone "${ZONE}" \
--base-instance-name "${group_name}" \
--size "${this_mig_size}" \
--template "$template_name" || true;

This comment has been minimized.

@pjh

pjh Jan 31, 2019

Author Contributor

Done.

--project "${PROJECT}" \
--timeout "${MIG_WAIT_UNTIL_STABLE_TIMEOUT}" || true;
done
wait

This comment has been minimized.

@pjh

pjh Jan 31, 2019

Author Contributor

You're right. This was copied from create-nodes() just above... which also doesn't seem to create any subprocesses. (shrug)

Removed.

@pjh

This comment has been minimized.

Copy link
Contributor Author

pjh commented Jan 31, 2019

Ok, I looked at everything outside of cluster/gce/win1803 and it's reasonable. I had a couple style nits. Can you cleanup git history?

I've synced with head and squashed the commits back into one.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Jan 31, 2019

@pjh: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-e2e-kops-aws 4069f19 link /test pull-kubernetes-e2e-kops-aws

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@michmike

This comment has been minimized.

Copy link

michmike commented Jan 31, 2019

overall these changes lgtm. i won't approve, but i will provide my lgtm
/lgtm

@pjh

This comment has been minimized.

Copy link
Contributor Author

pjh commented Jan 31, 2019

/retest

@pjh

This comment has been minimized.

Copy link
Contributor Author

pjh commented Jan 31, 2019

All tests have passed, @mikedanese please let me know if you have any other feedback.

@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Jan 31, 2019

/assign @mikedanese
Thanks for the review!

@mikedanese

This comment has been minimized.

Copy link
Member

mikedanese commented Feb 1, 2019

/approve

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Feb 1, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mikedanese, pjh, yujuhong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 244795a into kubernetes:master Feb 1, 2019

16 checks passed

cla/linuxfoundation pjh authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-cross Skipped
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-godeps Skipped
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped
pull-kubernetes-local-e2e-containerized Skipped
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
tide In merge pool.
Details

SIG-Windows automation moved this from Backlog (v1.14) to Done (v.1.14) Feb 1, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment