Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix windows mounting bug-1090 #1189

Merged

Conversation

torredil
Copy link
Member

@torredil torredil commented Mar 21, 2022

Signed-off-by: Eddie Torres torredil@amazon.com

Is this a bug fix or adding new feature?

  • Customers can encounter this bug when attempting to mount a disk with offline status on a Windows node, resulting in pods being perpetually stuck in the ContainerCreating stage. Windows "volume id empty" #1090

What is this PR about? / Why do we need it?

  • This bug is fixed by ensuring the disk is online before being mounted using the CSI-Proxy API.

What testing is done?

  1. Build and upload container images to a registry.
  2. Change the image in the helm chart.
  3. Test mounting functionality using the new helm chart.

Signed-off-by: Eddie Torres <torredil@amazon.com>
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 21, 2022
@k8s-ci-robot
Copy link
Contributor

Hi @torredil. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Mar 21, 2022
@wongma7
Copy link
Contributor

wongma7 commented Mar 22, 2022

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 22, 2022
@wongma7
Copy link
Contributor

wongma7 commented Mar 23, 2022

/lgtm
/approve
/hold

Since windows is not exercised by CI I'll play CI and run a basic windows mount test as well, then will cancel the hold.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 23, 2022
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 23, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: torredil, wongma7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 23, 2022
@wongma7
Copy link
Contributor

wongma7 commented Mar 23, 2022

Documenting my test for future self, based on your notes, as it seems I need to set up a windows cluster every few months in absence of CI!..:

  1. Create EKS cluster with windows support + SSH/RDP support https://eksctl.io/usage/windows-worker-nodes
  2. Install csi-proxy by RDPing in (this part would be difficult to automate / document in foolproof way I think, better to use an AMI with it already installed but I don't have such an AMI ID handy)
set PREFIX_LIST $MY_PREFIX_LIST
set INSTANCE_ID (basename (k get node --selector kubernetes.io/os=windows -o jsonpath='{.items[0].spec.providerID}'))
echo $INSTANCE_ID 
set GROUP_ID (aws ec2 describe-instances --instance-ids $INSTANCE_ID | jq -r '.Reservations[0].Instances[0].SecurityGroups[] | select(.GroupName | contains("nodegroup")).GroupId')
echo $GROUP_ID 
aws ec2 authorize-security-group-ingress --group-id=$GROUP_ID --ip-permissions FromPort=3389,ToPort=3389,IpProtocol=TCP,PrefixListIds=[{PrefixListId=$PREFIX_LIST}]
set PUBLIC_DNS_NAME (aws ec2 describe-instances --instance-ids $INSTANCE_ID | jq -r '.Reservations[0].Instances[0].NetworkInterfaces[0].Association.PublicDnsName')
echo $PUBLIC_DNS_NAME

# copy csi-proxy.exe to C:\ (using shared folder feature of rdp client or something)

# run
    $flags = "-windows-service -log_file=C:\csi-proxy.log -logtostderr=false"
    sc.exe create csiproxy start= "auto" binPath= "C:\csi-proxy.exe $flags"
    sc.exe failure csiproxy reset= 0 actions= restart/10000
    sc.exe start csiproxy
  1. Install driver: helm upgrade --install aws-ebs-csi-driver --namespace kube-system ./charts/aws-ebs-csi-driver --set node.enableWindows=true --values ./charts/aws-ebs-csi-driver/values.yaml --set controller.serviceAccount.create=false
  2. Create windows example: k create -f examples/kubernetes/windows/specs/
  3. As soon as disk shows up in windows disk management GUI, take it offline (I'm not sure how in a real world situation disk how becomes offline at this stage but doing it manually is sufficient for purpose of the test)
  4. Driver 1.5.1/1.5.2 fails with: k logs -n kube-system ebs-csi-node-windows-cxszt ebs-plugin E0323 21:36:24.421267 9244 driver.go:119] GRPC error: rpc error: code = Internal desc = could not format "5" and mount it at "\\var\\lib\\kubelet\\plugins\\kubernetes.io\\csi\\pv\\pvc-03a6422f-b621-42df-99d9-852ff6ef63ac\\globalmount": rpc error: code = Unknown desc = volume id empty
  5. Upgrade driver to contain this bugfix: helm upgrade --install aws-ebs-csi-driver --namespace kube-system ./charts/aws-ebs-csi-driver --set node.enableWindows=true --values ./charts/aws-ebs-csi-driver/values.yaml --set controller.serviceAccount.create=false --set image.repository=$ACCOUNT_ID.dkr.ecr.us-west-2.amazonaws.com/aws-ebs-csi-driver --set image.tag=mar22
  6. Driver with this bugfix succeeds: k get po default windows-server-iis-7c5fc8f6c5-dvxmv 1/1 Running 0 8m41s

@wongma7
Copy link
Contributor

wongma7 commented Mar 23, 2022

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 23, 2022
@k8s-ci-robot k8s-ci-robot merged commit c95a14a into kubernetes-sigs:master Mar 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants