Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure cluster [5fafdf38...] nfs related tests are failing since 2/7 #123236

Closed
dims opened this issue Feb 11, 2024 · 11 comments
Closed

Failure cluster [5fafdf38...] nfs related tests are failing since 2/7 #123236

dims opened this issue Feb 11, 2024 · 11 comments
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@dims
Copy link
Member

dims commented Feb 11, 2024

Failure cluster 5fafdf38fb34e960736d

Error text:
[FAILED] waiting for pod with inline volume: Timed out after 900.001s.
Expected Pod to be in <v1.PodPhase>: "Running"
Got instead:
    <*v1.Pod | 0xc00128bb08>: 
        metadata:
          creationTimestamp: "2024-02-07T20:44:06Z"
          generateName: inline-volume-tester-
          labels:
            app: inline-volume-tester
          managedFields:
          - apiVersion: v1
            fieldsType: FieldsV1
            fieldsV1:
              f:metadata:
                f:generateName: {}
                f:labels:
                  .: {}
                  f:app: {}
              f:spec:
                f:containers:
                  k:{"name":"csi-volume-tester"}:
                    .: {}
                    f:command: {}
                    f:image: {}
                    f:imagePullPolicy: {}
                    f:name: {}
                    f:resources: {}
                    f:terminationMessagePath: {}
                    f:terminationMessagePolicy: {}
                    f:volumeMounts:
                      .: {}
                      k:{"mountPath":"/mnt/test-0"}:
                        .: {}
                        f:mountPath: {}
                        f:name: {}
                f:dnsPolicy: {}
                f:enableServiceLinks: {}
                f:restartPolicy: {}
                f:schedulerName: {}
                f:securityContext: {}
                f:terminationGracePeriodSeconds: {}
                f:volumes:
                  .: {}
         

Recent failures:

2/11/2024, 8:26:41 AM ci-containerd-e2e-ubuntu-gce
2/11/2024, 8:25:45 AM ci-cos-containerd-e2e-ubuntu-gce
2/11/2024, 7:25:26 AM ci-containerd-e2e-ubuntu-gce
2/11/2024, 7:24:36 AM ci-cos-containerd-e2e-ubuntu-gce
2/11/2024, 6:24:24 AM ci-containerd-e2e-ubuntu-gce

/kind failing-test

Also see:

/sig storage

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/storage Categorizes an issue or PR as relevant to SIG Storage. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Feb 11, 2024
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dims
Copy link
Member Author

dims commented Feb 11, 2024

cc @msau42 @xing-yang @jsafrane

@kannon92
Copy link
Contributor

xref #123195 (comment)

@jsafrane
Copy link
Member

From this ci-cos-containerd-e2e-ubuntu-gce run:

I0211 13:40:08.837076 10227 dump.go:53] At 2024-02-11 13:34:52 +0000 UTC - event for nfss2rc2: {persistentvolume-controller } ExternalProvisioning: Waiting for a volume to be created either by the external provisioner 'example.com/nfs-provisioning-7426' or manually by the system administrator. If volume creation is delayed, please verify that the provisioner is running and correctly registered.
I0211 13:40:08.837083 10227 dump.go:53] At 2024-02-11 13:34:56 +0000 UTC - event for nfss2rc2: {example.com/nfs-provisioning-7426_external-provisioner-dv6dl_51d05d30-cbe1-4864-9f19-4de5b2ceda0d } Provisioning: External provisioner is provisioning volume for claim "provisioning-7426/nfss2rc2"
I0211 13:40:08.837090 10227 dump.go:53] At 2024-02-11 13:34:56 +0000 UTC - event for nfss2rc2: {example.com/nfs-provisioning-7426_external-provisioner-dv6dl_51d05d30-cbe1-4864-9f19-4de5b2ceda0d } ProvisioningSucceeded: Successfully provisioned volume pvc-f01c410c-718f-4518-8c86-a14744c412ea
I0211 13:40:08.837098 10227 dump.go:53] At 2024-02-11 13:34:59 +0000 UTC - event for pod-subpath-test-dynamicpv-q4h6: {kubelet bootstrap-e2e-minion-group-ld0w} FailedMount: MountVolume.SetUp failed for volume "pvc-f01c410c-718f-4518-8c86-a14744c412ea" : mount failed: exit status 1
Mounting command: /home/kubernetes/containerized_mounter/mounter
Mounting arguments: mount -t nfs -o vers=4.1 10.64.3.13:/export/pvc-f01c410c-718f-4518-8c86-a14744c412ea /var/lib/kubelet/pods/74f01fe0-c831-416c-bb5c-988366f5b6fa/volumes/kubernetes.io~nfs/pvc-f01c410c-718f-4518-8c86-a14744c412ea
Output: Mount failed: mount failed: exit status 32
Mounting command: chroot
Mounting arguments: [/home/kubernetes/containerized_mounter/rootfs mount -t nfs -o vers=4.1 10.64.3.13:/export/pvc-f01c410c-718f-4518-8c86-a14744c412ea /var/lib/kubelet/pods/74f01fe0-c831-416c-bb5c-988366f5b6fa/volumes/kubernetes.io~nfs/pvc-f01c410c-718f-4518-8c86-a14744c412ea]
Output: mount.nfs: Protocol not supported

Did anything change in the OS image recently? It looks like NFS suddenly stopped supporting v4.1.

@dims
Copy link
Member Author

dims commented Feb 12, 2024

@jsafrane i had collected the OS info initially as i thought this a containerd/ulimit issue, so i save the info here: containerd/containerd#9799 (comment)

please take a look

@jsafrane
Copy link
Member

I'm trying to run the test on my GCE account, but I can't setup one. From the job logs I can see CI uses something like this:

kubetest --dump=/home/jsafrane/project/test-infra/scenarios/_artifacts '--up'  '--down'  '--test'  '--provider=gce'  '--cluster=bootstrap-e2e'  '--gcp-network=bootstrap-e2e'  '--check-leaked-resources'    '--gcp-node-image=ubuntu'  '--gcp-zone=us-west1-b'  '--ginkgo-parallel=30' '--image-family=pipeline-1-29' '--image-project=ubuntu-os-gke-cloud'  '--test_args=--ginkgo.skip=\\[Driver:.gcepd\\]|\\[Slow\\]|\\[Serial\\]|\\[Disruptive\\]|\\[Flaky\\]|\\[Feature:.+\\]|\\[NodeFeature:RuntimeHandler\\]' '--timeout=50m' --gcp-project=openshift-gce-devel

But that cluster does not come up, apiserver fails with:

$ crictl logs <SHA of kube-apiserver pod>
2024/02/13 14:33:34 Running command:
Command env: (log-file=/var/log/kube-apiserver.log, also-stdout=false, redirect-stderr=true)
Run from directory: 
Executable path: /usr/local/bin/kube-apiserver
Args (comma-delimited): /usr/local/bin/kube-apiserver,--allow-privileged=true,--v=4,--runtime-config=extensions/v1beta1,scheduling.k8s.io/v1alpha1,--delete-collection-workers=1,--cloud-config=/etc/gce.conf,--allow-privileged=true,--cloud-provider=external,--client-ca-file=/etc/srv/kubernetes/pki/ca-certificates.crt,--etcd-servers=https://127.0.0.1:2379,--etcd-cafile=/etc/srv/kubernetes/pki/etcd-apiserver-ca.crt,--etcd-certfile=/etc/srv/kubernetes/pki/etcd-apiserver-client.crt,--etcd-keyfile=/etc/srv/kubernetes/pki/etcd-apiserver-client.key,--etcd-servers-overrides=/events#http://127.0.0.1:4002,--storage-backend=etcd3,--secure-port=443,--tls-cert-file=/etc/srv/kubernetes/pki/apiserver.crt,--tls-private-key-file=/etc/srv/kubernetes/pki/apiserver.key,--kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname,--requestheader-client-ca-file=/etc/srv/kubernetes/pki/aggr_ca.crt,--requestheader-allowed-names=aggregator,--requestheader-extra-headers-prefix=X-Remote-Extra-,--requestheader-group-headers=X-Remote-Group,--requestheader-username-headers=X-Remote-User,--proxy-client-cert-file=/etc/srv/kubernetes/pki/proxy_client.crt,--proxy-client-key-file=/etc/srv/kubernetes/pki/proxy_client.key,--enable-aggregator-routing=true,--kubelet-client-certificate=/etc/srv/kubernetes/pki/apiserver-client.crt,--kubelet-client-key=/etc/srv/kubernetes/pki/apiserver-client.key,--service-account-key-file=/etc/srv/kubernetes/pki/serviceaccount.crt,--token-auth-file=/etc/srv/kubernetes/known_tokens.csv,--service-cluster-ip-range=10.0.0.0/16,--service-account-issuer=https://kubernetes.default.svc.cluster.local,--api-audiences=https://kubernetes.default.svc.cluster.local,--service-account-signing-key-file=/etc/srv/kubernetes/pki/serviceaccount.key,--audit-policy-file=/etc/audit_policy.config,--audit-log-path=/var/log/kube-apiserver-audit.log,--audit-log-maxage=0,--audit-log-maxbackup=0,--audit-log-maxsize=2000000000,--audit-log-mode=batch,--audit-log-truncate-enabled=true,--enable-admission-plugins=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,NodeRestriction,Priority,StorageObjectInUseProtection,PersistentVolumeClaimResize,RuntimeClass,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota,--admission-control-config-file=/etc/srv/kubernetes/admission_controller_config.yaml,--min-request-timeout=300,--runtime-config=batch/v2alpha1=true,--advertise-address=34.168.220.132,--authorization-mode=Node,RBAC,--egress-selector-config-file=/etc/srv/kubernetes/egress_selector_configuration.yaml
2024/02/13 14:33:34 Now listening for interrupts
2024/02/13 14:33:35 running command: exit status 1

The image is registry.k8s.io/kube-apiserver-amd64:v1.29.1. I'd expect the API server to be more talkative at v=4.

@kannon92
Copy link
Contributor

@dims I was hoping #123362 would fix these failures https://testgrid.k8s.io/sig-node-containerd#containerd-e2e-ubuntu but it doesn't seem too.

@dims
Copy link
Member Author

dims commented Feb 19, 2024

@kannon92 yep me as well :) trying to see if we have a presubmit where we can iterate - #123390

i don't see it on on any of the aws/eks/ec2 jobs so we need to find a way to recreate it somewhere first

@dims
Copy link
Member Author

dims commented Feb 22, 2024

this should recover when #123423 lands

@dims
Copy link
Member Author

dims commented Feb 22, 2024

we can use pull-cos-containerd-e2e-ubuntu-gce to recreate as needed!

@dims dims closed this as completed Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

No branches or pull requests

4 participants