Flake: periodic-kubernetes-containerd-conformance-test-ppc64le, e2e conformance test in /e2e/network/service.go related to session affinity #112442

Rajalakshmi-Girish · 2022-09-14T08:44:34Z

Which jobs are flaking?

https://prow.k8s.io/job-history/gs/ppc64le-kubernetes/logs/periodic-kubernetes-containerd-conformance-test-ppc64le

Which tests are flaking?

[sig-network] Services should be able to switch session affinity for NodePort service [LinuxOnly] [Conformance]
[sig-network] Services should have session affinity timeout work for NodePort service [LinuxOnly] [Conformance]
[sig-network] Services should have session affinity work for NodePort service [LinuxOnly] [Conformance]

Since when has it been flaking?

probably from a fortnight.

Testgrid link

https://k8s-testgrid.appspot.com/ibm-conformance-ppc64le#Periodic%20Conformance%20on%20ppc64le%20with%20containerd%20as%20runtime

Reason for failure (if possible)

No response

Anything else we need to know?

below is the failure:

{Sep  9 07:46:59.187: service is not reachable within 2m0s timeout on endpoint 192.168.159.221:30141 over TCP protocol failed test/e2e/network/service.go:3814
k8s.io/kubernetes/test/e2e/network.execAffinityTestForSessionAffinityTimeout(0xc000d506e0, {0x1753f508, 0xc002548f00}, 0xc002a25680)
	test/e2e/network/service.go:3814 +0x6a4
k8s.io/kubernetes/test/e2e/network.glob..func25.30()
	test/e2e/network/service.go:2263 +0x90}

Relevant SIG(s)

/sig testing

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2022-09-14T08:44:41Z

@Rajalakshmi-Girish: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Rajalakshmi-Girish · 2022-09-14T08:47:02Z

@iXinqi any idea on this failure?
I ask you as you seem to have submitted a test case to the same file recently.

Rajalakshmi-Girish · 2022-09-14T13:04:54Z

@mkumatag ^^

aojea · 2022-09-14T21:34:33Z

service is not reachable within 2m0s timeout

can it be an underresourced environment?
we've seen in the other bug that etcd didn't boot on time, seems this environment is super slow

Rajalakshmi-Girish · 2022-09-19T11:23:23Z

can it be an underresourced environment?

We are running the conformance suite on a multi-node cluster(one master and two workers) created with 32Gb Memory and 8vcpus for each node.
I am not sure if this would be underresourced for the conformance suite. @aojea can you please help in understanding the resource requirements for running the conformance suite against ci/latest in k8s-release-dev

we've seen in the other bug that etcd didn't boot on time, seems this environment is super slow

@aojea do you mean the bug #112412?
That job for k8s unit tests, run on a test-pod in prow infra. I think the environments for these two issues are quite different.

aojea · 2022-09-19T12:13:12Z

@aojea do you mean the bug #112412?
That job for k8s unit tests, run on a test-pod in prow infra. I think the environments for these two issues are quite different.

My bad, sorry, I should not have assumed that

aojea · 2022-09-19T12:25:53Z

Those tests are completely green on the other jobs, and they use to be affected by environmental constraints problems, that is why I assumed wrongly it was the same environment.

This job install external components like Calico, and these tests only seem to fail in this job , and ppc64 is not officially supported... I can't help much here, happy to answer questions though (if I can help )

mkumatag · 2022-09-19T12:39:56Z

Here is exact error for this failure:

Sep 19 10:49:57.861: INFO: Running '/home/prow/go/src/github.com/ppc64le-cloud/kubetest2-plugins/_rundir/ab4a552e-3803-11ed-a942-f2778d4b64c2/kubectl --kubeconfig=/home/prow/go/src/github.com/ppc64le-cloud/kubetest2-plugins/config1-1663582512/kubeconfig --namespace=services-4493 exec execpod-affinity29fkj -- /bin/sh -x -c echo hostName | nc -v -t -w 2 192.168.160.102 32615'
Sep 19 10:50:02.468: INFO: rc: 1
Sep 19 10:50:02.468: INFO: Service reachability failing with error: error running /home/prow/go/src/github.com/ppc64le-cloud/kubetest2-plugins/_rundir/ab4a552e-3803-11ed-a942-f2778d4b64c2/kubectl --kubeconfig=/home/prow/go/src/github.com/ppc64le-cloud/kubetest2-plugins/config1-1663582512/kubeconfig --namespace=services-4493 exec execpod-affinity29fkj -- /bin/sh -x -c echo hostName | nc -v -t -w 2 192.168.160.102 32615:
Command stdout:

stderr:
+ + echo hostName
nc -v -t -w 2 192.168.160.102 32615
nc: connect to 192.168.160.102 port 32615 (tcp) timed out: Operation in progress
command terminated with exit code 1

error:
exit status 1

Seems like it is timing out while reaching to the node port service running on 32615 on node 192.168.160.102.

aojea · 2022-09-19T14:13:29Z

yeah, the underlay network or a constrained environment use to be the usual suspects , but those things use to hit randomly, if is consistent, there was a bug in IBM platforms and openshift CI, related to security groups dropping packets ... just throwing ideas

mkumatag · 2022-09-19T14:39:50Z

yeah, the underlay network or a constrained environment use to be the usual suspects , but those things use to hit randomly, if is consistent, there was a bug in IBM platforms and openshift CI, related to security groups dropping packets ... just throwing ideas

It's not consistent, that's what driving me crazy.. security group is not involved here, it's all in the same private network.

I guess something wrong with interpod connection, will debug this further..

aojea · 2022-09-19T14:47:04Z

It's not consistent, that's what driving me crazy.. security group is not involved here, it's all in the same private network.

I guess something wrong with interpod connection, will debug this further..

those are the most difficult ones ... I can't see components logs on the job, if you have kube-proxy logs that may give us a hint

mkumatag · 2022-09-19T14:50:25Z

those are the most difficult ones ... I can't see components logs on the job, if you have kube-proxy logs that may give us a hint

unfortunately haven't implemented DumpClusterLogs for the kubetest2 plugin we wrote :( will try to repro manually and get the logs..

aojea · 2022-09-19T14:52:27Z

I suggest to discard the obvious first, last time you had a problem that was solved updating calico #106264 (comment)

mkumatag · 2022-09-19T14:54:17Z

I suggest to discard the obvious first, last time you had a problem that was solved updating calico #106264 (comment)

good point, @Rajalakshmi-Girish ^^

Rajalakshmi-Girish · 2022-09-19T19:13:28Z

if you have kube-proxy logs that may give us a hint

kube-proxy-logs.txt
kubelet.txt

@aojea any clue from these logs, please?
We could retain an environment where this flakiness can be reproduced!

aojea · 2022-09-20T04:18:17Z

@aojea any clue from these logs, please?

nothing suspicious on the logs

if is an intermittent network problem you have to repro and trace all the path from the origin to the destination to check where it is failing, it seems all tests involve session affinity for NodePort , and those are tricky: CNI, underlay network or iptables can be possible candidates

Rajalakshmi-Girish · 2022-09-22T04:54:30Z

I suggest to discard the obvious first, last time you had a problem that was solved updating calico #106264 (comment)

@aojea I tried running this suite on the latest version of calico (v.3.24.0). But it is flaking there too!
https://prow.ppc64le-cloud.org/view/gs/ppc64le-kubernetes/logs/test-periodic-kubernetes-containerd-conformance-test-ppc64le/1572596481464995840

aojea · 2022-09-22T05:37:18Z

@Rajalakshmi-Girish can you try to run with this patch?

#112663

Rajalakshmi-Girish · 2022-09-22T10:53:54Z

@Rajalakshmi-Girish can you try to run with this patch?

@aojea It is failing even with this patch.
I re-created e2e.test binary with this change. Steps I followed:

git clone https://github.com/aojea/kubernetes
cd kubernetes/
git checkout 909a08d
make WHAT="test/e2e/e2e.test"

ginkgo --nodes=4 --until-it-fails e2e.test -- --kubeconfig= --ginkgo.focus="Services\ should\ have\ session\ affinity" --report-dir=/root/artifacts

aojea · 2022-09-22T13:41:05Z

How is the output error with that patch?

Rajalakshmi-Girish · 2022-09-22T17:00:55Z

How is the output error with that patch?

stderr:
    + nc -v -z -w 2 192.168.155.114 32753
    nc: connect to 192.168.155.114 port 32753 (tcp) timed out: Operation in progress
    command terminated with exit code 1

    error:
    exit status 1
    Retrying...
    Sep 22 16:55:00.490: INFO: Unexpected error:
        <*errors.errorString | 0xc000ab6250>: {
            s: "service is not reachable within 2m0s timeout on endpoint 192.168.155.114:32753 over TCP protocol",
        }
    Sep 22 16:55:00.490: FAIL: service is not reachable within 2m0s timeout on endpoint 192.168.155.114:32753 over TCP protocol

@aojea Please find the attached complete output
flake_test_output.txt

BenTheElder · 2022-09-22T18:43:36Z

/remove-sig testing

SIG Testing owns the Test Framework, CI infrastructure, test tools, etc. We are not responsible for individual test cases or external CI.
All of these test cases are [sig-network]

/sig network

aojea · 2022-09-26T18:20:11Z

/assign @aojea

now that I found a problem, I'm curious if we have something wrong on the tests

aojea · 2022-09-28T06:45:52Z

It seems most tests are skipped because there are no nodes available, conformance requires 2 nodes minimum

38;5;14mRequires at least 2 nodes (not -1) [0m

Rajalakshmi-Girish · 2022-09-28T06:48:24Z

It seems most tests are skipped because there are no nodes available, conformance requires 3 nodes minimum

38;5;14mRequires at least 2 nodes (not -1) [0m

My bad! I assumed they ran well as the cluster had 2 workers and one master.
I shall look into it and re-run it.

aojea · 2022-09-28T07:54:57Z

no worries, you can filter the SCTP tests, those are not going to work based on the logs

Rajalakshmi-Girish · 2022-09-28T11:13:15Z

no worries, you can filter the SCTP tests, those are not going to work based on the logs

Ok I shall run SCTP tests and update here.
But I am not sure why it said INFO: Requires at least 2 nodes (not -1) inspite of having two worker nodes in Ready state.

Rajalakshmi-Girish · 2022-09-28T11:26:21Z

@aojea yes 10 out of 16 tests filtered with SCTP regex failed. Please find the attached log.
SCTP_test_run.txt

aojea · 2022-09-28T14:14:06Z

heh, sorry, I meant just the opposite, do not run SCTP tests 🙃

focus="[sig-network]\ Networking\ Granular\ Checks:"
skip=SCTP

Rajalakshmi-Girish · 2022-09-29T07:17:06Z

heh, sorry, I meant just the opposite, do not run SCTP tests 🙃

focus="[sig-network]\ Networking\ Granular\ Checks:" skip=SCTP

Granular_network_serial.txt
Those skips that said INFO: Requires at least 2 nodes (not -1) ran fine when the suite was run serially.

Rajalakshmi-Girish · 2022-10-03T09:38:03Z

Granular_network_serial.txt
Those skips that said INFO: Requires at least 2 nodes (not -1) ran fine when the suite was run serially.

@aojea Did you get a chance to look at the run results. These tests are still flaking in our jobs.

aojea · 2022-10-07T00:24:33Z

no time sorry, this kind of problems usually are solved when you can reproduce them on demand, so you can check all the network path and components

thockin · 2022-11-09T23:56:21Z

is this still alive?

mkumatag · 2022-11-10T02:31:28Z

is this still alive?

yes, and one of our n/w person is debugging it.

aojea · 2022-12-01T10:10:04Z

@mkumatag @Rajalakshmi-Girish can you paste the kube-proxy config?

aojea · 2022-12-01T10:21:39Z

I'm mostly interested in knowing the value of minSyncPeriod , xref #114171 (comment)

Rajalakshmi-Girish · 2022-12-01T10:51:29Z

I have fetched the config details from the configmap by name kube-proxy in the cluster:
It shows minSyncPeriod: 0s

config.conf: |-
    apiVersion: kubeproxy.config.k8s.io/v1alpha1
    bindAddress: 0.0.0.0
    bindAddressHardFail: false
    clientConnection:
      acceptContentTypes: ""
      burst: 0
      contentType: ""
      kubeconfig: /var/lib/kube-proxy/kubeconfig.conf
      qps: 0
    clusterCIDR: 172.20.0.0/16
    configSyncPeriod: 0s
    conntrack:
      maxPerCore: null
      min: null
      tcpCloseWaitTimeout: null
      tcpEstablishedTimeout: null
    detectLocal:
      bridgeInterface: ""
      interfaceNamePrefix: ""
    detectLocalMode: ""
    enableProfiling: false
    healthzBindAddress: ""
    hostnameOverride: ""
    iptables:
      masqueradeAll: false
      masqueradeBit: null
      minSyncPeriod: 0s
      syncPeriod: 0s
    ipvs:
      excludeCIDRs: null
      minSyncPeriod: 0s
      scheduler: ""
      strictARP: false
      syncPeriod: 0s
      tcpFinTimeout: 0s
      tcpTimeout: 0s
      udpTimeout: 0s
    kind: KubeProxyConfiguration
    metricsBindAddress: ""
    mode: ""
    nodePortAddresses: null
    oomScoreAdj: null
    portRange: ""
    showHiddenMetricsForVersion: ""
    udpIdleTimeout: 0s
    winkernel:
      enableDSR: false
      forwardHealthCheckVip: false
      networkName: ""
      rootHnsEndpointName: ""
      sourceVip: ""

@aojea is this the information you were looking for?

aojea · 2022-12-01T12:14:08Z

minSyncPeriod: 0s that means it should be using the default, that is 1 second, so it doesn't seem related.

If you add verbosity level 2 to kube-proxy you should be able to find the exact value on the logs, to discard the flags are overriding the value

Rajalakshmi-Girish · 2022-12-01T17:05:21Z

If you add verbosity level 2 to kube-proxy you should be able to find the exact value on the logs, to discard the flags are overriding the value

@aojea
It is 1 second and no flag seems to override this default value for minSyncPeriod
I see it in logs after adding verbosity level 2

I1201 17:00:27.800529       1 proxier.go:294] "Iptables sync params" ipFamily=IPv4 minSyncPeriod="1s" syncPeriod="30s" burstSyncs=2
I1201 17:00:27.800579       1 proxier.go:304] "Iptables supports --random-fully" ipFamily=IPv4
I1201 17:00:27.800595       1 proxier.go:234] "Setting route_localnet=1, use nodePortAddresses to filter loopback addresses for NodePorts to skip it https://issues.k8s.io/90259"

aojea · 2022-12-01T17:07:14Z

Nah, not the same thing:/

Rajalakshmi-Girish · 2022-12-01T17:28:00Z

I happen to create a new cluster today with the current ci/latest version of k8s(v1.27.0-alpha.0.54+79cba170b55bd0).

Though the test [sig-network] Services should have session affinity work for NodePort service [LinuxOnly] [Conformance] PASSED on this cluster, there is a nc failure(nc -v -z -w 2 <node IP> <node port>) from the execpod to one of the node which the test didn't pick up while running. The test would have failed if it picked up the node to which nc is not successful!

As the lines https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/service/jig.go#L969:#L970 considers only two nodes for testing, the test PASSES when the issue is only with one node(Sometimes the nc failure occurs to more than one node from execpod) and that node isn't picked up by the test case.

Thus, I would like to reiterate that these tests have been flaking on our environment for more than a couple of months :|

aojea · 2022-12-02T12:04:11Z

there is a nc failure(nc -v -z -w 2 ) from the execpod to one of the node which the test didn't pick up while running

can you elaborate on that, how the exec failed and the test didn't pick up it? I can see that it returns an error if fails

kubernetes/test/e2e/framework/service/jig.go

Lines 867 to 870 in 3e26e10

    
           err := testEndpointReachability(internalAddr, sp.NodePort, sp.Protocol, pod) 
        
           if err != nil { 
        
           	return err 
        
           }

Thus, I would like to reiterate that these tests have been flaking on our environment for more than a couple of months :|

this is a downstream environment, is not much that community can do more than help with suggestions without access the environment, there are a lot of external factors that can influence the problem, you mentioned that there was a n/w person in your side debugging it, what are their findings? maybe that can give us some leads

Rajalakshmi-Girish · 2022-12-02T17:15:02Z

can you elaborate on that, how the exec failed and the test didn't pick up it? I can see that it returns an error if fails

As mentioned in https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/service/jig.go#L969:#L970, The function testReachabilityOverNodePorts is tested only with two nodes. In our cluster(with 2 workers and 1 master) when the node with nc failure is not chosen, the test case PASSES, else it FAILS.

you mentioned that there was a n/w person in your side debugging it, what are their findings?

The debugging is in progress. Will share here once there are any findings.

aojea · 2022-12-02T17:36:24Z

when the node with nc failure is not chosen, the test case PASSES, else it FAILS.

sorry, this is the part I'm not getting, is the test running that nc or are you running that nc in parallel .. or how is this happening?

Rajalakshmi-Girish · 2022-12-02T17:51:51Z

is the test running that nc or are you running that nc in parallel ..

Yes the test runs the nc commands at https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/service/jig.go#L905:#L908

When the func testReachabilityOverNodePorts calls func testEndpointReachability at https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/service/jig.go#L867, it is passing the internalAddr of only 2 nodes fetched from https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/service/jig.go#L969:#L970

Thus if there is a nc failure using the third node, the test would still PASS.

aojea · 2022-12-02T18:01:37Z

Thus if there is a nc failure using the third node, the test would still PASS.

but is this something you are seeing or a hypothesis?

Rajalakshmi-Girish · 2022-12-02T18:19:19Z

but is this something you are seeing or a hypothesis?

We are seeing this on the clusters created with ci/latest version of k8s in our environment.

Though the test run against such an environment PASSES, It still has nc command failure from execpod to the node that the test did not test. (We see this failure when nc command is manually run from execpod)

If the test happens to choose the nodeIP that has a failure, The test FAILs.
Hence these tests are flaking.

aojea · 2022-12-03T09:38:36Z

we can't test each nodePort of a cluster because that doesn't scale, these tests run in clusters with 5k, 10k and 15k nodes

There is some trade offs and this approach was working well for a long time, we can always revisit if we detect a considerable rate of false positives , but if that is happening in your environment then you have to investigate why that is happening , dump iptables rules in all nodes and see if they are different and there is a bug on the code or an environmental problem with that specific node.

thockin · 2022-12-21T22:51:52Z

Should we close this, then?

Rajalakshmi-Girish · 2022-12-23T04:15:02Z

Is https://issues.redhat.com/browse/OCPBUGS-4503 related to this?

aojea · 2022-12-23T15:35:49Z

Is https://issues.redhat.com/browse/OCPBUGS-4503 related to this?

some of the logs say you are using calico, I didn't check recent runs, but if you are using ovn-kubernetes , then it seems so

Rajalakshmi-Girish added the kind/flake Categorizes issue or PR as related to a flaky test. label Sep 14, 2022

k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Sep 14, 2022

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 14, 2022

aojea mentioned this issue Sep 22, 2022

do not assume backend on e2e service jig #112663

Merged

k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Sep 22, 2022

k8s-ci-robot assigned aojea Sep 26, 2022

aojea mentioned this issue Dec 1, 2022

e2e loadbalancer test connectivity within cluster first #114171

Merged

thockin closed this as completed Jan 5, 2023

Flake: periodic-kubernetes-containerd-conformance-test-ppc64le, e2e conformance test in /e2e/network/service.go related to session affinity #112442

Flake: periodic-kubernetes-containerd-conformance-test-ppc64le, e2e conformance test in /e2e/network/service.go related to session affinity #112442

Comments

Rajalakshmi-Girish commented Sep 14, 2022

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Relevant SIG(s)

k8s-ci-robot commented Sep 14, 2022

Rajalakshmi-Girish commented Sep 14, 2022

Rajalakshmi-Girish commented Sep 14, 2022

aojea commented Sep 14, 2022

Rajalakshmi-Girish commented Sep 19, 2022 • edited

aojea commented Sep 19, 2022

aojea commented Sep 19, 2022

mkumatag commented Sep 19, 2022

aojea commented Sep 19, 2022

mkumatag commented Sep 19, 2022

aojea commented Sep 19, 2022

mkumatag commented Sep 19, 2022 • edited

aojea commented Sep 19, 2022

mkumatag commented Sep 19, 2022

Rajalakshmi-Girish commented Sep 19, 2022

aojea commented Sep 20, 2022

Rajalakshmi-Girish commented Sep 22, 2022

aojea commented Sep 22, 2022

Rajalakshmi-Girish commented Sep 22, 2022

aojea commented Sep 22, 2022

Rajalakshmi-Girish commented Sep 22, 2022

BenTheElder commented Sep 22, 2022 • edited

aojea commented Sep 26, 2022

aojea commented Sep 28, 2022 • edited

Rajalakshmi-Girish commented Sep 28, 2022

aojea commented Sep 28, 2022

Rajalakshmi-Girish commented Sep 28, 2022

Rajalakshmi-Girish commented Sep 28, 2022

aojea commented Sep 28, 2022

Rajalakshmi-Girish commented Sep 29, 2022

Rajalakshmi-Girish commented Oct 3, 2022

aojea commented Oct 7, 2022

thockin commented Nov 9, 2022

mkumatag commented Nov 10, 2022

aojea commented Dec 1, 2022

aojea commented Dec 1, 2022

Rajalakshmi-Girish commented Dec 1, 2022

aojea commented Dec 1, 2022

Rajalakshmi-Girish commented Dec 1, 2022 • edited

aojea commented Dec 1, 2022

Rajalakshmi-Girish commented Dec 1, 2022

aojea commented Dec 2, 2022

Rajalakshmi-Girish commented Dec 2, 2022

aojea commented Dec 2, 2022

Rajalakshmi-Girish commented Dec 2, 2022

aojea commented Dec 2, 2022

Rajalakshmi-Girish commented Dec 2, 2022 • edited

aojea commented Dec 3, 2022 • edited

thockin commented Dec 21, 2022

Rajalakshmi-Girish commented Dec 23, 2022

aojea commented Dec 23, 2022

Rajalakshmi-Girish commented Sep 19, 2022 •

edited

mkumatag commented Sep 19, 2022 •

edited

BenTheElder commented Sep 22, 2022 •

edited

aojea commented Sep 28, 2022 •

edited

Rajalakshmi-Girish commented Dec 1, 2022 •

edited

Rajalakshmi-Girish commented Dec 2, 2022 •

edited

aojea commented Dec 3, 2022 •

edited