Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flake: periodic-kubernetes-containerd-conformance-test-ppc64le, e2e conformance test in /e2e/network/service.go related to session affinity #112442

Closed
Rajalakshmi-Girish opened this issue Sep 14, 2022 · 53 comments
Assignees
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@Rajalakshmi-Girish
Copy link
Contributor

Which jobs are flaking?

https://prow.k8s.io/job-history/gs/ppc64le-kubernetes/logs/periodic-kubernetes-containerd-conformance-test-ppc64le

Which tests are flaking?

  • [sig-network] Services should be able to switch session affinity for NodePort service [LinuxOnly] [Conformance]
  • [sig-network] Services should have session affinity timeout work for NodePort service [LinuxOnly] [Conformance]
  • [sig-network] Services should have session affinity work for NodePort service [LinuxOnly] [Conformance]

Since when has it been flaking?

probably from a fortnight.

Testgrid link

https://k8s-testgrid.appspot.com/ibm-conformance-ppc64le#Periodic%20Conformance%20on%20ppc64le%20with%20containerd%20as%20runtime

Reason for failure (if possible)

No response

Anything else we need to know?

below is the failure:

{Sep  9 07:46:59.187: service is not reachable within 2m0s timeout on endpoint 192.168.159.221:30141 over TCP protocol failed test/e2e/network/service.go:3814
k8s.io/kubernetes/test/e2e/network.execAffinityTestForSessionAffinityTimeout(0xc000d506e0, {0x1753f508, 0xc002548f00}, 0xc002a25680)
	test/e2e/network/service.go:3814 +0x6a4
k8s.io/kubernetes/test/e2e/network.glob..func25.30()
	test/e2e/network/service.go:2263 +0x90}

Relevant SIG(s)

/sig testing

@Rajalakshmi-Girish Rajalakshmi-Girish added the kind/flake Categorizes issue or PR as related to a flaky test. label Sep 14, 2022
@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Sep 14, 2022
@k8s-ci-robot
Copy link
Contributor

@Rajalakshmi-Girish: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Sep 14, 2022
@Rajalakshmi-Girish
Copy link
Contributor Author

@iXinqi any idea on this failure?
I ask you as you seem to have submitted a test case to the same file recently.

@Rajalakshmi-Girish
Copy link
Contributor Author

@mkumatag ^^

@aojea
Copy link
Member

aojea commented Sep 14, 2022

service is not reachable within 2m0s timeout

can it be an underresourced environment?
we've seen in the other bug that etcd didn't boot on time, seems this environment is super slow

@Rajalakshmi-Girish
Copy link
Contributor Author

Rajalakshmi-Girish commented Sep 19, 2022

can it be an underresourced environment?

We are running the conformance suite on a multi-node cluster(one master and two workers) created with 32Gb Memory and 8vcpus for each node.
I am not sure if this would be underresourced for the conformance suite. @aojea can you please help in understanding the resource requirements for running the conformance suite against ci/latest in k8s-release-dev

we've seen in the other bug that etcd didn't boot on time, seems this environment is super slow

@aojea do you mean the bug #112412?
That job for k8s unit tests, run on a test-pod in prow infra. I think the environments for these two issues are quite different.

@aojea
Copy link
Member

aojea commented Sep 19, 2022

@aojea do you mean the bug #112412?
That job for k8s unit tests, run on a test-pod in prow infra. I think the environments for these two issues are quite different.

My bad, sorry, I should not have assumed that

@aojea
Copy link
Member

aojea commented Sep 19, 2022

Those tests are completely green on the other jobs, and they use to be affected by environmental constraints problems, that is why I assumed wrongly it was the same environment.

This job install external components like Calico, and these tests only seem to fail in this job , and ppc64 is not officially supported... I can't help much here, happy to answer questions though (if I can help )

@mkumatag
Copy link
Member

Here is exact error for this failure:

Sep 19 10:49:57.861: INFO: Running '/home/prow/go/src/github.com/ppc64le-cloud/kubetest2-plugins/_rundir/ab4a552e-3803-11ed-a942-f2778d4b64c2/kubectl --kubeconfig=/home/prow/go/src/github.com/ppc64le-cloud/kubetest2-plugins/config1-1663582512/kubeconfig --namespace=services-4493 exec execpod-affinity29fkj -- /bin/sh -x -c echo hostName | nc -v -t -w 2 192.168.160.102 32615'
Sep 19 10:50:02.468: INFO: rc: 1
Sep 19 10:50:02.468: INFO: Service reachability failing with error: error running /home/prow/go/src/github.com/ppc64le-cloud/kubetest2-plugins/_rundir/ab4a552e-3803-11ed-a942-f2778d4b64c2/kubectl --kubeconfig=/home/prow/go/src/github.com/ppc64le-cloud/kubetest2-plugins/config1-1663582512/kubeconfig --namespace=services-4493 exec execpod-affinity29fkj -- /bin/sh -x -c echo hostName | nc -v -t -w 2 192.168.160.102 32615:
Command stdout:

stderr:
+ + echo hostName
nc -v -t -w 2 192.168.160.102 32615
nc: connect to 192.168.160.102 port 32615 (tcp) timed out: Operation in progress
command terminated with exit code 1

error:
exit status 1

Seems like it is timing out while reaching to the node port service running on 32615 on node 192.168.160.102.

@aojea
Copy link
Member

aojea commented Sep 19, 2022

yeah, the underlay network or a constrained environment use to be the usual suspects , but those things use to hit randomly, if is consistent, there was a bug in IBM platforms and openshift CI, related to security groups dropping packets ... just throwing ideas

@mkumatag
Copy link
Member

yeah, the underlay network or a constrained environment use to be the usual suspects , but those things use to hit randomly, if is consistent, there was a bug in IBM platforms and openshift CI, related to security groups dropping packets ... just throwing ideas

It's not consistent, that's what driving me crazy.. security group is not involved here, it's all in the same private network.

I guess something wrong with interpod connection, will debug this further..

@aojea
Copy link
Member

aojea commented Sep 19, 2022

It's not consistent, that's what driving me crazy.. security group is not involved here, it's all in the same private network.

I guess something wrong with interpod connection, will debug this further..

those are the most difficult ones ... I can't see components logs on the job, if you have kube-proxy logs that may give us a hint

@mkumatag
Copy link
Member

mkumatag commented Sep 19, 2022

those are the most difficult ones ... I can't see components logs on the job, if you have kube-proxy logs that may give us a hint

unfortunately haven't implemented DumpClusterLogs for the kubetest2 plugin we wrote :( will try to repro manually and get the logs..

@aojea
Copy link
Member

aojea commented Sep 19, 2022

I suggest to discard the obvious first, last time you had a problem that was solved updating calico #106264 (comment)

@mkumatag
Copy link
Member

I suggest to discard the obvious first, last time you had a problem that was solved updating calico #106264 (comment)

good point, @Rajalakshmi-Girish ^^

@Rajalakshmi-Girish
Copy link
Contributor Author

if you have kube-proxy logs that may give us a hint

kube-proxy-logs.txt
kubelet.txt

@aojea any clue from these logs, please?
We could retain an environment where this flakiness can be reproduced!

@aojea
Copy link
Member

aojea commented Sep 20, 2022

@aojea any clue from these logs, please?

nothing suspicious on the logs

if is an intermittent network problem you have to repro and trace all the path from the origin to the destination to check where it is failing, it seems all tests involve session affinity for NodePort , and those are tricky: CNI, underlay network or iptables can be possible candidates

@Rajalakshmi-Girish
Copy link
Contributor Author

I suggest to discard the obvious first, last time you had a problem that was solved updating calico #106264 (comment)

@aojea I tried running this suite on the latest version of calico (v.3.24.0). But it is flaking there too!
https://prow.ppc64le-cloud.org/view/gs/ppc64le-kubernetes/logs/test-periodic-kubernetes-containerd-conformance-test-ppc64le/1572596481464995840

@aojea
Copy link
Member

aojea commented Sep 22, 2022

@Rajalakshmi-Girish can you try to run with this patch?

#112663

@Rajalakshmi-Girish
Copy link
Contributor Author

@Rajalakshmi-Girish can you try to run with this patch?

@aojea It is failing even with this patch.
I re-created e2e.test binary with this change. Steps I followed:

git clone https://github.com/aojea/kubernetes
cd kubernetes/
git checkout 909a08d
make WHAT="test/e2e/e2e.test"

ginkgo --nodes=4 --until-it-fails e2e.test -- --kubeconfig= --ginkgo.focus="Services\ should\ have\ session\ affinity" --report-dir=/root/artifacts

@aojea
Copy link
Member

aojea commented Sep 22, 2022

How is the output error with that patch?

@Rajalakshmi-Girish
Copy link
Contributor Author

How is the output error with that patch?

stderr:
    + nc -v -z -w 2 192.168.155.114 32753
    nc: connect to 192.168.155.114 port 32753 (tcp) timed out: Operation in progress
    command terminated with exit code 1

    error:
    exit status 1
    Retrying...
    Sep 22 16:55:00.490: INFO: Unexpected error:
        <*errors.errorString | 0xc000ab6250>: {
            s: "service is not reachable within 2m0s timeout on endpoint 192.168.155.114:32753 over TCP protocol",
        }
    Sep 22 16:55:00.490: FAIL: service is not reachable within 2m0s timeout on endpoint 192.168.155.114:32753 over TCP protocol

@aojea Please find the attached complete output
flake_test_output.txt

@BenTheElder
Copy link
Member

BenTheElder commented Sep 22, 2022

/remove-sig testing

SIG Testing owns the Test Framework, CI infrastructure, test tools, etc. We are not responsible for individual test cases or external CI.
All of these test cases are [sig-network]

/sig network

@k8s-ci-robot k8s-ci-robot added sig/network Categorizes an issue or PR as relevant to SIG Network. and removed sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Sep 22, 2022
@aojea
Copy link
Member

aojea commented Sep 26, 2022

/assign @aojea

now that I found a problem, I'm curious if we have something wrong on the tests

@aojea
Copy link
Member

aojea commented Sep 28, 2022

It seems most tests are skipped because there are no nodes available, conformance requires 2 nodes minimum

38;5;14mRequires at least 2 nodes (not -1) [0m

@Rajalakshmi-Girish
Copy link
Contributor Author

It seems most tests are skipped because there are no nodes available, conformance requires 3 nodes minimum

38;5;14mRequires at least 2 nodes (not -1) [0m

My bad! I assumed they ran well as the cluster had 2 workers and one master.
I shall look into it and re-run it.

@aojea
Copy link
Member

aojea commented Sep 28, 2022

no worries, you can filter the SCTP tests, those are not going to work based on the logs

@Rajalakshmi-Girish
Copy link
Contributor Author

no worries, you can filter the SCTP tests, those are not going to work based on the logs

Ok I shall run SCTP tests and update here.
But I am not sure why it said INFO: Requires at least 2 nodes (not -1) inspite of having two worker nodes in Ready state.

@Rajalakshmi-Girish
Copy link
Contributor Author

@aojea yes 10 out of 16 tests filtered with SCTP regex failed. Please find the attached log.
SCTP_test_run.txt

@aojea
Copy link
Member

aojea commented Sep 28, 2022

heh, sorry, I meant just the opposite, do not run SCTP tests 🙃

focus="[sig-network]\ Networking\ Granular\ Checks:"
skip=SCTP

@Rajalakshmi-Girish
Copy link
Contributor Author

heh, sorry, I meant just the opposite, do not run SCTP tests 🙃

focus="[sig-network]\ Networking\ Granular\ Checks:" skip=SCTP

Granular_network_serial.txt
Those skips that said INFO: Requires at least 2 nodes (not -1) ran fine when the suite was run serially.

@Rajalakshmi-Girish
Copy link
Contributor Author

Granular_network_serial.txt
Those skips that said INFO: Requires at least 2 nodes (not -1) ran fine when the suite was run serially.

@aojea Did you get a chance to look at the run results. These tests are still flaking in our jobs.

@aojea
Copy link
Member

aojea commented Oct 7, 2022

no time sorry, this kind of problems usually are solved when you can reproduce them on demand, so you can check all the network path and components

@thockin
Copy link
Member

thockin commented Nov 9, 2022

is this still alive?

@mkumatag
Copy link
Member

is this still alive?

yes, and one of our n/w person is debugging it.

@aojea
Copy link
Member

aojea commented Dec 1, 2022

@mkumatag @Rajalakshmi-Girish can you paste the kube-proxy config?

@aojea
Copy link
Member

aojea commented Dec 1, 2022

I'm mostly interested in knowing the value of minSyncPeriod , xref #114171 (comment)

@Rajalakshmi-Girish
Copy link
Contributor Author

I have fetched the config details from the configmap by name kube-proxy in the cluster:
It shows minSyncPeriod: 0s

config.conf: |-
    apiVersion: kubeproxy.config.k8s.io/v1alpha1
    bindAddress: 0.0.0.0
    bindAddressHardFail: false
    clientConnection:
      acceptContentTypes: ""
      burst: 0
      contentType: ""
      kubeconfig: /var/lib/kube-proxy/kubeconfig.conf
      qps: 0
    clusterCIDR: 172.20.0.0/16
    configSyncPeriod: 0s
    conntrack:
      maxPerCore: null
      min: null
      tcpCloseWaitTimeout: null
      tcpEstablishedTimeout: null
    detectLocal:
      bridgeInterface: ""
      interfaceNamePrefix: ""
    detectLocalMode: ""
    enableProfiling: false
    healthzBindAddress: ""
    hostnameOverride: ""
    iptables:
      masqueradeAll: false
      masqueradeBit: null
      minSyncPeriod: 0s
      syncPeriod: 0s
    ipvs:
      excludeCIDRs: null
      minSyncPeriod: 0s
      scheduler: ""
      strictARP: false
      syncPeriod: 0s
      tcpFinTimeout: 0s
      tcpTimeout: 0s
      udpTimeout: 0s
    kind: KubeProxyConfiguration
    metricsBindAddress: ""
    mode: ""
    nodePortAddresses: null
    oomScoreAdj: null
    portRange: ""
    showHiddenMetricsForVersion: ""
    udpIdleTimeout: 0s
    winkernel:
      enableDSR: false
      forwardHealthCheckVip: false
      networkName: ""
      rootHnsEndpointName: ""
      sourceVip: ""

@aojea is this the information you were looking for?

@aojea
Copy link
Member

aojea commented Dec 1, 2022

minSyncPeriod: 0s that means it should be using the default, that is 1 second, so it doesn't seem related.

If you add verbosity level 2 to kube-proxy you should be able to find the exact value on the logs, to discard the flags are overriding the value

@Rajalakshmi-Girish
Copy link
Contributor Author

Rajalakshmi-Girish commented Dec 1, 2022

If you add verbosity level 2 to kube-proxy you should be able to find the exact value on the logs, to discard the flags are overriding the value

@aojea
It is 1 second and no flag seems to override this default value for minSyncPeriod
I see it in logs after adding verbosity level 2

I1201 17:00:27.800529       1 proxier.go:294] "Iptables sync params" ipFamily=IPv4 minSyncPeriod="1s" syncPeriod="30s" burstSyncs=2
I1201 17:00:27.800579       1 proxier.go:304] "Iptables supports --random-fully" ipFamily=IPv4
I1201 17:00:27.800595       1 proxier.go:234] "Setting route_localnet=1, use nodePortAddresses to filter loopback addresses for NodePorts to skip it https://issues.k8s.io/90259"

@aojea
Copy link
Member

aojea commented Dec 1, 2022

Nah, not the same thing:/

@Rajalakshmi-Girish
Copy link
Contributor Author

I happen to create a new cluster today with the current ci/latest version of k8s(v1.27.0-alpha.0.54+79cba170b55bd0).

Though the test [sig-network] Services should have session affinity work for NodePort service [LinuxOnly] [Conformance] PASSED on this cluster, there is a nc failure(nc -v -z -w 2 <node IP> <node port>) from the execpod to one of the node which the test didn't pick up while running. The test would have failed if it picked up the node to which nc is not successful!

As the lines https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/service/jig.go#L969:#L970 considers only two nodes for testing, the test PASSES when the issue is only with one node(Sometimes the nc failure occurs to more than one node from execpod) and that node isn't picked up by the test case.

Thus, I would like to reiterate that these tests have been flaking on our environment for more than a couple of months :|

@aojea
Copy link
Member

aojea commented Dec 2, 2022

there is a nc failure(nc -v -z -w 2 ) from the execpod to one of the node which the test didn't pick up while running

can you elaborate on that, how the exec failed and the test didn't pick up it? I can see that it returns an error if fails

err := testEndpointReachability(internalAddr, sp.NodePort, sp.Protocol, pod)
if err != nil {
return err
}

Thus, I would like to reiterate that these tests have been flaking on our environment for more than a couple of months :|

this is a downstream environment, is not much that community can do more than help with suggestions without access the environment, there are a lot of external factors that can influence the problem, you mentioned that there was a n/w person in your side debugging it, what are their findings? maybe that can give us some leads

@Rajalakshmi-Girish
Copy link
Contributor Author

can you elaborate on that, how the exec failed and the test didn't pick up it? I can see that it returns an error if fails

As mentioned in https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/service/jig.go#L969:#L970, The function testReachabilityOverNodePorts is tested only with two nodes. In our cluster(with 2 workers and 1 master) when the node with nc failure is not chosen, the test case PASSES, else it FAILS.

you mentioned that there was a n/w person in your side debugging it, what are their findings?

The debugging is in progress. Will share here once there are any findings.

@aojea
Copy link
Member

aojea commented Dec 2, 2022

when the node with nc failure is not chosen, the test case PASSES, else it FAILS.

sorry, this is the part I'm not getting, is the test running that nc or are you running that nc in parallel .. or how is this happening?

@Rajalakshmi-Girish
Copy link
Contributor Author

is the test running that nc or are you running that nc in parallel ..

Yes the test runs the nc commands at https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/service/jig.go#L905:#L908

When the func testReachabilityOverNodePorts calls func testEndpointReachability at https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/service/jig.go#L867, it is passing the internalAddr of only 2 nodes fetched from https://github.com/kubernetes/kubernetes/blob/master/test/e2e/framework/service/jig.go#L969:#L970

Thus if there is a nc failure using the third node, the test would still PASS.

@aojea
Copy link
Member

aojea commented Dec 2, 2022

Thus if there is a nc failure using the third node, the test would still PASS.

but is this something you are seeing or a hypothesis?

@Rajalakshmi-Girish
Copy link
Contributor Author

Rajalakshmi-Girish commented Dec 2, 2022

but is this something you are seeing or a hypothesis?

We are seeing this on the clusters created with ci/latest version of k8s in our environment.

Though the test run against such an environment PASSES, It still has nc command failure from execpod to the node that the test did not test. (We see this failure when nc command is manually run from execpod)

If the test happens to choose the nodeIP that has a failure, The test FAILs.
Hence these tests are flaking.

@aojea
Copy link
Member

aojea commented Dec 3, 2022

we can't test each nodePort of a cluster because that doesn't scale, these tests run in clusters with 5k, 10k and 15k nodes

There is some trade offs and this approach was working well for a long time, we can always revisit if we detect a considerable rate of false positives , but if that is happening in your environment then you have to investigate why that is happening , dump iptables rules in all nodes and see if they are different and there is a bug on the code or an environmental problem with that specific node.

@thockin
Copy link
Member

thockin commented Dec 21, 2022

Should we close this, then?

@Rajalakshmi-Girish
Copy link
Contributor Author

Is https://issues.redhat.com/browse/OCPBUGS-4503 related to this?

@aojea
Copy link
Member

aojea commented Dec 23, 2022

Is https://issues.redhat.com/browse/OCPBUGS-4503 related to this?

some of the logs say you are using calico, I didn't check recent runs, but if you are using ovn-kubernetes , then it seems so

@thockin thockin closed this as completed Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

6 participants