Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make upgrade testing consistent #56787

Open
foxish opened this Issue Dec 4, 2017 · 23 comments

Comments

@foxish
Copy link
Member

foxish commented Dec 4, 2017

After GCE cluster upgrade, the nodes talk to the master using the in-cluster IP.

Reasoning:

api-server.log from gce upgrade cluster

I1201 13:56:34.287956 5 wrap.go:42] PATCH /api/v1/nodes/bootstrap-e2e-minion-group-hv6p/status: (18.100082ms) 200 [[node-problem-detector/v1.4.0 (linux/amd64) kubernetes/$Format] 10.128.0.4:53766]
I1201 13:56:34.287956 5 wrap.go:42] PATCH /api/v1/nodes/bootstrap-e2e-minion-group-hv6p/status: (18.100082ms) 200 [[node-problem-detector/v1.4.0 (linux/amd64) kubernetes/$Format] 10.128.0.4:53766]
I1201 13:56:34.515042 5 wrap.go:42] PATCH /api/v1/nodes/bootstrap-e2e-master/status: (4.327563ms) 200 [[kubelet/v1.9.0 (linux/amd64) kubernetes/e067596] 10.128.0.2:41898]

api-server.log from gce serial

I1201 15:59:46.863961 5 wrap.go:42] GET /api/v1/nodes/test-34cf3ed1e3-minion-group-zr99?resourceVersion=0: (926.753µs) 200 [[kubelet/v1.9.0 (linux/amd64) kubernetes/e067596] 104.154.254.154:40220]
I1201 15:59:46.881810 5 wrap.go:42] PATCH /api/v1/nodes/test-34cf3ed1e3-minion-group-zr99/status: (10.157704ms) 200 [[kubelet/v1.9.0 (linux/amd64) kubernetes/e067596] 104.154.254.154:40220]

cc @zmerlynn @krousey @mikedanese @kubernetes/test-infra-maintainers

@foxish

This comment has been minimized.

Copy link
Member Author

foxish commented Dec 4, 2017

@krousey, this is a blocker for network partition tests on GCE which rely on a consistent network interface being used for testing.

@krousey

This comment has been minimized.

Copy link
Member

krousey commented Dec 4, 2017

I haven't been following the network features too closely this release cycle. Did GCE enable private IP and switch it on by default? Or is this something else? If we did switch the master to private IP by default, the upgrade should respect that the cluster had a public IP and keep it. However, I don't think that's what is happening here because clients would probably break. Can someone from sig-network advise?

cc @kubernetes/sig-network-misc

@dnardo

This comment has been minimized.

Copy link
Contributor

dnardo commented Dec 4, 2017

@krousey Short answer no. Nothing sets this to the private IP by default that I know of. I need more info. Is this GKE or GCE?

@krousey

This comment has been minimized.

Copy link
Member

krousey commented Dec 4, 2017

@dnardo GCE according to the first post. I think @foxish (correct me if I'm wrong) is referring to this test suite: https://k8s-testgrid.appspot.com/sig-release-1.9-all#gke-1.8-1.9-upgrade-cluster-new&sort-by-failures=

Looks like the network partition tests aren't happy because something is changing (at least that's what this issue claims) with the network.

@foxish

This comment has been minimized.

Copy link
Member Author

foxish commented Dec 4, 2017

This is happening only in the GCE environment in cluster-upgrade tests.

@foxish

This comment has been minimized.

Copy link
Member Author

foxish commented Dec 4, 2017

The GKE failures (and the new runs of the network partition tests) are failing due to a botched fix (#56789) that I had tried. I've linked two offending runs in the issue itself.

Explaining further:
The network partition tests insert IP tables REJECT rules to simulate network partitions. They assume that that the node is talking to the master on the external IP and insert the rule preventing that communication temporarily. The test fails after GCE upgrade test (and only in that case), because suddenly, the master and the nodes are interfacing over in-cluster IP, so, the REJECT rule has no effect.

k8s-github-robot pushed a commit that referenced this issue Dec 4, 2017

Kubernetes Submit Queue
Merge pull request #56790 from foxish/disable-gce-target
Automatic merge from submit-queue (batch tested with PRs 56790, 56638). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Disable GCE target for network partition tests

Disabling until #56787 is addressed.
@dnardo

This comment has been minimized.

Copy link
Contributor

dnardo commented Dec 4, 2017

I know the cherry picks I added on Friday are not in yet. It would be good to see if these pass after they are merged.

@dnardo

This comment has been minimized.

Copy link
Contributor

dnardo commented Dec 5, 2017

Cherry picks have been merged, let me know if that clears things up.

@foxish

This comment has been minimized.

Copy link
Member Author

foxish commented Dec 6, 2017

@dnardo, the cluster setup in the upgrade clusters still looks the same. I think it wasn't solved by the cherrypicks.

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Mar 13, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented Apr 12, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Copy link

fejta-bot commented May 12, 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@pmoust

This comment has been minimized.

Copy link

pmoust commented Sep 21, 2018

I guess this should be re-opened, it still is a thing.

@Huang-Wei

This comment has been minimized.

Copy link
Member

Huang-Wei commented Nov 5, 2018

I do think so. I met this issue in #70627.

@Huang-Wei

This comment has been minimized.

Copy link
Member

Huang-Wei commented Nov 5, 2018

/reopen

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 5, 2018

@Huang-Wei: Reopening this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 5, 2018

@foxish @dnardo @zmerlynn @justinsb @kubernetes/test-infra-maintainers can we have someone look into the current state of this issue. This is now blocking e2e test for a 1.13 Beta feature from passing in the upgrade dashboard.

@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 5, 2018

/remove lifecycle-rotten
/kind bug
/priority important-soon
/milestone v1.13

@k8s-ci-robot k8s-ci-robot added this to the v1.13 milestone Nov 5, 2018

@jberkus

This comment has been minimized.

Copy link

jberkus commented Nov 5, 2018

fixing tags:

/remove-lifecycle rotten
/lifecycle frozen
/kind flake

@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Nov 6, 2018

Why not block traffic to both the internal and external IPs?

justinsb added a commit to justinsb/kubernetes that referenced this issue Nov 6, 2018

e2e: block all master addresses
This way we can be sure that the kubelet can't communicate with the
master, even if falls-back to the internal/external IP (which seems to
be the case with DNS)

Issue kubernetes#56787
@justinsb

This comment has been minimized.

Copy link
Member

justinsb commented Nov 6, 2018

Giving a try to blocking all IPs in #70681

justinsb added a commit to justinsb/kubernetes that referenced this issue Nov 6, 2018

justinsb added a commit to justinsb/kubernetes that referenced this issue Nov 6, 2018

e2e: block all master addresses
This way we can be sure that the kubelet can't communicate with the
master, even if falls-back to the internal/external IP (which seems to
be the case with DNS)

Issue kubernetes#56787
@AishSundar

This comment has been minimized.

Copy link
Contributor

AishSundar commented Nov 8, 2018

@justinsb I see that your PR #70681 merged and I see the test passing in the latest run https://testgrid.k8s.io/sig-release-master-upgrade#gce-new-master-upgrade-cluster-new !! Thanks much. we will wait for a few more runs and then close this.

@Huang-Wei

This comment has been minimized.

Copy link
Member

Huang-Wei commented Nov 8, 2018

Thanks a lot @justinsb!

goodluckbot added a commit to goodluckbot/kubernetes that referenced this issue Nov 11, 2018

e2e: block all master addresses
This way we can be sure that the kubelet can't communicate with the
master, even if falls-back to the internal/external IP (which seems to
be the case with DNS)

Issue kubernetes#56787

phenixblue added a commit to phenixblue/kubernetes that referenced this issue Jan 24, 2019

e2e: block all master addresses
This way we can be sure that the kubelet can't communicate with the
master, even if falls-back to the internal/external IP (which seems to
be the case with DNS)

Issue kubernetes#56787
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.