Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: Assorted fixes to get e2e-aws working again (Calico -> Flannel, etc.) #151

Merged
merged 6 commits into from
Aug 23, 2018

Conversation

wking
Copy link
Member

@wking wking commented Aug 20, 2018

Patterned on the existing worker_ingress_kubelet_insecure_from_master from b620c16 (coreos/tectonic-installer#264).

This should address errors like:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/150/pull-ci-origin-installer-e2e-aws/546/artifacts/e2e-aws/nodes/ip-10-0-52-134.ec2.internal/journal.gz | zcat >journal
$ journalread journal | grep 'current config label' | head -n3
2018-08-20T20:30:08.000827895Z I0820 20:30:08.826840       1 tnc.go:375] Node ip-10-0-134-147.ec2.internal does not have a current config label
2018-08-20T20:30:08.00082814Z  I0820 20:30:08.826860       1 tnc.go:375] Node ip-10-0-153-195.ec2.internal does not have a current config label
2018-08-20T20:30:08.000828371Z I0820 20:30:08.826866       1 tnc.go:375] Node ip-10-0-166-239.ec2.internal does not have a current config label

on the master node:

$ journalread journal | grep -A15 'Starting Ignition' | grep -v INFO
2018-08-20T20:21:40.00097323Z  Starting Ignition (files)...
2018-08-20T20:21:40.000991225Z DEBUG    : parsed url from cmdline: ""
2018-08-20T20:21:41.000010266Z DEBUG    : parsing config: {
2018-08-20T20:21:41.0000122Z   "ignition": {
2018-08-20T20:21:41.000014165Z "config": {
2018-08-20T20:21:41.000016186Z "append": [
2018-08-20T20:21:41.000018193Z {
2018-08-20T20:21:41.000023296Z "source": "http://ci-op-imi5mbig-68485-tnc.origin-ci-int-aws.dev.rhcloud.com:80/config/master",

which were resulting in:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/150/pull-ci-origin-installer-e2e-aws/546/build-log.txt | grep Ginkgo
   |  Ginkgo timed out waiting for all parallel nodes to report back!  |
 Ginkgo ran 1 suite in 10m6.626061944s

Inbound 10250 is the kubelet API used by the control plane. @smarterclayton suspects the e2e-aws tests are trying to get metrics from the kubelets, and hanging on the etcd kubelet because this rule was missing. I'm not clear why we've only been seeing this issue for the last week though.

The third commit in this PR adds the new rule. The previous two commits pivot from inline ingress and egress rules to stand-alone aws_security_group_rule resources, finishing a transition away from inline ingress/egress rules begun by coreos/tectonic-installer#264. More details in the first two commit messages.

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 20, 2018
@wking wking force-pushed the etcd-kublet-ingress branch 2 times, most recently from 735c4ae to 1ea04c7 Compare August 20, 2018 23:05
@wking
Copy link
Member Author

wking commented Aug 21, 2018

The e2e-aws error was:

Waiting for API at https://ci-op-ng25jj4j-68485-api.origin-ci-int-aws.dev.rhcloud.com:6443 to respond ...
...
Waiting for API at https://ci-op-ng25jj4j-68485-api.origin-ci-int-aws.dev.rhcloud.com:6443 to respond ...
Interrupted

I don't know if that's further along than the Ginkgo error or not, but I'll check the node logs later.

@wking
Copy link
Member Author

wking commented Aug 21, 2018

I don't know if that's further along than the Ginkgo error or not, but I'll check the node logs later.

Unfortunately, it looks like job 553 failed to capture node logs, so I don't know if this has addressed the "does not have a current config label" issue or not. On the off chance that the failed-log-capture was a flake, I'll try the test again:

/retest

self = true
}
protocol = "tcp"
from_port = 10250
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of the 'from' correct? don't most things trying to connect to here use a random high port?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of the 'from' correct? don't most things trying to connect to here use a random high port?

Yeah, that doesn't make sense to me either. But it's what we have had in master since forever; see, for example, here. Still, I'll drop them and see if that helps.

Copy link
Member Author

@wking wking Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of the 'from' correct? don't most things trying to connect to here use a random high port?

Yeah, that doesn't make sense to me either...

Ah, from_port and to_port are for a range of ports on a single host, not the ports for both hosts involved in the connection. So having from_port makes sense.

@wking
Copy link
Member Author

wking commented Aug 21, 2018

I got the same error again with build 555, so probably not a flake. Still not sure what's going on there...

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 21, 2018
@wking
Copy link
Member Author

wking commented Aug 21, 2018

I've spun off some tangential changes into #155 and #156. I don't think they were making semantic changes, but they were at least touching the API ingress rule (on port 6443) and therefore might have been causing our API timeouts.

@wking
Copy link
Member Author

wking commented Aug 21, 2018

I've spun off some tangential changes into #155 and #156.

And we're still getting:

Waiting for API at https://ci-op-hpjfcvwz-68485-api.origin-ci-int-aws.dev.rhcloud.com:6443 to respond ...
...
Waiting for API at https://ci-op-hpjfcvwz-68485-api.origin-ci-int-aws.dev.rhcloud.com:6443 to respond ...
Interrupted

@openshift-ci-robot openshift-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 22, 2018
@wking
Copy link
Member Author

wking commented Aug 22, 2018

Looks like I may have broken the Terraform ignition setup? The e2e-aws test failed with:


* module.assets_base.data.ignition_config.etcd: 3 error(s) occurred:

* module.assets_base.data.ignition_config.etcd[2]: data.ignition_config.etcd.2: unexpected EOF
* module.assets_base.data.ignition_config.etcd[0]: data.ignition_config.etcd.0: unexpected EOF
* module.assets_base.data.ignition_config.etcd[1]: data.ignition_config.etcd.1: unexpected EOF
* module.assets_base.module.bootkube.data.ignition_file.bootkube_sh: 1 error(s) occurred:

* module.assets_base.module.bootkube.data.ignition_file.bootkube_sh: data.ignition_file.bootkube_sh: unexpected EOF

@wking
Copy link
Member Author

wking commented Aug 22, 2018

Looks like I may have broken the Terraform ignition setup?

Ah, I'd missed a worker -> etcd replacement in a copy/paste. Fixed with bcb86ea -> 7fc19bd.

derekwaynecarr and others added 2 commits August 22, 2018 13:22
This includes the fixes needed to work with SELinux.
@wking
Copy link
Member Author

wking commented Aug 22, 2018

And we're back to:

Ginkgo timed out waiting for all parallel nodes to report back!

Which is where we were before this PR, and distinct from the:

Waiting for API at https://ci-op-ng25jj4j-68485-api.origin-ci-int-aws.dev.rhcloud.com:6443...

we saw during earlier versions of this PR (that just tried to open master -> etcd:10250).

@wking wking force-pushed the etcd-kublet-ingress branch 2 times, most recently from f2665fd to 81373c3 Compare August 22, 2018 19:50
@wking
Copy link
Member Author

wking commented Aug 22, 2018

I've fixed a:

* aws_security_group_rule.worker_ingress_flannel_from_worker: [WARN] A duplicate Security Group rule was found on (sg-008d3915f65c9e488). This may be
a side effect of a now-fixed Terraform issue causing two security groups with
identical attributes but different source_security_group_ids to overwrite each
other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
information and instructions for recovery. Error message: the specified rule “peer: sg-03a1212a84b765ddf, UDP, from port: 4789, to port: 4789, ALLOW” already exists

reported by @smarterclayton with 7fc19bd -> 81373c3. Looks like I had accidentally pasted some worker stuff into the master rules.

@wking
Copy link
Member Author

wking commented Aug 22, 2018

And we're still getting Ginkgo timeouts with 81373c3.

@wking
Copy link
Member Author

wking commented Aug 23, 2018

And we're still getting Ginkgo timeouts with cd83ddb.

@wking
Copy link
Member Author

wking commented Aug 23, 2018

Hooray, with b17dedc we're off the Ginkgo timeouts, and are only getting:

• Failure [34.901 seconds]
[Conformance][Area:Networking][Feature:Router]
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:21
  The HAProxy router
  /tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:40
    should set Forwarded headers appropriately [Suite:openshift/conformance/parallel] [It]
    /tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:41

    Aug 23 18:38:47.491: Unexpected header: '10.2.6.0' (expected 10.0.192.68); All headers: http.Header{"X-Forwarded-Port":[]string{"8080"}, "X-Forwarded-Proto":[]string{"http"}, "Forwarded":[]string{"for=10.2.6.0;host=router-headers.example.com;proto=http"}, "X-Forwarded-For":[]string{"10.2.6.0"}, "User-Agent":[]string{"curl/7.61.0"}, "Accept":[]string{"*/*"}, "X-Forwarded-Host":[]string{"router-headers.example.com"}}

    /tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:103

With b17dedc -> 2f1d81c, I've:

  • Touched up the Calico -> Flannel commit to also update some fixtures,
  • Returned to respecting user AMI overrides, and
  • Rebased this on top of Fix perm errors with selinux enabled #134 to pick up the SELinux changes needed to work with new RHCOS AMIs.

@wking wking force-pushed the etcd-kublet-ingress branch 2 times, most recently from 13ad9b1 to 59691f4 Compare August 23, 2018 19:24
We're having trouble accessing service IPs from pods with host network
namespaces, which was keeping the metrics API from coming up (I think
that's what the problem is ;) and eventually blocking namespace
deletion.  Defaulting to Flannel fixes that issue.
@wking
Copy link
Member Author

wking commented Aug 23, 2018

I've pushed 2f1d81c -> b67f809 to leave the Calico fixtures alone. They seem to be for internal unit tests unaffected by the default change, or (for tests/smoke/aws/vars/aws-basic.yaml) they may be dead code (I'm dropping tests/smoke/aws in #143).

@wking
Copy link
Member Author

wking commented Aug 23, 2018

And we've hit our quotas in the Jenkins account...

Error: Error applying plan:

7 error(s) occurred:

* module.vpc.aws_nat_gateway.nat_gw[1]: 1 error(s) occurred:

* aws_nat_gateway.nat_gw.1: Error creating NAT Gateway: NatGatewayLimitExceeded: Performing this operation would exceed the limit of 5 NAT gateways
	status code: 400, request id: fe819d69-b3d5-44da-b71d-5d0113f5d7fd
* module.vpc.aws_elb.api_internal: 1 error(s) occurred:

* aws_elb.api_internal: TooManyLoadBalancers: Exceeded quota of account 846518947292
	status code: 400, request id: 34db82f2-a70b-11e8-b742-172fdc8dbc5d
* module.vpc.aws_elb.api_external: 1 error(s) occurred:

* aws_elb.api_external: TooManyLoadBalancers: Exceeded quota of account 846518947292
	status code: 400, request id: 34e016ce-a70b-11e8-ac02-dd9cffd48ed9
* module.vpc.aws_nat_gateway.nat_gw[0]: 1 error(s) occurred:

* aws_nat_gateway.nat_gw.0: Error creating NAT Gateway: NatGatewayLimitExceeded: Performing this operation would exceed the limit of 5 NAT gateways
	status code: 400, request id: a6ac1e8c-e85d-4578-8966-57c92850e5f9
* module.vpc.aws_nat_gateway.nat_gw[2]: 1 error(s) occurred:

* aws_nat_gateway.nat_gw.2: Error creating NAT Gateway: NatGatewayLimitExceeded: Performing this operation would exceed the limit of 5 NAT gateways
	status code: 400, request id: a409d983-ed38-4b80-83fe-1796e5f383c0
* module.vpc.aws_elb.tnc: 1 error(s) occurred:

* aws_elb.tnc: TooManyLoadBalancers: Exceeded quota of account 846518947292
	status code: 400, request id: 351fde8c-a70b-11e8-8aa8-937689137735
* module.vpc.aws_elb.console: 1 error(s) occurred:

* aws_elb.console: TooManyLoadBalancers: Exceeded quota of account 846518947292
	status code: 400, request id: 352695ff-a70b-11e8-85bf-3f8eb4d4beb0

This happened before here. I'll see about reaping leaked resources in that account.

@wking
Copy link
Member Author

wking commented Aug 23, 2018

The most recent e2e-aws job (based on b67f809) got:


• Failure [43.888 seconds]
[Conformance][Area:Networking][Feature:Router]
/tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:21
  The HAProxy router
  /tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:40
    should set Forwarded headers appropriately [Suite:openshift/conformance/parallel] [It]
    /tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:41

    Aug 23 19:47:24.715: Unexpected header: '10.2.6.0' (expected 10.0.161.5); All headers: http.Header{"X-Forwarded-Proto":[]string{"http"}, "Forwarded":[]string{"for=10.2.6.0;host=router-headers.example.com;proto=http"}, "X-Forwarded-For":[]string{"10.2.6.0"}, "User-Agent":[]string{"curl/7.61.0"}, "Accept":[]string{"*/*"}, "X-Forwarded-Host":[]string{"router-headers.example.com"}, "X-Forwarded-Port":[]string{"8080"}}

    /tmp/openshift/build-rpms/rpm/BUILD/origin-3.11.0/_output/local/go/src/github.com/openshift/origin/test/extended/router/headers.go:103

again. Do we need to tweak something in the release repo to catch up with Calico -> Flannel?

@smarterclayton
Copy link
Contributor

Yay! This is an actual bug in the way that openshift/installer configures the router, because source ip isn't preserved.

I'll probably switch to another test in the job definition.

@smarterclayton
Copy link
Contributor

I switched to a kube conformance test that should pass and added a new job that runs all tests but is optional.

/retest

@wking
Copy link
Member Author

wking commented Aug 23, 2018

I'll see about reaping leaked resources in that account.

I've removed a lot of cruft from our Jenkins smoke-test account. Let's kick that off again:

retest this please

@wking
Copy link
Member Author

wking commented Aug 23, 2018

I switched to a kube conformance test that should pass...

Cross-linking openshift/release#1271.

@eparis
Copy link
Member

eparis commented Aug 23, 2018

/me hopes and hopes and hopes...

@smarterclayton
Copy link
Contributor

2018/08/23 21:12:28 Container test in pod e2e-aws completed successfully

!!!!

@wking wking changed the title modules/aws/vpc/sg-etcd: Add ingress 10250 from master *: Assorted fixes to get e2e-aws working again (Calico -> Flannel, etc.) Aug 23, 2018
@wking
Copy link
Member Author

wking commented Aug 23, 2018

e2e-aws is green :) (although the smoke tests are still running). Someone want to drop an /lgtm onto this?

@eparis
Copy link
Member

eparis commented Aug 23, 2018

/lgtm
assuming I count

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 23, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: eparis, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member Author

wking commented Aug 23, 2018

assuming I count

I think all members of GitHub's openshift org count for /lgtm.

@eparis
Copy link
Member

eparis commented Aug 23, 2018

Ohhh snap! Thank you for sticking with this @wking

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants