Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

leave oauth on jenkins extended test, use token for http-level access #12440

Merged

Conversation

gabemontero
Copy link
Contributor

@bparees PTAL

of course, this should not merge until the jenkins centos image is updated, but I'm assuming that happens before we finish reviewing this change and the merge queue frees up from the rebase.

@gabemontero
Copy link
Contributor Author

Need the image updated before the extended test will pass

@bparees
Copy link
Contributor

bparees commented Jan 11, 2017

oooh, nice!
lgtm pending success.

@gabemontero
Copy link
Contributor Author

gabemontero commented Jan 11, 2017 via email

@gabemontero
Copy link
Contributor Author

OK ... the extended tests pass locally for me using the official docker.io jenkins images (vs. running with my test image which I had to do during development of this pull). Going to assume there was just a timing issue with when the image was available on docker.io vs. when the test system was able to update the local version of the image. Will trigger another run momentarily.

@gabemontero
Copy link
Contributor Author

gabemontero commented Jan 11, 2017

Hmm ... no getting error on vagrant-openshift set up for this pull:

*****Locally Merging Pull Request: https://github.com/openshift/origin/pull/12440
+ test_pull_requests --local_merge_pull_request 12440 --repo origin --config /var/lib/jenkins/.test_pull_requests_origin.json

  Checking if current base repo commit ID matches what we expect
  Deleting comment #271917118
Base repository commit ID 6468143888ef9a8d30cc72b6d3a59be896c8588f doesn't match evaluated commit ID 4330ed72d5baff5578d080771ce2aca2f4c751b6
Build step 'Execute shell' marked build as failure

Did not see any existing flakes ... deleted/reposted the extended test comment in case it some how was using the earlier commit ID before I rebased this PR.

@gabemontero
Copy link
Contributor Author

Yep - deleting / reposting the test comment seemed to do the trick.

@gabemontero
Copy link
Contributor Author

gabemontero commented Jan 11, 2017

Ah .... there is a new set of tests in test/extended/builds/pipeline.go (it is the one @csrwng introduced recently). My test spec of "openshift pipeline builds" (which I copied / pasted from @csrwng 's PR) hits pipeline.go. "openshift pipeline plugin" would have triggered jenkins_plugin.go tests, which are what I was focused.

In any event, the pipline.go stuff is still hitting 403's on direct http accesses. Those tests still have ENABLE_OAUTH set to false on the oc new-app <jenkins template>l call. I'll push an update for that in a bit.

Ultimately, the extended test focus should be "openshift pipeline" to capture both pipeline.go and jenkins_plugin.go.

@gabemontero
Copy link
Contributor Author

OK, the test is still in flight, but the PR run is still getting 403's on direct http access, even though both pipeline.go and jenkins_plugin.go are passing for me locally now with the docker.io jenkins image.

Maybe there is some sort of user permission difference when running in the PR tester vs. running locally? Or something is messed up with docker on these test systems and we have stale jenkins images?

In either case, I'm going to have to push some temporary debug up to the PR to better nail down what is going on. I'll report back when I have some findings.

@gabemontero
Copy link
Contributor Author

Oooooh ... this time the jenkins pod dump had something I've never seen before which torpedos any OAuth based access to Jenkins:

com.google.api.client.http.HttpResponseException: 500 Internal Server Error
This request caused apisever to panic. Look in log for details.
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1061)
	at org.openshift.jenkins.plugins.openshiftlogin.OpenShiftOAuth2SecurityRealm.getOpenShiftOAuthProvider(OpenShiftOAuth2SecurityRealm.java:474)
	at org.openshift.jenkins.plugins.openshiftlogin.OpenShiftOAuth2SecurityRealm.populateDefaults(OpenShiftOAuth2SecurityRealm.java:346)
	at org.openshift.jenkins.plugins.openshiftlogin.OpenShiftOAuth2SecurityRealm.<init>(OpenShiftOAuth2SecurityRealm.java:274)
	at org.openshift.jenkins.plugins.openshiftlogin.OpenShiftSetOAuth.setOauth(OpenShiftSetOAuth.java:69)
	at org.openshift.jenkins.plugins.openshiftlogin.OpenShiftSetOAuth.setOauth(OpenShiftSetOAuth.java:46)
	at org.openshift.jenkins.plugins.openshiftlogin.OpenShiftItemListener.onLoaded(OpenShiftItemListener.java:41)
	at jenkins.model.Jenkins.<init>(Jenkins.java:960)
	at hudson.model.Hudson.<init>(Hudson.java:85)
	at hudson.model.Hudson.<init>(Hudson.java:81)
	at hudson.WebAppMain$3.run(WebAppMain.java:231)

@enj - FYI - The above exception happened when we try to hit the /.well-known/oauth-authorization-server OAuth http endpoint on the master. I'll have to wait for the extended test run to complete and see what appears in the master log. Given I git-rebased this PR against the newly rebased k8s/origin level on master, I wonder if we've hit a newly introduced issue with the just finished rebase..

@gabemontero
Copy link
Contributor Author

Sure enough, there is a panic noted in the master log (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/973/s3/download/test-extended/core/logs/openshift.log) wrt the GET on /.well-known/oauth-authorization-server. I'll open an origin issue.

It looks like:

E0111 14:05:33.258571   12471 panics.go:37] APIServer panic'd on GET /.well-known/oauth-authorization-server: runtime error: invalid memory address or nil pointer dereference
goroutine 115106 [running]:
runtime/debug.Stack(0x8d3d0e0, 0xc42e9b5f80, 0x42897e6)
	/usr/local/go/src/runtime/debug/stack.go:24 +0x79
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters.WithPanicRecovery.func1.1(0x3a88d80, 0xc4200100d0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters/panics.go:37 +0x74
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime.HandleCrash(0xc42dc09ec8, 0x1, 0x1)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/util/runtime/runtime.go:52 +0xe5
panic(0x3a88d80, 0xc4200100d0)
	/usr/local/go/src/runtime/panic.go:458 +0x243
github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Container).dispatch.func2(0xc422be2fc0, 0xc42dc08e28)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/container.go:206 +0x62
panic(0x3a88d80, 0xc4200100d0)
	/usr/local/go/src/runtime/panic.go:458 +0x243
github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Container).dispatch.func4(0xc422be2fc0, 0xc42dc08da8, 0xc42fbbc098, 0xc43112d180, 0xc436b0cff0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/container.go:241 +0x74
github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Container).dispatch(0xc422be2fc0, 0x8d3ca20, 0xc42fbbc090, 0xc436b0cff0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/container.go:242 +0x170
github.com/openshift/origin/vendor/github.com/emicklei/go-restful.(*Container).(github.com/openshift/origin/vendor/github.com/emicklei/go-restful.dispatch)-fm(0x8d3ca20, 0xc42fbbc090, 0xc436b0cff0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/emicklei/go-restful/container.go:120 +0x48
net/http.HandlerFunc.ServeHTTP(0xc4229111f0, 0x8d3ca20, 0xc42fbbc090, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:1726 +0x44
net/http.(*ServeMux).ServeHTTP(0xc4210b3f50, 0x8d3ca20, 0xc42fbbc090, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:2022 +0x7f
github.com/openshift/origin/pkg/cmd/server/origin.(*MasterConfig).authorizationFilter.func1(0x8d3ca20, 0xc42fbbc090, 0xc436b0cff0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/cmd/server/origin/handlers.go:103 +0x171
net/http.HandlerFunc.ServeHTTP(0xc420ef8f60, 0x8d3ca20, 0xc42fbbc090, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:1726 +0x44
github.com/openshift/origin/pkg/cmd/server/origin.(*MasterConfig).impersonationFilter.func1(0x8d3ca20, 0xc42fbbc090, 0xc436b0cff0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/cmd/server/origin/handlers.go:305 +0x2413
net/http.HandlerFunc.ServeHTTP(0xc420ef8f80, 0x8d3ca20, 0xc42fbbc090, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:1726 +0x44
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/apiserver/filters.WithAudit.func1(0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/apiserver/filters/audit.go:124 +0xa04
net/http.HandlerFunc.ServeHTTP(0xc4210aee80, 0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:1726 +0x44
github.com/openshift/origin/pkg/cmd/server/origin.authenticationHandlerFilter.func1(0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/cmd/server/origin/auth.go:786 +0x2ba
net/http.HandlerFunc.ServeHTTP(0xc4210aeec0, 0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:1726 +0x44
github.com/openshift/origin/pkg/cmd/server/origin.namespacingFilter.func1(0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/cmd/server/origin/handlers.go:183 +0xd2
net/http.HandlerFunc.ServeHTTP(0xc420c5fb00, 0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:1726 +0x44
github.com/openshift/origin/pkg/cmd/server/origin.cacheControlFilter.func1(0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/cmd/server/origin/handlers.go:151 +0xc2
net/http.HandlerFunc.ServeHTTP(0xc420c5fce0, 0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:1726 +0x44
github.com/openshift/origin/vendor/github.com/gorilla/context.ClearHandler.func1(0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/gorilla/context/context.go:141 +0x8b
net/http.HandlerFunc.ServeHTTP(0xc42136f660, 0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:1726 +0x44
net/http.(*ServeMux).ServeHTTP(0xc42010fd70, 0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:2022 +0x7f
net/http.(*ServeMux).ServeHTTP(0xc421e02ed0, 0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:2022 +0x7f
github.com/openshift/origin/pkg/cmd/server/origin.WithPatternsHandler.func1(0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/cmd/server/origin/master.go:945 +0xcd
net/http.HandlerFunc.ServeHTTP(0xc421deb000, 0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:1726 +0x44
github.com/openshift/origin/pkg/cmd/server/origin.WithAssetServerRedirect.func1(0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/cmd/server/origin/handlers.go:297 +0x7f
net/http.HandlerFunc.ServeHTTP(0xc421e03770, 0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:1726 +0x44
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters.WithCORS.func1(0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters/cors.go:77 +0x1a2
net/http.HandlerFunc.ServeHTTP(0xc4207b9f80, 0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:1726 +0x44
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters.WithPanicRecovery.func1(0x8d3d0e0, 0xc42e9b5f80, 0xc436b0cff0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters/panics.go:75 +0x24a
net/http.HandlerFunc.ServeHTTP(0xc421e038c0, 0x7f9cd818b180, 0xc4326ef7d0, 0xc436b0cff0)
	/usr/local/go/src/net/http/server.go:1726 +0x44
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters.(*timeoutHandler).ServeHTTP.func1(0xc421e6f840, 0x8d461a0, 0xc4326ef7d0, 0xc436b0cff0, 0xc426baa240)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters/timeout.go:78 +0x8d
created by github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters.(*timeoutHandler).ServeHTTP
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters/timeout.go:80 +0x1db


@liggitt
Copy link
Contributor

liggitt commented Jan 11, 2017

@sttts for apiserver rewiring

@enj
Copy link
Contributor

enj commented Jan 11, 2017

Slightly easier to read stack trace:

debug      stack.go:24      Stack(#3, #6, 0x42897e6)
filters    panics.go:37     WithPanicRecovery.func1.1(Handler(#1))
runtime    runtime.go:52    HandleCrash(func(0xc42dc09ec8), func(0x1))
           panic.go:458     panic(#1, #4)
go-restful container.go:206 (*Container).dispatch.func2(*Container(#5), ResponseWriter(0xc42dc08e28))
           panic.go:458     panic(#1, #4)
go-restful container.go:241 (*Container).dispatch.func4(*Container(#5), ResponseWriter(0xc42dc08da8), *Request(0xc43112d180), #9)
go-restful container.go:242 (*Container).dispatch(*Container(#5), ResponseWriter(#2), *Request(#9))
go-restful container.go:120 dispatch)-fm(*Container(#2), *WebService(#7), *ServeMux(#9))
http       server.go:1726   HandlerFunc.ServeHTTP(0xc4229111f0, #2, #7, #9)
http       server.go:2022   (*ServeMux).ServeHTTP(0xc4210b3f50, #2, #7, #9)
origin     handlers.go:103  (*MasterConfig).authorizationFilter.func1(*MasterConfig(#2), Handler(#7))
http       server.go:1726   HandlerFunc.ServeHTTP(0xc420ef8f60, #2, #7, #9)
origin     handlers.go:305  (*MasterConfig).impersonationFilter.func1(*MasterConfig(#2), Handler(#7))
http       server.go:1726   HandlerFunc.ServeHTTP(0xc420ef8f80, #2, #7, #9)
filters    audit.go:124     WithAudit.func1(Handler(#3), RequestAttributeGetter(#9))
http       server.go:1726   HandlerFunc.ServeHTTP(0xc4210aee80, #3, #6, #9)
origin     auth.go:786      authenticationHandlerFilter.func1(Handler(#3), Request(#9))
http       server.go:1726   HandlerFunc.ServeHTTP(0xc4210aeec0, #3, #6, #9)
origin     handlers.go:183  namespacingFilter.func1(Handler(#3), RequestContextMapper(#9))
http       server.go:1726   HandlerFunc.ServeHTTP(0xc420c5fb00, #3, #6, #9)
origin     handlers.go:151  cacheControlFilter.func1(Handler(#3), string(#9, len=0))
http       server.go:1726   HandlerFunc.ServeHTTP(0xc420c5fce0, #3, #6, #9)
context    context.go:141   ClearHandler.func1(Handler(#3), #9)
http       server.go:1726   HandlerFunc.ServeHTTP(0xc42136f660, #3, #6, #9)
http       server.go:2022   (*ServeMux).ServeHTTP(0xc42010fd70, #3, #6, #9)
http       server.go:2022   (*ServeMux).ServeHTTP(0xc421e02ed0, #3, #6, #9)
origin     master.go:945    WithPatternsHandler.func1(Handler(#3), Handler(#9))
http       server.go:1726   HandlerFunc.ServeHTTP(0xc421deb000, #3, #6, #9)
origin     handlers.go:297  WithAssetServerRedirect.func1(Handler(#3), string(#9, len=0))
http       server.go:1726   HandlerFunc.ServeHTTP(0xc421e03770, #3, #6, #9)
filters    cors.go:77       WithCORS.func1(Handler(#3), []string(#9 len=0 cap=0))
http       server.go:1726   HandlerFunc.ServeHTTP(0xc4207b9f80, #3, #6, #9)
filters    panics.go:75     WithPanicRecovery.func1(Handler(#3), RequestContextMapper(#9))
http       server.go:1726   HandlerFunc.ServeHTTP(0xc421e038c0, 0x7f9cd818b180, #8, #9)
filters    timeout.go:78    (*timeoutHandler).ServeHTTP.func1(*timeoutHandler(0xc421e6f840), ResponseWriter(0x8d461a0), *Request(#9), 0xc426baa240)

@gabemontero
Copy link
Contributor Author

With #12453 the use of OAUTH and direct http access via token has already successfully worked multiple times with the currently running extended test.

Will report back with any relevant analysis once the test completes.

@gabemontero
Copy link
Contributor Author

k8s plugin does not seem happy ...

@bparees
Copy link
Contributor

bparees commented Jan 15, 2017 via email

@gabemontero
Copy link
Contributor Author

@bparees - That thread could be part of it, but there are some more fundamental issues going on (which are totally independent of the token based http access this PR if focused on).

Several of the tests are encountering issues permissions issues with the jenkins service account. I've seen elements of this both with the openshift-restclient that jenkins-plugin uses, as well as the k8s plugin (which uses the fabric client under the covers).

An example from our jenkins-plugin and the openshift-restclient:

com.openshift.restclient.authorization.ResourceForbiddenException: User "system:serviceaccount:extended-test-jenkins-plugin-nr9lo-dtkq8-jenkins:jenkins" cannot "get" on "/swaggerapi/oapi/v1" User "system:serviceaccount:extended-test-jenkins-plugin-nr9lo-dtkq8-jenkins:jenkins" cannot "get" on "/swaggerapi/oapi/v1"

An example from the k8s plugin and fabric8:

SEVERE: Failed to load initial Builds: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default/oapi/v1/namespaces/extended-test-jenkins-plugin-nr9lo-ysea9-jenkins/builds?fieldSelector=status%3DNew. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked..
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://kubernetes.default/oapi/v1/namespaces/extended-test-jenkins-plugin-nr9lo-ysea9-jenkins/builds?fieldSelector=status%3DNew. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked..

I'm seeing the similar errors in our overnight test jobs as well ... search for ResourceForbidden in https://ci.openshift.redhat.com/jenkins/job/origin_extended_image_tests/818/consoleFull for example.

Something underneath us has broke recently. Don't know yet if it is extended test specific or general. I'll try mimicking some of these test cases manually in a local jenkins env I set up and see what transpires.

All this said, I think the turning on of oauth for the extended tests proves out, and technically speaking this PR could be merged. But I'm fine if you want to wait as well.

@bparees
Copy link
Contributor

bparees commented Jan 15, 2017

yeah i'd like to hold this out until we get some stability back into the existing codebase.

@gabemontero
Copy link
Contributor Author

At least the permission issues in jenkins_plugin.go are getting fixed once #12508 merges

@gabemontero
Copy link
Contributor Author

We minimally need openshift/jenkins#231 and https://ci.openshift.redhat.com/jenkins/view/Image%20Verification/job/push_images_s2i/6747/ to complete to see about the k8s plugin tests passing again.

@gabemontero
Copy link
Contributor Author

OK we are getting close. The only failure with this PR's ext test run was #12479

Note, the orchestration plugin test ext test passed locally for me just now, but the blue-green test failed.

New regressions beneath use non-withstanding, it is time to focus on pipeline.go, and add some full dumps of the jenkins master and slave pods as needed.

@gabemontero
Copy link
Contributor Author

Both pipeline.go and jenkins_plugin.go passed overnight, and are passing for me locally this morning. I'm going to push an update momentarily to add some more debug to pipeline.go when failures occur, and kick off another extended test run.

@gabemontero
Copy link
Contributor Author

So in the run this time, I got an intermittent hiccup with the Orchestration test case from pipeline.go.

I'm circling through the added debug to see if I can discern anything. @csrwng - would you have any cycles to help expedite diagnosis of these failures? Certainly these failures are intermittent (it passed in the overnight build and for me locally today, but failed locally for me last night, failed in recent overnight runs (per @PI-Victor see https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/985/consoleFull#123465095256cbb9a5e4b02b88ae8c2f77), and in this PR so far).

@csrwng
Copy link
Contributor

csrwng commented Jan 17, 2017

@gabemontero I can take a look at it later today/early tomorrow morning

@gabemontero gabemontero force-pushed the turnOnOAuthJenkinsExtTest branch 3 times, most recently from c236737 to 60d2c20 Compare January 24, 2017 23:25
@gabemontero
Copy link
Contributor Author

OK, in this last run, only the orchestration pipeline failed. This time for debug I

  • dumped any failed maven containers .... there were none
  • added a polling memory dump that ran jstats --gcutil to get a sense of GC activity ... there was no heavy GC activity
  • similar timeout looking situation with the mapsapp / mlbparks relationship that I saw last night. The mlbparks deployment of mlbparks-1 is still running after 10 minutes. Perhaps we just didn't wait long enough, but that seems like a crazy long wait.

In any event, per the discussion between @csrwng, @bparees, and myself after scrum, I'm just going to comment out the orchestration pipeline for now and get enough consistency in the test run for this PR to get it merged.

Per our post-scrum discussion, let's use #12479 for @csrwng to spend some time curating this test, sort out these prolonged timing windows, and then re-intergrate.

I'll trim the debug a bit, comment out the test, and if we get some consistent success, re-ask for the merge.

@gabemontero gabemontero force-pushed the turnOnOAuthJenkinsExtTest branch 3 times, most recently from db626aa to 711b54d Compare January 25, 2017 18:09
@gabemontero
Copy link
Contributor Author

OK the extended tests have passed mutliple times in a row now with the orchestration pipeline test commented out.

@bparees - please revisit the changes in the PR and let's redo the comments->merge loop.

thanks

Copy link
Contributor

@bparees bparees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple questions, mostly looks good.

@@ -13,36 +16,98 @@ import (
"github.com/openshift/origin/test/extended/util/jenkins"
)

func debugAnyJenkinsFailure(br *exutil.BuildResult, name string, oc *exutil.CLI, dumpMaster bool) {
if !br.BuildSuccess {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems slightly weird. Why not just remove the br argument and only call debugAnyJenkinsFailure when desired(namely when the build fails)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent was to just code up the if once vs. coding up the if check in all the places I added calls to debugAnyJenkinsFailure. I can switch it up if you like.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it's fine to leave it, but if we end up wanting to reuse this for debugging other failures, we're going to end up refactoring it.


if os.Getenv(jenkins.DisableJenkinsMemoryStats) == "" {
g.By("start jenkins gc tracking")
ticker = jenkins.StartJenkinsGCTracking(oc, oc.Namespace())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this test only track GC (not memory)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory tracking debug is very verbose. For the purposes of the debug I did the last few days in this pull, I simply needed to prove that the heap was NOT too small and that we were NOT under GC duress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hence, I made the choice in which debug you wanted a bit more granular and selectable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this check be based on a different env variable name then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is more in line with the whole granular motif I've been espousing. I'll make that change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update pushed

}
}()
if os.Getenv(jenkins.DisableJenkinsMemoryStats) == "" {
ticker = jenkins.StartJenkinsMemoryTracking(oc, jenkinsNamespace)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this test only tracks memory and not gc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, and the memory analysis here is really centered more on the native memory aspects of the JVM and not the heap aspects. If I recall correctly, in the original problem @jupierce chased down, the heap itself was not constrained. There was not a GC issue with that one. So yeah, again, I chose to make the debug tools more granular, and have applied only the ones that have so far been deemed necessary for each set of tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is very verbose as you say above, do we want it on by default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is a question of confidence that we aren't hitting the native memory issue any more. If it proved to be very intermittent, and we didn't have this on when it happens again, that would be a bummer.

I'll defer to you and/or @jupierce on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough, let's leave it on for now, if we get to 32bit JVM and are stable for a while, maybe we can turn it off then.


o.Expect(err).NotTo(o.HaveOccurred())

if os.Getenv(jenkins.DisableJenkinsMemoryStats) == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the assumption is a developer would set this locally when running extended tests? I assume it's not being set in our jenkins extended test runs today?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct and Correct. And in case it wasn't clear, we already merged in the use of this env var. I simply moved it from a private var in jenkins_plugin.go to a public var in monitor.go since it is being leveraged in different places now.

}
cleanup = func() {
if os.Getenv(jenkins.DisableJenkinsGCSTats) == "" {
g.By("stop jenkins memory tracking")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/memory/gc/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update pushed

Copy link
Contributor

@bparees bparees left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one final nit and lgtm.

@bparees
Copy link
Contributor

bparees commented Jan 25, 2017

[merge]

@openshift-bot
Copy link
Contributor

Evaluated for origin merge up to a6b94b4

@openshift-bot
Copy link
Contributor

[Test]ing while waiting on the merge queue

@openshift-bot
Copy link
Contributor

Evaluated for origin testextended up to a6b94b4

@gabemontero
Copy link
Contributor Author

in the test run, test/cmd/status.sh failed; certainly unrelated to this extended test change

@bparees
Copy link
Contributor

bparees commented Jan 25, 2017

@gabemontero please tag flakes anyway just so we can help identify frequency and get attention on them.

in this case it was flake #12667
[test]

@openshift-bot
Copy link
Contributor

Evaluated for origin test up to a6b94b4

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/test FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/13318/) (Base Commit: adc5ee3)

@openshift-bot
Copy link
Contributor

continuous-integration/openshift-jenkins/testextended FAILURE (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin_extended/1036/) (Base Commit: 59e57b1) (Extended Tests: core(openshift pipeline))

@bparees
Copy link
Contributor

bparees commented Jan 26, 2017

looks like the extended test got hung?

@openshift-bot
Copy link
Contributor

openshift-bot commented Jan 26, 2017

continuous-integration/openshift-jenkins/merge SUCCESS (https://ci.openshift.redhat.com/jenkins/job/test_pr_origin/13327/) (Base Commit: 50300d1) (Image: devenv-rhel7_5783)

@openshift-bot openshift-bot merged commit fa592b8 into openshift:master Jan 26, 2017
@gabemontero gabemontero deleted the turnOnOAuthJenkinsExtTest branch January 26, 2017 15:18
@gabemontero
Copy link
Contributor Author

Yeah the blue-green test failed this time. I've seen that fail sometimes though not as frequently as the orchestration pipeline test we commented out. I suspect it falls under the same sort of flake category as the orchestration tests (long delay or problem in pods getting started). Perhaps @csrwng 's upcoming investigation / rework in the orchestration test will have some carry-over to blue-green. Certainly if the flakes start coming up more regularly in the overnight runs, temporarily disabling it is a consideration.

I do have a theory on the way the test ended. I ended up putting a g.GinkgoRecover() call in the go thread in that test, because of a go panic warning that arose if an assert happened on that thread. I suspect that call is interfering with the defer cleanup() I have on the main thread.

If that theory holds water with others here, I'll open a new pull for reworking the monitoring piece to get invoked from the go threads themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants