New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1925524: bump k8s plugin to 1.30 to enable 1.0.48 release of openshift sync plugin #1297
Bug 1925524: bump k8s plugin to 1.30 to enable 1.0.48 release of openshift sync plugin #1297
Conversation
a8faf78
to
9f8fff0
Compare
/retest |
seems to have issues when trying to delete e2e namespaces.
/retest |
/test e2e-aws-jenkins |
the e2e failures are not obvious CI flakes to me I'm doing a deep dive into them, including performing test runs of the job in openshift/origin, as well as possible test PRs off of openshift/jenkins |
/retitle Bug 1925524: bump k8s plugin to 1.30 to enable 1.0.46 release of openshift sync plugin |
@akram: This pull request references Bugzilla bug 1925524, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I'll also note that syn plugin is already listed at 1.0.46: https://github.com/openshift/jenkins/blob/master/2/contrib/openshift/base-plugins.txt#L30 Now, I do see a 1.0.47 upsteram: https://github.com/jenkinsci/openshift-sync-plugin/tree/openshift-sync-1.0.47 Are now bumping k8s in prep for that? |
officially subscribing relevant team members while Akram is on PTO /assign @jkhelil |
@@ -27,7 +26,7 @@ mercurial:2.12 | |||
metrics:4.0.2.6 | |||
openshift-client:1.0.35 | |||
openshift-login:1.0.26 | |||
openshift-sync:1.0.46 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duh it is trying to go to 1.0.47 now :-)
/retitle Bug 1925524: bump k8s plugin to 1.30 to enable 1.0.47 release of openshift sync plugin |
So yeah e2e-aws-jenkins passed over in my test PR https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_jenkins/1301/pull-ci-openshift-jenkins-master-e2e-aws-jenkins/1412801356313399296 So that points to this failure stemming from one or both of these plugin updates @jkhelil @waveywaves When I circle back to this I'll see about trying to zero in on what the issue from these plugin bumps might be. Might entail me manually installing those plugins in a jenkins instance and running the test manually. |
Examining the first of the failures:
Perhaps an opportunity to bolster the extended tests. But for now I'm just going to reproduce this manually. |
e2e-aws-jenkins also passed in my openshift/origin test PR https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26313/pull-ci-openshift-origin-master-e2e-aws-jenkins/1412827959114338304 |
Well, after bumping my manually deployed jenkins to the new versions of k8s and sync plugins specified here, I get this nasty looking stacktrace on startup (which has come up in other CI analysis I've been a part of recently):
But IIRC correctly that stack trace was deemed benign in those past discussions, and my manual running of the bluegreen pipeline and samplepipeline-wth-envvars worked fine. Deletes were handled too. Will /retest one more time while I set up the openshift/origin e2e's to run against my manually deployed cluster. I'll see if the instructions I put in https://github.com/openshift/jenkins/blob/master/CONTRIBUTING_TO_OPENSHIFT_JENKINS_IMAGE_AND_PLUGINS.md#extended-tests back in July of 2019 still work :-) Probably will be Thursday/tommorrow from this comment @jkhelil @waveywaves before I get to this. |
OK made some progress debugging this @akram @jkhelil @waveywaves @jitendar-singh .... definitely tricky to diagnose with the current tests going, but running the tests against my local cluster and tracking progress live was helpful. I was they able to cross reference with that latest CI run at https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_jenkins/1297/pull-ci-openshift-jenkins-master-e2e-aws-jenkins/1412872236422926336 where I see the same thing.
See how it failed after exactly 1hr30m0s. That means the test timed out. Proof of that followed.
the waiting to schedule task where we don't have an agent with a given label certainly could be affected by the k8s plugin bump; probably the newer version of the k8s plugin handles agent labeling differently but bottom line, it is not finding our java/maven agent but it takes a long time to fail
in the CI run, the build had just started getting processed and there was no build uri yet in my local run, it just stopped in the middle of pulling an image
Next steps:
|
Next update: of course the openshift-jee-sample with configmap and imagestream pod templates worked for me locally with k8s 1.30 and sync 1.0.47 So I went back and compared debug again. the only difference I see is the last line of this log from the failed CI run:
I don't see that Unclear to me without diving into the k8s code if that is a benign log or really indicative of something wrong. OK, now on to rebuilding the image at k8s plugin 1.29.7 We'll then compare notes. |
the I see this in the pipeline log:
No idea yet what from those 1.0.47 changes would have caused this. I'll see what I can uncover today and post an update when I know something. |
@akram @jkhelil @waveywaves @jitendar-singh @adambkaplan turns out this test is faulty, and the checks there artificially let the past with the old plugin, based on the timing of when events happened. v1.0.47 changed the timing enough that the checks that were there now catch the failure. this test is an older one that creates a local git repo ... however, unless you do that in the same pod as jenkins, it cannot access the locally created repo short term, I'm going to comment out the test in openshift/origin after that, I'm going to move the local git repo to one of our test repos at https://github.com/openshift, and also use that as an opportunity next week to move in earnest moving jenkins e2e's to the jenkins repos we also have that client plugin test we marked for removal during the ARM/remove mongodb foray from Yaakov a few weeks ago. So net, I believe the only v1.0.47 regression is the imagestreamtag pod templates noted earlier here. |
One small follow up: it "works" when I use https://github.com/gabemontero/test-jenkins-bc-env-var-override and manually run
but v1.0.47 is significantly slower than v1.0.46. It takes several minutes for 1.0.47 to find the build that |
Created https://bugzilla.redhat.com/show_bug.cgi?id=1981957 to track the "slow start of pipelines" issue @gabemontero identified. I am inclined to set the "blocker +" flag on this regression. |
@gabemontero @akram @waveywaves @jkhelil we have an escalation on this PR. I think for the immediate term we need to document the restriction on imagestream tags and move on. |
/retest |
Hard to tell in the imagestreamtag test caused the other failures, but there are multiple failures. I'll do a combo tomorrow of seeing what test(s) I have to comment out against 1.0.47 to get a clean run, along with investigating the slow start time and seeing how pervasive that is. Of lesser concern, I have a thought on the "change" needed to get imagestreamtag to work, but neds a test modification. So short term comment it out, then add it back in afterward. Again, the super slow pipeline run start, and how pervasive it is, is the meets min for determining whether we go with 1.0.47 or if we really should craft 1.0.48. Also, I was about to add to the customer BZ that you can disable each of the ConfgMap/Secret/ImageStream/BuildConfig/Build watches individually to reduce load. However, our config panel seems to be broke as well. I only see the list interval config option displayed. Not sure if just making that super long reduces the api server load. Unless one of you guys already knows about that, something else to look into. Maybe @jkhelil @waveywaves we can divide and conquer on Friday, and one of you two could look into the config panel, or maybe somebody can look into #1297 (comment) on Europe / India time and get somewhere with it before I log on Friday AM US Eastern time? |
Hey @gabemontero. I just ran a few pipelines with sync plugin 1.0.47 and saw failures. After running https://github.com/sclorg/nodejs-ex/tree/master/openshift/pipeline I am seeing the below output in the console log and seeing the nasty stack trace.
The example you have provided with the one with env vars works but isn't really using any openshift libraries.
|
I would have to see the pipeline you are using @waveywaves but that looks like you have a typo in you Jenkfinsfile pipeline In any event, it is a red herring wrt to what we want to debug. You seem to have executed the pipeline OK based on the other output. My example simply proves that the env var substitution function of the sync plugin works. The thing we are trying to debug is the delay. Were you able to get anywhere with that? I've pinged you on slack about joining a video conf. If you are still around, please join it. Otherwise, I'll pursue I suppose. |
that said, I do have some insight on the delay @akram @jkhelil @waveywaves @adambkaplan .... it is an old friend / timing bug that has resurfaced with Akram' rewrite. The key log is: So when we new-app, it creates the new BC and Build at the same time. If the Build event comes first, you see the no job at this time. But it is suppose to put this on a list, and then when the BC event comes in, it is suppose to fire any Builds that came in before the BC. That is not happening. Hence it is dependent on the relist. So,
But next steps:
|
OK, so the persistent volume test also breaks with v1.0.47 of the sync plugin. This is were we
Lots of customers leverage persistent volumes and expect things to function wrt state like this, being able to reconcile, after a restart. First blush, I would consider this a blocker to releasing 1.0.47 @akram @waveywaves @jkhelil @adambkaplan But we can certainly reconvene on Monday when everyone is available. There is one more failure I need to look into next. Will report back when I have data. |
The remaining failure was in the bluegreen test like I noted last week with #1297 (comment) It still halts in the middle of its run. Since the test employs I have not tested it yet, but I coded up an initial fix attempt for that while theses tests have been running. Going to leave the PV test disabled locally for me right now, and see if this fix covers things with local testing. If it does, I'll open a sync plugin PR and we'll go from there. Then I'll circle back to the PV failure. |
correction - it is not the PV tests that are broke, it is the recognizing of deleted builds and deleting the corresponding jobs in jenkins. it is also broke with the ephemeral template. I'll see if I can fix that quick and add it to my fix for the event timing issue, which I've verified. |
OK I have everything passing for me again @akram @waveywaves @jkhelil @adambkaplan with my soon to be a PR sync plugin updates, except for the imagestreamtag regression I'll be pushing an openshift/origin PR shortly with that test commented out However, I'll next take a stab at addressing the imagestreamtag pod template when both the imagestream is labeled, and the tag is annotated. If that works as is, great. I'll just update our openshift/origin TC to do both. If sync plugins updates are needed to accommodate that, then will move forward with that test commented out. I'll include whatever sync plugin updates are needed to make that work, if containable, under my upcoming sync plugin PR. Then we'll re-enable the test in openshift/origin where we do both the label and annotation, and confirm it works. |
9f8fff0
to
744998e
Compare
aws install failure on last e2e |
@@ -17,8 +17,7 @@ htmlpublisher:1.21 | |||
jira:3.0.17 | |||
job-dsl:1.77 | |||
junit:1.30 | |||
kubernetes:1.29.7 | |||
kubernetes-client-api:4.13.3-1 | |||
kubernetes:1.30.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akram I noticed in my recent testing there is a now a version 1.30.1 for the k8s plugin
perhaps not required, but something to go ahead and move to
/retest |
@gabemontero I will test with the 1.30.1 on a separate branch then. |
sounds good @akram .... I'm fine merge this as is and following up I see the tests are green !! while the iron is hot /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: akram, gabemontero The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@akram: All pull requests linked via external trackers have merged: Bugzilla bug 1925524 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
https://issues.redhat.com/browse/ART-3173 for bumping RPMs for official image |
In preparation to release openshift-sync plugin 1.0.46