Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1925524: bump k8s plugin to 1.30 to enable 1.0.48 release of openshift sync plugin #1297

Merged

Conversation

akram
Copy link
Contributor

@akram akram commented Jun 24, 2021

In preparation to release openshift-sync plugin 1.0.46

@openshift-ci openshift-ci bot requested review from jkhelil and otaviof June 24, 2021 09:54
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 24, 2021
@akram akram force-pushed the bump-kubernetes-plugin-to-1.30 branch from a8faf78 to 9f8fff0 Compare June 24, 2021 17:40
@akram
Copy link
Contributor Author

akram commented Jun 25, 2021

/retest

@akram
Copy link
Contributor Author

akram commented Jun 28, 2021

seems to have issues when trying to delete e2e namespaces.

STEP: Destroying namespace "e2e-test-jenkins-pipeline-2kf9x" for this suite.
fail [github.com/openshift/origin/test/extended/builds/pipeline_origin_bld.go:516]: Expected
    <bool>: false
to be true

/retest

@jkhelil
Copy link
Contributor

jkhelil commented Jul 6, 2021

/test e2e-aws-jenkins

@gabemontero
Copy link
Contributor

the e2e failures are not obvious CI flakes to me

I'm doing a deep dive into them, including performing test runs of the job in openshift/origin, as well as possible test PRs off of openshift/jenkins

@gabemontero
Copy link
Contributor

/retitle Bug 1925524: bump k8s plugin to 1.30 to enable 1.0.46 release of openshift sync plugin

@openshift-ci openshift-ci bot changed the title Bump kubernetes plugin to 1.30 Bug 1925524: bump k8s plugin to 1.30 to enable 1.0.46 release of openshift sync plugin Jul 7, 2021
@openshift-ci openshift-ci bot added the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Jul 7, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 7, 2021

@akram: This pull request references Bugzilla bug 1925524, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @jitendar-singh

In response to this:

Bug 1925524: bump k8s plugin to 1.30 to enable 1.0.46 release of openshift sync plugin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jul 7, 2021
@gabemontero
Copy link
Contributor

I'll also note that syn plugin is already listed at 1.0.46: https://github.com/openshift/jenkins/blob/master/2/contrib/openshift/base-plugins.txt#L30

Now, I do see a 1.0.47 upsteram: https://github.com/jenkinsci/openshift-sync-plugin/tree/openshift-sync-1.0.47

Are now bumping k8s in prep for that?

@gabemontero
Copy link
Contributor

officially subscribing relevant team members while Akram is on PTO

/assign @jkhelil
/assign @waveywaves
/assign @gabemontero

@@ -27,7 +26,7 @@ mercurial:2.12
metrics:4.0.2.6
openshift-client:1.0.35
openshift-login:1.0.26
openshift-sync:1.0.46
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duh it is trying to go to 1.0.47 now :-)

@gabemontero
Copy link
Contributor

/retitle Bug 1925524: bump k8s plugin to 1.30 to enable 1.0.47 release of openshift sync plugin

@openshift-ci openshift-ci bot changed the title Bug 1925524: bump k8s plugin to 1.30 to enable 1.0.46 release of openshift sync plugin Bug 1925524: bump k8s plugin to 1.30 to enable 1.0.47 release of openshift sync plugin Jul 7, 2021
@gabemontero
Copy link
Contributor

So yeah e2e-aws-jenkins passed over in my test PR https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_jenkins/1301/pull-ci-openshift-jenkins-master-e2e-aws-jenkins/1412801356313399296

So that points to this failure stemming from one or both of these plugin updates @jkhelil @waveywaves

When I circle back to this I'll see about trying to zero in on what the issue from these plugin bumps might be. Might entail me manually installing those plugins in a jenkins instance and running the test manually.

@gabemontero
Copy link
Contributor

Examining the first of the failures:

  • the build blluegreen-pipeline-1 has a failed condition of true
  • unfortunately, they way the failure fell, we did not get a dump of the jenkins pod logs and pipeline console logs

Perhaps an opportunity to bolster the extended tests. But for now I'm just going to reproduce this manually.

@gabemontero
Copy link
Contributor

@gabemontero
Copy link
Contributor

Well, after bumping my manually deployed jenkins to the new versions of k8s and sync plugins specified here, I get this nasty looking stacktrace on startup (which has come up in other CI analysis I've been a part of recently):

'java.lang.NoSuchMethodError: 'java.lang.Object io.fabric8.kubernetes.client.dsl.WatchAndWaitable.watch(java.lang.Object)'
	at io.fabric8.jenkins.openshiftsync.ImageStreamWatcher$1.doRun(ImageStreamWatcher.java:75)
	at hudson.triggers.SafeTimerTask.run(SafeTimerTask.java:91)
	at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

But IIRC correctly that stack trace was deemed benign in those past discussions, and my manual running of the bluegreen pipeline and samplepipeline-wth-envvars worked fine. Deletes were handled too.

Will

/retest

one more time while I set up the openshift/origin e2e's to run against my manually deployed cluster.

I'll see if the instructions I put in https://github.com/openshift/jenkins/blob/master/CONTRIBUTING_TO_OPENSHIFT_JENKINS_IMAGE_AND_PLUGINS.md#extended-tests back in July of 2019 still work :-)

Probably will be Thursday/tommorrow from this comment @jkhelil @waveywaves before I get to this.

@gabemontero
Copy link
Contributor

OK made some progress debugging this @akram @jkhelil @waveywaves @jitendar-singh .... definitely tricky to diagnose with the current tests going, but running the tests against my local cluster and tracking progress live was helpful. I was they able to cross reference with that latest CI run at https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_jenkins/1297/pull-ci-openshift-jenkins-master-e2e-aws-jenkins/1412872236422926336 where I see the same thing.

  1. the first clue from the CI run was the first message in that run:

Test started yesterday at 4:32 PM failed after 1h30m0s

See how it failed after exactly 1hr30m0s. That means the test timed out.

Proof of that followed.

  1. the root cause culprit is
Timed out waiting for build "openshift-jee-sample-1" to complete

 e2e-test-jenkins-pipeline-xprfz-openshift-jee-sample job log:
OpenShift Build e2e-test-jenkins-pipeline-xprfz/openshift-jee-sample-1
Running in Durability level: MAX_SURVIVABILITY
[Pipeline] Start of Pipeline
[Pipeline] timeout
Timeout set to expire in 20 min
[Pipeline] {
[Pipeline] node
Still waiting to schedule task
‘Jenkins’ doesn’t have label ‘jenkins-slave’

the waiting to schedule task where we don't have an agent with a given label certainly could be affected by the k8s plugin bump; probably the newer version of the k8s plugin handles agent labeling differently

but bottom line, it is not finding our java/maven agent

but it takes a long time to fail

  1. the subsequent tests like the bluegreen one "failed" but when I looked at the debug logs, it was clear the processing had just stopped in the middle of things.

in the CI run, the build had just started getting processed and there was no build uri yet

in my local run, it just stopped in the middle of pulling an image

[Pipeline] _OcAction
[logs:build/nodejs-postgresql-example-2] Cloning "https://github.com/openshift/nodejs-ex.git" ...
[logs:build/nodejs-postgresql-example-2] 	Commit:	7b9f57949786059a3fab03b8493279c945770fb0 (Merge pull request #249 from multi-arch/master)
[logs:build/nodejs-postgresql-example-2] 	Author:	Honza Horak <hhorak@redhat.com>
[logs:build/nodejs-postgresql-example-2] 	Date:	Wed Sep 23 10:52:52 2020 +0200
[logs:build/nodejs-postgresql-example-2] Caching blobs under "/var/cache/blobs".
[logs:build/nodejs-postgresql-example-2] Getting image source signatures
[logs:build/nodejs-postgresql-example-2] Copying blob sha256:f8c518873786e7b92236e6abb32fb16c6ba09040dafd2b860ea79ebdcf10beaf
[logs:build/nodejs-postgresql-example-2] Copying blob sha256:9b0c218cbfb1a6d2db936bec28fd826b10399ecb4e3db3cf3bf69ebb4c37bfa3
[logs:build/nodejs-postgresql-example-2] Copying blob sha256:93156a512b9854bb32007f4b7f7bb31f3ec271c86f5f20b7617a1b9e3e62577b
[logs:build/nodejs-postgresql-example-2] Copying blob sha256:d97064154091d2ee58f5ef88d8b27739257a3f10c9629399b166d49eb3d73bfe
[logs:build/nodejs-postgresql-example-2] Copying blob sha256:661ebd06511fe10ae7965ff645e0dc34dad28d7195e3f34bddc7d785238ac513


 END debugAnyJenkinsFailure

Next steps:

  1. Although certainly @akram's PR title seems to imply that sync plugin 1.0.47 needs k8s plugin 1.30, I'm going to try a run 1.0.47 with 1.29.7 minimally for comparison, see where we land

  2. Then if need be, I'll debug our openshift-jee-sample test manually and see if I can get the labels etc. right so it runs, then try and apply that change to this PR

@gabemontero
Copy link
Contributor

Next update:

of course the openshift-jee-sample with configmap and imagestream pod templates worked for me locally with k8s 1.30 and sync 1.0.47

So I went back and compared debug again.

the only difference I see is the last line of this log from the failed CI run:

2021-07-07 21:35:02 INFO    io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onStarted Build cause for the run is: io.fabric8.jenkins.openshiftsync.BuildCause@1062cbf7
2021-07-07 21:35:02 INFO    io.fabric8.jenkins.openshiftsync.BuildSyncRunListener onStarted starting polling build job/e2e-test-jenkins-pipeline-xprfz/job/e2e-test-jenkins-pipeline-xprfz-openshift-jee-sample/1/
2021-07-07 21:35:02 INFO    io.fabric8.jenkins.openshiftsync.BuildInformer onUpdate Build informer received update event for: {} to: {}34760 34762
2021-07-07 21:35:04 INFO    io.fabric8.jenkins.openshiftsync.BuildSyncRunListener upsertBuild Setting build status values to: openshift-jee-sample-1:[ Running ]: 2021-07-07T21:35:02Z->null
2021-07-07 21:35:04 INFO    io.fabric8.jenkins.openshiftsync.BuildInformer onUpdate Build informer received update event for: {} to: {}34762 34782
2021-07-07 21:35:05 INFO    hudson.slaves.NodeProvisioner lambda$update$6 jenkins-slave-08k0c provisioning successfully completed. We have now 2 computer(s)
2021-07-07 21:35:05 INFO    org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch Created Pod: openshift e2e-test-jenkins-pipeline-xprfz/jenkins-slave-08k0c
2021-07-07 21:35:08 INFO    org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch Pod is running: openshift e2e-test-jenkins-pipeline-xprfz/jenkins-slave-08k0c
2021-07-07 21:35:08 INFO    hudson.TcpSlaveAgentListener$ConnectionHandler run Accepted JNLP4-connect connection #2 from /10.129.2.22:59546
2021-07-07 21:35:12 INFO    org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave _terminate Terminating Kubernetes instance for agent jenkins-slave-08k0c
2021-07-07 21:35:12 INFO    org.jenkinsci.plugins.workflow.job.WorkflowRun finish e2e-test-jenkins-pipeline-xprfz/e2e-test-jenkins-pipeline-xprfz-openshift-jee-sample #1 completed: SUCCESS
2021-07-07 21:35:12 INFO    io.fabric8.jenkins.openshiftsync.BuildSyncRunListener upsertBuild Setting build status values to: openshift-jee-sample-1:[ Complete ]: 2021-07-07T21:35:02Z->2021-07-07T21:35:12Z
Terminated Kubernetes instance for agent e2e-test-jenkins-pipeline-xprfz/jenkins-slave-08k0c
Disconnected computer jenkins-slave-08k0c
2021-07-07 21:35:12 INFO    org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave deleteSlavePod Terminated Kubernetes instance for agent e2e-test-jenkins-pipeline-xprfz/jenkins-slave-08k0c
2021-07-07 21:35:12 INFO    org.csanchez.jenkins.plugins.kubernetes.KubernetesSlave _terminate Disconnected computer jenkins-slave-08k0c
2021-07-07 21:35:12 INFO    jenkins.slaves.DefaultJnlpSlaveReceiver channelClosed Computer.threadPoolForRemoting [#70] for jenkins-slave-08k0c terminated: java.nio.channels.ClosedChannelException
2021-07-07 21:35:12 INFO    hudson.remoting.Request$2 run Failed to send back a reply to the request hudson.remoting.Request$2@25b090ea: hudson.remoting.ChannelClosedException: Channel "hudson.remoting.Channel@50e4e330:JNLP4-connect connection from ip-10-129-2-22.us-east-2.compute.internal/10.129.2.22:59546": channel is already closed

I don't see that hudson.remoting.Request$2 run Failed to send back a reply to the request log in my local run.

Unclear to me without diving into the k8s code if that is a benign log or really indicative of something wrong.

OK, now on to rebuilding the image at k8s plugin 1.29.7

We'll then compare notes.

@gabemontero
Copy link
Contributor

the Pipeline with env vars and git repo source e2e also appears to be broken now with 1.0.47 of the sync plugin @akram @jkhelil @waveywaves @jitendar-singh @adambkaplan

I see this in the pipeline log:

Cloning the remote Git repository
Cloning repository /tmp/test-build-app-pipeline544874952/test-build-app-pipeline.git
 > git init /var/lib/jenkins/jobs/e2e-test-jenkins-pipeline-9bmz9/jobs/e2e-test-jenkins-pipeline-9bmz9-test-build-app-pipeline/workspace@script # timeout=10
Fetching upstream changes from /tmp/test-build-app-pipeline544874952/test-build-app-pipeline.git
 > git --version # timeout=10
 > git --version # 'git version 2.27.0'
 > git fetch --tags --force --progress -- /tmp/test-build-app-pipeline544874952/test-build-app-pipeline.git +refs/heads/*:refs/remotes/origin/* # timeout=10
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --force --progress -- /tmp/test-build-app-pipeline544874952/test-build-app-pipeline.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: fatal: '/tmp/test-build-app-pipeline544874952/test-build-app-pipeline.git' does not appear to be a git repository
fatal: Could not read from remote repository.

No idea yet what from those 1.0.47 changes would have caused this. I'll see what I can uncover today and post an update when I know something.

@gabemontero
Copy link
Contributor

the Pipeline with env vars and git repo source e2e also appears to be broken now with 1.0.47 of the sync plugin @akram @jkhelil @waveywaves @jitendar-singh @adambkaplan

I see this in the pipeline log:

Cloning the remote Git repository
Cloning repository /tmp/test-build-app-pipeline544874952/test-build-app-pipeline.git
 > git init /var/lib/jenkins/jobs/e2e-test-jenkins-pipeline-9bmz9/jobs/e2e-test-jenkins-pipeline-9bmz9-test-build-app-pipeline/workspace@script # timeout=10
Fetching upstream changes from /tmp/test-build-app-pipeline544874952/test-build-app-pipeline.git
 > git --version # timeout=10
 > git --version # 'git version 2.27.0'
 > git fetch --tags --force --progress -- /tmp/test-build-app-pipeline544874952/test-build-app-pipeline.git +refs/heads/*:refs/remotes/origin/* # timeout=10
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --force --progress -- /tmp/test-build-app-pipeline544874952/test-build-app-pipeline.git +refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: fatal: '/tmp/test-build-app-pipeline544874952/test-build-app-pipeline.git' does not appear to be a git repository
fatal: Could not read from remote repository.

No idea yet what from those 1.0.47 changes would have caused this. I'll see what I can uncover today and post an update when I know something.

@akram @jkhelil @waveywaves @jitendar-singh @adambkaplan turns out this test is faulty, and the checks there artificially let the past with the old plugin, based on the timing of when events happened. v1.0.47 changed the timing enough that the checks that were there now catch the failure.

this test is an older one that creates a local git repo ... however, unless you do that in the same pod as jenkins, it cannot access the locally created repo

short term, I'm going to comment out the test in openshift/origin

after that, I'm going to move the local git repo to one of our test repos at https://github.com/openshift, and also use that as an opportunity next week to move in earnest moving jenkins e2e's to the jenkins repos

we also have that client plugin test we marked for removal during the ARM/remove mongodb foray from Yaakov a few weeks ago.

So net, I believe the only v1.0.47 regression is the imagestreamtag pod templates noted earlier here.

@gabemontero
Copy link
Contributor

gabemontero commented Jul 9, 2021

One small follow up: it "works" when I use https://github.com/gabemontero/test-jenkins-bc-env-var-override and manually run

oc new-app https://github.com/gabemontero/test-jenkins-bc-env-var-override.git --strategy=pipeline --build-env=FOO1=BAR1

but v1.0.47 is significantly slower than v1.0.46. It takes several minutes for 1.0.47 to find the build that oc new-app immediately creates. With v1.0.46 that build that oc new-app creates is found instantly.

@adambkaplan
Copy link
Contributor

Created https://bugzilla.redhat.com/show_bug.cgi?id=1981957 to track the "slow start of pipelines" issue @gabemontero identified. I am inclined to set the "blocker +" flag on this regression.

@adambkaplan
Copy link
Contributor

@gabemontero @akram @waveywaves @jkhelil we have an escalation on this PR. I think for the immediate term we need to document the restriction on imagestream tags and move on.

@adambkaplan
Copy link
Contributor

/retest

@gabemontero
Copy link
Contributor

Hard to tell in the imagestreamtag test caused the other failures, but there are multiple failures.

I'll do a combo tomorrow of seeing what test(s) I have to comment out against 1.0.47 to get a clean run, along with investigating the slow start time and seeing how pervasive that is.

Of lesser concern, I have a thought on the "change" needed to get imagestreamtag to work, but neds a test modification. So short term comment it out, then add it back in afterward.

Again, the super slow pipeline run start, and how pervasive it is, is the meets min for determining whether we go with 1.0.47 or if we really should craft 1.0.48.

Also, I was about to add to the customer BZ that you can disable each of the ConfgMap/Secret/ImageStream/BuildConfig/Build watches individually to reduce load. However, our config panel seems to be broke as well. I only see the list interval config option displayed. Not sure if just making that super long reduces the api server load.

Unless one of you guys already knows about that, something else to look into. Maybe @jkhelil @waveywaves we can divide and conquer on Friday, and one of you two could look into the config panel, or maybe somebody can look into #1297 (comment) on Europe / India time and get somewhere with it before I log on Friday AM US Eastern time?

@waveywaves
Copy link
Contributor

Hey @gabemontero. I just ran a few pipelines with sync plugin 1.0.47 and saw failures. After running https://github.com/sclorg/nodejs-ex/tree/master/openshift/pipeline I am seeing the below output in the console log and seeing the nasty stack trace.

java.lang.NoSuchMethodError: No such DSL method 'openshiftBuild' found among steps [_OcAction, _OcContextInit, _OcWatch, archive, bat, build, catchError, checkout, container, containerLog, deleteDir, dir, dockerFingerprintFrom, dockerFingerprintRun, echo, envVarsForTool, error, fileExists, findFiles, getContext, git, input, isUnix, jiraComment, jiraIssueSelector, jiraSearch, junit, library, libraryResource, load, lock, mail, milestone, node, nodesByLabel, parallel, podTemplate, powershell, properties, publishHTML, pwd, readCSV, readFile, readJSON, readManifest, readMavenPom, readProperties, readTrusted, readYaml, resolveScm, retry, script, sh, sha1, sleep, stage, stash, step, svn, tee, timeout, tm, tool, touch, unarchive, unstable, unstash, unzip, validateDeclarativePipeline, waitUntil, warnError, withContext, withCredentials, withDockerContainer, withDockerRegistry, withDockerServer, withEnv, wrap, writeCSV, writeFile, writeJSON, writeMavenPom, writeYaml, ws, zip] or symbols [all, allOf, always, ant, antFromApache, antOutcome, antTarget, any, anyOf, apiToken, architecture, archiveArtifacts, artifactManager, authorizationMatrix, batchFile, bitbucket, booleanParam, branch, buildButton, buildDiscarder, buildDiscarders, buildingTag, caseInsensitive, caseSensitive, certificate, changeRequest, changelog, changeset, checkoutToSubdirectory, choice, choiceParam, clock, command, configFile, configFileProvider, configMapVolume, containerEnvVar, containerLivenessProbe, containerTemplate, credentials, cron, crumb, default, defaultFolderConfiguration, defaultView, demand, disableConcurrentBuilds, disableResume, docker, dockerCert, dockerfile, downstream, dumb, durabilityHint, dynamicPVC, emptyDirVolume, emptyDirWorkspaceVolume, envVar, envVars, envVarsFilter, environment, equals, expression, file, fileParam, filePath, fingerprint, fingerprints, frameOptions, freeStyle, freeStyleJob, fromScm, fromSource, git, gitBranchDiscovery, gitHubBranchDiscovery, gitHubBranchHeadAuthority, gitHubExcludeArchivedRepositories, gitHubForkDiscovery, gitHubPullRequestDiscovery, gitHubSshCheckout, gitHubTagDiscovery, gitHubTrustContributors, gitHubTrustEveryone, gitHubTrustNobody, gitHubTrustPermissions, gitTagDiscovery, github, githubPush, globalConfigFiles, headRegexFilter, headWildcardFilter, hostPathVolume, hostPathWorkspaceVolume, hyperlink, hyperlinkToModels, inheriting, inheritingGlobal, installSource, isRestartedRun, jdk, jdkInstaller, jgit, jgitapache, jnlp, jobBuildDiscarder, jobDsl, jobName, kubeconfig, kubernetes, label, lastDuration, lastFailure, lastGrantedAuthorities, lastStable, lastSuccess, legacy, legacySCM, list, local, location, logRotator, loggedInUsersCanDoAnything, mailer, masterBuild, maven, maven3Mojos, mavenErrors, mavenGlobalConfig, mavenMojos, mavenWarnings, merge, modernSCM, myView, never, newContainerPerStage, nfsVolume, nfsWorkspaceVolume, node, nodeProperties, nonInheriting, none, not, oc, onFailure, override, overrideIndexTriggers, paneStatus, parallelsAlwaysFailFast, parameters, password, pattern, permanent, persistentVolumeClaim, persistentVolumeClaimWorkspaceVolume, pipeline-model, pipeline-model-docker, pipelineTriggers, plainText, plugin, podAnnotation, podEnvVar, podLabel, pollSCM, portMapping, preserveStashes, projectNamingStrategy, proxy, pruneTags, queueItemAuthenticator, quietPeriod, rateLimitBuilds, resourceRoot, retainOnlyVariables, run, runParam, schedule, scmRetryCount, script, scriptApproval, scriptApprovalLink, search, secretEnvVar, secretVolume, security, shell, simpleBuildDiscarder, skipDefaultCheckout, skipStagesAfterUnstable, slave, sourceRegexFilter, sourceWildcardFilter, sshPublicKey, sshUserPrivateKey, standard, status, string, stringParam, swapSpace, tag, teamSlugFilter, text, textParam, timezone, tmpSpace, toolLocation, triggeredBy, unsecured, upstream, url, userSeed, usernameColonPassword, usernamePassword, viewsTabBar, weather, withAnt, zip] or globals [currentBuild, docker, env, fileLoader, openshift, params, pipeline, scm]
	at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:216)
	at org.jenkinsci.plugins.workflow.cps.CpsScript.invokeMethod(CpsScript.java:122)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:93)
	at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1213)
	at groovy.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1022)
	at org.codehaus.groovy.runtime.callsite.PogoMetaClassSite.call(PogoMetaClassSite.java:42)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:48)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:113)
	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:163)
	at org.kohsuke.groovy.sandbox.GroovyInterceptor.onMethodCall(GroovyInterceptor.java:23)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.SandboxInterceptor.onMethodCall(SandboxInterceptor.java:157)
	at org.kohsuke.groovy.sandbox.impl.Checker$1.call(Checker.java:161)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:165)
	at org.kohsuke.groovy.sandbox.impl.Checker.checkedCall(Checker.java:135)
	at com.cloudbees.groovy.cps.sandbox.SandboxInvoker.methodCall(SandboxInvoker.java:17)
	at WorkflowScript.run(WorkflowScript:3)
	at ___cps.transform___(Native Method)
	at com.cloudbees.groovy.cps.impl.ContinuationGroup.methodCall(ContinuationGroup.java:86)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.dispatchOrArg(FunctionCallBlock.java:113)
	at com.cloudbees.groovy.cps.impl.FunctionCallBlock$ContinuationImpl.fixArg(FunctionCallBlock.java:83)
	at jdk.internal.reflect.GeneratedMethodAccessor246.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.dispatch(CollectionLiteralBlock.java:55)
	at com.cloudbees.groovy.cps.impl.CollectionLiteralBlock$ContinuationImpl.item(CollectionLiteralBlock.java:45)
	at jdk.internal.reflect.GeneratedMethodAccessor449.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at com.cloudbees.groovy.cps.impl.ContinuationPtr$ContinuationImpl.receive(ContinuationPtr.java:72)
	at com.cloudbees.groovy.cps.impl.ConstantBlock.eval(ConstantBlock.java:21)
	at com.cloudbees.groovy.cps.Next.step(Next.java:83)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:174)
	at com.cloudbees.groovy.cps.Continuable$1.call(Continuable.java:163)
	at org.codehaus.groovy.runtime.GroovyCategorySupport$ThreadCategoryInfo.use(GroovyCategorySupport.java:129)
	at org.codehaus.groovy.runtime.GroovyCategorySupport.use(GroovyCategorySupport.java:268)
	at com.cloudbees.groovy.cps.Continuable.run0(Continuable.java:163)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.access$001(SandboxContinuable.java:18)
	at org.jenkinsci.plugins.workflow.cps.SandboxContinuable.run0(SandboxContinuable.java:51)
	at org.jenkinsci.plugins.workflow.cps.CpsThread.runNextChunk(CpsThread.java:185)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.run(CpsThreadGroup.java:400)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup.access$400(CpsThreadGroup.java:96)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:312)
	at org.jenkinsci.plugins.workflow.cps.CpsThreadGroup$2.call(CpsThreadGroup.java:276)
	at org.jenkinsci.plugins.workflow.cps.CpsVmExecutorService$2.call(CpsVmExecutorService.java:67)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at hudson.remoting.SingleLaneExecutorService$1.run(SingleLaneExecutorService.java:139)
	at jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
	at jenkins.security.ImpersonatingExecutorService$1.run(ImpersonatingExecutorService.java:68)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Finished: FAILURE

The example you have provided with the one with env vars works but isn't really using any openshift libraries.

[Pipeline] {
[Pipeline] echo
FOO1 is BAR1
[Pipeline] echo
FOO2 is null
[Pipeline] echo
FOO3 is null
[Pipeline] echo
FOO4 is null
[Pipeline] }
[Pipeline] // node
[Pipeline] End of Pipeline
Finished: SUCCESS

@gabemontero
Copy link
Contributor

I would have to see the pipeline you are using @waveywaves but that looks like you have a typo in you Jenkfinsfile pipeline

In any event, it is a red herring wrt to what we want to debug.

You seem to have executed the pipeline OK based on the other output.

My example simply proves that the env var substitution function of the sync plugin works.

The thing we are trying to debug is the delay. Were you able to get anywhere with that?

I've pinged you on slack about joining a video conf. If you are still around, please join it.

Otherwise, I'll pursue I suppose.

@gabemontero
Copy link
Contributor

that said, I do have some insight on the delay @akram @jkhelil @waveywaves @adambkaplan .... it is an old friend / timing bug that has resurfaced with Akram' rewrite.

The key log is: 2021-07-16 12:38:00 INFO io.fabric8.jenkins.openshiftsync.BuildManager addEventToJenkinsJobRun skipping watch event for build test-jenkins-bc-env-var-override-2 no job at this time

So when we new-app, it creates the new BC and Build at the same time.

If the Build event comes first, you see the no job at this time.

But it is suppose to put this on a list, and then when the BC event comes in, it is suppose to fire any Builds that came in before the BC.

That is not happening.

Hence it is dependent on the relist.

So,

  1. how easy is it to fix? I'll know more when I dive into his new code. But presumably I could have something ready today.
  2. Would we block releasing 1.0.47 for this? ....... I lean toward yes, but we'll see how much progress we make today.

But next steps:

  1. I have asked @waveywaves to sort out our missing fields on the config panel. I'd like to have disabling some of the watches as an option. It may not be sufficient for all customers, but it could be for some. @jkhelil says @akram looked into this some, but it may have not helped entirely. He was not clear on all the details. We'll just have to wait until @akram is back on Monday to sort that out.

  2. First, I'm going to see how many test cases I have to disable to get the e2e-aws-jenkins job above to work. Based on how bad it is, we'll decide if that is even a viable option.

@gabemontero
Copy link
Contributor

OK, so the persistent volume test also breaks with v1.0.47 of the sync plugin.

This is were we

  1. create 5 builds
  2. bring down jenkins
  3. delete a few of the builds via oc
  4. bring jenkins back up
  5. see that sync plugin is able to reconcile state and delete the job runs associated with the builds we deleted

Lots of customers leverage persistent volumes and expect things to function wrt state like this, being able to reconcile, after a restart.

First blush, I would consider this a blocker to releasing 1.0.47 @akram @waveywaves @jkhelil @adambkaplan

But we can certainly reconvene on Monday when everyone is available.

There is one more failure I need to look into next. Will report back when I have data.

@gabemontero
Copy link
Contributor

The remaining failure was in the bluegreen test like I noted last week with #1297 (comment)

It still halts in the middle of its run.

Since the test employs oc new-app and waits for the initial build created by oc new-app to complete, I believe this is the same thing I noted in #1297 (comment) with the delayed start where @adambkaplan has opened tracking bug https://bugzilla.redhat.com/show_bug.cgi?id=1981957

I have not tested it yet, but I coded up an initial fix attempt for that while theses tests have been running.

Going to leave the PV test disabled locally for me right now, and see if this fix covers things with local testing. If it does, I'll open a sync plugin PR and we'll go from there.

Then I'll circle back to the PV failure.

@gabemontero
Copy link
Contributor

OK, so the persistent volume test also breaks with v1.0.47 of the sync plugin.

This is were we

1. create 5 builds

2. bring down jenkins

3. delete a few of the builds via oc

4. bring jenkins back up

5. see that sync plugin is able to reconcile state and delete the job runs associated with the builds we deleted

Lots of customers leverage persistent volumes and expect things to function wrt state like this, being able to reconcile, after a restart.

First blush, I would consider this a blocker to releasing 1.0.47 @akram @waveywaves @jkhelil @adambkaplan

But we can certainly reconvene on Monday when everyone is available.

There is one more failure I need to look into next. Will report back when I have data.

correction - it is not the PV tests that are broke, it is the recognizing of deleted builds and deleting the corresponding jobs in jenkins.

it is also broke with the ephemeral template.

I'll see if I can fix that quick and add it to my fix for the event timing issue, which I've verified.

@gabemontero
Copy link
Contributor

OK I have everything passing for me again @akram @waveywaves @jkhelil @adambkaplan with my soon to be a PR sync plugin updates, except for the imagestreamtag regression

I'll be pushing an openshift/origin PR shortly with that test commented out

However, I'll next take a stab at addressing the imagestreamtag pod template when both the imagestream is labeled, and the tag is annotated.

If that works as is, great. I'll just update our openshift/origin TC to do both.

If sync plugins updates are needed to accommodate that, then will move forward with that test commented out. I'll include whatever sync plugin updates are needed to make that work, if containable, under my upcoming sync plugin PR. Then we'll re-enable the test in openshift/origin where we do both the label and annotation, and confirm it works.

@akram akram force-pushed the bump-kubernetes-plugin-to-1.30 branch from 9f8fff0 to 744998e Compare July 21, 2021 07:28
@akram akram changed the title Bug 1925524: bump k8s plugin to 1.30 to enable 1.0.47 release of openshift sync plugin Bug 1925524: bump k8s plugin to 1.30 to enable 1.0.48 release of openshift sync plugin Jul 21, 2021
@gabemontero
Copy link
Contributor

aws install failure on last e2e

@@ -17,8 +17,7 @@ htmlpublisher:1.21
jira:3.0.17
job-dsl:1.77
junit:1.30
kubernetes:1.29.7
kubernetes-client-api:4.13.3-1
kubernetes:1.30.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akram I noticed in my recent testing there is a now a version 1.30.1 for the k8s plugin

perhaps not required, but something to go ahead and move to

@gabemontero
Copy link
Contributor

/retest

@akram
Copy link
Contributor Author

akram commented Jul 21, 2021

@gabemontero I will test with the 1.30.1 on a separate branch then.

@gabemontero
Copy link
Contributor

sounds good @akram .... I'm fine merge this as is and following up

I see the tests are green !!

while the iron is hot

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jul 21, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 21, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akram, gabemontero

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit b45f39e into openshift:master Jul 21, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 21, 2021

@akram: All pull requests linked via external trackers have merged:

Bugzilla bug 1925524 has been moved to the MODIFIED state.

In response to this:

Bug 1925524: bump k8s plugin to 1.30 to enable 1.0.48 release of openshift sync plugin

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@gabemontero
Copy link
Contributor

https://issues.redhat.com/browse/ART-3173 for bumping RPMs for official image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants