Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JENKINS-19022] rebuild BuildData objects instead of storing them in the build.xml #578

Merged
merged 15 commits into from Jan 21, 2019

Conversation

jacob-keller
Copy link
Contributor

@jacob-keller jacob-keller commented Mar 16, 2018

The BuildData objects are serialized into the build.xml for every build.
This is normally not a problem, as their contents are small. However,
for certain types of builds (gerrit-triggered projects, really any git
build which uses a large number of refs) the BuildData objects can
become enormous. Worse, in those cases the data it saves is nearly
worthless.

Unfortunately, complete removal of the BuildData objects is not
possible, as we rely on this data to correctly predict change logs, and
to determine what revisions to build. Disentangling all of these
operations from the BuildData is tricky, if not impossible. This is
especially so for extension points such as the BuildChoosers.

Instead, lets opt for a different approach: generate the BuildData
instead of serializing the buildsByBranchName map to the build.xml for
every single build.

Do this by implementing a simpler BuildDetails structure, which contains
most of the same information, but does not contain the
buildsByBranchName map.

Stop storing the BuildData object into the build Actions. Instead,
search for BuildDetails and regenerate the BuildData structure from
this. As a fallback, allow finding old BuildData actions and using
those, if they still exist.

Jenkins Administrators may which to purge this data from their build.xml
files if they do not need the historical data, but for now, keep the
plugin as backwards compatible as possible.

With this change, we now only store a single Build object per build
instead of storing the entire branch map, which should significantly
reduce build.xml bloat.

If the time cost of rebuilding the BuildData structure is too high,
(such as when there are a large number of historical builds), we could
add some sort of limit on the number of past builds to check. However,
this would require some thought, as it would not produce exact replicas
of the current BuildData objects.

See JENKINS-19022 and JENKINS-47789 for more details on the issues
the BuildData objects can cause.

Signed-off-by: Jacob Keller jacob.e.keller@intel.com

TODO:

  • Determine impact with a large number of jobs
  • Ensure we have enough test cases to cover things
  • Determine proper documentation required?
  • Maybe think about providing a method of removing the old data? Or maybe we automate it within the plugin??
  • Maybe think about adding some limiter for the jobs to dig backwards? i.e. only dig back a few hundred jobs maximum?

Checklist

  • I have read the CONTRIBUTING doc
  • I have referenced the Jira issue related to my changes in one or more commit messages
  • I have added tests that verify my changes
  • Unit tests pass locally with my changes
    (I'm seeing some weird timeout issues when I run the tests... It may be because I'm behind a proxy?)
  • I have added documentation as necessary
    (We definitely want to document this as a change since others may be reading the build data)
  • No Javadoc warnings were introduced with my changes
  • No findbugs warnings were introduced with my changes
  • I have interactively tested my changes
    (Not yet, it's on my list of TODO)
  • Any dependent changes have been merged and published in upstream modules (like git-client-plugin)

Types of changes

What types of changes does your code introduce? Put an x in the boxes that apply

  • Bug fix (breaking change which fixes an issue)

I'm not sure if this constitutes a breaking change or not. We no longer export BuildData, Instead exporting BuildDetails in the ActionableItems list.. This could(?) impact consumers of the data? but they can learn to rebuild the data just like we do in core??

Further comments

This approach was the easiest I've found so far. If we actually try to remove BuildData, it turns out a lot of the core code depends on being able to look some of the things up. It's possible we really want to limit the number of builds we need to search for, but I don't actually have data on how slow it is right now.

I welcome thoughts and criticism of this approach!

@tsuna
Copy link

tsuna commented Mar 19, 2018

How well is this going to scale for jobs that have >>10k of prior builds?

@MarkEWaite
Copy link
Contributor

I'm not sure it could scale any worse than the current implementation. I haven't performed any specific benchmarks, but have bug reports of bad results as many builds are retained.

@jacob-keller
Copy link
Contributor Author

I'm not sure it could scale any worse than the current implementation. I haven't performed any specific benchmarks, but have bug reports of bad results as many builds are retained.

It depends. If the builds have only one branch, then it will scale VERY poorly compared to the current solution. (ie: currently the map will be one branch per build, and would be very small to serialize from the builds). However, I don't know if that scaling will be poor in practice. I think we could get away with some sort of depth limit, where we say "the map will only be accurate back say 100 builds" rather than being accurate out to the full history. For how we use it, I think this trade off would be acceptable, assuming that digging open 10k builds actually is a problem.

For builds with many branches (one per build) such as gerrit change-id based builds, this will be pretty much the same as current (the map will contain one entry per build), unless loading each one takes longer?

I'd like to get actual results for this soon but I'm not sure when or how best to test that.

@jacob-keller
Copy link
Contributor Author

I'll take a look at fixing those find bug warnings later this evening.

@jacob-keller
Copy link
Contributor Author

Huh, well I fixed the first set of test failures, and that broke something else... I'm still not able to get the tests to run locally :(

@jacob-keller
Copy link
Contributor Author

Finally figured out what I was doing wrong when constructing the build data objects. "getBuildData()" expects to be able to dig farther back into history if the top build doesn't contain any build information, and i wasn't allowing that.

I also finally got my tests to resolve properly on my local system!

@MarkEWaite
Copy link
Contributor

@jacob-keller I don't understand why the tests don't run locally for you. I'm running the tests locally on

  • Windows 7
  • Windows 10
  • CentOS 6
  • CentOS 7
  • Debian 8
  • Debian 9
  • Ubuntu 14
  • Ubuntu 16
  • FreeBSD 10

Sometimes the tests are flaky on one or more of those platforms, but they definitely run. Can you share more information about the failure mode of the tests when you run them in your environment?

@jacob-keller
Copy link
Contributor Author

Sometimes the tests are flaky on one or more of those platforms, but they definitely run. Can you share more information about the failure mode of the tests when you run them in your environment?

I should clarify: they run, but several tests fail.

I'll attach the test output of my mvn test run here as git-plugin-test-output.txt

git-plugin-test-output.txt

I suspect this is because it tries to contact remote repositories and fails to use the proxy settings when talking to them.

@jacob-keller
Copy link
Contributor Author

Additionally, due to 2minute timeouts, this causes the test suite to take an extremely long time to finish.

@MarkEWaite
Copy link
Contributor

MarkEWaite commented Mar 20, 2018

@jacob-keller I merged your changes with the contents of the require-jdk-8 branch and the daniel-beck-make-matrix-optional branch and installed it into my Jenkins environment for testing.

Unfortunately, one of the earliest jobs I checked reported a new null pointer exception. I see it in many of my jobs. Before I spend time describing the detailed steps to reporduce the problem, could you try your build of the plugin in an environment that uses a Pipeline shared library?

Checking out Revision 7d1d5e7a0561458f33b01b1b25eaf52467723347 (master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 7d1d5e7a0561458f33b01b1b25eaf52467723347
Commit message: "Merge pull request #27 from daniel-beck/json"
java.lang.NullPointerException
	at io.jenkins.blueocean.autofavorite.FavoritingScmListener.onCheckout(FavoritingScmListener.java:70)
	at org.jenkinsci.plugins.workflow.steps.scm.SCMStep.checkout(SCMStep.java:127)
	at org.jenkinsci.plugins.workflow.libs.SCMSourceRetriever.doRetrieve(SCMSourceRetriever.java:112)
	at org.jenkinsci.plugins.workflow.libs.SCMSourceRetriever.retrieve(SCMSourceRetriever.java:84)
	at org.jenkinsci.plugins.workflow.libs.LibraryAdder.retrieve(LibraryAdder.java:153)
	at org.jenkinsci.plugins.workflow.libs.LibraryAdder.add(LibraryAdder.java:134)
	at org.jenkinsci.plugins.workflow.libs.LibraryDecorator$1.call(LibraryDecorator.java:125)
	at org.codehaus.groovy.control.CompilationUnit.applyToPrimaryClassNodes(CompilationUnit.java:1065)
	at org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:603)
	at org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:581)
	at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:558)
	at groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:298)
	at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:268)
	at groovy.lang.GroovyShell.parseClass(GroovyShell.java:688)
	at groovy.lang.GroovyShell.parse(GroovyShell.java:700)
	at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.doParse(CpsGroovyShell.java:133)
	at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.reparse(CpsGroovyShell.java:127)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.parseScript(CpsFlowExecution.java:557)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.start(CpsFlowExecution.java:518)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(WorkflowRun.java:290)
	at hudson.model.ResourceController.execute(ResourceController.java:97)
	at hudson.model.Executor.run(Executor.java:429)

GitHub has been notified of this commit?s build result

org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
WorkflowScript: Loading libraries failed

1 error

	at org.codehaus.groovy.control.ErrorCollector.failIfErrors(ErrorCollector.java:310)
	at org.codehaus.groovy.control.CompilationUnit.applyToPrimaryClassNodes(CompilationUnit.java:1085)
	at org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:603)
	at org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:581)
	at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:558)
	at groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:298)
	at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:268)
	at groovy.lang.GroovyShell.parseClass(GroovyShell.java:688)
	at groovy.lang.GroovyShell.parse(GroovyShell.java:700)
	at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.doParse(CpsGroovyShell.java:133)
	at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.reparse(CpsGroovyShell.java:127)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.parseScript(CpsFlowExecution.java:557)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.start(CpsFlowExecution.java:518)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(WorkflowRun.java:290)
	at hudson.model.ResourceController.execute(ResourceController.java:97)
	at hudson.model.Executor.run(Executor.java:429)
Finished: FAILURE

@jacob-keller
Copy link
Contributor Author

Unfortunately, one of the earliest jobs I checked reported a new null pointer exception. I see it in many of my jobs. Before I spend time describing the detailed steps to reporduce the problem, could you try your build of the plugin in an environment that uses a Pipeline shared library.

Yea I'll take a look, and see if I can spot it.

@jacob-keller
Copy link
Contributor Author

@MarkEWaite I'm pretty sure this is just a breaking change because the blueocean autofavorites plugin is searching for BuildData and expecting it to return non-null.

I suspect it's reasonably considered a bug in the FavoritingScmListener code, which needs to be null-tolerant about the missing BuildData, but previously it was "always there" so it never had issues...

This is an example of the "breaking change" I mentioned before. We no longer store BuildData actions (because they bloat the build.xml with potentially unbounded hashes), so ofcourse it's going to fail.

@jacob-keller
Copy link
Contributor Author

I'll have a pull request up for blueocean-autofavorites shortly to fix the null pointer exception.

@jacob-keller
Copy link
Contributor Author

@jacob-keller
Copy link
Contributor Author

I also added stapler/jelly code for displaying the build details now (sorry I forgot that the first time!)

and I fixed a small issue in getBySHA1 regarding failure to parse every action of a build.

@jacob-keller
Copy link
Contributor Author

Apparently something I changed in the last update broke a test in a weird way, hope to investigate tomorrow.

@jacob-keller
Copy link
Contributor Author

The failed test here doesn't fail for me locally, so I'm not exactly sure why it failed in the pipeline.

@jacob-keller
Copy link
Contributor Author

@MarkEWaite Is it possible to re-trigger the remote tests? They don't fail for me on the command line. I'm not sure what would be different about them on that system to cause a failure: Specifically, it seems something is timing out...

Stacktrace org.junit.runners.model.TestTimedOutException: test timed out after 180 seconds at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.jvnet.hudson.reactor.Reactor.execute(Reactor.java:267) at jenkins.InitReactorRunner.run(InitReactorRunner.java:44) at jenkins.model.Jenkins.executeReactor(Jenkins.java:916) at jenkins.model.Jenkins.<init>(Jenkins.java:815) at hudson.model.Hudson.<init>(Hudson.java:85) at org.jvnet.hudson.test.JenkinsRule.newHudson(JenkinsRule.java:611) at org.jvnet.hudson.test.JenkinsRule.before(JenkinsRule.java:384) at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:537) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298) at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.lang.Thread.run(Thread.java:748)

@jacob-keller
Copy link
Contributor Author

@jacob-keller I merged your changes with the contents of the require-jdk-8 branch and the daniel-beck-make-matrix-optional branch and installed it into my Jenkins environment for testing.

Unfortunately, one of the earliest jobs I checked reported a new null pointer exception. I see it in many of my jobs. Before I spend time describing the detailed steps to reporduce the problem, could you try your build of the plugin in an environment that uses a Pipeline shared library?

I locally built a pipeline library and set it to use modern SCM, and it worked for me. Note that the exception you pasted above is fixed by jenkinsci/blueocean-autofavorite-plugin#16 which was just merged.

I do not understand why the server CI test fails still though... I can't get it to fail like that on my machines.

@jacob-keller
Copy link
Contributor Author

I'm pretty sure the test failures here are due to the lack of entropy on the server (SSHD fails to start in time while blocking on /dev/random and the tests timeout). I don't believe anything that remains is a problem in my code here. Is there a way to re-run the checks?

@jiangty-addepar
Copy link

I'm excited for this to be released!

@jacob-keller
Copy link
Contributor Author

jacob-keller commented Apr 19, 2018

I'm excited for this to be released!

Yep. There's still a fair amount of work to do here before that can happen. More interactive testing is important, but I think we've resolved the test failures, and I've run some of my own interactive testing and haven't found issues yet. Unfortunately, the best places to check for failures are on jobs with a large amount of history...

@jacob-keller
Copy link
Contributor Author

I've interactively tested this on a small job and things seem to be ok, but I don't know a good way to test with thousands of jobs as the history..

@jiangty-addepar
Copy link

You could try running small jobs with a really tight cron schedule?

@MarkEWaite MarkEWaite modified the milestone: 4.0 Dec 11, 2018
@jacob-keller
Copy link
Contributor Author

I have no idea how the XmlFile thing would work ,and I suspect that would take to implement, as well as verify. It would be a significant rework to either this series, or scrapping and starting from master again.

@jglick how strongly do you feel about this? what about the smaller changes?

My own testing plus the testing from @MarkEWaite seems to indicate that this behaves well and we haven't yet noticed significant issues so far (obviously we could have missed something). Note that we're scanning this during I think SCM polling and the beginning of a build, not during web page rendering. (It actually reduces how much data is in the build xml, because we only store one revision object per build instead of copying the entire build data for every build.)

I'd like to know more thoughts given that there was previously hope to have this in 4.0 beta.

I see a few paths forward here, and I'd like to know which one we actually want to go with.

  • We take this as is (or with the minor fixups you mentioned earlier)
  • We decide that removing BuildData (and all the features associated with it) entirely is acceptable in a big release
  • We use an XmlFile implementation which adds a separate file to the project directory
  • We do nothing
  • Other???

I don't know enough about the ramifications of removing BuildData entirely (which is why I invented this approach instead!)

I don't understand how to implement the XmlFile, maybe this is actually simple and I'm not understanding it very well.

I don't like the idea of doing nothing, since this is one of the significant issues with the git plugin as it stands today.

I'd like more input so that we can make a good decision.

@MarkEWaite
Copy link
Contributor

@jacob-keller I prefer we have you apply the minor fix-ups that Jesse requested. I'm continuing my testing and have enlisted additional help from others to test as well. I included this change in the beta-3.0 branch and delivered it as part of git client plugin 3.0.0-beta6.

There's no reason to implement a fixNull for strings, since Util already
provides one.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Currently no one actually calls clone on a BuildDetails. Remove the
clone interface. If we do end up needing to implement copying of
a BuildDetails, we should implement a copy constructor.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Also mark the build field as final so that it cannot be modified after
object creation.

Update the callers to use the getters when accessing the now private
fields.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
These constructors were carried over as part of copying BuildData, but
are unnecessary. Lets just remove them. Future code can add new
constructors if/when they become useful.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
It's best to just let real translators handle the work of translation to
ensure that it is accurate.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Instead of implementing readResolve to convert a null remotes
collection, just mark the remote collection as @nonnull

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
@jacob-keller
Copy link
Contributor Author

Ok, I think I covered all the basic things that @jglick suggested. If there's other small fixup I missed please let me know :)

@MarkEWaite
Copy link
Contributor

@jacob-keller I've included your latest changes in the beta-4.0 branch. I did not deliver a new beta release of the plugin because I didn't think those changes were especially risky. Let me know if you think I should deliver a new beta release with those changes.

I'm continuing the testing of various parts of the git plugin 4.0 beta with this code.

@jacob-keller
Copy link
Contributor Author

Since you already included the bulk of changes in the earlier beta, I think that's fine. It wouldn't hurt to get more testing, but I think if we're confident then there isn't a huge reason to delay. Do you think a 4.0 beta would get a wider audience than the last beta did?

@MarkEWaite
Copy link
Contributor

It is getting a slightly larger audience than the earlier beta because @kshultzCB is testing it. We've had others testing as well (as evidenced by JENKINS-55387).

Not enough testing yet, but enough to be discovering new issues that need to be addressed. The new issues have not been related to BuildData. When an issue is related to BuildData, I'll post it here.

@jacob-keller
Copy link
Contributor Author

Ok sounds great!

@jglick
Copy link
Member

jglick commented Jan 3, 2019

@jglick how strongly do you feel about this?

Not strongly. Do not consider me an official reviewer—just trying to offer some very quick hints in case they are helpful.

you would most naturally just define a new XmlFile in the project root directory

I have no idea how to do that

You would create something along the lines of

new XmlFile(build.getParent().getRootDir(), "buildData.xml")

and then read / write a POJO defining the per-job metadata you wished to keep, with whatever internal structure makes sense—ArrayLists, HashMaps, Jenkins specifics like Build, etc. Much like storing a per-build Action except you are responsible for explicitly loading and saving the object.

How would this handle multiple builds running at the same time?

Well, that would be one of the tricky parts (along with cleaning up stale metadata after builds are deleted). I suppose you would most simply synchronize on the Job or just have a single (Jenkins-wide) static lock.

@jacob-keller
Copy link
Contributor Author

@jglick how strongly do you feel about this?

Not strongly. Do not consider me an official reviewer—just trying to offer some very quick hints in case they are helpful.

Ok

you would most naturally just define a new XmlFile in the project root directory

I have no idea how to do that

You would create something along the lines of

new XmlFile(build.getParent().getRootDir(), "buildData.xml")

and then read / write a POJO defining the per-job metadata you wished to keep, with whatever internal structure makes sense—ArrayLists, HashMaps, Jenkins specifics like Build, etc. Much like storing a per-build Action except you are responsible for explicitly loading and saving the object.

Ok, that makes some sense.

How would this handle multiple builds running at the same time?

Well, that would be one of the tricky parts (along with cleaning up stale metadata after builds are deleted). I suppose you would most simply synchronize on the Job or just have a single (Jenkins-wide) static lock.

The old mechanism of BuildData in each build.xml actually never cleaned itself up either. (So in some sense it's a behavior change that the new method will cleanup when old builds are deleted).

In terms of locking, Synchronizing on the Job makes sense to me, as the file is in the Job's root dir anwyays.

If I have time in the next few days I might see how difficult this is to implement.

The PreBuildMerge extension attempted to extend the BuildData and save
an indication that the build failed on the new merged revision. However,
this is now incorrect and will continue to promulgate the BuildData
action.

Fix this by having the extension store a BuildDetails instead.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
@jacob-keller
Copy link
Contributor Author

For reference, I started a WIP pull request against this pull request here at jacob-keller#1

Based on initial work, I think it's somewhat doable to switch to an XmlFile approach, but I'm not sure it's worth it, unless testing of the current implementation proves that loading all the old builds is a performance or memory issue.

vivek added a commit to jenkinsci/blueocean-plugin that referenced this pull request Jan 16, 2019
vivek added a commit to jenkinsci/blueocean-plugin that referenced this pull request Jan 16, 2019
@MarkEWaite MarkEWaite merged commit d77f380 into jenkinsci:master Jan 21, 2019
@MarkEWaite MarkEWaite removed this from the 4.0 milestone Oct 17, 2019
@jacob-keller
Copy link
Contributor Author

Since we found that it is indeed a performance issue with many thousands of jobs, I think we should seriously investigate the XMLFile approach, where we convert BuildData to a file stored in the job directory (rather than once per build). This approach means that it will only scale linearly to the size of the jobs instead of basically the n^2 issue we have now.

I haven't had any time to figure that approach out recently, but I do think it's the best approach to resolving this issue without losing any capability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BuildData Fix BuildData Bloat
Projects
None yet
6 participants