Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIXED JENKINS-19022] Add option to disable saving build/branch map. #163

Closed
wants to merge 9 commits into
base: master
from

Conversation

Projects
None yet
8 participants
@HedAurabesh
Copy link

HedAurabesh commented Aug 1, 2013

This commit fixes the issues reported in JENKINS-19022.
https://issues.jenkins-ci.org/browse/JENKINS-19022

It introduces an optional configuration flag that will remove all
previous information about which branches were built in Jenkins
before attaching it to the current build as an Action.

If you work with GIT repos with several thousand unmerged branches, this
can easily save several megabytes of both RAM and disk space per build.
With enough builds being run, we're talking about saving GIGAabytes.

The flag is optional and defaults to being disabled, so as to not to
introduce unwanted behaviour changes to existing installations.

The reason for this conservativeness is that it appears possible to us,
that enabling this option might break the "merge-before-build"
feature (we do not know enough about that feature to test it).
Other than that, our tests have not shown any issues with this patch.

[FIXES JENKINS-19022] Add option to disable saving build/branch map.
This commit fixes the issues reported in JENKINS-19022. It introduces an
optional configuration flag that will remove all previous information
about which branches were built in Jenkins before attaching it to the
current build as an Action.

If you work with GIT repos with several thousand unmerged branches, this
can easily save several megabytes of both RAM and disk space per build.
With enough builds being run, we're talking about saving GIGAabytes.

The flag is optional and defaults to being disabled, so as to not to
introduce unwanted behaviour changes to existing installations.

The reason for this conservativeness is that it appears possible to us,
that enabling this option *might* break the "merge-before-build"
feature.
@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Aug 1, 2013

disabling BuildData is only a workaround to JENKINS-19022 and will break buildChooser "to be built" commit detection. I'd prefer as a fix to get the branch:build map to be stored as a singleton project action in replacement to build action

... but this would have the side effect to polute the job configuration history (consider jobConfigHistory or scm-sync-config plugins)

Maybe BuildData could be implemented as a chain, only storing the changes with previous build's BuildData. Will then need to re-aggregate last BuildData when a build is deleted / build history is purged

@buildhive

This comment has been minimized.

Copy link

buildhive commented Aug 1, 2013

Jenkins » git-plugin #293 SUCCESS
This pull request looks good
(what's this?)

@jenkinsadmin

This comment has been minimized.

Copy link
Member

jenkinsadmin commented Aug 1, 2013

Thank you for a pull request! Please check this document for how the Jenkins project handles pull requests

@HedAurabesh

This comment has been minimized.

Copy link
Author

HedAurabesh commented Aug 1, 2013

I admit that the potential side-effects of removing the data can be problematic -- after all, there is a reason why they are there to begin with.

I think that one possible solution to this is the following: Instead of saving all the data in either the build or the project, I would propose to centrally store the data in the SCM plug-in and only save a small metadata snippet in the build -- just enough to correctly identify the essential data to be retrieved for each build.

Of course, this will only solve our problem (remember: thousands of branches in GIT!), if the plug-in does not need to track the specific state of all branch/build mappings for a particular build as a "snapshot in time".
From what I've seen in the code, you only refer to the very last build of a particular project anyway, so it should be no problem to simply keep one branch/build mapping in memory that all builds share -- instead one for every build of every project.

Am I right in this assumption? If so, I'll take a look at improving my pull-request to do the necessary changes.

@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Aug 1, 2013

would indeed be enough to only keep one BuildData instance for all builds, but need to ensure this one isn't lost if build history is purged/build deleted, etc.
This issue seems more related to a Git misuse. Having thousand branches in same repo is huge. Maybe using more repositories (remember, Git is DISTRIBUTED) or pruning old branches could help

@HedAurabesh

This comment has been minimized.

Copy link
Author

HedAurabesh commented Aug 1, 2013

Okay, I'll take a look into "globalizing" this data. Although I have to admit, that I do not quite understand why the branch history has to be kept forever -- after all, why is information needed about a branch that does not exist anymore or has been merged?

As for our use of GIT:
In our case, the GIT repositories are fully automated (due to a code transition from one SCM system to GIT). They are not used by users directly and do not serve a versioning purpose. As such, the large number of branches is not only sensible, but wanted by design; as the branches can be cleaned-up easily while the "mainline" stays small and clean.

In our case, almost every test creates a new branch (on which multiple different builds are subsequently spawned). Since we spawn several tens of thousands of these tests over a week, even aggressive culling of the history leaves a huge "paper-trail" in our Jenkins servers, for little to no gain to us.

Additionally, let me point out that having that many branches is not a misuse of GIT itself, as GIT does not care very much if something is a regular commit, a branch or a tag. It's all the same for it. Isn't the unofficial motto of GIT not even "Branch often, branch early?" :-)

Also something to keep in mind: Tools like Gerrit (Code Review tool) also spawn quite a lot of branches that can linger for quite some time.

@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Aug 1, 2013

sur, not telling you're doing things wrong, but I don't want some exotic use-cases to pollute the git plugin. As you tell git "do not serve a versioning purpose" don't let me think this is a mainstream usage. We got too many such "advanced" options, and KK worked on refactoring branch to isolate them in specialized "custom behaviors" and make code (a little more) maintainable.
I'd prefer you create the required extension point to mane BuildData factory plugable, and create a custom plugin to match your use-case.

In current implementation, each build invokes copyBuildData() so that whatever build you delete the builddata history is kept. If you switch to some distributed data around builds, you need to manage this manually.

@HedAurabesh

This comment has been minimized.

Copy link
Author

HedAurabesh commented Aug 1, 2013

The more I look at the code, the more I realize that this issue is not limited to our specific use-case; our use only runs earlier into this issue.

So far, I have not found a code segment that actually causes branches to be dropped from this map. All code paths only add branches, but never remove them. So even when you use GIT in the "intended" way of not creating all that many branches and always merging or deleting them -- this map still never stops growing.

It's only a question of time until you have (like we do) thousands of branches in the map. Since each branch-name is associated with a build-ID and SHA1, we're looking at at least 50-60 Bytes per branch. With a history of a thousand branches, that's ~50kB of additional data. This is stored in each and every build in the system. So 20 projects with just 500 builds each mean that this data alone wastes 500MB of disk space and main memory!

And these are, I believe, optimistic numbers for most Jenkins and GIT deployments.

Remember: Our company has ~3000 developers working on the code. It's inevitable that there will be lots of branches even once we switch over to a non-automated GIT implementation.

Not storing this map for each build will be a great first step, but eventually, the plug-in has to deal with the growing history of branches in one way or another.

@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Aug 1, 2013

Fully agree with your analysis, but can't find how to deal with branch deletion.
Probably need

  1. some way to centralize BuildData
  2. detect branches deleted from repo then purge related buildData

not a trivial refactoring...

@HedAurabesh

This comment has been minimized.

Copy link
Author

HedAurabesh commented Aug 1, 2013

Agreed; a full solution is tricky. In any case, I am working on writing a patch that at least solves problem n° 1, by centralizing the builds-to-branch map (which makes up 99.995% of the BuildData object size).

I made good progress so far, but need to test the solution first, before updating this push request.

HedAurabesh added some commits Aug 1, 2013

[FIXES JENKINS-19022] Remove branch-to-build map from build objects
This commit fixes the issues reported in JENKINS-19022. It removes the
need to store the branch-to-build map in the builds themselves. Instead,
the data is now stored centrally, thus avoiding the excessive overhead
of attaching an eternally growing history of branches to each build.

This can easily save several hundred megabytes of RAM and disk-space on
"grown" Jenkins/GIT deployments.
@HedAurabesh

This comment has been minimized.

Copy link
Author

HedAurabesh commented Aug 2, 2013

Okay, I added commits to this push request, which undo the changes from the initial commit and instead make it so that the branch/builds map is stored centrally in the "hudson.plugins.git.GitSCM.branches.xml.gz" file.

As you can see, I compress the file, as this can easily reduce the size of it by up to 90% (a factor of ten) for only a tiny amount of additional CPU time spent on the server.

My tests show that the new solution is working quite nicely; always displaying the most recent branches-to-builds map. I did not discard the entire BuildData class, as all the other fields only add a few kB to each build, instead of growing boundlessly like the map can.

Of course, as I said above, this still does not fix the fact that the branches list will grow eternally; so a later patch has to find a strategy to discard them.

@buildhive

This comment has been minimized.

Copy link

buildhive commented Aug 2, 2013

Jenkins » git-plugin #298 UNSTABLE
Looks like there's a problem with this pull request
(what's this?)

@HedAurabesh

This comment has been minimized.

Copy link
Author

HedAurabesh commented Aug 2, 2013

Okay, just noticed that lots of testcases fail; will take a close look at why they fail and then submit a fix.

@buildhive

This comment has been minimized.

Copy link

buildhive commented Aug 2, 2013

Jenkins » git-plugin #299 UNSTABLE
Looks like there's a problem with this pull request
(what's this?)

@buildhive

This comment has been minimized.

Copy link

buildhive commented on 5dfeb53 Aug 2, 2013

Jenkins » git-plugin #299 UNSTABLE
Looks like this commit caused a build failure
(what's this?)

Fixed test-issue when tests did not submit a proper Run object.
This does not yet fix the issue with other tests failing with messages
like these:

"scm polling should not detect any more changes after build"
"The build should have only one culprit expected:<1> but was:<0>"
"commitFile1 file not found in workspace"

These will be fixed as soon as I have identified the root-cause of them.
@buildhive

This comment has been minimized.

Copy link

buildhive commented Aug 2, 2013

Jenkins » git-plugin #300 UNSTABLE
Looks like there's a problem with this pull request
(what's this?)

Fixed remaining test issues.
All 129 tests now succeed. The failing "slave" tests were due to the
fact that the SCM and BuildData objects are serialised and sent to the
remote machine for execution. Since the fields were static/transient,
the test failed as no data was transmitted.

Now, the BuildData class contains a serialisable reference to the static
object stored in the GitSCM class. It is not saved to disk via XStream,
but can be copied via the network to the build host. Since the structure
is only read, but not changed on the build host, that is acceptable.

Do note though, that the structure will still grow with each branch,
thus eventually slowing down the test execution.
@HedAurabesh

This comment has been minimized.

Copy link
Author

HedAurabesh commented Aug 5, 2013

Hi.

I've fixed the last issues that caused build problems. All 129 test cases pass and an installation on a real server also worked nicely. Please read the commit message above for more details.

What do you think? Can this -- admittedly quite large -- alteration to the code admitted into the mainline? What needs to be done/changed to allow it?

@buildhive

This comment has been minimized.

Copy link

buildhive commented Aug 5, 2013

Jenkins » git-plugin #301 SUCCESS
This pull request looks good
(what's this?)

@buildhive

This comment has been minimized.

Copy link

buildhive commented Aug 5, 2013

Jenkins » git-plugin #302 SUCCESS
This pull request looks good
(what's this?)

@buildhive

This comment has been minimized.

Copy link

buildhive commented Aug 5, 2013

Jenkins » git-plugin #303 SUCCESS
This pull request looks good
(what's this?)

@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Aug 6, 2013

Using a centralized for all jobs Map has some major drawbacks: need to manage project deletion, rename, move inside distinct ItemGroup.
I'd prefer a buildByBranch Map per project / GitSCM instance. Maybe not trivial as GitSCM can't access it's "parent" project to retrieve job directory.

@HedAurabesh

This comment has been minimized.

Copy link
Author

HedAurabesh commented Aug 6, 2013

The latter point -- plus the fact that storing it in the config of each job/project would mean that it would drive plug-ins like the SCM-Sync plug-in insane was the reason why I decided against a project-local map.

I also indeed overlooked the part that a project deletion/renaming means that we need to adjust the map; but thankfully, Jenkins offers an extension point for that, so that's hardly difficult to do. I will add this feature tomorrow and submit another commit with the changes in place.

The only thing I don't understand is what you mean with "move inside distinct ItemGroup". I don't know why that would influence the branch-to-builds map, as long as the name of the project remains unique, the same and accessible via Jobs.getName()?

@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Aug 6, 2013

jobs inside ItemGroups (ie cloudbees folder plugin) are uniquely identified by fullName

Added logic for renaming/deleting projects.
Also switched to use the full name of a project as an identifier.
@buildhive

This comment has been minimized.

Copy link

buildhive commented Aug 7, 2013

Jenkins » git-plugin #307 UNSTABLE
Looks like there's a problem with this pull request
(what's this?)

@HedAurabesh

This comment has been minimized.

Copy link
Author

HedAurabesh commented Aug 7, 2013

Hi Nicolas.

I committed the needed changes. The plug-in will now use an ItemListener to be informed when a project is renamed or deleted. It will also use the full name instead of the relative name to address the content of the map.

I've also taken a look at ItemGroups, and if you rename a project inside an item group, it behaves just like normal renaming without any context item group. So the function can do the rename just fine.

Furthermore, I did do a quick test with the Cloudbees folder plug-in, and moving/copying jobs from one folder to another behaves just like as if you created the job anew at its new target and deleted it at the current one. The methods from the ItemListener extension point should therefore also handle this correctly.

@buildhive

This comment has been minimized.

Copy link

buildhive commented Aug 7, 2013

Jenkins » git-plugin #308 SUCCESS
This pull request looks good
(what's this?)

@HedAurabesh

This comment has been minimized.

Copy link
Author

HedAurabesh commented Aug 20, 2013

Hi Nick.

Have you had the chance to examine the latest commit?

@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Aug 20, 2013

Need more time, and would like @kohsuke advise on this change

@HedAurabesh

This comment has been minimized.

Copy link
Author

HedAurabesh commented Sep 24, 2013

Hi Nick.

Have you talked to Kohsuke (@kohsuke) and has he advised you on this change?

I am just asking because on one of our servers which has not had this change enabled, the "build.xml" grows to over 10MB in size nowadays.

Additionally, in our own tests, we have not seen any issues with the latest changes I uploaded to this pull request.

Thanks!

@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Sep 24, 2013

We indeed discussed this issue and consider some way to get it fixed, partially based on your approach. Will have a look after javaOne

@jglick

This comment has been minimized.

Copy link
Member

jglick commented Oct 3, 2013

My suggestion was to continue to save the map by default, but store it in a separate file in the job root, rather than in each build.

Also when there is a Git provider for branch-api (IIRC this is still in the refactoring branch), the SCMSource could suppress the branch map since it would probably not be used for multibranch projects; if anyone does by chance call the associated API, a linear search could satisfy the request.

@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Oct 3, 2013

I'm currently trying to implement your suggestion.

Also have to investigate scm-api

@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Oct 7, 2013

working on https://github.com/jenkinsci/git-plugin/tree/builddata

Need to investigate how to handle repositoryURL in (now centralized) BuildData

@HedAurabesh

This comment has been minimized.

Copy link
Author

HedAurabesh commented Dec 4, 2013

Hi Nick.

I've seen that you have not committed anything on the "builddata" branch for two months. Do you have any updates and/or blockers in that respect?

We would really like to get an "official" solution to this problem as soon as possible. :)

Thanks!

@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Dec 4, 2013

No much time to invest on this now.
Main issue I have is that legacy data can match distinct repo URL. This makes things hard to migrate.

@ndeloof ndeloof force-pushed the jenkinsci:master branch from cf964ea to 70b7c8d Sep 22, 2014

@MikeMcQuaid

This comment has been minimized.

Copy link

MikeMcQuaid commented Jul 1, 2015

Any news here?

@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Jul 1, 2015

Didn't made progress on this topic. branch-api and multi-branch project types are the best option imho.

@MikeMcQuaid

This comment has been minimized.

Copy link

MikeMcQuaid commented Jul 1, 2015

@ndeloof Sorry, I'm missing context here. I guess I'm wondering why the branch can't be merged as-is to fix the issue. I'm also in the situation that the build.xml files alone are taking up 16GB on my Jenkins box.

@MarkEWaite

This comment has been minimized.

Copy link

MarkEWaite commented Jul 1, 2015

@MikeMcQuaid the code as it sits today won't merge successfully. Even if it did merge successfully with the current branch, the work is incomplete (at least with regard to the volume of testing needed to verify this caliber of change). Delivering an incomplete solution to 60 000 installations of the git plugin will create far more pain than it solves.

You might consider building your own version of the git plugin for use on your server. You could then choose the plugin version you prefer, merge this change onto it, and provide valuable feedback on the completeness of this proposed change for your use case.

@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Jul 2, 2015

I'm 👎 on this approach. For sure it will fix your issue but is just hides the underlying design issue in a limited context and we hardly understand the possible side effects

@MikeMcQuaid

This comment has been minimized.

Copy link

MikeMcQuaid commented Jul 2, 2015

Even if it did merge successfully with the current branch, the work is incomplete (at least with regard to the volume of testing needed to verify this caliber of change). Delivering an incomplete solution to 60 000 installations of the git plugin will create far more pain than it solves.

@MarkEWaite @ndeloof It'd be good if you could elaborate a bit on what's required for this to get merged (beyond fixing the conflicts). It's a known, unresolved issue and there doesn't seem to be any work happening on it beyond this PR. If this is a "only someone with commit access can fix this" that's OK; it'd just be better to be explicit.

@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Jul 2, 2015

I won't accept this PR as it does not address the underlying issue, just provides a workaround for a specific context, and can introduce major issues when user later switch to a branch-based model that would require BuildData well set. BuildData is just a terrible idea and should never have been implemented (at least, not this way) so we have to find another way to handle this and provide migration path ... not a trivial development (I already tried 3 times to implement this without success)

Main issue is BuildData is exposed to BuildChooser, so any refactoring do interact with other plugins. Using a centralized branch Map doesn't work in this context, as some plugin do rely on BuildData history to compare current Build vs previous one at various point in build execution. Imho we need to significantly change git-plugin design and especially drop BuildChooser in favor for some more isolated API to let external plugin control the workspace setup / identify commit to build.

@MikeMcQuaid

This comment has been minimized.

Copy link

MikeMcQuaid commented Jul 2, 2015

@ndeloof Perhaps it should be closed, then?

For anyone else reading this issue: https://wiki.jenkins-ci.org/display/JENKINS/Remove+Git+Plugin+BuildsByBranch+BuildData is a slight workaround.

@ndeloof ndeloof closed this Jul 2, 2015

@HedAurabesh

This comment has been minimized.

Copy link
Author

HedAurabesh commented Jul 13, 2015

Hi everyone.

We had used a forked version of the GIT plugin with the centralized Branch data changes in place for a bit over a year without problems.

Nowadays, we have abandoned the use of the GIT plugin entirely, so we don't mind you closing this issue.

I just want to highlight that this is essentially a ticking time-bomb for all ~60.000 installations using this plugin. It is not a question IF it will become a problem for a particular user; it is only a question of WHEN.

Because, eventually, even the smallest growth in data -- if it grows monotonically like this one here -- will grow out of even the largest bounds.

@ndeloof

This comment has been minimized.

Copy link
Member

ndeloof commented Jul 13, 2015

I fully agree and this is my major concern with git-plugin.

@TheFriendlyCoder

This comment has been minimized.

Copy link

TheFriendlyCoder commented Jul 27, 2016

I can't speak to the difficulties with implementing a fix for this, but I can attest to the impact being significant and the use cases mentioned are not unique and isolated. In fact, for anyone using the MultiJob plugin the effects of this defect are compounded. This is due to the fact that each sub-job managed by a multi-job have their BuildData embedded in the build log for the parent. Also, this behaviour is recursive, so multi-jobs that manage other multi-jobs have their transitive descendants BuildData included in the parents as well. In fact, in our use case some build logs have grown to 5-10MB, 90% of which are contained within the BuildData nodes in the build.xml.

Easy or hard, I think a fix is desperately needed for this sooner rather than later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.