Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable stacktraces from ExportTable entries by default #441

Merged
merged 3 commits into from
Apr 17, 2021
Merged

Disable stacktraces from ExportTable entries by default #441

merged 3 commits into from
Apr 17, 2021

Conversation

res0nance
Copy link
Contributor

Not certain if this is a good idea but the idea spawned from JENKINS-63975 as stacktraces seem to take a considerable amount of memory usage in Jenkins and as far as I understand it, it is entirely for debugging.

Although I'm not certain how often this workflow is used in the real world, if its very rare this ever happens perhaps its better to disable by default.

@jeffret-b
Copy link
Contributor

This looks like a good idea. I don't think we get much value from retaining these stack traces. I'll have to look at it more deeply.

@basil
Copy link
Member

basil commented Feb 19, 2021

While I have not used it myself, this infrastructure smells like what Brendan Gregg would call a crisis tool: a tool which you would need to get to root cause in a crisis situation and which you would not be able to easily install or enable in that same crisis situation. If such tools are not installed or enabled by default, then these crises often pass without getting to root cause. Installing or enabling the crisis tool after the fact may not help if the problem is not easily reproducible, or if inducing another failure for the purposes of root cause analysis is prohibitively expensive. The philosophy of debugging from first failure is best served by having such crisis tools installed and enabled by default.

@res0nance
Copy link
Contributor Author

While I have not used it myself, this infrastructure smells like what Brendan Gregg would call a crisis tool: a tool which you would need to get to root cause in a crisis situation and which you would not be able to easily install or enable in that same crisis situation. If such tools are not installed or enabled by default, then these crises often pass without getting to root cause. Installing or enabling the crisis tool after the fact may not help if the problem is not easily reproducible, or if inducing another failure for the purposes of root cause analysis is prohibitively expensive. The philosophy of debugging from first failure is best served by having such crisis tools installed and enabled by default.

That was an informative watch, although in this case compared to having tools installed this has a cost associated in terms of memory which directly affects scalability. In some controllers StacktraceElement can be one of the largest consumers of memory. One of my instances has it pegged at the second highest consumer of memory. Depending on how useful this information is to remoting perhaps an argument can be made to disable it by default.

@jeffret-b
Copy link
Contributor

Yes, I'm sure this was intended as some sort of debugging tool in problematic situations. But, I don't know that there was really a justification for it or someone years ago thought it might be useful sometime. I'm not personally aware of it every being useful.

Most of the failures associated with Remoting these days originate from outside -- system, environment, or networking issues that cause the channel to close. Or weird plugin interactions that end up looking like something in Remoting. The stack traces from inside rarely provide interesting information in these cases.

Unless we can come up with a meaningful scenario where these stack traces help, removing them would be better given the cost.

@basil
Copy link
Member

basil commented Feb 25, 2021

I'm not personally aware of it ever being useful.

That doesn't necessarily mean this functionality is pointless, just that nobody has used it recently (which could be for any number of reasons). If this functionality is now pointless and has not been recently used, then it would indeed be clearly desirable to disable it by default or even to remove it entirely. If this functionality still has a legitimate purpose but has not been recently used, then making it harder to use during crisis situations would be a small regression in debuggability. Whether that small regression is worth the gain in scalability seems hard to quantify without fully understanding the original use case, something I don't personally understand and which isn't explained in this PR.

I think it would be helpful to research the original use case from which this functionality originated and to reason explicitly about whether or not that use case remains legitimate, regardless of whether or not anyone has used this functionality recently. Ideally a solution could be found that improves scalability without decreasing debuggability. It is unclear to me whether such a solution exists, but I think it would be worth thinking about. At minimum, going through this thought exercise would give us more clarity about the cost (to be weighed against the obvious scalability benefit).

If we have looked for such a solution but have not found one, then we have to choose which of the two is more important in this situation: scalability or debuggability. That decision is in part based on the cost-benefit calculus described above but also comes down to the community's values. I personally value debuggability over scalability, but I recognize that I am on the periphery of this community and that the community's values may not exactly match my own (which is fine).

@jeffret-b
Copy link
Contributor

This diagnostic capability was added 7 years ago in an attempt to gather more information for tracking down some specific information. I'm not seeing any indication that it was helpful in diagnosing the problem or that the problem has retained any importance. While it's hard to say that it no longer exists, I can't find any further reports. There are only 4 people watching it.

This PR does include a way to reintroduce the logging, if it is needed. Given the thin history of this capability, even that may be excessive.

Given the cost of this feature and the very dubious value, it's probably best to at least disable it.

Copy link
Contributor

@jeffret-b jeffret-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm inclined to go with this cleanup. I haven't seen cases where this information has been useful. If it does prove useful somewhere this change includes a flag to turn it back on.

@jeffret-b
Copy link
Contributor

Additional reviews and opinions on this one are definitely welcome. I've taken basil's into account and am still inclined to accept it as mentioned in my approval.

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be documented as a system property. No objections w.r.t disabling it by default, but maybe we need some means to enable it in runtime for the controller and agents. Otherwise debugging will become even more complicated

Will request reviews from @jenkinsci/core to get more feedback

@oleg-nenashev oleg-nenashev requested a review from a team April 5, 2021 21:36
Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably OK. If heap rather than CPU is the main issue, might be preferable to retain traces for recently created entries, but I am not sure if that would even be appropriate.

src/main/java/hudson/remoting/ExportTable.java Outdated Show resolved Hide resolved
@@ -11,6 +11,7 @@
*/
public class ExportTableTest extends TestCase {
public void testDiagnosis() throws Exception {
ExportTable.EXPORT_TRACES = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restore previous value in a finally block, or @After etc. (See FlagRule for JTH-based tests.)

Copy link
Member

@timja timja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm apart from Jesse's comments

res0nance and others added 2 commits April 6, 2021 19:51
Co-authored-by: Jesse Glick <jglick@cloudbees.com>
@oleg-nenashev oleg-nenashev self-requested a review April 6, 2021 12:00
@jeffret-b jeffret-b added ready-to-merge removed Functionality removed or otherwise cleaned up labels Apr 6, 2021
@jeffret-b
Copy link
Contributor

Looks like this has received enough approvals so I marked it as ready-to-merge. Other reviews are still welcome. We'll plan on getting this into an upcoming release.

Copy link
Member

@oleg-nenashev oleg-nenashev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@oleg-nenashev oleg-nenashev merged commit 6f30821 into jenkinsci:master Apr 17, 2021
@jeffret-b jeffret-b added the enhancement For changelog: An enhancement providing new capability. label Apr 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement For changelog: An enhancement providing new capability. ready-to-merge removed Functionality removed or otherwise cleaned up
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants