Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

constrain an agents memory #380

Merged
merged 2 commits into from Jan 26, 2022
Merged

Conversation

jtnord
Copy link
Member

@jtnord jtnord commented Jan 19, 2022

Noticed when looking at the memory usage of pipeline-model-defintion
that the agents created in the tests did not have a constrained memory
set. This means that the tests would be platform dependent. or flaky
depending on what GC happens at various times.

the number 512 is relatively arbitrary, but k8s agents use 256Mi as a
request and so 512 is double that, and the agent process is supposed to
be lightweight.

Whilst this could reveal a few issues in plugins when they upgrade, it
seems to me that that could be a bug in the plugin, and so I have not
implemented a way to create an online slave with a specific amount of
memory. if that is indeed required it can be added later as jth updates
are not forced on plugins.

here is a before from a pipeline-model-definition test (nearly 3gb claimed for the heap...) (and there are 4 agents in this test!):

image

and after :

image

  • Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
  • Ensure that the pull request title represents the desired changelog entry
  • Please describe what you did
  • Link to relevant issues in GitHub or Jira
  • Link to relevant pull requests, esp. upstream and downstream changes
  • Ensure you have provided tests - that demonstrates feature works or fixes the issue

Notivced when looking at the memory usage of pipeline-model-defintion
that the agents created in the tests did not have a constrained memory
set.  This means that the tests would be platform dependent. or flaky
depending on what GC happens at various times.

the number 512 is relativly arbitrary, but k8s agents use 256Mi as a
request and so 512 is double that, and the agent process is supposed to
be lightweight.

Whilst this could reveal a few issues in plugins when they upgrade, it
seems to me that that could be a bug in the plugin, and so I have not
implemented a way to create an online slave with a specific amount of
memory.  if that is indeed required it can be added later as jth updates
are not forced on plugins.
Co-authored-by: Tim Jacomb <21194782+timja@users.noreply.github.com>
@jglick jglick added the bug label Jan 19, 2022
@jglick jglick enabled auto-merge January 19, 2022 13:15
@basil
Copy link
Member

basil commented Jan 19, 2022

I think a realistic test of this PR would involve the following:

  1. (Regression test) Running the core test suite with the incremental from this PR
  2. (Regression test) Running a plugin's test suite with the incremental from this PR and a Jenkinsfile, assuming that:
    a) The plugin has test cases that create 1-2 agents
    b) The plugin is running on our standard buildPlugin(useContainerAgent: true) setup
    c) The plugin's tests currently pass
  3. (Fix test) Find a plugin with a test case that had to be disabled on Windows ACI container agents due to the fact that it created too many agents and used up too much memory (ExtendedEmailPublisherMatrixTest is a good one), enable the test on Windows ACI container agents to observe that it still fails, and then confirm that with this fix the test case now uses less memory and passes on Windows ACI container agents

@basil basil disabled auto-merge January 19, 2022 13:19
@jglick
Copy link
Member

jglick commented Jan 19, 2022

Yes we could do it that way rather than just waiting to see if there are any regressions and reverting if so. Any volunteers?

@jtnord
Copy link
Member Author

jtnord commented Jan 19, 2022

Yes we could do it that way rather than just waiting to see if there are any regressions and reverting if so. Any volunteers?

I am creating some PRs

@timja
Copy link
Member

timja commented Jan 19, 2022

Yes we could do it that way rather than just waiting to see if there are any regressions and reverting if so. Any volunteers?

I am creating some PRs

good luck while jenkins-infra/helpdesk#2733 is a thing

@jtnord
Copy link
Member Author

jtnord commented Jan 19, 2022

CI build failed due to a timeout (50+ minutes before even starting the tests) which looks like the artifactory issue) so there is no incremental to attempt to use at the moment 😢

jtnord added a commit to jtnord/jenkins that referenced this pull request Jan 19, 2022
@jtnord
Copy link
Member Author

jtnord commented Jan 19, 2022

I was thinking this would most likely break the maven-plugin as that does horrible things with maven in the agent process but we are in luck - it copies the JTH code 🥳
/cc @aheritier incase he want to make similar changes in the job.

@jtnord
Copy link
Member Author

jtnord commented Jan 21, 2022

@basil

(Regression test) Running the core test suite with the incremental from this PR

jenkinsci/jenkins#6192 exists and has 2 failures - both have been observed before this change.
Also given jenkinsci/jenkins#6198 (review) that the previous release has been reverted it is unclear how to demonstrate this further in jenkins core given the current state.

(Regression test) Running a plugin's test suite with the incremental from this PR and a Jenkinsfile, assuming that:

pipeline-model-definition does that and there is a PR that fixes a lot of tests.

also given the extended issue with artifactory it is not clear if I can demonstrate this any further before getting a release?

@timja
Copy link
Member

timja commented Jan 21, 2022

Shouldn't we identify the cause of the regression and revert the fix forward before adding more changes to this module?

@jtnord
Copy link
Member Author

jtnord commented Jan 21, 2022

Shouldn't we identify the cause of the regression and revert the fix forward before adding more changes to this module?

I assumed given Basil reverted it that he is working on it somewhere. there are so few things in the change the only thing that looked applicable was the junit5 stuff (#375).

the Jenkins build is also no longer happy regardless of the JTH version (windows ACI agent issue) and given this does not seem to impact everyone, merging this whilst the above investigation goes on should not be a blocker should it (no one is forcing this update on anyone - it is purely opt in and given the changelog should not impact any reverting or fixing due to conflicts)?

Also to note at least one plugin (that is pretty important) is getting much less test coverage than it should because of this issue jenkinsci/pipeline-model-definition-plugin#477

do we really think an agent should have more than 512MB in production, let alone testing (save the Maven project type?) that we think this is too risky to accept?

@basil
Copy link
Member

basil commented Jan 21, 2022

it is unclear how to demonstrate this further in jenkins core given the current state.

You could demonstrate this PR in core by rebasing it on top of the older version of jenkins-test-harness that core is currently using.

You have also not addressed my third point about Windows ACI tests like ExtendedEmailPublisherMatrixTest.

Shouldn't we identify the cause of the regression and revert the fix forward before adding more changes to this module?

Agreed. The reason jenkins-test-harness is in a bad state is that previous PRs were merged without sufficient testing. The bad state should be corrected by reverting the problematic PR, and we should avoid getting into a bad state going forward by doing more thorough testing of PRs (like this one) prior to merge.

I assumed given Basil reverted it that he is working on it somewhere.

Do not make assumptions about what I am working on. I am not working on stabilizing jenkins-test-harness.

@jtnord
Copy link
Member Author

jtnord commented Jan 21, 2022

You have also not addressed my third point about Windows ACI tests like ExtendedEmailPublisherMatrixTest.

which was

Fix test) Find a plugin with a test case that had to be disabled on Windows ACI container agents due to the fact that it created too many agents and used up too much memory (ExtendedEmailPublisherMatrixTest is a good one), enable the test on Windows ACI container agents to observe that it still fails, and then confirm that with this fix the test case now uses less memory and passes on Windows ACI container agents

see jenkinsci/pipeline-model-definition-plugin#480 - the tests where disabled on windows and it uses aci agents. I did not choose the plugin you suggested but this does as far as I am aware cover the ask.
granted I did not re-enable the tests to show it failed first - but you only disabled them in a PR that was merged the same day so figured that would be overkill.

@jtnord jtnord closed this Jan 21, 2022
@jtnord jtnord reopened this Jan 21, 2022
@jtnord jtnord merged commit 6b1e34c into jenkinsci:master Jan 26, 2022
@jtnord jtnord deleted the constrain-agent-memory branch January 26, 2022 12:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants