-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change cluster naming convention for e2e CI/PR jobs #7682
Change cluster naming convention for e2e CI/PR jobs #7682
Conversation
/cc @rmmh |
7180b60
to
9fa0b98
Compare
@rmmh Could you PTAL? The rationale of this change is explained here - #7673 (comment) |
jenkins/bootstrap.py
Outdated
@@ -657,9 +657,10 @@ def pr_paths(base, repos, job, build): | |||
# Batch merges are those with more than one PR specified. | |||
pr_nums = pull_numbers(pull) | |||
if len(pr_nums) > 1: | |||
pull = os.path.join(prefix, 'batch') | |||
os.environ[PULL_ENV] = 'batch' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't change bootstrap.py -- PULL_NUMBER is set by Prow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - PTAL.
# This ensures no conflict across runs of different jobs (see #7592). | ||
# For PR jobs, we use PR number instead of build number to ensure the | ||
# name is constant across different runs of the presubmit on the PR. | ||
# This helps clean potentially leaked resources from earlier run that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the cluster already exists, it will be deleted first, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right - as part of k/k's e2e-up.sh script.
9fa0b98
to
cbb1934
Compare
@rmmh Fixed on comments - PTAL. |
/hold |
scenarios/kubernetes_e2e.py
Outdated
# name is constant across different runs of the presubmit on the PR. | ||
# This helps clean potentially leaked resources from earlier run that | ||
# could've got evicted midway (see #7673). | ||
suffix = os.getenv('BUILD_NUMBER', 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this dispatch explicit based on JOB_TYPE
, and reduce the amount of magic happening in getenv:
job_type = os.getenv('JOB_TYPE')
if job_type == 'batch':
suffix = 'batch'
elif job_type == 'presubmit':
suffix = '%s' % os.environ['PULL_NUMBER']
else:
suffix = 'b%s' % os.getenv('BUILD_NUMBER', 0)
if len(suffix) > 10:
suffix = hashlib.md5(suffix).hexdigest([:10])
job_hash = hashlib.md5(os.getenv('JOB_NAME', '')).hexdigest()[:5]
return 'e2e-%s-%s' % (suffix, job_hash)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
my concern here: you have a PR:
|
@krzyzacy With this change, we wouldn't need to set a grace-period for the scalability jobs (and in fact for any job that's creating a k8s cluster) :) [EDIT: To clarify, reaping the leaked resources would be part of the next run's setup scripts] |
cbb1934
to
256b8d8
Compare
256b8d8
to
0e0e047
Compare
we should probably retain a long enough grace period for the test logs to upload /lgtm |
This may not be possible currently for our scalability presubmits due to need for various kinds of quota (unless we can somehow change boskos to account for that). |
Boskos supports multiple resource pools, gpu related testing has its own collection of projects with special quota. We can do something similar for scalability? |
@shyamjvs there's ways to manage quota for internal projects, you don't have to do it manually |
I see, thanks. If we can somehow ensure that our jobs land on specific project(s), that should work.
By "manage quota" do you mean manage increasing/decreasing quota for our scalability projects or manage allocation of projects to our jobs by test-infra? |
Yeah, for gpu jobs we have the flag |
Can we get this PR in if there are no other concerns?
If there are some discussions that can happen independent of this PR (like moving scale jobs to boskos) - I'd prefer not blocking this on those.
Does it SG? |
For now, I manually cleaned up the leaked resources from so many runs. With this PR in, there should be way less leaks. |
/lgtm |
I would prefer we not check in new code using a deprecated environment variable, FWIW. |
Flipping AllowCancellations should not make a difference because sinker is deleting the pods anyways currently. Basically we are always allowing cancellations. The fix that I am suggesting would just fix the |
@BenTheElder If the problem is using BUILD_NUMBER (which is being deprecated) instead of BUILD_ID.. then I can easily change that :) |
Please do. We should minimize the number of environment variables in use where reasonable. |
cca348b
to
b82a294
Compare
@BenTheElder Done, PTAL |
job_type = os.getenv('JOB_TYPE') | ||
if job_type == 'batch': | ||
suffix = 'batch-%s' % os.getenv('BUILD_ID', 0) | ||
elif job_type == 'presubmit': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why aren't we just using BUILD_ID for all cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the reason to have this PR in the first place :)
See #7673 (comment) for the rationale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use BUILD_ID for the presubmit runs then the whole purpose (of having the same cluster-name for a given presubmit+PR pair) is defeated - and we end being pretty much in the same state that we already are in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Er so we're going to depend on cluster naming for the same project + PR number + job and the side effect that kubetest will tear it down ... ? 😯
This feels like a brittle work around that will come to bite us. In the future this will get refactored... We don't want to need the "scenarios" long term..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/hold cancel
I'm merging this because we have very real problems with the affected jobs right now, but for the record I don't think this is a strong solution, new projects should be managed by boskos and we should improve the boskos janitor redundancy.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BenTheElder, krzyzacy, rmmh, shyamjvs The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks Ben. That does sound like a reasonable long-term plan.
…On Tue, Apr 24, 2018, 12:03 AM k8s-ci-robot ***@***.***> wrote:
Merged #7682 <#7682>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#7682 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEIhkyJBwjED0bM89Xnv48SXG5XV3lz5ks5trk-WgaJpZM4TTwCT>
.
|
Ref #7673
Does this look reasonable to you?
/cc @krzyzacy @BenTheElder @cjwagner