Skip to content

Show warning when restarting job and assets are missing #2676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Martchus
Copy link
Contributor

@Martchus Martchus commented Jan 20, 2020

See #2676 (comment) for the current state.


Related issue: https://progress.opensuse.org/issues/34783

This is WIP as it turns out to be not so easy as expected:

  • The required assets need to be parsed from the settings again because the assets of the job might have already been deleted from the database.
  • "Hidden" assets such as repositories need to be ignored.
  • So far the error is thrown in the duplicate function which means that this change would prevent any job duplication unless the assets are present.

Note that this is only about restarting assets to prevent the creation of a clone in the first place.

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Our change to look in the config in the new location broke other tests, e.g.

t/ui/18-tests-details.t ................ Can't call method "config" on an undefined value at /home/squamata/project/lib/OpenQA/Schema/Result/Assets.pm line 172.
t/ui/18-tests-details.t ................ 1/? 
#   Failed test 'on main page'
#   at t/ui/18-tests-details.t line 100.
#          got: ''
#     expected: 'openQA'

I would like to avoid that we need to add the boiler-plate fake app code in all affected test modules though. Maybe we should fall back to a default value in the code if the whole "config" can not be found or we change the "app->config" lookup in general to make that easier in general as well as for testing. As we already discussed the topic of the "global $app variable" yesterday, I wonder what's a better approach, e.g. have a method returning the app instead as a first step?

@@ -164,13 +164,18 @@ sub refresh_size {
return $new_size;
}

sub hidden {
sub is_type_hidden {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems weird to have this as a public function in a result class.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If OpenQA::Setup wasn't so chaotic i would say just put it there as a config utility function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be good finding a better place for this function. But for now this is likely the least problem with that draft.

@Martchus
Copy link
Contributor Author

I would like to avoid that we need to add the boiler-plate fake app code in all affected test modules though.

Me, too. And yes, it would make sense to let the code handle this instead of providing an app within all tests.

@Martchus
Copy link
Contributor Author

I suppose it is better to make this a warning as a first step to avoid potential problems.

Then we can decide what we actually want to do if an asset is missing:

  1. There's one ticket which suggests the download should be restarted
  2. and another ticket which suggests the restart should be prevented (that's what I started to implement here).

The check for missing assets introduced here would be useful regardless whether we implement option 1 or 2 in the end.

@Martchus Martchus force-pushed the prevent-restarting-job-if-asset-missing branch from ffdc09d to 4f14198 Compare February 4, 2020 16:58
@Martchus Martchus force-pushed the prevent-restarting-job-if-asset-missing branch from 4f14198 to 1ed83a6 Compare February 17, 2020 09:19
@Martchus Martchus force-pushed the prevent-restarting-job-if-asset-missing branch from 1ed83a6 to 9f4890e Compare February 20, 2020 12:11
@Martchus Martchus marked this pull request as ready for review February 20, 2020 16:43
@Martchus Martchus changed the title Prevent restarting job if asset missing Show warning when restarting job and assets are missing Feb 20, 2020
@codecov
Copy link

codecov bot commented Feb 20, 2020

Codecov Report

Merging #2676 into master will decrease coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2676      +/-   ##
==========================================
- Coverage   92.03%   91.98%   -0.05%     
==========================================
  Files         190      190              
  Lines       11764    11779      +15     
==========================================
+ Hits        10827    10835       +8     
- Misses        937      944       +7
Impacted Files Coverage Δ
lib/OpenQA/WebAPI/Controller/API/V1/Job.pm 87.92% <100%> (+0.03%) ⬆️
lib/OpenQA/Schema/Result/Assets.pm 98.41% <100%> (+0.02%) ⬆️
lib/OpenQA/Resource/Jobs.pm 100% <100%> (ø) ⬆️
lib/OpenQA/Schema/Result/Jobs.pm 95.76% <100%> (+0.04%) ⬆️
lib/OpenQA/Scheduler/Model/Jobs.pm 88.66% <0%> (-2.84%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c8a7e6...742afe5. Read the comment docs.

@Martchus Martchus merged commit 8597525 into os-autoinst:master Feb 21, 2020
@Martchus Martchus deleted the prevent-restarting-job-if-asset-missing branch February 21, 2020 11:05
@AdamWill
Copy link
Contributor

"There's one ticket which suggests the download should be restarted"

I'd quite like this, FWIW. Also what about the case where the asset was generated by a chained parent but has now been garbage-collected? Restart the parent?

@okurz
Copy link
Member

okurz commented Mar 20, 2020

Yes, if the asset was provided by the parent this is what I suggest for now: prevent the restart of the child. The parent can still be restarted which would generate the necessary asset and also trigger the child

@marmarek
Copy link
Contributor

This (and the later change making it an error) breaks restarting a job that has its hard disk image created by the parent job. I get:

Errors occurred when restarting jobs:

    Job 7639 misses the following mandatory assets: hdd/disk_unencrypted_updated.qcow2
    You may try to retrigger the parent job that should create the assets and will implicitly retrigger this job as well.

The asset is there, but is named 00007628-disk_unencrypted_updated.qcow2 (because it is created by the parent job).
Example affected job: https://openqa.qubes-os.org/tests/7639

PS I'd comment on progress.opensuse.org, but my user is not allowed there ("marmarek was authenticated but needs to be created and/or activated in Redmine first.").

@Martchus
Copy link
Contributor Author

I'll look into it on Monday.

@AdamWill
Copy link
Contributor

AdamWill commented Apr 18, 2020

oh, yeah, the name check will need to do the same as the code that modifies asset names on upload I guess. I think that's this bit here.

@AdamWill
Copy link
Contributor

So, looking a bit more: Schema/Results/Jobs.pm already has a _asset_find which takes into account the possibility of modified asset names. It looks for files-on-disk rather than database entries, but could this not just use that?

@AdamWill
Copy link
Contributor

AdamWill commented Apr 18, 2020

I also don't understand why this ignores hidden assets. Hidden assets are hidden from the web UI. It is still going to be a problem if the job needs to access them but they don't exist - why are we ignoring them? In Fedora openQA we set ISOs and disk images to be "hidden" because we don't want people to be able to download them from the web UI (our server doesn't really have the bandwidth for that), so that means this check will just ignore them and be more or less useless for us. Did you really want to exclude repos from the check, and this just 'works' because the default config is that only repos are hidden?

@okurz
Copy link
Member

okurz commented Apr 20, 2020

@AdamWill The point of "hidden assets" is that these are "hidden dependencies" which we don't want to (try to) clone on openqa-clone-job, e.g. openqa-clone-job <$id> of an installation job that has access to 4.7GB installation DVD + 120GB online repos: We want to clone the DVD but trying to clone 120GB online repos in most cases is too much and hopefully these assets are not necessary locally. This of course very much depends on test design because at best the tests do not rely on these "hidden assets" aka. online repos which are now hopefully still residing on the clone-from openQA instance. Seems like you (ab-)use the feature for something different :) For your particular use case I suggest to not let openQA handle the "prevent download from our poor-bandwidth openQA server" but use your web proxy, e.g. apache or nginx, to either forbid download or redirect to another instance that can handle the load. openSUSE is using the same approach to prevent GM images for openSUSE Leap from openqa.o.o and either forbid with 403 or redirect to the static filename on download.opensuse.org . Example entries for apache:

<VirtualHost *:80>

    ServerName openqa.opensuse.org
    ServerAlias openqa.infra.opensuse.org
…
    RewriteEngine on
    RewriteCond %{REMOTE_ADDR}       !^192\.168\.112\.
    RewriteRule "/assets/iso/fixed/openSUSE-Leap-(42.[23]|15.[0-9])-(DVD|NET)-x86_64.iso" "http://download.opensuse.org/distribution/leap/$1/iso/openSUSE-Leap-$1-$2-x86_64.iso"

    # Uncomment during GMC phase of Leap
    #RewriteCond %{REMOTE_ADDR}       !^192\.168\.112\.
    #RewriteRule "/assets/iso/openSUSE-Leap-(42.[23]|15.[0-9])-(.*).iso" "http://download.opensuse.org/distribution/leap/$1/iso/openSUSE-Leap-$1-$2.iso"

    RewriteCond %{REMOTE_ADDR}       !^192\.168\.112\.
    RewriteRule "/assets/.*/([wW]in.*)" "https://www.microsoft.com/software-download/windows10ISO"
</VirtualHost>

Martchus added a commit to Martchus/openQA that referenced this pull request Apr 20, 2020
* Otherwise the check for missing assets is unable to find the asset under
  its adjusted name
* See os-autoinst#2676 (comment)
@Martchus
Copy link
Contributor Author

@marmarek It turns out that this is not that trivial to implement because the job settings don't have the (parent) job ID which is added on asset creation. It would be possible to apply the same logic as in register_assets_from_settings so it would try the IDs for all parents. However, that's a little bit complicated and wouldn't work for cloned jobs without further hacks in the clone job script.

Another idea is to simply adjust the job settings when naming the asset: #2961
Not sure what side-effects this change will have. Besides, it might only work for the first restart.

I could also add a "Force restart" button to make it at least not so inconvenient to restart the job when a false alert happens.

@AdamWill
Copy link
Contributor

AdamWill commented Apr 20, 2020

@okurz we are not "abusing it for something else". The hidden assets feature you're using is specifically designed for concealing assets from display as downloadable by the web UI. Once again I can tell you this because I wrote it. :)

At the time it was written it was only used for this purpose, so if openqa-clone-job came along and adopted this feature for some other purpose, then it is the one abusing it.

@Martchus
Copy link
Contributor Author

Mh... your commit makes repos the only default "hidden asset". That's why I though of repos as an example of a hidden asset: Something we can not download locally and don't display on the web page. Apparently it was only about the last property, though. (The commit message make that clear.)

But the clone script didn't even adopt the feature. It has a hard-coded exception for repos only for the "can not download locally part": next if $type eq 'repo'; # we can't download repos

So maybe we should open yet another category for not clonable assets if we want to avoid hardcoding in the clone script. These assets could then be ignored by the check for missing assets and not all hidden assets.

The good thing is that this is not a regression for you. It simply means you can not benefit from the new feature to prevent starting jobs with missing assets. But that feature still has its problems anyways.

@AdamWill
Copy link
Contributor

AdamWill commented Apr 20, 2020

right, in practice I'm happy that bug exists, but it's still a bug =)

Before my commit, this was similarly hardcoded for the web UI: it just hardcoded "don't show repo assets". My commit made that a config setting instead. I left the default value as just repos, because that's what SUSE wants (wanted?), but for Fedora, ever since that commit we've had that setting set to also hide ISOs and hard disk images. And yeah, I agree it makes sense to abstract out the concept of "non-clonable assets", it just needs to be different from "not-displayed-in-the-web-UI assets".

@marmarek
Copy link
Contributor

I could also add a "Force restart" button to make it at least not so inconvenient to restart the job when a false alert happens.

Ideally a proper solution would be better. But in the meantime, is there a single line I can comment out to workaround the issue? I'm unable to restart most of the jobs right now...

@Martchus
Copy link
Contributor Author

You can still force the restart via the API (using the query parameter force=1). You could also simply insert a return 0; at the beginning of sub missing_assets {.

@marmarek
Copy link
Contributor

I didn't know force=1, that works as a workaround for me, thanks!

marmarek added a commit to marmarek/openQA that referenced this pull request Apr 22, 2020
Inspired by register_assets_from_settings, check using _asset_find()
instead of database query. What does matter here is not really looking
for the actual file, but that _asset_find() also consider private assets
named after parent job id.
For this purpose, query also parent(s) job ids in missing_assets().

Related to os-autoinst#2676
marmarek added a commit to marmarek/openQA that referenced this pull request Apr 23, 2020
When checking for missing assets, consider also private assets uploaded
by the parent job (with name prefixes by the parent id).
For this purpose, factor out _parent_job_ids() from
register_assets_from_settings() and use it in missing_assets() too.

Related to os-autoinst#2676

Co-authored-by: Marius Kittler <mkittler@suse.de>
marmarek added a commit to marmarek/openQA that referenced this pull request Apr 23, 2020
When checking for missing assets, consider also private assets uploaded
by the parent job (with name prefixes by the parent id).
For this purpose, factor out _parent_job_ids() from
register_assets_from_settings() and use it in missing_assets() too.

Related to os-autoinst#2676

Co-authored-by: Marius Kittler <mkittler@suse.de>
marmarek added a commit to marmarek/openQA that referenced this pull request Apr 23, 2020
When checking for missing assets, consider also private assets uploaded
by the parent job (with name prefixes by the parent id).
For this purpose, factor out _parent_job_ids() from
register_assets_from_settings() and use it in missing_assets() too.

Related to os-autoinst#2676

Co-authored-by: Marius Kittler <mkittler@suse.de>
marmarek added a commit to marmarek/openQA that referenced this pull request Apr 23, 2020
When checking for missing assets, consider also private assets uploaded
by the parent job (with name prefixes by the parent id).
For this purpose, factor out _parent_job_ids() from
register_assets_from_settings() and use it in missing_assets() too.

Related to os-autoinst#2676

Co-authored-by: Marius Kittler <mkittler@suse.de>
marmarek added a commit to marmarek/openQA that referenced this pull request Apr 23, 2020
When checking for missing assets, consider also private assets uploaded
by the parent job (with name prefixes by the parent id).
For this purpose, factor out _parent_job_ids() from
register_assets_from_settings() and use it in missing_assets() too.

Related to os-autoinst#2676

Co-authored-by: Marius Kittler <mkittler@suse.de>
@AdamWill
Copy link
Contributor

ok, so that fixes the file name calculation, but is someone planning to fix the incorrect use of is_type_hidden?

@okurz
Copy link
Member

okurz commented Apr 25, 2020

Not sure what incorrect use you mean. But there is #2978 to enable forced restart over the UI as well so is only about internal semantics done right or is there a broken feature behind that you want to see fixed?

@AdamWill
Copy link
Contributor

@okurz I'm talking about the discussion we had last week where we determined that this code uses is_type_hidden incorrectly, thinking it means "non-clonable-assets" when it really means "not-to-be-displayed-in-the-web-UI-assets".

AdamWill added a commit to AdamWill/openQA that referenced this pull request Apr 29, 2020
As discussed in
os-autoinst#2676 (comment)
and follow-ups, this check misunderstands what the "hidden" type
means. The code assumes that "hidden" assets are the same thing
as "assets we don't really want to copy down when cloning jobs",
but they are not. The "hidden" attribute was written (by me) to
mean "asset types not to be shown for downloading in the web UI".

For SUSE, it happens to be the case that ["repo"] would be the
right array of both "not-to-be-shown-in-the-web-ui assets" and
"non-clonable assets", so this bug wasn't apparent, as SUSE
deployments leave the `hide_asset_types` config setting at its
default value of just 'repo'. But on Fedora deployments, this
setting is changed to 'repo iso hdd' (because we don't want to
show those asset types for download in the web UI), so this code
also ignored ISO and HDD assets when checking for "missing"
assets, which we don't want.

As discussed in the pull request we could potentially make this
a configurable attribute and have the clone_job script use it
too, but doing that is a bit harder, and I don't think for now
anyone wants a different definition of "non-clonable assets", so
it doesn't seem really necessary.

Signed-off-by: Adam Williamson <awilliam@redhat.com>
AdamWill added a commit to AdamWill/openQA that referenced this pull request Apr 29, 2020
As discussed in
os-autoinst#2676 (comment)
and follow-ups, this check misunderstands the "hidden" attribute.
The code assumes that "hidden" assets are the same thing as
"assets we don't really want to copy down when cloning jobs",
but they are not. The "hidden" attribute was written (by me) to
mean "asset types not to be shown for downloading in the web UI".

For SUSE, it happens to be the case that ["repo"] would be the
right array of both "not-to-be-shown-in-the-web-ui assets" and
"non-clonable assets", so this bug wasn't apparent, as SUSE
deployments leave the `hide_asset_types` config setting at its
default value of just 'repo'. But on Fedora deployments, this
setting is changed to 'repo iso hdd' (because we don't want to
show those asset types for download in the web UI), so this code
also ignored ISO and HDD assets when checking for "missing"
assets, which we don't want.

As discussed in the pull request we could potentially make this
a configurable attribute and have the clone_job script use it
too, but doing that is a bit harder, and I don't think for now
anyone wants a different definition of "non-clonable assets", so
it doesn't seem really necessary.

Signed-off-by: Adam Williamson <awilliam@redhat.com>
@AdamWill
Copy link
Contributor

I sent #3017 which just hardcodes 'repo', for now, since at least that's correct. I looked at adding a clonable attribute or something, but ran into the problem of figuring out where to put it exactly (as discussed in the review above, adding another public function to the asset class seems weird).

AdamWill added a commit to AdamWill/openQA that referenced this pull request May 5, 2020
As discussed in
os-autoinst#2676 (comment)
and follow-ups, this check misunderstands the "hidden" attribute.
The code assumes that "hidden" assets are the same thing as
"assets we don't really want to copy down when cloning jobs",
but they are not. The "hidden" attribute was written (by me) to
mean "asset types not to be shown for downloading in the web UI".

For SUSE, it happens to be the case that ["repo"] would be the
right array of both "not-to-be-shown-in-the-web-ui assets" and
"non-clonable assets", so this bug wasn't apparent, as SUSE
deployments leave the `hide_asset_types` config setting at its
default value of just 'repo'. But on Fedora deployments, this
setting is changed to 'repo iso hdd' (because we don't want to
show those asset types for download in the web UI), so this code
also ignored ISO and HDD assets when checking for "missing"
assets, which we don't want.

As discussed in the pull request we could potentially make this
a configurable attribute and have the clone_job script use it
too, but doing that is a bit harder, and I don't think for now
anyone wants a different definition of "non-clonable assets", so
it doesn't seem really necessary.

Signed-off-by: Adam Williamson <awilliam@redhat.com>
Wabri added a commit to Wabri/openQA that referenced this pull request Mar 14, 2025
The retry commit is now possible to rever thanks to the merged pr of Dominik Heidler os-autoinst#2676

This will revert the changes made by prs that mitigate the problem: os-autoinst#6244 and os-autoinst#6115

related: https://progress.opensuse.org/issues/175060
Wabri added a commit to Wabri/openQA that referenced this pull request Mar 17, 2025
The retry commit is now possible to rever thanks to the merged pr of Dominik Heidler os-autoinst#2676

This will revert the changes made by prs that mitigate the problem: os-autoinst#6244 and os-autoinst#6115

related: https://progress.opensuse.org/issues/175060
Wabri added a commit to Wabri/openQA that referenced this pull request Mar 18, 2025
The retry commit is now possible to rever thanks to the merged pr of Dominik Heidler os-autoinst#2676

This will revert the changes made by prs that mitigate the problem: os-autoinst#6244 and os-autoinst#6115

related: https://progress.opensuse.org/issues/175060
asdil12 pushed a commit to asdil12/openQA that referenced this pull request Apr 1, 2025
The retry commit is now possible to rever thanks to the merged pr of Dominik Heidler os-autoinst#2676

This will revert the changes made by prs that mitigate the problem: os-autoinst#6244 and os-autoinst#6115

related: https://progress.opensuse.org/issues/175060
Wabri added a commit to Wabri/openQA that referenced this pull request Apr 1, 2025
The retry commit is now possible to rever thanks to the merged pr of Dominik Heidler os-autoinst#2676

This will revert the changes made by prs that mitigate the problem: os-autoinst#6244 and os-autoinst#6115

related: https://progress.opensuse.org/issues/175060
Wabri added a commit to Wabri/openQA that referenced this pull request Apr 2, 2025
The retry commit is now possible to rever thanks to the merged pr of Dominik Heidler os-autoinst#2676

This will revert the changes made by prs that mitigate the problem: os-autoinst#6244 and os-autoinst#6115

related: https://progress.opensuse.org/issues/175060
Wabri added a commit to Wabri/openQA that referenced this pull request Apr 3, 2025
The retry commit is now possible to rever thanks to the merged pr of Dominik Heidler os-autoinst#2676

This will revert the changes made by prs that mitigate the problem: os-autoinst#6244 and os-autoinst#6115

related: https://progress.opensuse.org/issues/175060
Wabri added a commit to Wabri/openQA that referenced this pull request Apr 7, 2025
The retry commit is now possible to rever thanks to the merged pr of Dominik Heidler os-autoinst#2676

This will revert the changes made by prs that mitigate the problem: os-autoinst#6244 and os-autoinst#6115

related: https://progress.opensuse.org/issues/175060
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants