Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openqa-investigate: Provide support for multi-machine scenarios #170

Merged
merged 4 commits into from Jul 11, 2022

Conversation

Martchus
Copy link
Contributor

@Martchus Martchus commented Jun 24, 2022

Latest commit: Sync investigation of parallel clusters via openQA comment

  • Instead of only considering parallel parents, just do the investigation
    for any job with parallel dependencies
    • Avoid having to run the investigation script for all job results
    • Sync via an openQA comment instead (to avoid running the same
      investigation twice; abort if a concurrent job already does the
      investigation of the cluster)
    • Use --max-depth 0 to clone all jobs in the parallel cluster,
      regardless whether we're starting from a parallel parent or child
      • Has no effect on other dependency types since we're
        • using --skip-chained-deps anyways
        • not using --clone-children
        • still excluding directly chained dependencies
  • Write an investigation comment on the job we're actually investigating
    and on the first job in the cluster (for the synchronization)
  • See https://progress.opensuse.org/issues/95783#note-58

@Martchus Martchus force-pushed the parallel-2 branch 2 times, most recently from ef4eec2 to 6d73d5b Compare July 4, 2022 15:58
openqa-investigate Outdated Show resolved Hide resolved
openqa-investigate Outdated Show resolved Hide resolved
openqa-investigate Outdated Show resolved Hide resolved
openqa-investigate Outdated Show resolved Hide resolved
@Martchus Martchus marked this pull request as ready for review July 5, 2022 12:21
openqa-investigate Outdated Show resolved Hide resolved
openqa-investigate Outdated Show resolved Hide resolved
openqa-investigate Outdated Show resolved Hide resolved
@Martchus
Copy link
Contributor Author

Martchus commented Jul 6, 2022

Pushed a new commit. It is still WIP and I need to adjust tests but feedback is appreciated. See the commit message for details.

openqa-investigate Outdated Show resolved Hide resolved
openqa-investigate Outdated Show resolved Hide resolved
openqa-investigate Outdated Show resolved Hide resolved
openqa-investigate Outdated Show resolved Hide resolved
@Martchus Martchus force-pushed the parallel-2 branch 2 times, most recently from 904b536 to 37c0869 Compare July 7, 2022 14:03
@Martchus
Copy link
Contributor Author

Martchus commented Jul 7, 2022

It works in production on a simple case, see https://openqa.opensuse.org/tests/2456131#comments. It also skips it correctly on the 2nd run:

$ echo 'https://openqa.opensuse.org/tests/2456131' | env exclude_group_regex='foobar'  ./openqa-investigate 
{"id":285792}
$ echo 'https://openqa.opensuse.org/tests/2456131' | env exclude_group_regex='foobar'  ./openqa-investigate 
Skipping investigation of job 2456131: job cluster is already being investigated, see comment on job 2456131

It also worked on a failed parallel child, see https://openqa.opensuse.org/tests/2456116#comments. That restarted the cluster correctly (no chained parents, just the cluster as expected). It also created the comment on the first job in the cluster and edited it later, see https://openqa.opensuse.org/tests/2456115#comments. A second run was also skipped as expected, as well as a run on the parent:

$ echo 'https://openqa.opensuse.org/tests/2456116' | env dry_run= exclude_group_regex='foobar'  ./openqa-investigate 
Cloning parents of opensuse-Tumbleweed-DVD-aarch64-Build20220706-salt-minion@aarch64
Cloning children of opensuse-Tumbleweed-DVD-aarch64-Build20220706-salt-master@aarch64
Cloning parents of opensuse-Tumbleweed-DVD-aarch64-Build20220706-salt-minion@aarch64
Cloning children of opensuse-Tumbleweed-DVD-aarch64-Build20220706-salt-master@aarch64
Cloning parents of opensuse-Tumbleweed-DVD-aarch64-Build20220628-salt-minion@aarch64
Cloning children of opensuse-Tumbleweed-DVD-aarch64-Build20220628-salt-master@aarch64
Cloning parents of opensuse-Tumbleweed-DVD-aarch64-Build20220628-salt-minion@aarch64
Cloning children of opensuse-Tumbleweed-DVD-aarch64-Build20220628-salt-master@aarch64
{"id":285794}
{"id":285795}
$ echo 'https://openqa.opensuse.org/tests/2456116' | env dry_run= exclude_group_regex='foobar'  ./openqa-investigate 
Skipping investigation of job 2456116: job cluster is already being investigated, see comment on job 2456115
$ echo 'https://openqa.opensuse.org/tests/2456115' | env dry_run= exclude_group_regex='foobar'  ./openqa-investigate 
Skipping investigation of job 2456115: job cluster is already being investigated, see comment on job 2456115

(I've cancelled all investigation jobs again manually.)


I could still extend the unit tests to cover everything (not just the individual functions).

Copy link
Member

@okurz okurz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works in production on a simple case, see https://openqa.opensuse.org/tests/2456131#comments. It also skips it correctly on the 2nd run:

$ echo 'https://openqa.opensuse.org/tests/2456131' | env exclude_group_regex='foobar'  ./openqa-investigate 
{"id":285792}
$ echo 'https://openqa.opensuse.org/tests/2456131' | env exclude_group_regex='foobar'  ./openqa-investigate 
Skipping investigation of job 2456131: job cluster is already being investigated, see comment on job 2456131

It also worked on a failed parallel child, see https://openqa.opensuse.org/tests/2456116#comments. That restarted the cluster correctly (no chained parents, just the cluster as expected). It also created the comment on the first job in the cluster and edited it later, see https://openqa.opensuse.org/tests/2456115#comments.

Looking at https://openqa.opensuse.org/tests/2456116#comment-285795 the first job pair is:

  • salt-minion:investigate:retry: https://openqa.opensuse.org/t2457333 https://openqa.opensuse.org/t2457334
    Out of those both the second is correctly called "opensuse-Tumbleweed-DVD-aarch64-salt-minion:investigate:retry@aarch64" and not within any job group, but the first is still with the original name "opensuse-Tumbleweed-DVD-aarch64-Build20220706-salt-master@aarch64" and within the original job group and build. That should be avoided as we don't want to pollute production builds.

@Martchus
Copy link
Contributor Author

Martchus commented Jul 8, 2022

I suppose I need to enable parental inheritance for that. However, then we have the same problem as before with the restart approach - we cannot have job-specific settings. For TEST we could use += but for and if OPENQA_INVESTIGATE_ORIGIN is always the same we can likely live with that. However, for CASEDIR that doesn't make necessarily sense.

@Martchus
Copy link
Contributor Author

Martchus commented Jul 8, 2022

I suppose we can also live with using the same CASEDIR everywhere. So I've added another commit to apply the job settings for the whole cluster, see its commit message.

I also moved the previously created jobs out of the group/build.

test/02-investigate.t Outdated Show resolved Hide resolved
test/02-investigate.t Show resolved Hide resolved
When checking `openqa.ini` on OSD I've noticed that hooks scripts for
incomplete jobs and timeouts are only invoking openqa-label-known-issues.
So for consistency with when we currently trigger the investigation I make
the investigate script anything but failures.
* Instead of only considering parallel parents, just do the investigation
  for any job with parallel dependencies
    * Avoid having to run the investigation script for all job results
    * Sync via an openQA comment instead (to avoid running the same
      investigation twice; abort if a concurrent job already does the
      investigation of the cluster)
    * Use `--max-depth 0` to clone all jobs in the parallel cluster,
      regardless whether we're starting from a parallel parent or child
        * Has no effect on other dependency types since we're
            * using `--skip-chained-deps` anyways
            * *not* using `--clone-children`
            * still excluding directly chained dependencies
* Write an investigation comment on the job we're actually investigating
  and on the first job in the cluster (for the synchronization)
* See https://progress.opensuse.org/issues/95783#note-58
@okurz
Copy link
Member

okurz commented Jul 10, 2022

I suppose we can also live with using the same CASEDIR everywhere. So I've added another commit to apply the job settings for the whole cluster, see its commit message.

I also moved the previously created jobs out of the group/build.

Can you trigger a new run on o3 with openqa-investigate and investigation jobs please?

@Martchus
Copy link
Contributor Author

I did, see the second comment on https://openqa.opensuse.org/tests/2456116#comment-286104. (To be able to trigger it I temporarily edited the first comment.)

@mergify mergify bot merged commit f8dbc5e into os-autoinst:master Jul 11, 2022
@Martchus Martchus deleted the parallel-2 branch July 11, 2022 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants