Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[lit] Echo full RUN lines in case of external shells #65267

Merged

Conversation

jdenny-ornl
Copy link
Collaborator

Before https://reviews.llvm.org/D154984 and https://reviews.llvm.org/D156954, lit reported full RUN lines in a Script: section. Now, in the case of lit's internal shell, it's the execution trace that includes them. However, if lit is configured to use an external shell (e.g., bash, windows cmd), they aren't reported at all.

A fix was requested at the following:

This patch does not correctly address the case when the external shell is windows cmd. As discussed at #65242, it's not clear whether that's a use case that people still care about, and it seems to be generally broken anyway.

commands[j] += f": {shlex.quote(command.lstrip())} >&2 " \
f"&& {command}"
else:
commands[j] += " has no command after substitutions"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a testcase for this codepath? Thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, especially given that I forgot the stderr redirection in that case. I pushed a commit.

Copy link
Contributor

@banach-space banach-space left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@mikaelholmen
Copy link
Collaborator

I like this. Thank you!

Copy link
Contributor

@RoboTux RoboTux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too.

Before <https://reviews.llvm.org/D154984> and
<https://reviews.llvm.org/D156954>, lit reported full RUN lines in a
`Script:` section.  Now, in the case of lit's internal shell, it's the
execution trace that includes them.  However, if lit is configured to
use an external shell (e.g., bash, windows `cmd`), they aren't
reported at all.

A fix was requested at the following:

* <https://reviews.llvm.org/D154984#4627605>
* <https://discourse.llvm.org/t/rfc-improving-lits-debug-output/72839/35?u=jdenny-ornl>

This patch does not correctly address the case when the external shell
is windows `cmd`.  As discussed at
<llvm#65242>, it's not clear
whether that's a use case that people still care about, and it seems
to be generally broken anyway.
@jdenny-ornl jdenny-ornl force-pushed the lit-external-shell-echo-run-lines branch from 21444d6 to 9191ba7 Compare September 5, 2023 14:48
@jdenny-ornl jdenny-ornl merged commit 19b44c2 into llvm:main Sep 5, 2023
1 of 2 checks passed
@jdenny-ornl
Copy link
Collaborator Author

Thanks for the reviews.

@jsji
Copy link
Member

jsji commented Sep 5, 2023

jdenny-ornl added a commit that referenced this pull request Sep 5, 2023
Buildbots failed after this landed, as reported at:

<#65267 (comment)>

This reverts commit 9191ba7.
@jdenny-ornl
Copy link
Collaborator Author

I pushed a revert (efec733). I cannot tell from the logs what happened.

@jdenny-ornl
Copy link
Collaborator Author

@jsji Thanks for alerting me to the bot fails. The revert seems to have fixed at list one bot. Did you see any additional clue in the logs about what happened? All I see is:

remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]

I don't know how to debug that.

@jsji
Copy link
Member

jsji commented Sep 5, 2023

Sorry, no idea about why either.

@dyung
Copy link
Collaborator

dyung commented Sep 6, 2023

It's probably not much help, but this affected one of my build bots, and when I ran check-all manually on the machine, it died with an error "User defined signal 2":

PASS: libFuzzer-x86_64-default-Linux :: recommended-dictionary.test (72990 of 75935)
PASS: libFuzzer-x86_64-default-Linux :: cleanse.test (72991 of 75935)                                                                                                                                                                             
User defined signal 2

@petrhosek
Copy link
Member

We saw the same issue on our bots as well, see for example:

https://luci-milo.appspot.com/ui/p/fuchsia/builders/toolchain.ci/clang-linux-x64/b8770760294040862593/overview

We did some investigation but haven't found the culprit either, but we saw trap faults in the kernel log.

@jdenny-ornl
Copy link
Collaborator Author

jdenny-ornl commented Sep 6, 2023

Thanks for the info, and sorry for all the trouble. I am now able to reproduce on a local system: ninja check-fuzzer kills my terminal window [edit: well, the shell running in it] with signal 12. I'll keep poking at it as I have time. If you see any additional clues, please post them here.

@uweigand
Copy link
Member

uweigand commented Sep 6, 2023

I'm running into this issue on the s390x builder as well. It seems the problem happens in the ./compiler-rt/test/fuzzer/fork-sigusr.test case here:

RUN: %run %t/ForkSIGUSR -fork=3 -ignore_crashes=1 2>%t/log & export PID=$!
RUN: sleep 3
RUN: kill -SIGUSR2 $PID
RUN: sleep 6
RUN: cat %t/log | FileCheck %s --dump-input=fail

Normally, the PID variable is set to the pid of the ForkSIGUSR process, which then gets killed with SIGUSR2 and terminates cleanly.

However, with this patch applied, the PID variable get set to the pid of a shell process executing

  echo "..." && ../ForkSIGUSR ...

The shell eats the SIGUSR2 and prints a message, but the ForkSIGUSR process never gets the signal and keeps running indefinitely.

This somehow causes the buildbot connection to terminate - not sure exactly why but possibly some bad reaction to that shell getting the signal?

Unfortunately whenever I re-start the build bot, it thinks it need to re-try the previous build since that had the failed connection - but that previous build still has this patch applied and therefore terminates again :-( I'm trying to work around this now.

@jdenny-ornl
Copy link
Collaborator Author

@uweigand Thanks! That was the clue I needed.

I found there are four tests that have a similar pattern and cause trouble on my local system:

  • fork-sigusr.test: Seemed to hang, but I never saw it kill the shell that was running check-fuzzer. That appears to be a different result than @uweigand saw, so maybe there's something non-deterministic, or maybe our shells are different.
  • merge-sigusr.test: Killed the shell.
  • sigint.test: Seemed to hang.
  • sigusr.test: Killed the shell.

If I disable those tests, check-fuzzer behaves as at the parent commit.

I have a simple lit fix I've pushed to this branch. It also makes check-fuzzer behave as at the parent commit.

I'm hesitant to land the new version as last time was a bit of a disaster. Based on @uweigand's comments, some bots are still recovering. Sorry! Who knows what other trouble might be lurking on the next attempt.

Any advice on how to proceed?

@jdenny-ornl
Copy link
Collaborator Author

Here's the fix: jdenny-ornl@61e272e

Unfortunately, I made the mistake of pushing the revert directly to git (old workflow) instead of clicking the revert button here in the PR. I'm not sure how to reopen the PR now. I can start a new PR if that's desirable.

@dyung
Copy link
Collaborator

dyung commented Sep 7, 2023

Unfortunately whenever I re-start the build bot, it thinks it need to re-try the previous build since that had the failed connection - but that previous build still has this patch applied and therefore terminates again :-( I'm trying to work around this now.

@uweigand if you are still trying to recover your buildbot, this is the process I used which I was able to recover mine. Thanks to the info everyone found here, it seems there are 4 affected tests fork-sigusr.test, merge-sigusr.test, sigint.test and sigusr.test. With the worker offline, I manually edited the files to add a , * to the UNSUPPORTED line to get the run to skip the tests so it wouldn't fail. I then stashed the change and brought the worker back online. When the job rerun had finished the sync and was doing the build, I went to the local git repo and unstashed the test changes and then let the build/test complete. Then before the next run is executed, just revert the local test changes. The tests were not exactly skipped (they ended up being "UNRESOLVED" because of the new syntax required for the UNSUPPORTED lines), but the net result was the run completed and moved on to the next job which then passed. Hope this helps you to recover your bot.

"Completed" job: https://lab.llvm.org/buildbot/#/builders/247/builds/8680

@uweigand
Copy link
Member

uweigand commented Sep 7, 2023

Thanks for the tips. I've managed to get the bot going again by clearing the build queues in the web interface, so it would reset to current mainline. All s390x bots are now green again.

jdenny-ornl added a commit that referenced this pull request Sep 7, 2023
avillega pushed a commit to avillega/llvm-project that referenced this pull request Sep 11, 2023
Before <https://reviews.llvm.org/D154984> and
<https://reviews.llvm.org/D156954>, lit reported full RUN lines in a
`Script:` section. Now, in the case of lit's internal shell, it's the
execution trace that includes them. However, if lit is configured to use
an external shell (e.g., bash, windows `cmd`), they aren't reported at
all.

A fix was requested at the following:

* <https://reviews.llvm.org/D154984#4627605>
*
<https://discourse.llvm.org/t/rfc-improving-lits-debug-output/72839/35?u=jdenny-ornl>

This patch does not correctly address the case when the external shell
is windows `cmd`. As discussed at
<llvm#65242>, it's not clear
whether that's a use case that people still care about, and it seems to
be generally broken anyway.
avillega pushed a commit to avillega/llvm-project that referenced this pull request Sep 11, 2023
Buildbots failed after this landed, as reported at:

<llvm#65267 (comment)>

This reverts commit 9191ba7.
avillega pushed a commit to avillega/llvm-project that referenced this pull request Sep 11, 2023
@jdenny-ornl
Copy link
Collaborator Author

I'm trying to figure out how to resurrect this PR after rebasing onto main and applying fixes. It will need additional review. Is it best to just to create a new PR, or is there some way to reopen this one?

@jdenny-ornl
Copy link
Collaborator Author

Based on the information I found, it's not possible to reopen this PR, so I created a new one: #66408

qihangkong pushed a commit to rvgpu/llvm that referenced this pull request Apr 18, 2024
Buildbots failed after this landed, as reported at:

<llvm/llvm-project#65267 (comment)>

This reverts commit 9191ba7.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants