Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Fix some failing tests. #4718

Merged
merged 3 commits into from
May 9, 2016
Merged

[RFC] Fix some failing tests. #4718

merged 3 commits into from
May 9, 2016

Conversation

jszakmeister
Copy link
Contributor

@jszakmeister jszakmeister commented May 7, 2016

There are a couple of issues here causing the tests to hang because nvim is prompting for the user to hit enter. It'd really be nice to find a more robust way of dealing with this, and for folks to stop ignoring when the tests hang on QB and not Travis. :-( I also found that one tempfile_spec test was failing periodically because there was cruft in there from other tests. I fixed that too.

I believe this fixes #4637 too.

In longer directory paths, the test can hang waiting for the user to hit
enter to continue.  Let's use the silent prefix to avoid this.
While trying to debug an issue, I discovered that the tests for illegal
arguments depended on the prior suite having run and started a session.
Let's remove that unintentional dependency by starting our own session
before each test.
It's possible that the first test encounters a temp directory with files
in it, due to a previous test causing the first test to fail.  Instead,
let's clean up before and after the test to make sure the temp area is
pristine before and after the test.
@KillTheMule
Copy link
Contributor

I agree there's no use about a build if the results aren't resolve, but I've found the QB logs very hard to read, which makes it pretty much impossible for me to debug these issues. Could you point out some documentation that I could read up on? Or maybe write up a sketch how you went about it so there's some sort of foot-steps to follow.

@jszakmeister
Copy link
Contributor Author

but I've found the QB logs very hard to read, which makes it pretty much impossible for me to debug these issues

Can you be more specific?

I go to the step status:

screen shot 2016-05-07 at 7 15 09 am

And then click on the little document icon next to the test run to see the logs associated with it:

screen shot 2016-05-07 at 7 16 04 am

There is also the log link on the side panel where you can see the log for the entire run.

For issues where the test is hanging, there's nothing better or worse on QB than you get on Travis: stuff hangs and the test runner is terminated. The step is marked as cancelled (because you exceeded the time limit):

screen shot 2016-05-07 at 7 23 32 am

And it looks this way in the logs for that step:

13:05:48,017 INFO  - Terminating launched command gracefully...
13:05:48,019 DEBUG - Executing command: sh /usr/home/quickbuild/buildagent/framework/configuration/org.eclipse.osgi/com/pmease/quickbuild/execution/kill-tree.sh 48872
13:05:48,019 DEBUG - Command working directory: /usr/home/quickbuild/buildagent/bin
13:05:48,030 WARN  - Vim: Caught deadly signal 'SIGTERM'
13:05:48,030 WARN  -
13:05:48,032 WARN  - Vim: Finished.
13:05:48,034 WARN  - Terminated
13:05:48,036 INFO  - ok 358 - string() function used to represent lists dumps nested
13:05:48,036 INFO  - Executing post-execute action...
13:05:48,037 ERROR - Step 'master>buildall>build-node?testNode=freebsd-64>build-and-run-tests>build-and-run-tests-parameterized?buildType=Debug>run-tests>run-functional-tests' is failed.
    java.lang.RuntimeException: java.lang.InterruptedException
        at com.pmease.quickbuild.execution.Commandline.execute(Commandline.java:372)
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep.run(CommandBuildStep.java:215)
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$7751e9d.CGLIB$run$0(<generated>)
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$7751e9d$$FastClassByCGLIB$$ec8c8f91.invoke(<generated>)
        at net.sf.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:215)
        at com.pmease.quickbuild.DefaultScriptEngine$Interpolator.intercept(DefaultScriptEngine.java:261)
        at com.pmease.quickbuild.plugin.basis.CommandBuildStep$$EnhancerByCGLIB$$7751e9d.run(<generated>)
        at com.pmease.quickbuild.stepsupport.Step.execute(Step.java:548)
        at com.pmease.quickbuild.stepsupport.StepExecutionJob.executeStepAwareJob(StepExecutionJob.java:31)
        at com.pmease.quickbuild.stepsupport.StepAwareJob.executeBuildAwareJob(StepAwareJob.java:45)
        at com.pmease.quickbuild.BuildAwareJob.execute(BuildAwareJob.java:60)
        at com.pmease.quickbuild.grid.GridJob.run(GridJob.java:106)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.InterruptedException
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at java.lang.UNIXProcess.waitFor(UNIXProcess.java:401)
        at com.pmease.quickbuild.execution.Commandline.execute(Commandline.java:350)
        ... 16 more

The messages could be better here, but what is happening is that we exceeded the time limit and QB has come behind us to kill the running process and try to clean up.

Things look a little bit different when it's simply a failed test:

screen shot 2016-05-07 at 7 26 58 am

You can see the publish step is marked as failed. This is generally not because it failed to publish, as you might think, it's because there were failed tests found in the report. Occasionally, it is because the test report is incomplete, which is usually because something died a horrible death and failed to emit the rest of the report, so it's invalid.

If you have a failed test, I think it's easiest to go to the step status page and look through the logs for the test run on that platform. But, you can also see the results on the overview page, and dig into the failure there. The nice thing about using the TAP report here is that what you see in the logs is likely what you'll see on the test report page:

screen shot 2016-05-07 at 7 31 27 am

Click on "View Report", and you'll see:

screen shot 2016-05-07 at 7 32 37 am

Click the "errors" in "1 errors" that is part of the test run summary and you get this:

screen shot 2016-05-07 at 7 33 54 am

Click on the little document, and you get the detailed test results for the test:

screen shot 2016-05-07 at 7 34 47 am

Which is exactly what you'd find in the log file too.

So there are several ways to go about it, but in the end I think you get the same information you do from Travis. shrug

There are also several reasons why things fail on QB and not Travis, with the sheer majority of the issues being path length. The build is done under a path like /home/user/buildagent/workspace/root/neovim/master (vs /home/user/neovim), and so you bump up against some instances where what is being done will cause Neovim to prompt under QB and not Travis. I consider this a feature of the QB tests, because as a user/developer I'd expect the tests to run successfully when checking out the project despite my path causing it to exceed 80 characters in the test.

Another reason why things fail under QB generally has to do with ordering. Sometimes the test order is just different (busted doesn't necessarily sort, it picks up things in whatever order the directory listing api gives them). Also, because two other platforms involved, you'll run into where things like enumerating dictionaries or tables leads to different results--but we shouldn't be counting on them being in a specific order anyways.

Occasionally, the QB build does fail for some other reason. The most common of these is that a failed test run has generated a several gigabyte nvimlog file and filled the partition. At that point, I generally have to go in and fix something up, though I think I've got that issue taken care with some extra steps in QB that always run. I have had power events cause problems too, but those kinds issues have been few and far between.

I don't know if that helps, and this information almost certainly belongs somewhere else. I, for one, don't believe QB's interface is so obtuse that you can't get to the information. Nearly every time I've looked into these issues, I've found a bug in either the failing test or Neovim itself.

@jszakmeister jszakmeister changed the title Fix some failing tests. [RFC] Fix some failing tests. May 7, 2016
@marvim marvim added the RFC label May 7, 2016
@KillTheMule
Copy link
Contributor

Wow yeah that helps alot, thank you! I agree with you about the long path issue and everything that just stems from different OSes/distributions.

It was just easy for me in travis to find the info I needed, but with qb I somehow never managed to click the right buttons, so yeah, thanks again. It should be on the Wiki imho, if you want, I'd put it there when I'm back home tomorrow.

@KillTheMule
Copy link
Contributor

KillTheMule commented May 8, 2016

Ok, so I see some QB failures on my PR's and I'm trying to debug them. Only, it's not happening locally, so I need to modify the tests, change the PR and wait for the QB output. Is there a way to get some output from the tests, even if they don't fail? I'd use the good ol' "PRINT" strategy, but is that captured somewhere? I see I can browse the file system, would a good strategy be to write to a file that I can inspect after the run?

(e) I see that this PR supposedly fixes what I'm seeing 👍 Still, it would be good to know how to go on if something like this happens again.

@jszakmeister
Copy link
Contributor Author

@KillTheMule What would you do if Travis was failing? Many of the same strategies there work here too. Print from the Lua tests is a fine strategy to use, you'll just have to look in the logs for your output (on the run-functional-tests or run-unittests step). Printing to stderr would probably be best too. You can look at the entire log too (by clicking the Log link on the sidebar). It's pretty much what you would see from Travis.

You can output to a file too, though I don't know if anonymous people have the ability to browse the workspace.

And yes, this PR is meant to address the current failures on master. :-)

@KillTheMule
Copy link
Contributor

@jszakmeister Thanks :) Well, I don't know what I would do about travis, I've not yet seen a failure there that I'd investigate non-locally. I'm just getting the knack of CI.... I've just seen that locally "print" does nothing for me. I can use printing to stderr for failing tests, but it would not be printed on hanging tests, so I assumed that's the same for QB/Travis.

@jszakmeister
Copy link
Contributor Author

I'm just getting the knack of CI.... I've just seen that locally "print" does nothing for me. I can use printing to stderr for failing tests, but it would not be printed on hanging tests, so I assumed that's the same for QB/Travis.

You know, there is a small difference between QB and Travis: I run the tests explicitly in QB, not through make tests, so the output is visible in the logs. IOW, printing works in QB but not Travis. It's not so much a function of the CI though, it's more a function of how we have to run things. CMake doesn't have the built-in ability to tee the output and we need to remain cross-platform. :-( But I really wanted the ability to see things as they happen in QB, so I did it a little differently. Also in QB, on the build page there is an "overview" link on the sidebar. If you click on that, at the top of the page are artifacts collected by the build. Those artifacts include the output of test runs, so you can examine them locally if the builds fail.

@jszakmeister
Copy link
Contributor Author

My only concern with this PR is 69d1bc1. I'm not sure if clearing the directory is enough before the test, or if we should add the expectation that the named socket might be there, or if we should stop asserting the directory should be empty all together. Thoughts?

@justinmk
Copy link
Member

justinmk commented May 9, 2016

Cleaning the directory before each test is reasonable. Thanks for this!

@justinmk justinmk merged commit 31373e4 into neovim:master May 9, 2016
@justinmk justinmk removed the RFC label May 9, 2016
@justinmk
Copy link
Member

justinmk commented May 9, 2016

@jszakmeister Thanks for those detailed steps, just knowing that "step status" is the best place to start removes a lot of confusion. I've added a blurb at https://github.com/neovim/neovim/wiki/Contributing#quickbuild-failures and I will incorporate your detailed notes in a more significant doc update I'm working on.

By the way, is it possible to change the "default" name of the build to something like "quickbuild"? This helps with communication about the various builds we have.

@jszakmeister
Copy link
Contributor Author

By the way, is it possible to change the "default" name of the build to something like "quickbuild"? This helps with communication about the various builds we have.

Unfortunately, no, it is not possible at the moment. I filed a ticket with PMEase several months ago, but they aren't ready to update the required libraries and expose that particular knob yet. I've worked with these guys for 10+ years, so I'm sure they'll get to it eventually.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests are hanging in QB on master...
4 participants