Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible confusing logging on ProxyOperationManager.cs #2379

Closed
MarcoRossignoli opened this issue Mar 29, 2020 · 14 comments · Fixed by #3161
Closed

Possible confusing logging on ProxyOperationManager.cs #2379

MarcoRossignoli opened this issue Mar 29, 2020 · 14 comments · Fixed by #3161

Comments

@MarcoRossignoli
Copy link
Contributor

MarcoRossignoli commented Mar 29, 2020

Usually when we've some issue with coverlet msbuild or .NET tool driver we enable logging to understand if vstest plat killed process after famous 100ms.
BTW seem that logging is wrong and also in case of clean shutdown timeout log is emitted

public virtual void Close()
{
try
{
// do not send message if host did not launch
if (this.testHostLaunched)
{
this.RequestSender.EndSession();
// We want to give test host a chance to safely close.
// The upper bound for wait should be 100ms.
var timeout = 100;
EqtTrace.Verbose("ProxyOperationManager.Close: waiting for test host to exit for {0} ms", timeout);
this.testHostExited.Wait(timeout);
}
}
catch (Exception ex)
{
// Error in sending an end session is not necessarily a failure. Discovery and execution should be already
// complete at this time.
EqtTrace.Warning("ProxyOperationManager: Failed to end session: " + ex);
}
finally
{
this.initialized = false;
EqtTrace.Warning("ProxyOperationManager: Timed out waiting for test host to exit. Will terminate process.");
// please clean up test host.
this.testHostManager.CleanTestHostAsync(CancellationToken.None).Wait();
this.testHostManager.HostExited -= this.TestHostManagerHostExited;
this.testHostManager.HostLaunched -= this.TestHostManagerHostLaunched;
}
}

I think that log emitted in line 228 should be emitted only if testHostExited event was not signaled.

@nohwnd
Copy link
Member

nohwnd commented Mar 30, 2020

Why famous 100ms are there other issues connected to this?

@MarcoRossignoli
Copy link
Contributor Author

The main issue to move to collectors from msbuild/.net tool version of coverlet is that issue https://github.com/tonerdo/coverlet/blob/master/Documentation/KnownIssues.md#1-vstest-stops-process-execution-earlydotnet-test

@nohwnd
Copy link
Member

nohwnd commented Mar 30, 2020

ah gotcha, it only applies to .net fmw. :)

Do you want to PR this change?

@MarcoRossignoli
Copy link
Contributor Author

MarcoRossignoli commented Mar 30, 2020

ah gotcha, it only applies to .net fmw. :)

What do you mean?The issue happens also on .net core afaik.

@MarcoRossignoli
Copy link
Contributor Author

@nohwnd @AbhitejJohn I have also another proposal, can we increase the timeout to 30 seconds?
I mean if that timeout is here to avoid that somebody freeze host process(legit) why don't use a more "standard" timeout and give a realistic time to allow to host process(usually user test classes, think for instance a class subscribed to process exit that cleanup something and needs more than 100ms) to cleanup?
100ms is very few time, 30 second is standard time if process will be blocked for more that 30 seconds likely there is something wrong with it, but the timeout protection will work.

@nohwnd
Copy link
Member

nohwnd commented May 4, 2020

@jakubch1 could you take this on? I am not that aware of what are the possibilities around coverlet and in-proc out-proc data collectors. I don't think increasing the timeout to 30s is the right way to go. But maybe to 1s (or max 3s) would mitigate the issue a bit, and would not add too much teardown time. I thought this was solved for dotnet core based on the link above, but I probably misunderstood it.

@jakubch1
Copy link
Member

jakubch1 commented May 4, 2020

@MarcoRossignoli did you try to use MemoryMappedFile for msbuild case? We discussed this already. This is the best approach here and by this we don't need to increase timeout.

@MarcoRossignoli
Copy link
Contributor Author

MarcoRossignoli commented May 4, 2020

Not yet, discussed yesterday with @tonerdo, the work will be huge because I think we cannot use the default memory mapped file class provided with .net runtime but to support System.Private.Corelib.dll coverage we need to re-wrap all memory related API for every SO, as we cannot have access to that object for this specific dll.
Take a look here coverlet-coverage/coverlet#808 (comment)

We put it on roadmap on low priority for now https://github.com/coverlet-coverage/coverlet/blob/master/Documentation/Roadmap.md

@KirillOsenkov
Copy link
Member

I just wasted four days and found all sorts of issues just because of this 100ms timeout.

@nohwnd I want to dig a bit deeper into your reservations around increasing the timeout to 30 seconds. In the success case, when the process finishes quickly enough, no problem, we're not waiting long.

But suppose the process is doing some expensive teardown, or there's just a fluke (like hard drive/antivirus delay, or random noise from other processes on the machine). Suppose the short (100ms, or 1s) timeout expires. Let's look carefully at what happens. We kill the process. The message that we killed the process is logged into --diag file, which is off by default. The user-facing log only displays "Test host process crashed" with no further information. Suppose I even go and manage to collect a dump (which seems like an absolutely impossible endeavour at the moment). The dump would just show that the process was randomly interrupted, and not have any exception or any clue that would help. So users would be left clueless, wasting their time, living with flaky CI. I'm actually investigating this for a partner team who have been living with flaky CI, about half their CI builds fail randomly, and the poor people are just used to requeuing and hoping it works next time. Nobody knows how to investigate this. Honestly, in light of days or weeks of lost productivity of people worldwide, flaky CI builds, saving on this 100ms timeout looks really strange from a business perspective.

I see absolutely no harm about increasing the timeout to 30 seconds. It will make the worst case successful, and won't impact the regular success case. Remember that books are written about timeouts in computers, and how you can never rely on 100ms actually being 100ms. Your entire process may be paused by the OS for any amount of time randomly. Heck, a random GC can easily create a timeout longer than that.

Additionally, this bug should track better user-facing logging in case that timeout does expire. See related:
#2952

It should be immediately obvious to the user that what happened isn't that a particular test failed, but we timed out waiting for the host process to terminate.

Actually the best thing would be to use procdump in that case after the timeout, forcibly grab a dump of the process and make sure it's published as an AzDO artifact, so users can see what the process was doing that caused a 30 seconds timeout.

At the risk of repeating myself, I can't underscore enough how important it is that we make this set of improvements to the dotnet test experience on CI. It will result in more stable pipelines, remove the flakiness and noise and make failures easier to diagnose. It's hard to measure the amount of pain the current unfortunate constellation of defects is causing to engineers using dotnet test in CI.

@nohwnd if you feel like any of this work doesn't fit in your current schedule, please ping me and I will ensure this gets escalated to whatever level is necessary to prioritize this as an absolute top priority.

@KirillOsenkov
Copy link
Member

I also think that this line of code will be run if the process has exited before the timeout, which seems wrong:

EqtTrace.Warning("ProxyOperationManager: Timed out waiting for test host to exit. Will terminate process.");

@KirillOsenkov
Copy link
Member

It should check the result of the .Wait() call and take different actions based on whether the timeout was reached or not. If we kill the process after the timeout, we need to take a full dump, and log a user-facing message saying "we killed the process because it took too long to exit"

@nohwnd
Copy link
Member

nohwnd commented Jun 30, 2021

I see absolutely no harm about increasing the timeout to 30 seconds. It will make the worst case successful, and won't impact the regular success case. Remember that books are written about timeouts in computers, and how you can never rely on 100ms actually being 100ms. Your entire process may be paused by the OS for any amount of time randomly. Heck, a random GC can easily create a timeout longer than that.

Assuming the worst case actually is 30 seconds. 🙂 I am aware of 100 ms not being exactly 100 ms for a thread. And I even agree with this in a pipeline scenario. There it might be reasonable to wait 30s for a process to terminate, to give it a better chance of terminating. Unfortunately 30s is forever in interactive scenario like when you click Stop button in VS TestExplorer. There you don't want to be waiting 30s before seeing anything happen. That is why I was recommending to increase to 1-3s. This gives the process a lot of time to teardown, while keeping the interactive user on the verge of "this is stuck".

It would also be nice to give the extensions a better infra where they could report what is happening, and we would show it in case of abort. Problem is how do we get it from the process without spending a lot of resources on "just" logging. Would be awesome if we could have a post-mortem view, that would be more verbose than what you would normally be interested in.

This 100ms should also not be problem anymore for coverlet (which I think was crashing your build as well, together I think with the Debug.Assert coming from the code) because newer coverlet is now an in-proc data collector to avoid this problem.

I also think that this line of code will be run if the process has exited before the timeout, which seems wrong:

You are right. Building a fix.

@nohwnd
Copy link
Member

nohwnd commented Jun 30, 2021

Test host process crashed <- yeah that is a whole another story, both blame and vstest console should report back to the reporter that they killed testhost. I originally attempted to get this fixed in the blame, but doing it would have required more changes that I was comfortable with at that time. IIRC.

@KirillOsenkov
Copy link
Member

I see, now that I realize that this runs in interactive and IDE scenarios, 30 seconds is of course, way too much. I was missing this detail.

Still, the CI case needs to be absolutely robust. You saw how flaky the CI can get because of this issue. Perhaps this should be configurable and CI should pass a custom timeout like 1 minute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants