Possible confusing logging on ProxyOperationManager.cs #2379

MarcoRossignoli · 2020-03-29T17:46:15Z

Usually when we've some issue with coverlet msbuild or .NET tool driver we enable logging to understand if vstest plat killed process after famous 100ms.
BTW seem that logging is wrong and also in case of clean shutdown timeout log is emitted

vstest/src/Microsoft.TestPlatform.CrossPlatEngine/Client/ProxyOperationManager.cs

Lines 202 to 236 in d10bcbb

    
           public virtual void Close() 
        
           { 
        
               try 
        
               { 
        
                   // do not send message if host did not launch 
        
                   if (this.testHostLaunched) 
        
                   { 
        
                       this.RequestSender.EndSession(); 
        
                       // We want to give test host a chance to safely close. 
        
                       // The upper bound for wait should be 100ms. 
        
                       var timeout = 100; 
        
                       EqtTrace.Verbose("ProxyOperationManager.Close: waiting for test host to exit for {0} ms", timeout); 
        
                       this.testHostExited.Wait(timeout); 
        
                   } 
        
               } 
        
               catch (Exception ex) 
        
               { 
        
                   // Error in sending an end session is not necessarily a failure. Discovery and execution should be already 
        
                   // complete at this time. 
        
                   EqtTrace.Warning("ProxyOperationManager: Failed to end session: " + ex); 
        
               } 
        
               finally 
        
               { 
        
                   this.initialized = false; 
        
                   EqtTrace.Warning("ProxyOperationManager: Timed out waiting for test host to exit. Will terminate process."); 
        
                   // please clean up test host. 
        
                   this.testHostManager.CleanTestHostAsync(CancellationToken.None).Wait(); 
        
                   this.testHostManager.HostExited -= this.TestHostManagerHostExited; 
        
                   this.testHostManager.HostLaunched -= this.TestHostManagerHostLaunched; 
        
               } 
        
           }

I think that log emitted in line 228 should be emitted only if testHostExited event was not signaled.

The text was updated successfully, but these errors were encountered:

nohwnd · 2020-03-30T12:11:32Z

Why famous 100ms are there other issues connected to this?

MarcoRossignoli · 2020-03-30T12:26:18Z

The main issue to move to collectors from msbuild/.net tool version of coverlet is that issue https://github.com/tonerdo/coverlet/blob/master/Documentation/KnownIssues.md#1-vstest-stops-process-execution-earlydotnet-test

nohwnd · 2020-03-30T12:51:10Z

ah gotcha, it only applies to .net fmw. :)

Do you want to PR this change?

MarcoRossignoli · 2020-03-30T13:36:40Z

ah gotcha, it only applies to .net fmw. :)

What do you mean?The issue happens also on .net core afaik.

MarcoRossignoli · 2020-04-04T18:02:05Z

@nohwnd @AbhitejJohn I have also another proposal, can we increase the timeout to 30 seconds?
I mean if that timeout is here to avoid that somebody freeze host process(legit) why don't use a more "standard" timeout and give a realistic time to allow to host process(usually user test classes, think for instance a class subscribed to process exit that cleanup something and needs more than 100ms) to cleanup?
100ms is very few time, 30 second is standard time if process will be blocked for more that 30 seconds likely there is something wrong with it, but the timeout protection will work.

nohwnd · 2020-05-04T11:16:46Z

@jakubch1 could you take this on? I am not that aware of what are the possibilities around coverlet and in-proc out-proc data collectors. I don't think increasing the timeout to 30s is the right way to go. But maybe to 1s (or max 3s) would mitigate the issue a bit, and would not add too much teardown time. I thought this was solved for dotnet core based on the link above, but I probably misunderstood it.

jakubch1 · 2020-05-04T11:23:53Z

@MarcoRossignoli did you try to use MemoryMappedFile for msbuild case? We discussed this already. This is the best approach here and by this we don't need to increase timeout.

MarcoRossignoli · 2020-05-04T11:36:48Z

Not yet, discussed yesterday with @tonerdo, the work will be huge because I think we cannot use the default memory mapped file class provided with .net runtime but to support System.Private.Corelib.dll coverage we need to re-wrap all memory related API for every SO, as we cannot have access to that object for this specific dll.
Take a look here coverlet-coverage/coverlet#808 (comment)

We put it on roadmap on low priority for now https://github.com/coverlet-coverage/coverlet/blob/master/Documentation/Roadmap.md

KirillOsenkov · 2021-06-30T02:35:41Z

I just wasted four days and found all sorts of issues just because of this 100ms timeout.

@nohwnd I want to dig a bit deeper into your reservations around increasing the timeout to 30 seconds. In the success case, when the process finishes quickly enough, no problem, we're not waiting long.

But suppose the process is doing some expensive teardown, or there's just a fluke (like hard drive/antivirus delay, or random noise from other processes on the machine). Suppose the short (100ms, or 1s) timeout expires. Let's look carefully at what happens. We kill the process. The message that we killed the process is logged into --diag file, which is off by default. The user-facing log only displays "Test host process crashed" with no further information. Suppose I even go and manage to collect a dump (which seems like an absolutely impossible endeavour at the moment). The dump would just show that the process was randomly interrupted, and not have any exception or any clue that would help. So users would be left clueless, wasting their time, living with flaky CI. I'm actually investigating this for a partner team who have been living with flaky CI, about half their CI builds fail randomly, and the poor people are just used to requeuing and hoping it works next time. Nobody knows how to investigate this. Honestly, in light of days or weeks of lost productivity of people worldwide, flaky CI builds, saving on this 100ms timeout looks really strange from a business perspective.

I see absolutely no harm about increasing the timeout to 30 seconds. It will make the worst case successful, and won't impact the regular success case. Remember that books are written about timeouts in computers, and how you can never rely on 100ms actually being 100ms. Your entire process may be paused by the OS for any amount of time randomly. Heck, a random GC can easily create a timeout longer than that.

Additionally, this bug should track better user-facing logging in case that timeout does expire. See related:
#2952

It should be immediately obvious to the user that what happened isn't that a particular test failed, but we timed out waiting for the host process to terminate.

Actually the best thing would be to use procdump in that case after the timeout, forcibly grab a dump of the process and make sure it's published as an AzDO artifact, so users can see what the process was doing that caused a 30 seconds timeout.

At the risk of repeating myself, I can't underscore enough how important it is that we make this set of improvements to the dotnet test experience on CI. It will result in more stable pipelines, remove the flakiness and noise and make failures easier to diagnose. It's hard to measure the amount of pain the current unfortunate constellation of defects is causing to engineers using dotnet test in CI.

@nohwnd if you feel like any of this work doesn't fit in your current schedule, please ping me and I will ensure this gets escalated to whatever level is necessary to prioritize this as an absolute top priority.

KirillOsenkov · 2021-06-30T02:50:06Z

I also think that this line of code will be run if the process has exited before the timeout, which seems wrong:

vstest/src/Microsoft.TestPlatform.CrossPlatEngine/Client/ProxyOperationManager.cs

Line 228 in d10bcbb

    
           EqtTrace.Warning("ProxyOperationManager: Timed out waiting for test host to exit. Will terminate process.");

KirillOsenkov · 2021-06-30T02:53:32Z

It should check the result of the .Wait() call and take different actions based on whether the timeout was reached or not. If we kill the process after the timeout, we need to take a full dump, and log a user-facing message saying "we killed the process because it took too long to exit"

nohwnd · 2021-06-30T13:16:43Z

I see absolutely no harm about increasing the timeout to 30 seconds. It will make the worst case successful, and won't impact the regular success case. Remember that books are written about timeouts in computers, and how you can never rely on 100ms actually being 100ms. Your entire process may be paused by the OS for any amount of time randomly. Heck, a random GC can easily create a timeout longer than that.

Assuming the worst case actually is 30 seconds. 🙂 I am aware of 100 ms not being exactly 100 ms for a thread. And I even agree with this in a pipeline scenario. There it might be reasonable to wait 30s for a process to terminate, to give it a better chance of terminating. Unfortunately 30s is forever in interactive scenario like when you click Stop button in VS TestExplorer. There you don't want to be waiting 30s before seeing anything happen. That is why I was recommending to increase to 1-3s. This gives the process a lot of time to teardown, while keeping the interactive user on the verge of "this is stuck".

It would also be nice to give the extensions a better infra where they could report what is happening, and we would show it in case of abort. Problem is how do we get it from the process without spending a lot of resources on "just" logging. Would be awesome if we could have a post-mortem view, that would be more verbose than what you would normally be interested in.

This 100ms should also not be problem anymore for coverlet (which I think was crashing your build as well, together I think with the Debug.Assert coming from the code) because newer coverlet is now an in-proc data collector to avoid this problem.

I also think that this line of code will be run if the process has exited before the timeout, which seems wrong:

You are right. Building a fix.

nohwnd · 2021-06-30T13:24:15Z

Test host process crashed <- yeah that is a whole another story, both blame and vstest console should report back to the reporter that they killed testhost. I originally attempted to get this fixed in the blame, but doing it would have required more changes that I was comfortable with at that time. IIRC.

KirillOsenkov · 2021-07-01T03:25:19Z

I see, now that I realize that this runs in interactive and IDE scenarios, 30 seconds is of course, way too much. I was missing this detail.

Still, the CI case needs to be absolutely robust. You saw how flaky the CI can get because of this issue. Perhaps this should be configurable and CI should pass a custom timeout like 1 minute.

MarcoRossignoli mentioned this issue Apr 6, 2020

Add possibility to run coverlet in standalone mode coverlet-coverage/coverlet#781

Closed

KirillOsenkov mentioned this issue Jun 30, 2021

"Test host process crashed" error hard to diagnose #2952

Open

MarcoRossignoli mentioned this issue Nov 23, 2021

Complete the attachment processors extension #3161

Merged

MarcoRossignoli closed this as completed in #3161 Jan 12, 2022

MarcoRossignoli mentioned this issue Feb 1, 2022

Add option for timeout and custom procdump #2976

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible confusing logging on ProxyOperationManager.cs #2379

Possible confusing logging on ProxyOperationManager.cs #2379

MarcoRossignoli commented Mar 29, 2020 •

edited

Loading

nohwnd commented Mar 30, 2020

MarcoRossignoli commented Mar 30, 2020

nohwnd commented Mar 30, 2020

MarcoRossignoli commented Mar 30, 2020 •

edited

Loading

MarcoRossignoli commented Apr 4, 2020

nohwnd commented May 4, 2020

jakubch1 commented May 4, 2020

MarcoRossignoli commented May 4, 2020 •

edited

Loading

KirillOsenkov commented Jun 30, 2021

KirillOsenkov commented Jun 30, 2021

KirillOsenkov commented Jun 30, 2021

nohwnd commented Jun 30, 2021

nohwnd commented Jun 30, 2021

KirillOsenkov commented Jul 1, 2021

Possible confusing logging on ProxyOperationManager.cs #2379

Possible confusing logging on ProxyOperationManager.cs #2379

Comments

MarcoRossignoli commented Mar 29, 2020 • edited Loading

nohwnd commented Mar 30, 2020

MarcoRossignoli commented Mar 30, 2020

nohwnd commented Mar 30, 2020

MarcoRossignoli commented Mar 30, 2020 • edited Loading

MarcoRossignoli commented Apr 4, 2020

nohwnd commented May 4, 2020

jakubch1 commented May 4, 2020

MarcoRossignoli commented May 4, 2020 • edited Loading

KirillOsenkov commented Jun 30, 2021

KirillOsenkov commented Jun 30, 2021

KirillOsenkov commented Jun 30, 2021

nohwnd commented Jun 30, 2021

nohwnd commented Jun 30, 2021

KirillOsenkov commented Jul 1, 2021

MarcoRossignoli commented Mar 29, 2020 •

edited

Loading

MarcoRossignoli commented Mar 30, 2020 •

edited

Loading

MarcoRossignoli commented May 4, 2020 •

edited

Loading