Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows large pages commit - modern version instantly crashing #2677

Closed
Coolchessguykevin opened this issue May 13, 2020 · 22 comments
Closed

Comments

@Coolchessguykevin
Copy link

I just downloaded the latest Stockfish build ("Add support for Windows large pages" commit) from the abrok.eu site.

I run the engine exe-file in the command prompt (just open exe-file without adding it to any GUI) and see the following:

Stockfish 130520 64 POPCNT by T. Romstad, M. Costalba, J. Kiiski, G. Linscott
info string Hash table allocation: Windows large pages not used.

Of course I did not change the system policy to use page locking.

Bench command is working well.

But when I try to close the command prompt window, the engine stops working (with Windows appropriate message and log information in the event viewer). This happens only when the latest modern build is launched (my processor is quite old and is not supports BMI command), all the previous builds until Lock pages commit work without any problems.

System: Win 10 Pro x64 (November 2019 update).

@Coolchessguykevin
Copy link
Author

Updated: I checked the latest build on another, more modern computer (BMI2 - Haswell CPU - Stockfish version). The problem is exactly the same.

At the moment, I did not add the newest build to any GUI, as it is obvious that the problem after Lock pages commit exists.

@CoffeeOne
Copy link

What do you mean exactly with "But when I try to close the command prompt window, the engine stops working"?

@Coolchessguykevin
Copy link
Author

@CoffeeOne I mean that if I try to close the running engine window as normal way - engine crashing with Windows application error report is following. It is not good. Such was never happened before the latest commit.

@MichaelB7
Copy link
Contributor

MichaelB7 commented May 13, 2020

I tried to duplicate, with and without Large Pages from abrok , to no avail.

Can you post system information ( search "system information: in search bar and run app.)
Output will look something like this:

OS Name	Microsoft Windows 10 Pro
Version	10.0.18363 Build 18363
Other OS Description 	Not Available
OS Manufacturer	Microsoft Corporation
System Name	VM-894787
System Manufacturer	System manufacturer
System Model	System Product Name
System Type	x64-based PC
System SKU	SKU
Processor	AMD Ryzen Threadripper 3970X 32-Core Processor, 3693 Mhz, 32 Core(s), 64 Logical Processor(s)
BIOS Version/Date	American Megatrends Inc. 0807, 1/30/2020
SMBIOS Version	3.2
Embedded Controller Version	255.255
BIOS Mode	UEFI
BaseBoard Manufacturer	ASUSTeK COMPUTER INC.
BaseBoard Product	PRIME TRX40-PRO
BaseBoard Version	Rev 1.xx
Platform Role	Desktop
Secure Boot State	Off
PCR7 Configuration	Binding Not Possible
Windows Directory	C:\Windows
System Directory	C:\Windows\system32
Boot Device	\Device\HarddiskVolume7
Locale	United States
Hardware Abstraction Layer	Version = "10.0.18362.752"
User Name	VM-894787\MichaelB7
Time Zone	Eastern Daylight Time
Installed Physical Memory (RAM)	128 GB
Total Physical Memory	128 GB
Available Physical Memory	116 GB
Total Virtual Memory	136 GB
Available Virtual Memory	114 GB
Page File Space	8.00 GB
Page File	C:\pagefile.sys
Kernel DMA Protection	Off
Virtualization-based security	Not enabled
Device Encryption Support	Reasons for failed automatic device encryption: TPM is not usable, PCR7 binding is not supported, Hardware Security Test Interface failed and device is not Modern Standby, Un-allowed DMA capable bus/device(s) detected, TPM is not usable
Hyper-V - VM Monitor Mode Extensions	Yes
Hyper-V - Second Level Address Translation Extensions	Yes
Hyper-V - Virtualization Enabled in Firmware	Yes
Hyper-V - Data Execution Protection	Yes

@MichaelB7
Copy link
Contributor

One thing I would try, is to run wndows updates until it does not update anymore. Standard procedure when an app that always worked before stops working.

@silversolver1
Copy link

silversolver1 commented May 14, 2020

Not sure if this is related but I compiled the latest version (large pages), on Windows, (with large pages off) and while I'm not getting any crashes or errors, the executable takes noticeably longer to close (around 2 to 3 seconds) while all prior executables close pretty much instantaneously when the "X" button is clicked. This is repeatable too, and I'm not doing anything differently than I normally would.

I checked and this behavior isn't present/hasn't been present in compiles that go only up to the TBTables::Entry commit, so I'm fairly certain it has something to do with the latest Large Pages commit. I also find it strange that this effect is even occurring given that I'm not even using the functionality

@Coolchessguykevin
Copy link
Author

Not sure if this is related but I compiled the latest version (large pages), on Windows, (with large pages off) and while I'm not getting any crashes or errors, the executable takes noticeably longer to close (around 2 to 3 seconds) while all prior executables close pretty much instantaneously when the "X" button is clicked. This is repeatable too, and I'm not doing anything differently than I normally would.

@silversolver1 Can you please check Event log - Application. I assume you will find out Stockfish crashes after those 2-3 seconds of freezing. This is exactly behavior that I got on the second machine. On the first one there were Windows crashes messages after seconds of freezing, but on the second machine only freezing was visible, but I guessed to look into the event log and found those same crashes.

@Coolchessguykevin
Copy link
Author

@MichaelB7 The reason is definitely not in the hardware, as this problem occurs on two different computers at once. And I did not update Windows before the crash issue occurred. It appeared exclusively in the latest Stockfish build with Large pages commit.

Right now I am running any previous Stockfish builds on both PCs and none of them have a similar crashes problem.

@MichaelB7
Copy link
Contributor

MichaelB7 commented May 14, 2020 via email

@silversolver1
Copy link

silversolver1 commented May 14, 2020

@Coolchessguykevin Yes, I checked and you were right it shows up with event name APPCRASH.

Some of the seemingly relevant details in the crash log are:

P4: ntdll.dll
P7: c0000005
P8: 00000000000407d3

https://stackoverflow.com/questions/9497384/how-to-interpret-windows-appcrash-mysterious-log

From there,
P4 = Module name
P7 = Exception code
P8 = Exception offset from the start of the module

According to google, exception code c0000005 indicates an access violation

The only real thing I could notice about this is that in tt.cpp's TranspositionTable::resize, Here, variable mem is not declared, defined or initialized in any way that I can tell. It does not seem to have a type, is not passed in as a parameter in any way, is not a global variable and does not point to anything either (so I am not sure how it can be used. although i could of course be missing something ;) )

@vondele
Copy link
Member

vondele commented May 14, 2020

@skiminki could you have a look ?

From the description is seems that the TT.resize(0) at the end might be causing troubles.

@vondele
Copy link
Member

vondele commented May 14, 2020

@silversolver1 mem is a variable of the TranspositionTable class (see tt.h line 95).

This is certainly not related to the TBTables::Entry commit

@vondele
Copy link
Member

vondele commented May 14, 2020

@silversolver1 could you try the following, it is a bit a shot in the dark, but it would be useful to exclude it as a cause:

diff --git a/src/main.cpp b/src/main.cpp
index c7cf2c6f2..f0b7f240b 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -49,6 +49,7 @@ int main(int argc, char* argv[]) {
 
   UCI::loop(argc, argv);
 
+  Threads.main()->wait_for_search_finished();
   TT.resize(0);
   Threads.set(0);
   return 0;

@silversolver1
Copy link

@vondele Sure, I compiled a version with just the line Threads.main()->wait_for_search_finished(); added in that particular spot on top of the Large Pages commit. It seems the problem persists with the same error log, including the same exception code and exception offset.

@vondele
Copy link
Member

vondele commented May 14, 2020

OK, thanks. So that's not an issue. The windows binary (including the abrok one) runs fine under wine on Linux. @silversolver1 could you reproduce this crash in a debugger and get the location of the crash confirmed?

Also as asked previously, I guess this will be Windows version specific, so can you provide that as well.

@dorzechowski
Copy link

It happens only on SIGINT sent to the process (so Ctrl-C or just closing terminal window with SF running). Command 'quit' works with no problem.

@skiminki
Copy link
Contributor

I think I have reproduced this. I get this behavior + entry in the application log if I close the stockfish terminal window with the "close window" button.

However, I do not see this issue if I use command 'quit'.

My reproduction steps precisely:

  1. Click 'stockfish.exe' icon to open a terminal Window
  2. Click 'X' button in top-right to close the window
  3. It takes maybe 2 secs to close the window, and there are 3 new events in the event log

However, the following works without issues

  1. Click 'stockfish.exe' icon to open a terminal Window
  2. type 'quit' to close the engine gracefully
  3. Terminal window closes instantly

Yeah, I think this has to do with SIGINT and possibly about that we didn't get a chance to close VirtualAlloc's. I'll post the events in a follow-up comment.

@skiminki
Copy link
Contributor

This is about NTDLL.DLL crashing because we didn't close VirtualAlloc's in ctrl+c exit path. I get the same behavior with 'quit' exit with the following patch:

diff --git a/src/tt.cpp b/src/tt.cpp
index 6ee63138..4b2c45d0 100644
--- a/src/tt.cpp
+++ b/src/tt.cpp
@@ -63,6 +63,12 @@ void TranspositionTable::resize(size_t mbSize) {
 
   Threads.main()->wait_for_search_finished();
 
+  if (!mbSize)
+  {
+      // aligned_ttmem_free(mem); mem = nullptr; -- no crash if this line commented out
+      exit(0);
+  }
+
   if (mem)
       aligned_ttmem_free(mem);

With this patch, the following gives a crash:

  1. Click 'stockfish.exe' icon to open a terminal Window
  2. type 'quit'
  3. It takes maybe 2 secs to close the window, and there are 2 new events in the event log (error + info)

Actually, I don't think we need the crash events here. I'll investigate a bit further.

@skiminki
Copy link
Contributor

skiminki commented May 14, 2020

Ok, the crash is most likely because of a left-over. That is, TranspositionTable destructor frees mem if it's non-null. The fix is to delete it, because we free transposition table now manually--or rather, maybe we should delete that resize(0) stuff and replace free(mem) with aligned_ttmem_free(mem) call. This was a bug in my patch. I'll send a PR soon.

diff --git a/src/tt.h b/src/tt.h
index 8b70f797..94fb9f61 100644
--- a/src/tt.h
+++ b/src/tt.h
@@ -75,7 +75,7 @@ class TranspositionTable {
   static_assert(sizeof(Cluster) == 32, "Unexpected Cluster size");
 
 public:
- ~TranspositionTable() { free(mem); }
+  ~TranspositionTable() { }
   void new_search() { generation8 += 8; } // Lower 3 bits are used by PV flag and Bound
   TTEntry* probe(const Key key, bool& found) const;
   int hashfull() const;

So with this patch, I don't see crashes. TBH, I'm a bit surprised that Windows calls the static destructors on window close after aborting the program.

@skiminki
Copy link
Contributor

skiminki commented May 14, 2020

Candidate PR to fix this: #2679

@silversolver1
Copy link

silversolver1 commented May 14, 2020

I haven't yet tried anything with a debugger but as I commented in 2679, the PR seems to work to remedy the problem and I no longer get any application crashes or errors in the Windows Event Viewer log (running Windows 10). The hang on closing using the Ctrl-C and "X" methods also seems to be gone now, so it looks like this worked

@vondele
Copy link
Member

vondele commented May 14, 2020

yes, will be fixed after merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants