When troubleshooting long STW pauses in production code we noticed that they seem to be often happening in conjunction with reading big files what seems to be visible in go traces - the execution on all but two processors stops with one of them running a goroutine which reads a file using standard library call ioutil.ReadFile and another being in STW phase which seemingly waits for the file reading goroutine to finish. As the files we are reading are relatively big this stops execution of the program on other CPU cores for rather long time, in the worst cases up to 100ms.
The following a very simple program demonstrates the problem:
// dat-1G is 1G file created with command 'fallocate -l 1G dat-1G'_, err:=ioutil.ReadFile("dat-1G")
// forces GC to trigger STWruntime.GC()
If you run it and then look into the trace you will see that STW phase last untl the code finishes reading the file. My possibly incomplete understanding of the problem is that internally ioutil.ReadFile tries to read the file with a single 'read' syscall which go cannot preemt so STW phase cannot pause the goroutine which does the very long file read operation.
Probably ioutil.ReadFile implementation should be changed to read file in blocks using multiple 'read' syscalls to allow go to preemt it when necessary.
The text was updated successfully, but these errors were encountered:
If you run it and then look into the trace you will see that STW phase last untl the code finishes reading the file.
I don't think this is what is happening. If you wait for the goroutine doing ReadFile to finish (before the main function exits), it will take a lot longer. From your trace it seems it took ~40 ms to preempt that goroutine and stop the world, but it didn't wait the whole ReadFile to finish.
(~40 ms is still kind of long. I guess it is probably allocating and zeroing the memory of the big slice, which is currently nonpreemptible (#36365))
@cherrymui You are right. I tried to replace ioutil.ReadFile with manual allocation of the big slice and then explicitly reading the file via os.Open() and Read() and I can see in the trace that long STW pause is due to the big slice allocation as you wrote.