Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erratic performance for Win32 emulation of select with pipes #7935

Open
vicuna opened this Issue Mar 4, 2019 · 5 comments

Comments

Projects
None yet
2 participants
@vicuna
Copy link
Collaborator

commented Mar 4, 2019

Original bug ID: 7935
Reporter: @alainfrisch
Status: new
Resolution: open
Priority: normal
Severity: minor
Category: platform support (windows, cross-compilation, etc)
Monitored by: @nojb jmeber @yakobowski

Bug description

I'm attaching a small program that exhibit an erratic performance profile under Windows. I strongly suspect OCaml's emulation of select for pipes:

void read_pipe_poll (HANDLE hStop, void *_data)
{
DWORD res;
DWORD event;
DWORD n;
LPSELECTQUERY iterQuery;
LPSELECTDATA lpSelectData;
DWORD i;
DWORD wait;
/* Poll pipe */
event = 0;
n = 0;
lpSelectData = (LPSELECTDATA)_data;
wait = 1;
DEBUG_PRINT("Checking data pipe");
while (lpSelectData->EState == SELECT_STATE_NONE)
{
for (i = 0; i < lpSelectData->nQueriesCount; i++)
{
iterQuery = &(lpSelectData->aQueries[i]);
res = PeekNamedPipe(
iterQuery->hFileDescr,
NULL,
0,
NULL,
&n,
NULL);
if (check_error(lpSelectData,
(res == 0) &&
(GetLastError() != ERROR_BROKEN_PIPE)))
{
break;
};
if ((n > 0) || (res == 0))
{
lpSelectData->EState = SELECT_STATE_SIGNALED;
select_data_result_add(lpSelectData, iterQuery->EMode,
iterQuery->lpOrigIdx);
};
};
/* Alas, nothing except polling seems to work for pipes.
Check the state & stop_worker_event every 10 ms
*/
if (lpSelectData->EState == SELECT_STATE_NONE)
{
event = WaitForSingleObject(hStop, wait);
/* Fast start: begin to wait 1, 2, 4, 8 and then 10 ms.
* If we are working with the output of a program there is
* a chance that one of the 4 first calls succeed.
*/
wait = 2 * wait;
if (wait > 10)
{
wait = 10;
};
if (event == WAIT_OBJECT_0
|| check_error(lpSelectData, event == WAIT_FAILED))
{
break;
}
}
}
DEBUG_PRINT("Finish checking data on pipe");
}

This works by first checking if some data is available on a pipe, and if not, trying to wait 1ms, check again, then wait 2ms, check again, then 4ms, check again, then 8ms, check again, and then 10ms check again and loop with 10ms wait until data is available.

However, the accuracy for those wait operations is the current precision of the system clock, which can be, I believe, as low as 16ms. I didn't confirm it experimentally, but I wouldn't be surprised if the 1,2,3,8 ms were rounded to 0 (no wait), and 10ms rounded to 16ms. This means that if data is not almost immediately available, the code will wait for at least 16ms, which is really large in some contexts.

The benchmark consists in spawning a worker process and synchronizing with it through its stdin/stdout channels (connected to pipes). Namely, the main process sends a single character to the worker, does a tiny bit of computation, then select on the worker's stdout and reads one character; and repeat that 1000 times. The worker waits for a character (with Unix.read) from the main process, does a bit of computation (the amount computation being proportional to a parameter N) and sends back one character to the main process. We do that 5 times for each value of N from 1 to 50, and we observe that for N large enough, the total time for the whole interaction with the worker starts exhibiting an erratic behavior, with apparently two "extreme" points (e.g. between 1.21s and 11.06s for N=50).

See results below for runs under MSVC64 and Linux (under WSL) on the same machine and the same OS. The machine has 2 physical cores, and I wouldn't be surprised that the two "performance modes" depend on whether the two processes run on the same core or not. If they do, the main process is slowed down enough so that the next ping from the worker is there when it starts to wait for it. If they don't, the main process is ready too quickly, and enters the waiting strategy described above, resulting in a very large slowdown (around 10 seconds for the entire 1000 iterations).

Additional information

The results of the benchmark with the MSVC64 port:

1:  0.04 0.05 0.04 0.05 0.04
2:  0.03 0.04 0.03 0.04 0.05
3:  0.07 0.05 0.03 0.03 0.04
4:  0.05 0.04 0.03 0.04 0.05
5:  0.05 0.05 0.05 0.03 0.04
6:  0.05 0.04 0.03 0.04 0.05
7:  0.04 0.05 0.04 0.03 0.04
8:  0.05 0.05 0.05 0.04 0.03
9:  0.05 0.05 0.20 0.12 0.06
10:  0.14 0.07 0.06 0.03 0.03
11:  0.19 0.07 0.25 0.08 0.08
12:  0.07 0.08 0.04 0.03 0.08
13:  0.18 0.15 0.05 0.05 0.08
14:  0.17 0.26 0.18 0.22 0.08
15:  0.13 0.11 0.09 0.15 0.13
16:  0.14 0.11 0.12 0.16 0.16
17:  0.10 0.14 0.17 0.32 0.06
18:  0.06 0.07 0.08 0.33 0.07
19:  0.12 0.12 0.25 0.25 0.22
20:  0.23 0.14 0.36 0.28 0.25
21:  0.30 0.30 0.25 0.25 0.33
22:  0.12 0.22 0.27 0.20 0.33
23:  0.22 0.33 0.43 0.15 0.12
24:  0.24 0.26 0.33 0.47 0.22
25:  0.28 0.43 0.30 0.30 0.35
26:  0.25 0.34 0.27 0.35 0.37
27:  0.30 0.22 0.38 0.50 0.30
28:  0.32 0.36 0.39 0.38 0.39
29:  0.35 0.35 0.44 0.40 0.29
30:  0.31 0.38 0.42 0.41 0.36
31:  0.50 0.42 0.83 0.43 0.56
32:  0.54 0.82 0.48 0.45 0.52
33:  1.04 0.50 0.82 0.59 0.75
34:  0.63 0.60 0.60 0.72 0.45
35:  0.93 1.69 1.56 0.75 0.49
36:  0.53 0.50 2.93 1.25 0.98
37:  1.13 5.82 1.20 0.82 4.05
38:  0.81 0.77 6.06 6.77 5.30
39:  0.87 0.91 6.09 7.70 4.72
40:  0.95 3.14 8.33 7.26 0.93
41:  0.98 8.77 9.16 0.98 0.89
42:  8.43 9.47 1.12 0.99 9.02
43:  7.62 2.18 0.99 7.88 10.12
44:  1.15 1.48 9.67 8.01 1.03
45:  3.54 10.02 5.63 1.05 6.43
46:  10.80 2.12 1.04 10.34 7.94
47:  1.12 4.39 10.53 4.18 1.06
48:  7.89 10.51 1.13 4.32 11.41
49:  3.82 1.16 10.90 7.46 1.16
50:  6.87 11.39 1.21 2.78 11.96

and on the same machine, with the Linux port under WSL:

1:  0.02 0.03 0.02 0.03 0.02
2:  0.02 0.03 0.02 0.02 0.02
3:  0.02 0.02 0.03 0.02 0.02
4:  0.02 0.02 0.02 0.03 0.02
5:  0.02 0.03 0.03 0.03 0.03
6:  0.02 0.02 0.02 0.02 0.03
7:  0.02 0.03 0.03 0.02 0.02
8:  0.03 0.02 0.02 0.02 0.02
9:  0.03 0.03 0.03 0.02 0.02
10:  0.02 0.02 0.03 0.03 0.03
11:  0.03 0.02 0.02 0.03 0.03
12:  0.02 0.02 0.03 0.03 0.03
13:  0.03 0.03 0.04 0.03 0.03
14:  0.03 0.03 0.03 0.03 0.03
15:  0.03 0.03 0.03 0.03 0.03
16:  0.03 0.03 0.03 0.03 0.03
17:  0.03 0.03 0.03 0.03 0.03
18:  0.03 0.03 0.03 0.03 0.03
19:  0.03 0.03 0.03 0.03 0.03
20:  0.03 0.03 0.03 0.03 0.03
21:  0.03 0.03 0.03 0.03 0.03
22:  0.03 0.03 0.03 0.03 0.03
23:  0.03 0.03 0.03 0.03 0.03
24:  0.03 0.03 0.03 0.03 0.03
25:  0.03 0.03 0.03 0.03 0.03
26:  0.03 0.03 0.03 0.03 0.03
27:  0.03 0.03 0.03 0.03 0.03
28:  0.03 0.04 0.03 0.03 0.04
29:  0.03 0.03 0.03 0.03 0.03
30:  0.03 0.04 0.04 0.03 0.03
31:  0.03 0.04 0.03 0.03 0.03
32:  0.03 0.03 0.03 0.03 0.04
33:  0.03 0.04 0.03 0.03 0.03
34:  0.03 0.03 0.03 0.03 0.03
35:  0.03 0.04 0.03 0.03 0.03
36:  0.03 0.03 0.03 0.03 0.03
37:  0.04 0.03 0.03 0.03 0.03
38:  0.03 0.03 0.03 0.03 0.03
39:  0.04 0.04 0.04 0.03 0.03
40:  0.04 0.04 0.04 0.04 0.04
41:  0.04 0.04 0.04 0.04 0.03
42:  0.03 0.04 0.04 0.04 0.05
43:  0.05 0.05 0.04 0.05 0.04
44:  0.05 0.05 0.05 0.05 0.05
45:  0.05 0.05 0.05 0.04 0.05
46:  0.05 0.05 0.04 0.05 0.04
47:  0.05 0.05 0.05 0.05 0.05
48:  0.05 0.05 0.05 0.05 0.05
49:  0.05 0.05 0.05 0.04 0.04
50:  0.04 0.04 0.04 0.04 0.04

File attachments

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 4, 2019

Comment author: @alainfrisch

Another test, to check for the clock resolution:

    for i = 0 to 20 do
      let n = (float i  /. 1000.) in
      Printf.printf "%i: %!" i;
      for _j = 1 to 5 do
        let t0 = Unix.gettimeofday () in
        for _i = 1 to 100 do
          Unix.select [] [] [] n
        done;
        let t1 = Unix.gettimeofday () in
        Printf.printf " %.02f %!" (10. *. (t1 -. t0))
      done;
      print_newline ()
    done

The call to Unix.select ends up calling [Sleep] (with its argument multiplied by 1000).

Under WSL, with a good clock accuracy, I get:

0:  0.00  0.00  0.01  0.00  0.00
1:  1.20  1.27  1.26  1.26  1.26
2:  2.28  2.27  2.27  2.27  2.27
3:  3.28  3.23  3.28  3.28  3.29
4:  4.29  4.30  4.29  4.83  4.81
5:  5.80  5.77  5.30  5.30  5.31
6:  6.30  6.30  6.58  6.31  6.41
7:  7.32  7.42  7.42  7.38  7.40
8:  8.63  8.40  8.37  8.35  8.37
9:  9.44  9.74  9.40  9.45  9.50
10:  10.63  10.75  10.56  10.43  10.43
11:  11.44  11.45  11.43  11.42  12.05
...

but under Windows, runs are very erratic; here is one example:

0:  0.00  0.00  0.00  0.00  0.00
1:  16.69  16.74  16.75  16.69  16.75
2:  16.76  16.69  16.76  7.81  2.83
3:  3.78  3.72  3.67  3.74  14.72
4:  16.60  16.69  16.76  16.22  16.76
5:  16.68  16.76  16.60  16.69  16.30
6:  6.76  6.70  7.22  16.76  16.69
7:  16.75  16.76  16.70  16.75  17.07
8:  16.76  16.69  16.92  14.27  8.69
9:  10.72  16.69  16.75  16.75  16.69
10:  16.76  16.76  16.69  16.74  16.76
11:  16.91  14.79  11.69  15.28  16.92
12:  16.75  17.06  16.69  16.75  16.76
13:  16.69  16.76  16.92  16.76  13.65
14:  15.78  16.75  16.69  16.74  16.75
15:  16.84  16.75  16.76  16.70  17.06
16:  26.58  19.15  33.44  29.29  31.93
17:  33.47  32.75  24.14  24.46  33.13
18:  33.44  33.45  33.14  32.90  18.55
19:  32.94  33.52  33.14  33.14  33.14
20:  26.35  27.88  33.14  33.44  33.13

It's as if the system clock was changed temporarily by another process from the usual low resolution (around 16ms) to a higher resolution (around or smaller than 1ms) for short periods of time. Could be something like:

https://bugs.chromium.org/p/chromium/issues/detail?id=153139

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 4, 2019

Comment author: @alainfrisch

Indeed, previous program gives the following under Windows, without Chrome running:

0:  0.00  0.00  0.00  0.00  0.00
1:  16.76  16.69  16.76  16.75  16.69
2:  16.76  16.75  16.72  16.42  16.69
3:  16.44  16.69  16.45  16.69  16.45
4:  16.53  16.75  16.69  16.92  16.75
5:  16.37  16.91  16.75  16.40  16.75
6:  16.54  16.76  16.76  16.69  16.76
7:  16.75  16.68  16.76  16.76  16.69
8:  16.92  16.76  16.69  16.76  16.75
9:  16.91  16.70  16.76  17.07  16.76
10:  16.69  16.75  16.75  16.69  16.75
11:  16.75  16.68  16.91  16.75  16.69
12:  16.75  16.76  16.69  16.91  16.76
13:  16.76  16.84  16.76  16.76  16.69
14:  16.76  16.75  16.92  16.69  16.76
15:  16.75  16.69  16.75  16.76  16.69
16:  33.52  33.13  33.13  23.12  33.12
17:  33.13  33.14  33.14  33.44  32.82
18:  33.44  33.45  32.82  33.46  33.16
19:  33.34  33.32  33.45  33.45  33.52
20:  33.45  33.45  33.52  33.13  33.13

This suggests that timeout are rounded up, and that the select emulation will end up waiting for 16ms when the clock resolution is kept untouched by other processes, as soon as data is not immediately available.

With Chrome opened, scrolling through tabs:

0:  0.00  0.00  0.00  0.00  0.00
1:  2.01  2.28  2.08  1.92  1.98
2:  2.85  3.21  3.08  3.20  3.14
3:  4.24  4.18  4.06  4.63  4.67
4:  5.14  5.12  6.02  6.71  6.50
5:  7.78  8.02  7.66  8.82  8.65
6:  8.65  8.83  8.86  9.66  9.54
7:  9.70  10.80  9.85  11.30  10.85
8:  11.21  11.65  9.13  9.12  11.40
9:  11.82  11.61  12.80  11.53  12.27
10:  12.86  14.95  16.52  16.24  16.23
11:  16.44  16.93  11.75  13.29  16.36
12:  16.53  16.75  16.37  16.65  16.69
13:  16.53  16.76  16.90  17.06  15.42
14:  15.28  16.59  16.81  16.70  17.55
15:  18.19  17.66  18.95  17.62  18.03
16:  19.66  17.28  19.41  20.89  20.67
17:  21.31  21.26  21.77  21.57  21.96
18:  19.73  18.61  18.58  18.51  18.42
19:  19.59  19.47  19.52  19.45  19.51
20:  20.54  20.68  31.06  31.44  30.94

(I stopped using Chrome between 10..12)

This suggests that our select emulation (and in particular the "active" waiting for pipes) is sensitive to other processes changing the clock resolution.

Btw, I noticed this problem on a real application, and initially, this materialized as the application running much slower when the computer was otherwise idle, and faster when the user played with other applications (which seemed counter-intuitive, but now makes sense).

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 4, 2019

Comment author: @alainfrisch

Cf https://docs.microsoft.com/en-us/windows/desktop/api/timeapi/nf-timeapi-timebeginperiod :

This function affects a global Windows setting. Windows uses the lowest value
(that is, highest resolution) requested by any process.

@vicuna

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 4, 2019

Comment author: @alainfrisch

Tentative fix at alainfrisch@289bc0c

The previous implementation of the select emulation on pipes would
often wait for 16ms when no data is immediately available (or a better
resolution if another process happens to request such a resolution).

This can lead to bad reactivity of programs that rely on select.  This
commit attempts to fix this by switching to a busy polling for about
64ms (but making sure to yield to other threads/processes).  After
this initial period, we go back to loop with 16ms timeouts.  Thus, the
time "wasted" is at most 1/4 of the total wait time.  The downside is
that the process can eat 100% of one CPU core if data comes fast
enough.

An alternative might be to do a blocking read on the pipe (in a thread),
with a zero buffer size.
@vicuna

This comment has been minimized.

Copy link
Collaborator Author

commented Mar 4, 2019

Comment author: @alainfrisch

One could also, of course, request better clock precision in our emulation of select, but I feel it is bad style to affect a global system setting (well, it's not like others hesitate to do it, apparently).

@nojb nojb referenced this issue Mar 15, 2019

Closed

[RFC/WIP] Threadless scheduler #1883

0 of 4 tasks complete

@nojb nojb added the windows label Mar 15, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.