Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

100% nginx usage #80

Closed
eustas opened this issue Sep 3, 2019 · 51 comments · Fixed by #100
Closed

100% nginx usage #80

eustas opened this issue Sep 3, 2019 · 51 comments · Fixed by #100

Comments

@eustas
Copy link
Collaborator

eustas commented Sep 3, 2019

See eustas#30

@moderndeveloperllc
Copy link

@eustas Does this mean that you've officially taken over maintenance of this at Google? Great news if so. I think some other developers might have been waiting for the "official" repo to be considered active before progressing.

@eustas
Copy link
Collaborator Author

eustas commented Sep 23, 2019

Yes, finally, brotli team is able to support ngx_brotli =) Sorry for super-late response, I've thought that I've answered already, but, perhaps, that was in another issue =)

@eustas
Copy link
Collaborator Author

eustas commented Sep 23, 2019

@kapouer @HansVanEijsden @hroost does it still happen after #83?

@HansVanEijsden
Copy link

@kapouer @HansVanEijsden @hroost does it still happen after #83?

Currently I'm testing it on two production servers. Compiled and enabled it 15 minutes ago. Up & running with no issues, yet. I'll let it run for some hours and I'll report it back here. Thnx!

@HansVanEijsden
Copy link

HansVanEijsden commented Sep 23, 2019

Unfortunately, I still have the problems. After ~18 and ~30 minutes Nginx was hanging on one of the four cores with 100%. And as a consequence 25% of the web requests were hanging too, without getting any response from Nginx on that core. After 40 seconds the CPU usage goes down to normal.
About 20 minutes later the CPU usage was going to 100% on one core again, and after ~ 40 seconds it went down to normal. And as I'm typing, 5 minutes later, CPU is 100% again, so I will immediately have to disable ngx_brotli again in my Nginx.

My Nginx error log shows this:

2019/09/23 21:45:58 [alert] 17909#17909: worker process 17910 exited on signal 9
2019/09/23 21:48:32 [alert] 28839#28839: worker process 28840 exited on signal 9
2019/09/23 22:01:26 [alert] 29478#29478: worker process 29480 exited on signal 9
2019/09/23 22:24:19 [alert] 5144#5144: *4129 open socket #64 left in connection 27
2019/09/23 22:24:19 [alert] 5144#5144: aborting
2019/09/23 22:24:22 [alert] 5143#5143: worker process 5147 exited on signal 9
2019/09/23 22:28:16 [alert] 17471#17471: worker process 17473 exited on signal 9

The 100% CPU problems appeared exactly on those timestamps.

My Nginx Brotli config:

	brotli on;
	brotli_static on;
	brotli_types text/css text/x-component application/x-javascript application/javascript text/javascript text/x-js text/richtext image/svg+xml text/plain text/xsd text/xsl text/xml image/bmp application/java application/msword application/vnd.ms-fontobject application/x-msdownload image/x-icon image/webp application/json application/vnd.ms-access video/webm application/vnd.ms-project application/x-font-otf application/vnd.ms-opentype application/vnd.oasis.opendocument.database application/vnd.oasis.opendocument.chart application/vnd.oasis.opendocument.formula application/vnd.oasis.opendocument.graphics application/vnd.oasis.opendocument.spreadsheet application/vnd.oasis.opendocument.text audio/ogg application/pdf application/vnd.ms-powerpoint application/x-shockwave-flash image/tiff application/x-font-ttf audio/wav application/vnd.ms-write application/font-woff application/font-woff2 application/vnd.ms-excel;

@chipitsine
Copy link

do you have core dumps allowed ?

it can be easily achived by using

working_directory /root; # or some other writable directory
worker_rlimit_core 99999M;

@HansVanEijsden
Copy link

HansVanEijsden commented Sep 23, 2019

do you have core dumps allowed ?

it can be easily achived by using

working_directory /root; # or some other writable directory
worker_rlimit_core 99999M;

Yes I have, thanks for the suggestion. It's strange: no core dumps are being generated.

worker_rlimit_core 99999M;
working_directory   /tmp/;

Nothing. Also dmesg -T doesn't show anything: no Nginx crashes and/or segfaults. Nginx just keeps on hanging at 100% CPU, and/or quits (with the other workers going back to normal) after ~40 seconds without errors.

Do you have any other suggestions?

@PiotrSikora
Copy link
Contributor

@chipitsine /root/ isn't writable to worker processes, so you wouldn't get core dumps there.

@HansVanEijsden dmesg wouldn't show NGINX crashes, but you can see them in the logs:

2019/09/23 21:45:58 [alert] 17909#17909: worker process 17910 exited on signal 9

(signal 9 is a segmentation fault)

@HansVanEijsden
Copy link

@HansVanEijsden dmesg wouldn't show NGINX crashes, but you can see them in the logs:

2019/09/23 21:45:58 [alert] 17909#17909: worker process 17910 exited on signal 9

(signal 9 is a segmentation fault)

That's good to know!
Another thing: I will compile Nginx with --with-debug and try again. Let's see if that will generate the core dumps. 😉

@chipitsine
Copy link

it depends on user xxx;

if nginx is started with used root, it can write dumps to /root

if nginx is started with another user, so another folder should be chosen

@HansVanEijsden
Copy link

My Nginx runs as www-data with user www-data; and I have set working_directory /tmp/;. The /tmp is writable by the www-data user.
I've built Nginx with --with-debug but still no core dumps in /tmp although I had signal 9 again.
I'm out of ideas, but I'll try to search for other ways to try to get more information.

@HansVanEijsden
Copy link

Another update. Monit was running on my servers and was automatically restarting Nginx after those ~40 seconds. That was probably throwing the signal 9 in the Nginx log.

My syslog:

Sep 24 14:45:42 www monit[543]: 'nginx' failed protocol test [HTTP] at [localhost]:80/ [TCP/IP] -- HTTP: Error receiving data -- Resource temporarily unavailable
Sep 24 14:45:42 www monit[543]: 'nginx' trying to restart
Sep 24 14:45:42 www monit[543]: 'nginx' stop: '/bin/systemctl stop nginx'
Sep 24 14:45:42 www systemd[1]: Stopping Nginx...
Sep 24 14:45:50 www systemd[1]: nginx.service: Succeeded.
Sep 24 14:45:50 www systemd[1]: Stopped Nginx.
Sep 24 14:45:50 www monit[543]: 'nginx' start: '/bin/systemctl start nginx'
Sep 24 14:45:50 www systemd[1]: Starting Nginx...
Sep 24 14:45:50 www systemd[1]: Started Nginx.

So, I disabled Monit and tried again. Nginx keeps hanging forever at 100%, without segfaults, without core dumps and without any other log entries.
I thought about enabling Nginx debug error logging but I'm afraid it generates too much log data on a production server.

@chipitsine
Copy link

is it production environment ?

if "no" (i.e. test env), can you provide an ssh access to it ?

@jdylag
Copy link

jdylag commented Sep 24, 2019

We have the same issue. Nginx 1.16.1 and libbrotli 1.0.7. We deployed ngx_brotli on ~150 servers (with 70k domains) and noticed 100% cpu usage on 3 servers. All servers have same configuration.

We tried to find domain / request responsible (problem starts usually 1-2 hours after restart) and managed to identify domain and that last request processed by hanged process was always 404.

If i try to repeat query that resulted in error it never happens, it happens on random 404 after some time. It suggest that problem is linked with 404 handling. We are using custom error pages generated with SSI using "include" directive (debug logs shows SSI processed just before brotli).

Removing "include" directive from error page fixed problem on all 3 servers, 147 servers are fine with "include" directive and ngx_brotli.

Error log during problem (debug log level) shows repeated:

[debug] 22111#22111: *10257 http write filter 0000000000000000
[debug] 22111#22111: *10257 http write filter: l:0 f:0 s:0
[debug] 22111#22111: *10257 http write filter limit 0
[debug] 22111#22111: *10257 http write filter 0000000000000000
[debug] 22111#22111: *10257 http write filter: l:0 f:0 s:0
[debug] 22111#22111: *10257 http write filter limit 0
[debug] 22111#22111: *10257 http write filter 0000000000000000
[debug] 22111#22111: *10257 http write filter: l:0 f:0 s:0
[debug] 22111#22111: *10257 http write filter limit 0
[debug] 22111#22111: *10257 http write filter 0000000000000000
[debug] 22111#22111: *10257 http write filter: l:0 f:0 s:0
[debug] 22111#22111: *10257 http write filter limit 0
[debug] 22111#22111: *10257 http write filter 0000000000000000
[debug] 22111#22111: *10257 http write filter: l:0 f:0 s:0
[debug] 22111#22111: *10257 http write filter limit 0

I can provide full nginx error log if needed.

@HansVanEijsden
Copy link

HansVanEijsden commented Sep 24, 2019

@chipitsine I cannot reproduce it in a test environment. It needs some traffic to trigger, as far as I know. Currently I'm reproducing it on a VPS with only one website (a local news website).
I can provide you with the complete Nginx configuration though, just let me know.
And I would love to provide SSH access, but root access only to the persons working directly at Nginx, Google or this project, because of my responsibilities.

I managed to get an Nginx debug log while one of the processes was hanging at 100% CPU. I can't find anything special in that log though. Just the normal epoll and thread things.

The only thing which I could find, right before it went to 100% CPU, was this (epoll_wait() error?):

2019/09/24 16:53:04 [debug] 24703#24703: worker cycle
2019/09/24 16:53:04 [debug] 24703#24703: epoll timer: 16256
2019/09/24 16:53:04 [debug] 24700#24700: epoll: fd:86 ev:0005 d:00007FCF8E1D2310
2019/09/24 16:53:04 [debug] 24700#24700: timer delta: 3712
2019/09/24 16:53:04 [debug] 24700#24700: worker cycle
2019/09/24 16:53:04 [debug] 24700#24700: epoll timer: 52032
2019/09/24 16:53:05 [debug] 24703#24703: epoll: fd:19 ev:0001 d:00007FCF8E1D1F30
2019/09/24 16:53:05 [debug] 24703#24703: timer delta: 1332
2019/09/24 16:53:05 [debug] 24703#24703: worker cycle
2019/09/24 16:53:05 [debug] 24703#24703: epoll timer: 14924
2019/09/24 16:53:05 [debug] 24703#24703: epoll: fd:26 ev:0001 d:00007FCF8E1D3671
2019/09/24 16:53:05 [debug] 24703#24703: shmtx lock
2019/09/24 16:53:05 [debug] 24703#24703: slab alloc: 171 slot: 5
2019/09/24 16:53:05 [debug] 24703#24703: slab alloc: 00007FCF86EF4F00
2019/09/24 16:53:05 [debug] 24703#24703: slab alloc: 128 slot: 4
2019/09/24 16:53:05 [debug] 24703#24703: slab alloc: 00007FCF86EF3800
2019/09/24 16:53:05 [debug] 24703#24703: shmtx unlock
2019/09/24 16:53:05 [debug] 24703#24703: timer delta: 16
2019/09/24 16:53:05 [debug] 24703#24703: worker cycle
2019/09/24 16:53:05 [debug] 24703#24703: epoll timer: 14908
2019/09/24 16:53:05 [debug] 24703#24703: epoll: fd:26 ev:0001 d:00007FCF8E1D3671
2019/09/24 16:53:05 [debug] 24703#24703: timer delta: 8
2019/09/24 16:53:05 [debug] 24703#24703: worker cycle
2019/09/24 16:53:05 [debug] 24703#24703: epoll timer: 14900
2019/09/24 16:53:05 [debug] 24703#24703: epoll: fd:26 ev:0001 d:00007FCF8E1D3671
2019/09/24 16:53:05 [debug] 24703#24703: timer delta: 8
2019/09/24 16:53:05 [debug] 24703#24703: worker cycle
2019/09/24 16:53:05 [debug] 24703#24703: epoll timer: 14892
2019/09/24 16:53:05 [debug] 24703#24703: epoll: fd:45 ev:2019 d:00007FCF8E1D2DB9
2019/09/24 16:53:05 [debug] 24700#24700: epoll: fd:163 ev:201D d:00007FCF8E1D2BC8
2019/09/24 16:53:05 [debug] 24703#24703: epoll_wait() error on fd:45 ev:2019
2019/09/24 16:53:05 [debug] 24700#24700: epoll_wait() error on fd:163 ev:201D
2019/09/24 16:53:05 [debug] 24703#24703: timer delta: 72
2019/09/24 16:53:05 [debug] 24703#24703: worker cycle
2019/09/24 16:53:05 [debug] 24703#24703: epoll timer: 14820
2019/09/24 16:53:05 [debug] 24700#24700: timer delta: 780
2019/09/24 16:53:05 [debug] 24700#24700: worker cycle
2019/09/24 16:53:05 [debug] 24700#24700: epoll timer: 51252
2019/09/24 16:53:05 [debug] 24700#24700: epoll: fd:162 ev:2019 d:00007FCF8E1D2DB8
2019/09/24 16:53:05 [debug] 24700#24700: epoll_wait() error on fd:162 ev:2019
2019/09/24 16:53:05 [debug] 24700#24700: *3397 http2 idle handler
2019/09/24 16:53:05 [debug] 24700#24700: *3397 reusable connection: 0
2019/09/24 16:53:05 [debug] 24700#24700: *3397 posix_memalign: 00007FCF8FE67000:4096 @16
2019/09/24 16:53:05 [debug] 24700#24700: *3397 http2 read handler
2019/09/24 16:53:05 [debug] 24700#24700: *3397 SSL_read: -1
2019/09/24 16:53:05 [debug] 24700#24700: *3397 SSL_get_error: 5
2019/09/24 16:53:05 [debug] 24700#24700: *3397 peer shutdown SSL cleanly
2019/09/24 16:53:05 [debug] 24700#24700: *3397 close http connection: 162
2019/09/24 16:53:05 [debug] 24700#24700: *3397 SSL_shutdown: 1
2019/09/24 16:53:05 [debug] 24700#24700: *3397 event timer del: 162: 1095055796
2019/09/24 16:53:05 [debug] 24700#24700: *3397 reusable connection: 0
2019/09/24 16:53:05 [debug] 24700#24700: *3397 run cleanup: 00007FCF8FE25BD0
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 00007FCF8FE67000, unused: 4016
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 0000000000000000
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 00007FCF8FE61000
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 00007FCF8FE25E00
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 00007FCF8E450740
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 00007FCF8FE25A00, unused: 8
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 00007FCF8FE25C00, unused: 8
2019/09/24 16:53:05 [debug] 24700#24700: *3397 free: 00007FCF8FE27000, unused: 224
2019/09/24 16:53:05 [debug] 24700#24700: timer delta: 0
2019/09/24 16:53:05 [debug] 24700#24700: worker cycle
2019/09/24 16:53:05 [debug] 24700#24700: epoll timer: 52760
2019/09/24 16:53:05 [debug] 24703#24703: epoll: fd:35 ev:2019 d:00007FCF8E1D2310
2019/09/24 16:53:05 [debug] 24703#24703: epoll_wait() error on fd:35 ev:2019
2019/09/24 16:53:05 [debug] 24703#24703: *3396 http2 idle handler
2019/09/24 16:53:05 [debug] 24703#24703: *3396 reusable connection: 0
2019/09/24 16:53:05 [debug] 24703#24703: *3396 posix_memalign: 00007FCF8FE5D000:4096 @16
2019/09/24 16:53:05 [debug] 24703#24703: *3396 http2 read handler
2019/09/24 16:53:05 [debug] 24703#24703: *3396 SSL_read: -1
2019/09/24 16:53:05 [debug] 24703#24703: *3396 SSL_get_error: 5
2019/09/24 16:53:05 [debug] 24703#24703: *3396 peer shutdown SSL cleanly
2019/09/24 16:53:05 [debug] 24703#24703: *3396 close http connection: 35
2019/09/24 16:53:05 [debug] 24703#24703: *3396 SSL_shutdown: 1
2019/09/24 16:53:05 [debug] 24703#24703: *3396 event timer del: 35: 1095055796
2019/09/24 16:53:05 [debug] 24703#24703: *3396 reusable connection: 0
2019/09/24 16:53:05 [debug] 24703#24703: *3396 run cleanup: 00007FCF8FE4DBD0
2019/09/24 16:53:05 [debug] 24703#24703: *3396 free: 00007FCF8FE5D000, unused: 4016
2019/09/24 16:53:05 [debug] 24703#24703: *3396 free: 0000000000000000
2019/09/24 16:53:05 [debug] 24703#24703: *3396 free: 00007FCF8E4FDD00
2019/09/24 16:53:05 [debug] 24703#24703: *3396 free: 00007FCF8FE4DA00, unused: 8
2019/09/24 16:53:05 [debug] 24703#24703: *3396 free: 00007FCF8FE4DC00, unused: 144
2019/09/24 16:53:05 [debug] 24703#24703: timer delta: 0
2019/09/24 16:53:05 [debug] 24703#24703: worker cycle
2019/09/24 16:53:05 [debug] 24703#24703: epoll timer: 14820
2019/09/24 16:53:06 [debug] 24703#24703: epoll: fd:19 ev:0001 d:00007FCF8E1D1F30
2019/09/24 16:53:06 [debug] 24703#24703: timer delta: 968
2019/09/24 16:53:06 [debug] 24703#24703: worker cycle
2019/09/24 16:53:06 [debug] 24703#24703: epoll timer: 13852
2019/09/24 16:53:06 [debug] 24703#24703: epoll: fd:26 ev:0001 d:00007FCF8E1D2311
2019/09/24 16:53:06 [debug] 24703#24703: shmtx lock
2019/09/24 16:53:06 [debug] 24703#24703: slab alloc: 165 slot: 5
2019/09/24 16:53:06 [debug] 24703#24703: slab alloc: 00007FCF86EF5000
2019/09/24 16:53:06 [debug] 24703#24703: slab alloc: 128 slot: 4
2019/09/24 16:53:06 [debug] 24703#24703: slab alloc: 00007FCF86EF3880
2019/09/24 16:53:06 [debug] 24703#24703: shmtx unlock
2019/09/24 16:53:06 [debug] 24703#24703: timer delta: 104
2019/09/24 16:53:06 [debug] 24703#24703: worker cycle
2019/09/24 16:53:06 [debug] 24703#24703: epoll timer: 13748
2019/09/24 16:53:06 [debug] 24703#24703: epoll: fd:26 ev:0001 d:00007FCF8E1D2311
2019/09/24 16:53:06 [debug] 24703#24703: timer delta: 96
2019/09/24 16:53:06 [debug] 24703#24703: worker cycle

I don't know if it's something useable though.

Nginx 1.17.3 (and also 1.17.4, today's release) and libbrotli 1.0.7. An up-to-date Debian Stretch system.

# uname -a
Linux vps 4.19.0-0.bpo.5-amd64 #1 SMP Debian 4.19.37-5+deb10u2~bpo9+1 (2019-08-16) x86_64 GNU/Linux

@chipitsine
Copy link

@HansVanEijsden , please contact me in private (either email or skype: chipitsine@gmail.com).
I'd like to get some core dump (if it will be safe for your environment)

@HansVanEijsden
Copy link

@chipitsine done!

@PiotrSikora
Copy link
Contributor

@HansVanEijsden 100% CPU won't result in a segfault (so those are 2 separate issues), but usually indicates that there is an infinite loop somewhere.

If you run NGINX under gdb, then you can hit ctrl+c once the process is at 100% CPU, and it should break in the middle of the infinite loop.

@dilyanpalauzov
Copy link

Does upgrading to nginx 1.16.1 help, see

http://nginx.org/en/CHANGES-1.16 ?

@DocumentRoot
Copy link

I tried using ngx_brotli half a year ago (from eustas repo) and came across this bug with 100% cpu.

Can you clarify if this resolved?

@kapouer
Copy link

kapouer commented Nov 28, 2019

AFAICT the bug is still in nginx 1.16.1.
I'll check again.

@gzzz
Copy link

gzzz commented Dec 25, 2019

Same here.

Nginx 1.17.6, brotli 1.0.7.

@carlesjavierre carlesjavierre mentioned this issue Jan 3, 2020
@awused
Copy link
Contributor

awused commented Jan 9, 2020

I have run into this with the fancyindex module in a couple of large directories on my server. On those directories it reproduces 100% of the time with some clients and HTTP/2, and the partial truncated responses the browser receives always end at the same point regardless of specific compression settings.

Firefox is the only client I've found that works with brotli compression and HTTP/2. Chromium and curl do not, even when curl is sending the same headers as Firefox. When downgrading to HTTP/1.1 (whether on the client or by disabling it on the server) all clients work in every case I can test.

@awused
Copy link
Contributor

awused commented Jan 23, 2020

I took a stab at it but what I thought was an incompatibility between filters turned out to be a trickier bug. The filter doesn't handle the case where the input is too small to be compressed by brotli, and assumes that flushes are guaranteed to produce output. BrotliEncoderCompress has a fallback internally to MakeUncompressedStream for input that is too small but there isn't a clean way to do the same with BrotliEncoderCompressStream. ngx_brotli only sets flush/last_buf on output buffers when there's output from the brotli library so it can end up unable to make progress, leading to spinning.

It ended up not being a simple or clean fix so I'll leave it to the maintainer. I've seen this same bug crop up in other wrappers around brotli, now that I know the cause and symptoms. The BrotliEncoderCompressStream method seems very difficult to use safely when the input may be small.

@eustas
Copy link
Collaborator Author

eustas commented Jan 23, 2020

Thanks for the investigation. Will try to reproduce / fix when get back from vacation.
Have a high hope that this long standing blocker could be fixed at last.

@ktecho
Copy link

ktecho commented Feb 19, 2020

@eustas Any news on this one? I was thinking on putting this in out server, but this is a bit scary.

Thanks!

@eustas
Copy link
Collaborator Author

eustas commented Mar 3, 2020

Hi. I'm finally back =) Will work on that this week

@eustas
Copy link
Collaborator Author

eustas commented Mar 4, 2020

Added basic http2 tests. Still can not reproduce (with curl --http2-prior-knowledge -s -H 'Accept-encoding: br'.
Could you share a minimised repro configuration, please?

@awused
Copy link
Contributor

awused commented Mar 4, 2020

The only consistent repro I have happens to involve too much private data. Just throwing random data of a sufficient length should be enough to hit it eventually, the bug is such that any server running this on dynamic data can expect to run into it. But you can probably trigger this artificially with test cases easier than you can generate random data to reproduce it. (I guess there are no unit tests) It's the same bug as #81 but on flush instead of end_of_input, just instead of truncating the output it enters an infinite loop.

Code inspection is enough to tell that the code does not behave properly when a flush is requested but BrotliEncoderHasMoreOutput remains false after writing all input up to the flush request. The flush flag on the out_buf only gets set when output is read from the compressor.

@eustas
Copy link
Collaborator Author

eustas commented Mar 4, 2020

Hmmm... Brotli obeys flushing, so if there was some input (after pushing the last output block) it will produce output again. On the other side, it won't produce more output, if no more input arrived, and just another flush is requested... Will try to reproduce that. Thanks.

@eustas
Copy link
Collaborator Author

eustas commented Mar 4, 2020

The flush flag on the out_buf only gets set when output is read from the compressor.

So, brotli should be a second level compressor then? ngx_brotli marks buffer with flush, only if the corresponding input buffer was marked so. So what nginx module could help me with simulating this?

(alternatively, I could try to activate some sctipting, e.g. php, and try to force flush out there)
(also, perhaps http2 multiplexing somehow triggers that, not sure how to check...)

@eustas
Copy link
Collaborator Author

eustas commented Mar 5, 2020

Was able to setup http2 + ssl + perl in docker.
Perl script forces the flush. Everything works fine.
Consequently, there is something about the filter stack. Some filters do not collaborate as expected.

Will try to setup fancyindex...

PS:
nginx.conf:

load_module modules/ngx_http_brotli_filter_module.so;
load_module modules/ngx_http_perl_module.so;

events {
  worker_connections 4;
}

daemon on;

http {
  gzip on;
  gzip_comp_level 1;
  gzip_types text/plain text/css;

  brotli on;
  brotli_comp_level 3;
  brotli_types text/plain text/css;

  perl_require hello.pm;

  server {
    listen                     8443 ssl http2;
    server_name                localhost;

    ssl                        on;
    ssl_protocols              TLSv1 TLSv1.1 TLSv1.2;
    ssl_certificate            nginx.crt;
    ssl_certificate_key        nginx.key;

    root ./;

    index index.html;

    location / {
      perl hello::handler;
    }
  }
}

hello.pl:

package hello;

use nginx;

sub handler {
    my $r = shift;

    $r->send_http_header("text/html");
    return OK if $r->header_only;

    $r->print("head\n<br/>");
    for (my $i=0; $i <= 9; $i++) {
        $r->print("body\n<br/>");
        $r->flush();
    }
    $r->print("tail\n<br/>");

    return OK;
}

1;
__END__

@eustas
Copy link
Collaborator Author

eustas commented Mar 5, 2020

(though I believe, filters below compression should not affect the workflow, as we've seen with perl; most likely, it is something above the compression; @awused what other filters / modules are configured?)

@HansVanEijsden
Copy link

HansVanEijsden commented Mar 5, 2020

@eustas thanks for looking into it.
For what it's worth, this is my nginx -V on a production machine which gives these problems when Brotli is enabled:

nginx version: nginx/1.17.9 built by gcc 6.3.0 20170516 (Debian 6.3.0-18+deb9u1) built with OpenSSL 1.1.1d 10 Sep 2019 TLS SNI support enabled configure arguments: --prefix=/opt/nginx --user=www-data --group=www-data --with-http_v2_module --with-http_ssl_module --with-http_v2_hpack_enc --with-openssl=/usr/local/src/openssl-1.1.1d --with-openssl-opt='enable-ec_nistp_64_gcc_128 enable-rfc3779 enable-cms no-tests -DCFLAGS='-O3 -march=native -flto -fuse-linker-plugin'' --with-pcre-jit --with-file-aio --with-http_flv_module --with-http_geoip_module --with-http_gunzip_module --with-http_mp4_module --with-http_realip_module --with-http_stub_status_module --with-threads --with-libatomic --with-zlib=/usr/local/src/zlib-cloudflare --with-zlib-opt='-O3 -march=native -flto -fuse-linker-plugin' --with-http_gzip_static_module --add-module=/usr/local/src/headers-more-nginx-module --add-module=/usr/local/src/echo-nginx-module --add-module=/usr/local/src/ngx_http_substitutions_filter_module --add-module=/usr/local/src/srcache-nginx-module --add-module=/usr/local/src/redis2-nginx-module --add-module=/usr/local/src/ngx_http_redis-0.3.9 --add-module=/usr/local/src/ngx_devel_kit --add-module=/usr/local/src/set-misc-nginx-module --add-module=/usr/local/src/ngx_brotli --with-cc-opt='-DTCP_FASTOPEN=23 -march=native -flto -O3 -fuse-linker-plugin -Wno-error=strict-aliasing -fstack-protector-strong -D_FORTIFY_SOURCE=2' --with-ld-opt='-lrt -z relro -fstack-protector-strong'

System (uname -a):
Linux www391 4.19.0-0.bpo.6-amd64 #1 SMP Debian 4.19.67-2+deb10u2~bpo9+1 (2019-11-12) x86_64 GNU/Linux

@awused
Copy link
Contributor

awused commented Mar 6, 2020

I played around with it some more. The particular location block where I have a reproduction is also running http_image_filter. If I disable that filter it changes where in the response it happens to spin but doesn't change the nature of the spin. The only other filters running are fancyindex and brotli.

What I do see is that brotli is calling ngx_http_next_body_filter while holding an out_buf it believes to be of non-zero size but http_write_filter_module thinks it has been called with a chain of size 0. This is repeated in a tight loop.

2020/03/05 21:27:03 [debug] 7323#101520: *3 brotli out_buf: 000000080C9FA7C8, size:87482
2020/03/05 21:26:34 [debug] 7317#101644: *2 http write filter limit 0
2020/03/05 21:27:03 [debug] 7323#101520: *3 http write filter: l:0 f:0 s:0
2020/03/05 21:26:34 [debug] 7317#101644: *2 http write filter 0000000000000000
2020/03/05 21:27:03 [debug] 7323#101520: *3 http write filter limit 0

I still believe there's a problem with empty output compressor flushes, but that might not be the cause of this error.

@eustas
Copy link
Collaborator Author

eustas commented Mar 6, 2020

That is interesting: limit == 0 -> either sendfile_max_chunk == 0, or r->limit_rate * (ngx_time() - r->start_sec + 1) == c->sent - r->limit_rate_after (unlikely).

@eustas
Copy link
Collaborator Author

eustas commented Mar 6, 2020

NB: also, there is no place in the code that logs like "brotli out_buf: X, size: Y", only "brotli out: X, size: Y"

@awused
Copy link
Contributor

awused commented Mar 6, 2020

That was something I added immediately before the call to ngx_http_next_body_filter.

@eustas
Copy link
Collaborator Author

eustas commented Mar 6, 2020

Please, add a flags to that output: ctx->output_ready and ctx->output_busy. Depending on it either buffer is passed, or NULL.

Tried https + h2 + fancy index. Works well on folder with 10-20-30-40-50k empty files with uuid names.

@eustas
Copy link
Collaborator Author

eustas commented Mar 6, 2020

If it is infinite loop with ctx->output_busy then the "above" filters do not consume the brotli output and hold the buffer. Brotli filter follows the "zero copy" paradigm, so it can not proceed, until the previous output is fully consumed.

If it is the case, we could check what filter in the filter stack holds the buffer. If those can not be fixed, then we could request flush in situations like that, and see if this would help.

@splitice
Copy link
Contributor

I've done some testing and:

  1. This issue still occurs with nginx/1.17.10
  2. I was only able to replicate the issue when there is blocking due to output pressure.

Setup for testing: openresty + brotli (over http 1.1). brunzip compiled in but not enabled.

@eustas it sounds like you were on to something on Mar 6. Do you have anything I can test? I'd be happy to help in that way.

eustas pushed a commit that referenced this issue Apr 16, 2020
@eustas eustas reopened this Apr 16, 2020
@splitice
Copy link
Contributor

Thank you @eustas I will test in the next few days (tomorrow maybe) and report back. Crazy busy at the moment (in part due to trying to fix similar issues with my ngx_brunzip) but I promise I'll get back to you ASAP.

@eustas
Copy link
Collaborator Author

eustas commented Apr 16, 2020

Thanks to @awused, we got some progress with this issue.

Hey, @splitice, @HansVanEijsden, @ktecho, @kapouer, @gzzz, @jdylag, would appreciate very much your feedback, if the problem is solved now!

@HansVanEijsden
Copy link

Pushing it to production now. Will keep you up-to-date! 👍🏻 Thanks!

@HansVanEijsden
Copy link

@eustas, after having it in production for 24 hours, with production traffic, all seems to be stable and functioning. Thanks @awused for your commit, it looks like it did the trick! 🎉

@eustas
Copy link
Collaborator Author

eustas commented Apr 17, 2020

Woo-hoo =)

@eustas
Copy link
Collaborator Author

eustas commented Apr 17, 2020

Nice moment to cut a release #90 =)

@splitice
Copy link
Contributor

Does not fix issue.

Perf result after upgrade on a spinning worker show that it's still brotli doing the spinning.

    55.89%  nginx    nginx              [.] ngx_http_brotli_body_filter
    23.65%  nginx    nginx              [.] ngx_http_gzip_body_filter
    18.31%  nginx    nginx              [.] ngx_http_write_filter
     2.15%  nginx    nginx              [.] ngx_http_chunked_body_filter

Our workload typically means output pressure as the backends are fast.

@gzzz
Copy link

gzzz commented Apr 20, 2020

nginx/1.17.10 with last ngx_brotli in filter mode – 3 days running without problem.

@HansVanEijsden
Copy link

@gzzz I can confirm also in my case: on all the servers in production (and even without restarts or reloads of nginx) continuously running without problems.

@splitice
Copy link
Contributor

Actually ignore my report, I incorrectly cherry-picked the commit. Everything has been good in testing for 12 hours now.

@eustas eustas closed this as completed Apr 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.