Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gzip compression leaking half closed sockets and consuming excess CPU #327

Closed
cspargo opened this issue Mar 19, 2018 · 7 comments
Closed

Comments

@cspargo
Copy link

cspargo commented Mar 19, 2018

When I use the gzip compression option between two carbon-c-relays (version 3.2), the receiver eventually starts consuming excessive CPU (consuming all CPU cores on a 24 core host).

On the receiving side, I see sockets in lsof output with "can't identify protocol"

# lsof -p 22682 -nP | egrep sock\|TCP
carbon-c- 22682 root    0u  IPv4    4781151      0t0      TCP *:9997 (LISTEN)
carbon-c- 22682 root    3u  IPv6    4781152      0t0      TCP *:9997 (LISTEN)
carbon-c- 22682 root    4u  sock        0,6      0t0 23657600 can't identify protocol
carbon-c- 22682 root    5u  sock        0,6      0t0 23661545 can't identify protocol
carbon-c- 22682 root    6u  sock        0,6      0t0 23671203 can't identify protocol
carbon-c- 22682 root    9u  sock        0,6      0t0 23652483 can't identify protocol
carbon-c- 22682 root   10u  sock        0,6      0t0 23675464 can't identify protocol
carbon-c- 22682 root   11u  sock        0,6      0t0 23668344 can't identify protocol
carbon-c- 22682 root   12u  sock        0,6      0t0 23663455 can't identify protocol
carbon-c- 22682 root   13u  sock        0,6      0t0 23665000 can't identify protocol
carbon-c- 22682 root   14u  sock        0,6      0t0 23679405 can't identify protocol
carbon-c- 22682 root   15u  sock        0,6      0t0 23684928 can't identify protocol
carbon-c- 22682 root   16u  sock        0,6      0t0 23687407 can't identify protocol
carbon-c- 22682 root   17u  sock        0,6      0t0 23706969 can't identify protocol
carbon-c- 22682 root   18u  sock        0,6      0t0 23683831 can't identify protocol
carbon-c- 22682 root   19u  sock        0,6      0t0 53190358 can't identify protocol
carbon-c- 22682 root   20u  sock        0,6      0t0 53197318 can't identify protocol
carbon-c- 22682 root   21u  sock        0,6      0t0 53201568 can't identify protocol
carbon-c- 22682 root   22u  sock        0,6      0t0 53188454 can't identify protocol
carbon-c- 22682 root   23u  sock        0,6      0t0 53195391 can't identify protocol
carbon-c- 22682 root   24u  sock        0,6      0t0 53203650 can't identify protocol
carbon-c- 22682 root   25u  sock        0,6      0t0 53205028 can't identify protocol
carbon-c- 22682 root   26u  sock        0,6      0t0 53210324 can't identify protocol
carbon-c- 22682 root   27u  sock        0,6      0t0 53254293 can't identify protocol
carbon-c- 22682 root   28u  IPv4   62627321      0t0      TCP x.x.x.x:9997->y.y.y.y:54162 (ESTABLISHED)
carbon-c- 22682 root   29u  IPv4   62627328      0t0      TCP 127.0.0.1:32990->127.0.0.1:2013 (ESTABLISHED)
carbon-c- 22682 root   30u  IPv4   62627344      0t0      TCP x.x.x.x:55240->a.a.a.a:2013 (ESTABLISHED)
carbon-c- 22682 root   31u  sock        0,6      0t0 82855677 can't identify protocol
carbon-c- 22682 root   32u  sock        0,6      0t0 82860276 can't identify protocol
carbon-c- 22682 root   33u  sock        0,6      0t0 82864219 can't identify protocol
carbon-c- 22682 root   34u  sock        0,6      0t0 82870051 can't identify protocol
carbon-c- 22682 root   35u  sock        0,6      0t0 82873013 can't identify protocol
carbon-c- 22682 root   36u  sock        0,6      0t0 82895669 can't identify protocol

these sockets are not shown in netstat

# netstat -anp | grep 22682/
tcp        0      0 0.0.0.0:9997            0.0.0.0:*               LISTEN      22682/carbon-c-rela
tcp        0      0 x.x.x.x:55240      a.a.a.a:2013        ESTABLISHED 22682/carbon-c-rela
tcp        0      0 127.0.0.1:32990         127.0.0.1:2013          ESTABLISHED 22682/carbon-c-rela
tcp        0      0 x.x.x.x:9997       y.y.y.y:54162      ESTABLISHED 22682/carbon-c-rela
tcp6       0      0 :::9997                 :::*                    LISTEN      22682/carbon-c-rela

I am also seeing some corrupt metrics when this happens. If I switch back to not using gzip, then these issues go away.

@grobian
Copy link
Owner

grobian commented Mar 19, 2018

hmm, thanks, I'll have a look, that indeed doesn't look healthy

@grobian
Copy link
Owner

grobian commented Mar 20, 2018

Are you sure these are stale handles? Does the list grow over time?

@cspargo
Copy link
Author

cspargo commented Mar 22, 2018

the list does grow over time. It corresponds with the CPU load increasing, which seems to happen in big increments. This graph shows it increasing roughly daily

cpuload

If I strace one of the threads, they seem to be constantly cycling through the FDs doing reads. I guess that's what's consuming all the CPU

10:47:16.184448 read(7, "", 7677)       = 0
10:47:16.184574 read(10, "", 7677)      = 0
10:47:16.184614 read(12, "", 7506)      = 0
10:47:16.184649 read(13, "", 8128)      = 0
10:47:16.184682 read(14, "", 8188)      = 0
10:47:16.184715 read(16, "", 7677)      = 0
10:47:16.184743 read(17, "", 7677)      = 0
10:47:16.184772 read(18, "", 7677)      = 0
10:47:16.184803 read(21, "", 7677)      = 0
10:47:16.184836 read(22, "", 7582)      = 0
10:47:16.184871 read(23, "", 8176)      = 0
10:47:16.184898 read(24, "", 8175)      = 0
10:47:16.184937 read(28, "", 7997)      = 0
10:47:16.184973 read(29, "", 8155)      = 0
10:47:16.185004 read(30, "", 7582)      = 0
10:47:16.185033 read(31, "", 8176)      = 0
10:47:16.185065 read(32, "", 8175)      = 0
10:47:16.185091 read(33, "", 7997)      = 0
10:47:16.185116 read(34, "", 8155)      = 0
10:47:16.185140 read(35, "", 7677)      = 0
10:47:16.185164 read(36, "", 7582)      = 0
10:47:16.185189 read(37, "", 8176)      = 0
10:47:16.185214 read(38, "", 8175)      = 0
10:47:16.185239 read(42, "", 8188)      = 0
10:47:16.185360 read(7, "", 7677)       = 0
10:47:16.185385 read(8, 0x7f1a700091c9, 8191) = -1 EAGAIN (Resource temporarily unavailable)
10:47:16.185423 read(10, "", 7677)      = 0
10:47:16.185448 read(12, "", 7506)      = 0
10:47:16.185473 read(13, "", 8128)      = 0
10:47:16.185497 read(14, "", 8188)      = 0
10:47:16.185522 read(16, "", 7677)      = 0
10:47:16.185548 read(20, "", 8155)      = 0
10:47:16.185579 read(21, "", 7677)      = 0
10:47:16.185624 read(22, "", 7582)      = 0

all these FDs shows in lsof with "can't identify protocol"

...
carbon-c- 17207 root   20u  sock        0,6      0t0 2775789023 can't identify protocol
carbon-c- 17207 root   21u  sock        0,6      0t0 2775798366 can't identify protocol
carbon-c- 17207 root   22u  sock        0,6      0t0 2775809336 can't identify protocol
carbon-c- 17207 root   23u  sock        0,6      0t0 2775816698 can't identify protocol
...

and in /proc/17207/fd

...
lrwx------ 1 root root 64 Mar 22 10:15 20 -> socket:[2775789023]
lrwx------ 1 root root 64 Mar 22 10:15 21 -> socket:[2775798366]
lrwx------ 1 root root 64 Mar 22 10:15 22 -> socket:[2775809336]
lrwx------ 1 root root 64 Mar 22 10:15 23 -> socket:[2775816698]
...

@grobian
Copy link
Owner

grobian commented Mar 22, 2018

can you share your config? that might give me some clues

@cspargo
Copy link
Author

cspargo commented Mar 26, 2018

ok, what i worked out was that one of the source carbon-c-relays that was sending to the receiver was passing through a haproxy TCP proxy (due to some firewall/network challenges). I was recently able to replace that proxy with a carbon-c-relay, and the problem went away.

So the flow was: data sources -> carbon-c-relay -> haproxy TCP proxy -> carbon-c-relay

I'm not sure why that caused the symptoms that I was seeing, but i'm going to close this issue as it's not a problem now.

@cspargo cspargo closed this as completed Mar 26, 2018
@grobian
Copy link
Owner

grobian commented Mar 26, 2018

Hmm, ok, so it is the closing behaviour of haproxy that the relay doesn't handle very well. I'll see if I can find something using that approach.

grobian added a commit that referenced this issue Mar 26, 2018
Ensure EOF is visible to the main dispatcher loop, else we keep trying
to read and decompress, but in vain, just as in issue #327.
@grobian
Copy link
Owner

grobian commented Mar 26, 2018

I think I found the culprit, thanks for your details/analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants