Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graphite TCP input -- questions about performance #3741

Closed
pauldix opened this issue Aug 19, 2015 · 13 comments
Closed

Graphite TCP input -- questions about performance #3741

pauldix opened this issue Aug 19, 2015 · 13 comments
Assignees

Comments

@pauldix
Copy link
Member

pauldix commented Aug 19, 2015

The UDP listener will take a batch of lines: https://github.com/influxdb/influxdb/blob/master/services/graphite/service.go#L213-L220

The TCP listener should do the same, otherwise performance will be much poorer.

@otoolep
Copy link
Contributor

otoolep commented Aug 20, 2015

I can take a look at this, should be very easy.

@otoolep otoolep self-assigned this Aug 20, 2015
@otoolep
Copy link
Contributor

otoolep commented Aug 20, 2015

@pauldix and I took a look at this code, and there doesn't actually appear to be an issue.

Whether the code loops reading the buffer, and injects a point into the batcher every read-loop, or splits the buffer, loops on that, and injects a point into the batcher, there should be no difference. For high performance batching must be enabled i.e. batch sizes greater than 1, ideally 100s or 1000s.

@otoolep
Copy link
Contributor

otoolep commented Aug 20, 2015

#3772 is to help understand how the system is behaving.

@otoolep
Copy link
Contributor

otoolep commented Aug 20, 2015

We don't believe there is an issue here, and I have confirmed that batching is at 1,000 by default.

@otoolep otoolep closed this as completed Aug 20, 2015
@pkittenis
Copy link

@otoolep , @pauldix looks like the root cause is the way TCP connections are handled in the graphite plugin, not batching.

If TCP handler loops over message line by line and there is a large number of lines in message, won't the TCP connection be blocked for the duration of line parsing? Batch size in plugin does not matter, connection handling happens prior to messages getting put into batcher.

The blocked connections will time out and get put in a [fin/close]_wait state by the kernel, causing client connection failures, and, if clients auto re-connect, eventually influxdb runs out of sockets, effectively DOSing itself until the kernel cleans up the waiting sockets.

This does not happen on UDP as there is only one udp socket which does not need closing.

To replicate with default batch size and graphite plugin listening on tcp port 2013, relay on port 2003 forwarding to influxdb (so you can see the connection failures) and load test client:

python graphite_test_client.py -c 200 -d0 --port 2003 127.0.0.1 3600

# On another terminal, check connections to influxdb listener
netstat -an | grep 2013 | wc -l
30

# Connections in wait state. This increases as more connections are opened to replace the timed out ones.
$ netstat -an | grep 2013 | grep '_WAIT' | wc -l
81
$ netstat -an | grep 2013 | grep '_WAIT' | wc -l
97

Client sees connection failures, more sockets go in FIN_WAIT1 state.

[2015-08-21 10:56:26] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:26] server 127.0.0.1:2013: OK
[2015-08-21 10:56:29] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:30] server 127.0.0.1:2013: OK
[2015-08-21 10:56:33] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:34] server 127.0.0.1:2013: OK
[2015-08-21 10:56:37] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:38] server 127.0.0.1:2013: OK
[2015-08-21 10:56:41] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:42] server 127.0.0.1:2013: OK
[2015-08-21 10:56:44] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:44] server 127.0.0.1:2013: OK
[2015-08-21 10:56:47] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:48] server 127.0.0.1:2013: OK
[2015-08-21 10:56:51] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:52] server 127.0.0.1:2013: OK
[2015-08-21 10:56:55] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:56] server 127.0.0.1:2013: OK

Strace also shows connection time outs on the influxdb process:

[pid 31834] select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
[pid 31842] futex(0xf021f8, FUTEX_WAKE, 1 <unfinished ...>
[pid 31833] futex(0xf021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31843] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 31833] futex(0xf021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31834] <... select resumed> )      = 0 (Timeout)
[pid 31834] select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
[pid 31842] futex(0xf021f8, FUTEX_WAKE, 1 <unfinished ...>
[pid 31834] select(0, NULL, NULL, NULL, {0, 10000} <unfinished ...>
[pid 31842] <... futex resumed> )       = 1
[pid 31833] <... futex resumed> )       = 0
[pid 31833] epoll_wait(5, {}, 128, 0)   = 0
[pid 31833] futex(0xf021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31834] <... select resumed> )      = 0 (Timeout)
[pid 31834] select(0, NULL, NULL, NULL, {0, 10000} <unfinished ...>
[pid 31843] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)

Hope this helps. Can raise a new ticket for this if you'd like.

@otoolep
Copy link
Contributor

otoolep commented Aug 21, 2015

Ok, thanks for the detailed analysis - we will take a look at this.

On Friday, August 21, 2015, Dan notifications@github.com wrote:

@otoolep https://github.com/otoolep , @pauldix
https://github.com/pauldix looks like the root cause is the way TCP
connections are handled in the graphite plugin, not batching.

If TCP handler loops over message line by line and there is a large number
of lines in message, won't the TCP connection be blocked for the duration
of line parsing? Batch size in plugin does not matter, connection handling
happens prior to messages getting put into batcher.

The blocked connections will time out and get put in a [fin/close]_wait
state by the kernel, causing client connection failures, and, if clients
auto re-connect, eventually influxdb runs out of sockets, effectively
DOSing itself until the kernel cleans up the waiting sockets.

This does not happen on UDP as there is only one udp socket which does not
need closing.

To replicate with default batch size and graphite plugin listening on tcp
port 2013, relay on port 2003 https://github.com/grobian/carbon-c-relay
forwarding to influxdb (so you can see the connection failures) and load
test client https://gist.github.com/pkittenis/3ef48fc062902840d749:

python graphite_test_client.py -c 200 -d0 --port 2003 127.0.0.1 3600

On another terminal, check connections to influxdb listener

netstat -an | grep 2013 | wc -l
30

Connections in wait state. This increases as more connections are opened to replace the timed out ones.

$ netstat -an | grep 2013 | grep '_WAIT' | wc -l
81
$ netstat -an | grep 2013 | grep '_WAIT' | wc -l
97

Client sees connection failures, more sockets go in FIN_WAIT1 state.

[2015-08-21 10:56:26] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:26] server 127.0.0.1:2013: OK
[2015-08-21 10:56:29] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:30] server 127.0.0.1:2013: OK
[2015-08-21 10:56:33] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:34] server 127.0.0.1:2013: OK
[2015-08-21 10:56:37] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:38] server 127.0.0.1:2013: OK
[2015-08-21 10:56:41] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:42] server 127.0.0.1:2013: OK
[2015-08-21 10:56:44] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:44] server 127.0.0.1:2013: OK
[2015-08-21 10:56:47] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:48] server 127.0.0.1:2013: OK
[2015-08-21 10:56:51] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:52] server 127.0.0.1:2013: OK
[2015-08-21 10:56:55] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:56] server 127.0.0.1:2013: OK

Strace also shows connection time outs on the influxdb process:

[pid 31834] select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
[pid 31842] futex(0xf021f8, FUTEX_WAKE, 1 <unfinished ...>
[pid 31833] futex(0xf021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31843] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 31833] futex(0xf021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31834] <... select resumed> ) = 0 (Timeout)
[pid 31834] select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
[pid 31842] futex(0xf021f8, FUTEX_WAKE, 1 <unfinished ...>
[pid 31834] select(0, NULL, NULL, NULL, {0, 10000} <unfinished ...>
[pid 31842] <... futex resumed> ) = 1
[pid 31833] <... futex resumed> ) = 0
[pid 31833] epoll_wait(5, {}, 128, 0) = 0
[pid 31833] futex(0xf021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31834] <... select resumed> ) = 0 (Timeout)
[pid 31834] select(0, NULL, NULL, NULL, {0, 10000} <unfinished ...>
[pid 31843] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)

Hope this helps. Can raise a new ticket for this if you'd like.


Reply to this email directly or view it on GitHub
#3741 (comment).

@pkittenis
Copy link

Much appreciated :)

@otoolep
Copy link
Contributor

otoolep commented Aug 21, 2015

We should have seen this earlier as we hit this before with collectd. We
need to decouple parsing from reading bytes on the connection, for both UDP
and TCP. It's an easy fix using a channel and a goroutine.

Thanks again @pkittenis

On Friday, August 21, 2015, Philip O'Toole philip@influxdb.com wrote:

Ok, thanks for the detailed analysis - we will take a look at this.

On Friday, August 21, 2015, Dan <notifications@github.com
javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

@otoolep https://github.com/otoolep , @pauldix
https://github.com/pauldix looks like the root cause is the way TCP
connections are handled in the graphite plugin, not batching.

If TCP handler loops over message line by line and there is a large
number of lines in message, won't the TCP connection be blocked for the
duration of line parsing? Batch size in plugin does not matter, connection
handling happens prior to messages getting put into batcher.

The blocked connections will time out and get put in a [fin/close]_wait
state by the kernel, causing client connection failures, and, if clients
auto re-connect, eventually influxdb runs out of sockets, effectively
DOSing itself until the kernel cleans up the waiting sockets.

This does not happen on UDP as there is only one udp socket which does
not need closing.

To replicate with default batch size and graphite plugin listening on tcp
port 2013, relay on port 2003 https://github.com/grobian/carbon-c-relay
forwarding to influxdb (so you can see the connection failures) and load
test client https://gist.github.com/pkittenis/3ef48fc062902840d749:

python graphite_test_client.py -c 200 -d0 --port 2003 127.0.0.1 3600

On another terminal, check connections to influxdb listener

netstat -an | grep 2013 | wc -l
30# Connections in wait state. This increases as more connections are opened to replace the timed out ones.
$ netstat -an | grep 2013 | grep '_WAIT' | wc -l
81
$ netstat -an | grep 2013 | grep '_WAIT' | wc -l
97

Client sees connection failures, more sockets go in FIN_WAIT1 state.

[2015-08-21 10:56:26] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:26] server 127.0.0.1:2013: OK
[2015-08-21 10:56:29] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:30] server 127.0.0.1:2013: OK
[2015-08-21 10:56:33] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:34] server 127.0.0.1:2013: OK
[2015-08-21 10:56:37] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:38] server 127.0.0.1:2013: OK
[2015-08-21 10:56:41] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:42] server 127.0.0.1:2013: OK
[2015-08-21 10:56:44] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:44] server 127.0.0.1:2013: OK
[2015-08-21 10:56:47] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:48] server 127.0.0.1:2013: OK
[2015-08-21 10:56:51] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:52] server 127.0.0.1:2013: OK
[2015-08-21 10:56:55] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:56] server 127.0.0.1:2013: OK

Strace also shows connection time outs on the influxdb process:

[pid 31834] select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
[pid 31842] futex(0xf021f8, FUTEX_WAKE, 1 <unfinished ...>
[pid 31833] futex(0xf021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31843] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 31833] futex(0xf021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31834] <... select resumed> ) = 0 (Timeout)
[pid 31834] select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
[pid 31842] futex(0xf021f8, FUTEX_WAKE, 1 <unfinished ...>
[pid 31834] select(0, NULL, NULL, NULL, {0, 10000} <unfinished ...>
[pid 31842] <... futex resumed> ) = 1
[pid 31833] <... futex resumed> ) = 0
[pid 31833] epoll_wait(5, {}, 128, 0) = 0
[pid 31833] futex(0xf021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31834] <... select resumed> ) = 0 (Timeout)
[pid 31834] select(0, NULL, NULL, NULL, {0, 10000} <unfinished ...>
[pid 31843] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)

Hope this helps. Can raise a new ticket for this if you'd like.


Reply to this email directly or view it on GitHub
#3741 (comment)
.

@otoolep
Copy link
Contributor

otoolep commented Aug 21, 2015

Technically, decouple parsing from handling the socket since there is no
connection with UDP.

On Friday, August 21, 2015, Philip O'Toole philip@influxdb.com wrote:

We should have seen this earlier as we hit this before with collectd. We
need to decouple parsing from reading bytes on the connection, for both UDP
and TCP. It's an easy fix using a channel and a goroutine.

Thanks again @pkittenis

On Friday, August 21, 2015, Philip O'Toole <philip@influxdb.com
javascript:_e(%7B%7D,'cvml','philip@influxdb.com');> wrote:

Ok, thanks for the detailed analysis - we will take a look at this.

On Friday, August 21, 2015, Dan notifications@github.com wrote:

@otoolep https://github.com/otoolep , @pauldix
https://github.com/pauldix looks like the root cause is the way TCP
connections are handled in the graphite plugin, not batching.

If TCP handler loops over message line by line and there is a large
number of lines in message, won't the TCP connection be blocked for the
duration of line parsing? Batch size in plugin does not matter, connection
handling happens prior to messages getting put into batcher.

The blocked connections will time out and get put in a [fin/close]_wait
state by the kernel, causing client connection failures, and, if clients
auto re-connect, eventually influxdb runs out of sockets, effectively
DOSing itself until the kernel cleans up the waiting sockets.

This does not happen on UDP as there is only one udp socket which does
not need closing.

To replicate with default batch size and graphite plugin listening on
tcp port 2013, relay on port 2003
https://github.com/grobian/carbon-c-relay forwarding to influxdb (so
you can see the connection failures) and load test client
https://gist.github.com/pkittenis/3ef48fc062902840d749:

python graphite_test_client.py -c 200 -d0 --port 2003 127.0.0.1 3600

On another terminal, check connections to influxdb listener

netstat -an | grep 2013 | wc -l
30# Connections in wait state. This increases as more connections are opened to replace the timed out ones.
$ netstat -an | grep 2013 | grep '_WAIT' | wc -l
81
$ netstat -an | grep 2013 | grep '_WAIT' | wc -l
97

Client sees connection failures, more sockets go in FIN_WAIT1 state.

[2015-08-21 10:56:26] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:26] server 127.0.0.1:2013: OK
[2015-08-21 10:56:29] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:30] server 127.0.0.1:2013: OK
[2015-08-21 10:56:33] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:34] server 127.0.0.1:2013: OK
[2015-08-21 10:56:37] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:38] server 127.0.0.1:2013: OK
[2015-08-21 10:56:41] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:42] server 127.0.0.1:2013: OK
[2015-08-21 10:56:44] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:44] server 127.0.0.1:2013: OK
[2015-08-21 10:56:47] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:48] server 127.0.0.1:2013: OK
[2015-08-21 10:56:51] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:52] server 127.0.0.1:2013: OK
[2015-08-21 10:56:55] failed to write() to 127.0.0.1:2013: uncomplete write
[2015-08-21 10:56:56] server 127.0.0.1:2013: OK

Strace also shows connection time outs on the influxdb process:

[pid 31834] select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
[pid 31842] futex(0xf021f8, FUTEX_WAKE, 1 <unfinished ...>
[pid 31833] futex(0xf021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31843] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 31833] futex(0xf021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31834] <... select resumed> ) = 0 (Timeout)
[pid 31834] select(0, NULL, NULL, NULL, {0, 10000}) = 0 (Timeout)
[pid 31842] futex(0xf021f8, FUTEX_WAKE, 1 <unfinished ...>
[pid 31834] select(0, NULL, NULL, NULL, {0, 10000} <unfinished ...>
[pid 31842] <... futex resumed> ) = 1
[pid 31833] <... futex resumed> ) = 0
[pid 31833] epoll_wait(5, {}, 128, 0) = 0
[pid 31833] futex(0xf021f8, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 31834] <... select resumed> ) = 0 (Timeout)
[pid 31834] select(0, NULL, NULL, NULL, {0, 10000} <unfinished ...>
[pid 31843] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)

Hope this helps. Can raise a new ticket for this if you'd like.


Reply to this email directly or view it on GitHub
#3741 (comment)
.

@otoolep otoolep reopened this Aug 21, 2015
@otoolep
Copy link
Contributor

otoolep commented Aug 21, 2015

@pkittenis -- is there any special way the relay needs to be run? Default config?

@otoolep
Copy link
Contributor

otoolep commented Aug 21, 2015

@DanielMorsing

Related issues:

#2428
#2528

I decoupled parsing from reading from the TCP connection and saw about a 10% speed-up. We can implement this, but after further analysis and discussion, we're still not sure if this is really solving the problem, and if there actually is any.

It should be noted that the code wraps the connection in bufio.NewReader which, as we understand, reads in kilobytes of data at a time, therefore ensuring that the network is serviced efficiently by the program. Most of the time getting data from the Reader is very fast.

My understanding of TCP is as follows, please correct me if am wrong.

If the server-side code processing a given TCP connection cannot keep up with the incoming traffic, then the client will see slow performance. But why would the system disconnect the connection just because upstream is not accepting the data as fast as the client would like to send it? The client will simply block. Connection failures have nothing to do with existing connections being serviced by the system. Furthermore acceptance of new connections is completely decoupled by the code because handleTCPConnection is run in a separate goroutine. If connections failures are occurring, they are nothing to do with servicing existing connections (except in so much as there is no more processing capacity on the server to accept connections because it is busy servicing current connections).

@pkittenis -- you mention "blocked connections will time out and get put in a [fin/close]_wait state" -- can you provide more references about this?

Also, performance of TCP was contrasted with performance of UDP. How was this measureed? Are you sure every single UDP packet made it into the system? The kernel has buffers that may overflow if the UDP socket is not serviced sufficiently quickly. These drops can be detected using netstat (see issue #2528 for details). So UDP performance may look better but that may because data is being dropped.

No matter what we do, there will be a limit to how fast the system can service a given TCP connection. At some point that limit will be reached. We can make some improvements (perhaps 10%) but it will not be huge.

If I am missing something here, please let me know. The smoking gun here would be demonstrable evidence that UDP is working way faster -- sending exactly the same amount of data over both protocols, and having that data actually returned in queries so we know it all made it. @pkittenis -- did you see that?

@otoolep otoolep changed the title Graphite TCP listener should be able to take multiple lines Graphite TCP input -- questions about performance Aug 21, 2015
@pkittenis
Copy link

| you mention "blocked connections will time out and get put in a [fin/close]_wait state" -- can you provide more references about this?"

From wikipedia:

| FIN-WAIT-1
(both server and client) represents waiting for a connection termination request from the remote TCP, or an acknowledgment of the connection termination request previously sent.

Basically the client sees the unresponsive connection as a sign that the server is.. unresponsive and closes it on client side (timeout setting on the client side).

Since the connection is still open and being processed on server side, it goes into half closed state.

The kernel will forcefully close such connections after some time - that time in seconds being the value of /proc/sys/net/ipv4/tcp_fin_timeout on linux.

I agree this is probably just an effect of slow write throughput and large batches of metrics sent by client. Not much server can do if it's getting that much data and can't process it fast enough.

I did demonstrate UDP working faster than TCP in the mailing list, with a test client and relay. That was on 0.9.3-rc1. Since then performance has improved across the board, however. You will find a full test setup with relay config at https://groups.google.com/forum/#!topic/influxdb/T6mF4_WeNkM

Still, 10% speed up is quite significant.

Thanks for the investigation.

@otoolep
Copy link
Contributor

otoolep commented Sep 1, 2015

FWIW, I am still working on this issue. Right now I am adding stats and diags to the system, and specifically instrumenting the Graphite input first. See #3916 for more details.

Once this is in place, I hope it will bring more quantitative details to the discussion about Graphite performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants