Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input Beats in much slower than the LSF #45

Closed
ph opened this issue Jan 25, 2016 · 59 comments

Comments

Projects
None yet
@ph
Copy link
Contributor

commented Jan 25, 2016

We have report on discuss and other users that the beats input is much slower than his predecessor LSF. See https://discuss.elastic.co/t/insufficient-throughput-from-filebeat/39564/11

@urso did a few testing on his side to isolate the issue and a lot of factor point to the direction the input.

The following tests were made with no filter and with the null output on a Hetzner test machine. (4 core i7 3.4GHz 32 gigs of ram)

Setup EPS
LSF -> Lumberjack 30k lines/sec
Filebeat -> Beats 18K lines/sec

What change drastically from LSF to Beats:

  • The protocol now use a JSON frame which contains multiples events. and its probably slower than our binary protocol
  • We use partial acks, we ack on every 20% on the events received. this call is a blocking call
  • 2.1.X introduces more object allocation but we have seen report of the performance issue with previous version too which doesn't use the refactored code.

This is what we can try in the short term

  • Make sense of the jruby profiling report from the tests
  • Remove the partial ack.
  • Measure the JSON performance on large payload.
  • Async model might be necessary to get more performance?

Jruby profiling report

flat: https://gist.github.com/ph/716c8876efd5ab760c15
graph: https://gist.github.com/
graph in json: https://gist.github.com/ph/0e05e51cff97d251b60c

cc @tsg @urso

(Theses tests were made with the metric filter instead of PV to be able to correlate data with a user, but PV show a similar slowdown)

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 25, 2016

I was helping an user and I did the following to test the current performance.
Run for ~ 2mins.

LSF -> Lumberjack

rate: 7037.071025600132
rate: 7839.787441594143
rate: 8447.083098833273
rate: 9198.067285770032
rate: 9852.754230029077
rate: 10505.9470487211
rate: 10992.305117164353

Filebeat 1.0.1 -> Beats 2.1.2

rate: 3342.7377205107396
rate: 3753.21868427974
rate: 4092.5007213291456
rate: 4458.67325797694
rate: 4789.075861578294
rate: 5079.868260014546
rate: 5379.920123024201

Log file is a normal apache log.

Configs files:
https://gist.github.com/ph/9f76ccbda387d4597c7a

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 25, 2016

updated with configs

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 25, 2016

Result from beats 2.0.3 slower than 2.1.2

rate: 3223.5530186008205
rate: 3354.396094926669
rate: 3418.105017582448
rate: 3537.8221143742203
rate: 3647.9351783789803
rate: 3749.260079011355
rate: 3842.4994790181167
rate: 3928.267877100653
rate: 3995.329194956261
rate: 4015.21181473813
rate: 4087.18688803451
@tsg

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2016

@ph Can the logstash version be relevant here (i.e 2.1 vs 2.2), or is it just the plugin version that matters?

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 26, 2016

@tsg all the tests were done in the same logstash version which is 2.1.1, so for me the problem lie in the plugin and the core doesn't matter.

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 27, 2016

This what I know from the yesterday experiments

  1. The refactoring of 2.1.X did not cause the performance
  2. Beats input 2.0.3 and before show the slowdown.
  3. logstash-input-lumberjack 2.0.5 and 2.0.3 Have almost the same logic code in the plugins
  4. logstash-input-lumberjack 2.0.5 and 2.0.3 Have the same TCP socket handling logic.
  5. I also tried to change some of the option in the tcp socket like TCP_NODELAY, no improvements.

What is different on the server side.

  • Json frame
  • Partial ACK

What is different on client side:

  • filebeat does a slow start.
  • Support partial ack
  • Better retry logic

A jruby profile in the callgrind format show that most of the time is spend on reading the socket.

screenshot 2016-01-27 09 48 20

I am trying the following:

  • add traces to the parsed data to inspect if the window size is stable or not
@tsg

This comment has been minimized.

Copy link
Contributor

commented Jan 27, 2016

Are those CPU or real times? In other words, the IO#sysread time contains the time it waits for a new packet to be available?

Would be interesting to compare it with a profile generated while having LSF as a client.

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 27, 2016

@tsg +1 on that, It will be my next check, its cpu time IIRC.

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 27, 2016

I have traced the communication in both of the plugins and the window/ack logic seems OK.

I will profile the other plugin.

@jordansissel

This comment has been minimized.

Copy link
Contributor

commented Jan 27, 2016

@ph are you sure the profile data uses cpu time? I have doubts and think it uses real time - 84% of time spent in IO#sysread sounds like real time, not cpu time.

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 27, 2016

☕️ ☕️ Yes its real time. ☕️ ☕️

Its reeeeal time

@jordansissel

This comment has been minimized.

Copy link
Contributor

commented Jan 27, 2016

Given that, it does seem like it's something network or protocol related, maybe -- that for most of the time we are blocked waiting on reads from the network?

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 27, 2016

@jordansissel Yup, its certainly feel slower in the trace, I am rerunning the test with a fixed amount of logs, 200k of the same line)

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 27, 2016

I went through the codes and compared multiples different version but I don't see something really obvious.

@urso Its possible to do a simple test with filebeat and remove the acking strategy in the code? I am curious to see the throughput without the ack? Can you make a version available for me?

@urso

This comment has been minimized.

Copy link
Contributor

commented Jan 27, 2016

@ph you mean no ack at all? You change the ack strategy at will, as long as long as last ack is send.

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2016

On my quest to find the slowdown I have tested a lot of different scenario and I would like to share the numbers with you.
The number were generated by pv -War and using the dots codec for the stdout.

Lets take into the LSF -> Lumberjack input as the baseline.

The test machine was a:
Macbook pro retina, 3 GHz Intel Core i7, 16gigs of ram

Baseline numbers

Time Sender Receiver Performance
3 mins LSF 0.4.0 Lumberjack 2.0.5 [17.6KiB/s] [18.1KiB/s]
3 mins Filebeat 1.1.0 Beats 2.1.2 [9.98KiB/s] [10.1KiB/s]
3 mins Filebeat 1.1.0 Beats 2.0.3 [10.8KiB/s] [9.89KiB/s]
3 mins netcat TCP + Codec Json ~25KiB

Experimentations to find the bottle neck.

After looking at profiling I didn't find anything obvious.

Time Sender Receiver Performance
3 mins Filebeat 1.1.0 Beats 2.1.2 With no partial ack [9.99KiB/s] [9.55KiB/s]
3 mins Filebeat 1.1.0 with compression off Beats 2.1.2 [ 8.8KiB/s] [9.28KiB/s]
3 mins Filebeat 1.1.0 Beats 2.1.2 With No JSON decompress and outputting the same event, ACK at end of window only. [9.99KiB/s] [10.3KiB/s]
3 mins Filebeat 1.1.0 with compression off Beats 2.1.2 With No JSON decompress and outputting the same event, ACK at the end of window only [11.2KiB/s] [9.53KiB/s]

I was not able to use the special build from @urso with ack disabled, it gives me a zlib error when receiving events in the beat inputs.

I think the last 2 scenarios are interesting, no deserialization of the string is done on our side. The library will output the same hash over and over again. (possibily JITTING It)
The only waiting we have is the read on the socket and/or decompressing the payload.

Not sure what to think, its like we are getting or reading the events slower than before? The stdout + codec is the faster output we can have so nothing should block on the queue and this shouldn't block any IO reading.

The reading socket logic that we had for LSF is pretty similar to what we use for beats.
The only things that is different with the frame, they should be a bit bigger with the json overhead, I haven't gathered any numbers for that though.

Beats LSF
read the whole compressed frame from the socket. read the whole compressed frame from the socket
read json bytes read key,value until end of event
parse json ----
output event output event

Config Used:

filebeat:
  spool_size: 1024
  prospectors:
    -
      paths:
        - /tmp/multi11.log
      input_type: log
      document_type: diffusion
  registry_file: /tmp/test_registry
output:
  logstash:
    hosts: ["localhost:5555"]
    bulk_max_size: 1024 # Same default value of LSF.
    #compression_level: 0
shipper:
logging:
  level: info
  files:
    rotateeverybytes: 10485760
    keepfiles: 3
{
  "network": {
    "servers": [ "localhost:6782" ],
#    "ssl ca": "/tmp/lumberjack.crt",
    "ssl ca": "/Users/ph/es/logstash-config/certificates/logstash-forwarder.crt",
    "timeout": 15
  },
  "files": [
    {
      "paths": [
      "/tmp/multi*log"
      ]
    }
  ]
}
@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2016

If you have any ideas or suggestions I am all ears!

@untergeek

This comment has been minimized.

Copy link

commented Jan 28, 2016

Is there a chance that Beats adding/using @metadata has an effect?

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2016

@untergeek The content of the metadata itself is a bit bigger than what we were sending before,
Yes it should take more time to send and compress.

Beat metadata

{
       "message" => "\#{i} 71.141.244.242 - kurt [18/May/2011:01:48:10 -0700] \"GET /admin HTTP/1.1\" 301 566 \"-\" \"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3",
      "@version" => "1",
    "@timestamp" => "2016-01-28T15:25:36.872Z",
          "beat" => {
        "hostname" => "sashimi",
            "name" => "sashimi"
    },
         "count" => 1,
        "fields" => nil,
    "input_type" => "log",
        "offset" => 976398840,
        "source" => "/tmp/multi11.log",
          "type" => "diffusion",
          "host" => "sashimi",
          "tags" => [
        [0] "beats_input_codec_plain_applied"
    ],
     "@metadata" => {
        "beat" => "filebeat",
        "type" => "diffusion"
    }
}

LSF metadata

{
       "message" => "\#{i} 71.141.244.242 - kurt [18/May/2011:01:48:10 -0700] \"GET /admin HTTP/1.1\" 301 566 \"-\" \"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3",
      "@version" => "1",
    "@timestamp" => "2016-01-28T15:24:45.372Z",
          "file" => "/tmp/multi11.log",
          "host" => "sashimi",
        "offset" => "575646480"
}
@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2016

updated with the same generated data frame. You can exclude the tags in the beats format they are added by the input.

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2016

Beats 1024 events window

2016-01-28 10:32:45 -0500 - compressed payload: 9225
2016-01-28 10:32:47 -0500 - compressed payload: 9260
2016-01-28 10:32:48 -0500 - compressed payload: 9462
2016-01-28 10:32:49 -0500 - compressed payload: 9118
2016-01-28 10:32:50 -0500 - compressed payload: 9233
2016-01-28 10:32:52 -0500 - compressed payload: 8979

LSF 1024 events window

2016-01-28 10:30:52 -0500 - compressed payload: 7088
2016-01-28 10:30:53 -0500 - compressed payload: 6899
2016-01-28 10:30:53 -0500 - compressed payload: 6900
2016-01-28 10:30:54 -0500 - compressed payload: 7252
2016-01-28 10:30:54 -0500 - compressed payload: 7253
2016-01-28 10:30:55 -0500 - compressed payload: 7245
2016-01-28 10:30:55 -0500 - compressed payload: 7246
@colinsurprenant

This comment has been minimized.

Copy link

commented Jan 28, 2016

@ph did you have a chance to look at the potential scenario we discussed yesterday? this is an hypothesis without having looked into the protocol code but: if the acking mechanism is synchronous on the sender end (filebeat), and sending frame N requires the reception of ack N-1 then the throughput will be throttled by the acking latency. If this is the case, maybe the protocol could be tweaked so the sender keeps on sending frames and processed acks asynchronously?

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2016

@colinsurprenant @urso has a branch that support pipelining which send events asynchronously,increasing performance at the cost of increased CPU and memory usage on both the sender and the receiver side. @urso This could be a flag that people turn on?

Filebeat (current implementation) and LSF were both doing sync ack.

But with the change in the format and increased data frame it will effectively make beats input send the ack later than what LSF was doing and will increase the latency of the sending flow.

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2016

One thing I did not consider in the test is testing with beats input with SSL turn on.

@urso

This comment has been minimized.

Copy link
Contributor

commented Jan 28, 2016

@colinsurprenant I've run some tests some time ago with filebeat using asynchronous mode sending the next batch of frames before receiving an ACK.

filebeat workers  compression pipelining | cpu(%)   mem-res(MB)  throughput(1k lines/s)
---------------------------------------------------------------------------------------
 master     1        y             n     |   66       24             18.2
  exp       1        y             n     |   71       44             18.6
  exp       1        n             n     |   76       45             19.9
  exp       1        n             y     |   97       65             24.7
                                         |
 master     2        y             n     |   66       25             18.2
  exp       2        y             n     |  128       60             30.8
  exp       2        n             n     |  126       59             32.6
  exp       2        n             y     |  156       91             38.0
                                         |
 master     4        y             n     |   66       26             18.0
  exp       4        y             n     |  190       84             44.3
  exp       4        n             n     |  185       76             45.9
  exp       4        n             y     |  195      130             47.4

column 'pipelining' means async mode enabled. It kinda improves the situation, but is still slower than LSF doing sync ack (30k lines/s).

For async sending (or without ACK), compression must be disabled in beats so not to trigger a bug in input plugin buffer handling.

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2016

@urso the tests we discussed

Time Sender Receiver Performance
3 mins LSF 0.4.0 Beats 2.1.2 TLS [21.4KiB/s] [21KiB/s]
3 mins Filebeat 1.1.0 TLS Beats 2.1.2 TLS [12.1KiB/s] [11.9KiB/s]

its a bit faster, not sure its really meaningful.

@colinsurprenant

This comment has been minimized.

Copy link

commented Jan 28, 2016

@urso in the single worker scenario, enabling pipelining increases by ~25% which is not negligible but it is intriguing that it is not more efficient.

What is the multiple workers scenario? is that multiple internal workers in the file beat sending to the same LS instance?

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Jan 28, 2016

@colinsurprenant pipelining + fixing the problem with the zlib will probably make it faster. but still.

@urso

This comment has been minimized.

Copy link
Contributor

commented Jan 28, 2016

@colinsurprenant workers is number of workers with dedicated tcp connections doing some loadbalancing in filebeat.

Some more tests from my side with generatorbeat and some fake lumberjack server in go (https://gist.github.com/urso/a676fb14269c49f78749). The fake server parses json encoded + compressed data frames.

generatorbeat + fakelumberjack = ~58k events/s
generatorbeat + logstash = ~18k events/s

@colinsurprenant

This comment has been minimized.

Copy link

commented Jan 28, 2016

@urso interesting. the tests with the fake server, that was without pipelining right?

@colinsurprenant

This comment has been minimized.

Copy link

commented Feb 9, 2016

curious, have we made any progress on this?

@urso

This comment has been minimized.

Copy link
Contributor

commented Feb 9, 2016

not much. I tested with logstash 2.2.0 + most recent beats plugin this week and throughput improved from 18k to 20k lines/s.

Testing with PR #47 and pipelining in logstash it get like 25.5k lines/s. Adding a second worker I get like 43k lines/s. For comparison 4 workers have been required in the past to get somewhat similar throughput.

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Feb 10, 2016

@colinsurprenant Not yet, I reviewed @urso PR which is an improvements and will try to release it asap.

@berglh

This comment has been minimized.

Copy link

commented Feb 22, 2016

I just wanted to chime in. I have a pretty complex multi-line log that we're shipping. I had been using log-courier and noticed the new multi-line support in filebeat, so I decided to try it out.

My set-up was pretty straight forward:

filebeat 1.1.0 => logstash 2.2.0 w logstsah-input-beats 2.1.2

In my example I had 32 multi-line log files (4 GB size total). The log entries have a relatively high cardinality. There were no filters in the logstash pipeline and the output was written to fast spinning disk as a file. The log sample was actually the same log entry repeated over and over again.

In my testing of filebeat 1.1.0 I maxed out at 3,500 events/s. With log-courier 1.8, I was seeing 30,000 events/s.

Adjusting the number of workers didn't seem to make any difference to my test. The only difference was the input plugin and the shipper. I haven't tried shipping the data from filebeat straight to /dev/null to ensure that the multi-line capabilities of filebeat are not the limitation, but this thread seemed to back up my experience.

I'll upgrade all versions and try the test again and report any changes. I saw someone report libbeatLogstashPublishedAndAckedEvents, how do i get that output from filebeat?

@urso

This comment has been minimized.

Copy link
Contributor

commented Feb 22, 2016

@berglh can you try with filebeat master, as this is the place we do some more changes. How did you measure throughput?

Next to spooler_size in filebeat set publish_async: true. This enables the async publisher pipeline enabling load-balancing in logstash. You can try to play with spooler_size and bulk_max_size to increase throughput. Sometimes setting these to 4096 did give me slightly better results.

Try with console output to /dev/null to get some baseline stats, especially for multiline. For benchmarking purposes of filebeat->logstash connection it might be benficial to disable multiline support. Maybe comparing with and without multiline enabled might be kinda interesting (as due to json based transport log lines need to be escaped and unescaped).

The throughput from beats (variable libbeatLogstashPublishedAndAckedEvents) you can get via github.com/urso/govarbeat or python script posted to original discussion (see link in issue description)

@suyograo suyograo added the bug label Mar 3, 2016

@berglh

This comment has been minimized.

Copy link

commented Mar 8, 2016

@urso As I mentioned, I had filebeat sending directly to logstsah which wrote the output to file. I had a simple script that watched for the creation of the file, then did wc -l on the file and divided the value by seconds since creation as each event was a single line in the output file - it wasn't perfect but a reasonable value.

I already had publish_async: true and did experiment adjusting spool_size up to 20480. bulk_max_size seemed to be only relevant to the elasticsearch output.

I finally managed to do some testing of both single line and multi-line logs.

The publish_async: true was set in all runs and spool_size had no consistent impact on the throughput, so I stuck with the default in the end. All of these test were performed on the same machine with the same source log files and used localhost for the network addresses to avoid transient network conditions interfering with the results.

Single Event Lines Results

filebeat 1.1.1 => /dev/null
~414,072 Eps (390K to 415K)

filebeat 1.1.1 => logstash 2.2.2 (logstash-input-beats 2.1.3) => /dev/null
~10,453 Eps

log-courier 1.8.2 => logstash 2.2.2 (logstash-input-courier 1.9.0) => /dev/null
~36,471 Eps

Multi-line Event Detection Results

filebeat 1.1.1 => /dev/null
~15,680 Eps (20K to 10K, slowing over time)

filebeat 1.1.1 => logstash 2.2.2 (logstash-input-beats 2.1.3) => /dev/null
~5,844 Eps

log-courier 1.8.2 => logstash 2.2.2 (logstash-input-courier 1.9.0) => /dev/null
~21,854-24,106 Eps

Observations

  • logstash-input-courier 1.9.0 is around ~4 times faster than logstash-input-beats 2.1.3 consistently
  • filebeat 1.1.1 writing directly to /dev/null is still about ~25% slower than log-courier 1.8.2 sending to logstash-input-courier 1.9.0 on logstash 2.2.2for multi-line
  • Singe event line performance of filebeat 1.1.1 is very impressive to /dev/null.

Postulations

  • logstash-input-beats 2.1.3 could be more efficient for both single and multi-line events from filebeat 1.1.1
  • filebeat 1.1.1 multi-line detection could also use some performance improvement in-line using log-courier 1.8.2 as a benchmark
@ruflin

This comment has been minimized.

Copy link

commented Mar 8, 2016

@berglh Thanks a lot for taking the time and measuring all the different setups. We will definitively take this into account. It's interesting to see that the multiline implementation in filebeat is so much slower compared to log-courier.

@ph

This comment has been minimized.

Copy link
Contributor Author

commented Mar 8, 2016

@berglh Thanks for the description, I going back to tackle the issue probably next week.

@jay-dihenkar

This comment has been minimized.

Copy link

commented Apr 11, 2016

Any updates on timeline for expected bug resolve on this?

@berglh

This comment has been minimized.

Copy link

commented Apr 19, 2016

@ph @ruflin any chance to have a look into this yet? the beats have some great features that solve a lot of problems for us, but unable to due to the slow plugin performance, let me know if there is any other benchmarking i can do for you!

@ruflin

This comment has been minimized.

Copy link

commented Apr 21, 2016

@berglh No specific update here yet, but we definitively have it on our list to be improved. Could you perhaps open an issue in the Beats repo with the data for multiline, so we can directly link to the issue when we fix it?

@matajaz

This comment has been minimized.

Copy link

commented Apr 24, 2016

Hi,
I am tried to use and ELK server (logstash 2.2.4 and beats 1.2.1) to collect debug logs from my cloud applications but I see really poor performance even if I am generating events in much slower rate than test results above.
My app peaks event generation at 1383.10 KB/s, 8941 trace lines/s. An certain important event shows up 230 times/minute in original application log.
Kibana only detects this trace 10 times per minute.

I have also noticed that starting top beat on the same machine as I have filebeat running recuces the number of traces shown in Kibana. (both filebeat and topbeat is configured to send to logstash)

Logstash log is full of beat, logstash and elasticsearch problems:
:timestamp=>"2016-04-23T21:17:13.999000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}
{:timestamp=>"2016-04-23T21:17:14.499000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}
{:timestamp=>"2016-04-23T21:17:15.000000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}
{:timestamp=>"2016-04-23T21:17:15.500000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}
{:timestamp=>"2016-04-23T21:17:15.632000+0000", :message=>"Attempted to send a bulk request to Elasticsearch configured at '["http://localhost:9200/"]', but Elasticsearch appears to be unreachable or down!", :error_message=>"Connection refused", :class=>"Manticore::SocketException", :client_config=>{:hosts=>["http://localhost:9200/"], :ssl=>nil, :transport_options=>{:socket_timeout=>0, :request_timeout=>0, :proxy=>nil, :ssl=>{}}, :transport_class=>Elasticsearch::Transport::Transport::HTTP::Manticore, :logger=>nil, :tracer=>nil, :reload_connections=>false, :retry_on_failure=>false, :reload_on_failure=>false, :randomize_hosts=>false}, :level=>:error}
{:timestamp=>"2016-04-23T21:17:16.001000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}
{:timestamp=>"2016-04-23T21:17:16.501000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}
{:timestamp=>"2016-04-23T21:17:17.002000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}
{:timestamp=>"2016-04-23T21:17:17.055000+0000", :message=>"Connection refused", :class=>"Manticore::SocketException", :backtrace=>["/opt/logstash/vendor/bundle/jruby/1.9/gems/manticore-0.5.5-java/lib/manticore/response.rb:37:in initialize'", "org/jruby/RubyProc.java:281:incall'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/manticore-0.5.5-java/lib/manticore/response.rb:79:in call'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/manticore-0.5.5-java/lib/manticore/response.rb:256:incall_once'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/manticore-0.5.5-java/lib/manticore/response.rb:153:in code'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.15/lib/elasticsearch/transport/transport/http/manticore.rb:71:inperform_request'", "org/jruby/RubyProc.java:281:in call'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.15/lib/elasticsearch/transport/transport/base.rb:201:inperform_request'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.15/lib/elasticsearch/transport/transport/http/manticore.rb:54:in perform_request'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.15/lib/elasticsearch/transport/transport/sniffer.rb:32:inhosts'", "org/jruby/ext/timeout/Timeout.java:147:in timeout'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.15/lib/elasticsearch/transport/transport/sniffer.rb:31:inhosts'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/elasticsearch-transport-1.0.15/lib/elasticsearch/transport/transport/base.rb:76:in reload_connections!'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.5.5-java/lib/logstash/outputs/elasticsearch/http_client.rb:72:insniff!'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.5.5-java/lib/logstash/outputs/elasticsearch/http_client.rb:60:in start_sniffing!'", "org/jruby/ext/thread/Mutex.java:149:insynchronize'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.5.5-java/lib/logstash/outputs/elasticsearch/http_client.rb:60:in start_sniffing!'", "org/jruby/RubyKernel.java:1479:inloop'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-2.5.5-java/lib/logstash/outputs/elasticsearch/http_client.rb:59:in `start_sniffing!'"], :level=>:error}
{:timestamp=>"2016-04-23T21:17:17.502000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}

@tsg

This comment has been minimized.

Copy link
Contributor

commented Apr 24, 2016

@matajaz Your issue seems different from the the one in this thread. Please open a topic in https://discuss.elastic.co/ not to derail this issue. We'll help you there. Thanks!

@suyograo suyograo added the P1 label Apr 26, 2016

@ph ph self-assigned this May 13, 2016

@ph

This comment has been minimized.

Copy link
Contributor Author

commented May 13, 2016

Going back into that issue and investigating with @urso what we is the actual problem tay tune.

@urso

This comment has been minimized.

Copy link
Contributor

commented May 13, 2016

did some more test today on hetzner machine.

For comparison lsf -> logstash (beats plugin) = 30k eps.

Tests use:

  • generatorbeat to create and push events as fast as possible. Elasticsearch output bulk API has been modified to not parse JSON response and handle errors
  • test-lj lumberjack test server written in golang (go-lumber project)
  • logstash 2.3.2:
    • beats input plugin
    • http input plugin + es_bulk codec
generatorbeat-> no zlib zlib HTTP
tst-lj (go) 64.4k 59.2k
logstash lumberjack 23k 22.6k
logstash es_bulk 19.8k

Adding experimental pipelining to generatorbeat I get

generatorbeat-> no zlib zlib
tst-lj (go) 105k 94.6k
logstash lj 26.5k

just for fun generatorbeat (4 workers) -> tst-lj (go, 1 worker handling input) = 170k eps

@ph

This comment has been minimized.

Copy link
Contributor Author

commented May 13, 2016

@urso lets get theses numbers in #45 (comment) I think it will make things a bit more fair, since the golang server doesnt have the pipeline ovehead.

@ph

This comment has been minimized.

Copy link
Contributor Author

commented May 13, 2016

generatorbeat + ruby_fake_lumberjack (with json parser) = ~19.5k lines/s

This would be the number we had last time that should be in an equivalent setup for the go-lumber.

@suyograo suyograo added P2 and removed P1 labels May 17, 2016

@ph

This comment has been minimized.

Copy link
Contributor Author

commented May 18, 2016

OK, I have made some progress here, I have created a pure netty implementation of the protocol and the initial benchmark look promising.

Initial benchmarks with this library outside logstash gives 55K EPS on my personal machine, which is a MacBook Retina. This performance is similar to the numbers I get from the go-lumber implementation on the same machine.

Setup was:

Setup EPS
generator beats -> nozlib 56748.4502119/s (avg: 61625.3987965/s)
generator beats -> zlib 56722.2105926/s (avg: 55904.2689266/s)
@ph

This comment has been minimized.

Copy link
Contributor Author

commented May 18, 2016

OK, beats input will be faster with the java rewrite, but something else is definitively a bottle neck inside logstash. On my machine I get 32K, which is almost 2.5 times the speed I had before (12417.6067793/s)

@ph

This comment has been minimized.

Copy link
Contributor Author

commented May 31, 2016

So on hetzner with a netty implementation and a null output we reached ~100K EPS.

@ph ph referenced this issue Jun 9, 2016

Closed

Rewrite the beats plugin in pure java #92

14 of 14 tasks complete
@ph

This comment has been minimized.

Copy link
Contributor Author

commented Aug 5, 2016

work done in #92

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.