Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inefficiency caused by double event creation and extra field lookup #17

Closed
colinsurprenant opened this Issue Jan 29, 2016 · 13 comments

Comments

Projects
None yet
5 participants
@colinsurprenant
Copy link
Contributor

colinsurprenant commented Jan 29, 2016

the current codec implementation leverages the lines codec for doing the line parsing. The problem is that the line codec decode method is called which produces and event and then a new event is created using the json deserialization of the previous event message field.

This means that for each input line, and extra and superfluous event is created and an extra event field lookup is done.

Since the line parsing logic is so simple, we just should directly use the buffered tokenizer here and not rely on the line codec.

@jordansissel

This comment has been minimized.

Copy link
Contributor

jordansissel commented Jan 29, 2016

+1 your proposal.

On Friday, January 29, 2016, Colin Surprenant notifications@github.com
wrote:

the current codec implementation leverages the lines codec for doing
the line parsing. The problem is that the line codec decode method is
called

@lines.decode(data) do |event|

which produces and event and then a new event is created using the json
deserialization of the previous event message field
LogStash::Event.new(LogStash::Json.load(event["message"]))

.

This means that for each input line, and extra and superfluous event is
created and an extra event field lookup is done.

Since the line parsing logic is so simple, we just should directly using
the buffered tokenizer here and not rely on the line codec.


Reply to this email directly or view it on GitHub
#17.

@guyboertje

This comment has been minimized.

Copy link
Contributor

guyboertje commented Feb 4, 2016

Are we handling remainder flushing on close at all - in the current impl?

@jsvd

This comment has been minimized.

Copy link
Contributor

jsvd commented Feb 4, 2016

I made some arguments against this through another channel on the overall choice of buftok vs line codec:

implementation-wise I dont know if using the line codec makes the code more complex, which is my only concern on my suggestion (to use codec line instead of buftok)

That aside I see multiple benefits:
* reusing code
* one less place we have to maintain the contract of an external dependency
* using the codec this way helps us reinforce what one day might become the line tokenizer

Also, the performance gain is unknown and, for me, doesn't justify hardwiring buftok in json_lines (and multiline).

That said, for coherence sake and dismissing the performance argument, I'm +1 on going with this proposal if logstash-plugins/logstash-codec-multiline#26 is accepted

@colinsurprenant

This comment has been minimized.

Copy link
Contributor Author

colinsurprenant commented Feb 4, 2016

given the simplicity of this implementation (without the boilerplate, literally 1 line @buffer.extract(data).each do |line| ) I think DRYness is not really a concern and I'd rather opt for performance.

@guyboertje

This comment has been minimized.

Copy link
Contributor

guyboertje commented Feb 4, 2016

+1 to use buftok and to fix the "flush if remainder" problem.

@guyboertje

This comment has been minimized.

Copy link
Contributor

guyboertje commented Feb 4, 2016

+1 to improve performance too.

@guyboertje

This comment has been minimized.

Copy link
Contributor

guyboertje commented Feb 5, 2016

Here too is a flavour mismatch if json_lines is used with the file input, and because there is no flush one gets absolutely no events generated at all, not even incorrect ones.

@colinsurprenant

This comment has been minimized.

Copy link
Contributor Author

colinsurprenant commented Feb 5, 2016

@guyboertje codec flush is meant as a final flush, why events are not generated when used with file input?

@suyograo

This comment has been minimized.

Copy link
Contributor

suyograo commented Feb 5, 2016

Fixed in #18

@suyograo suyograo closed this Feb 5, 2016

@guyboertje

This comment has been minimized.

Copy link
Contributor

guyboertje commented Feb 5, 2016

@colinsurprenant

codec flush is meant as a final flush, why events are not generated when used with file input?

Because json_lines does not implement a flush method! But on reflection, unless the remainder is a parseable JSON snippet there is not much point, but maybe an event with a parse_failure is better than nothing at all.

@colinsurprenant

This comment has been minimized.

Copy link
Contributor Author

colinsurprenant commented Feb 7, 2016

@guyboertje but you said «if json_lines is used with the file input, and because there is no flush one gets absolutely no events generated at all»

So are «absolutely no events generated at all» or is there potentially a missing last event because there is no final flush?

It is important that this be clarified because if no events at all are generated by a lacking flush when used with the new file input, we do have a serious problem and need to update all/most codecs?

@colinsurprenant

This comment has been minimized.

Copy link
Contributor Author

colinsurprenant commented Feb 9, 2016

after discussions, we agree a flush method is required for the shutdown case where an incomplete line was buffered. It is pretty obvious this incomplete line will end up generating a json parser error but at least there will be a log trace for it.

@colinsurprenant

This comment has been minimized.

Copy link
Contributor Author

colinsurprenant commented Feb 9, 2016

followup issue in #20

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.