Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parsing error - value including a space followed by a token with a dot in it #54

Closed
MattCarothers opened this issue Jul 13, 2018 · 5 comments
Assignees

Comments

@MattCarothers
Copy link

This line:

CEF:0|TheVendor|TheProduct|1|1|TheName|1| query_string=key1\=value1&key2\=value3 aa.bc&key3\=value4

Should result in this field:

          "query_string" => "key1=value1&key2=value3 aa.bc&key3=value4"

But instead it results in these two fields:

          "aa.bc&key3\\" => "value4",
          "query_string" => "key1=value1&key2=value3",

The issue is this regex on line 174 of cef.rb:

 message = message.gsub((/(\s+(\w+\.[^\s]\w+[^\|\s\.\=]+\=))/),'|^^^\2')

It breaks for the case where a value has a space in it followed by a token with a dot in it. That line of code seems specific to one ArcSight mode. Perhaps add a configuration flag for people who aren't using ArcSight to skip it?

@yaauie yaauie self-assigned this Aug 16, 2018
@yaauie
Copy link
Contributor

yaauie commented Aug 16, 2018

The parser is pretty naive, doing a whole lot of superfluous splitting and replacing and then more work to put things back how it found them when it hits escapes and stuff, all of which could be done more simply with a scanning parser that has simple lookaheads that account for escaping.

This is quite similar to recent work that I did in the KF Filter Plugin, so I'm assigning this ticket to myself.

@MattCarothers
Copy link
Author

You may find this helpful. I haven't tested it extensively, but I replaced everything from line 167to line 188 with one message.split():

    # Strip any whitespace from the message
    if not message.nil? and message.include? '='
      message = message.strip

      # If the last KVP has no value, add an empty string, this prevents hash errors below
      if message.end_with?('=')
        message = message + ' ' unless message.end_with?('\=')
      end

      message.split(/\s+(?=[^=\\\s]*[^\\]=)/).each{ |kv_pair|
        key, value = kv_pair.gsub(/\\=/, '=').split(/=/, 2)
        event.set(MAPPINGS[key] || key, value)
      }
    end

As far as I can tell, it works for everything except one corner case where the key has an escaped = sign in it.

yaauie added a commit to yaauie/logstash-codec-cef that referenced this issue Aug 17, 2018
@breml
Copy link
Collaborator

breml commented Aug 18, 2018

@yaauie I thought a little bit about your above comment as well as the comment on the PR 55 and I have to admit, that I feel offended and I think that those comments are not in good accordance with the Elastic Community Code of Conduct. It is ok to propose improvements and we are all happy, if the code becomes better (bug-free, maintainable, more performant). But there is no value add (nor need) in judging the existing code in such a negative way (e.g. "naive" and "superfluous").
I don't feel, that these coments are very respectful towards the developers who helped to improve the codec to the state we have today. You have to keep in mind, that this codec already served well in a lot of cases (Elastic alone released a 6 part series of blog posts using this codec, starting here). Additonally, most of the developers built the existing state of the codec in their free/spare time, where it looks like you are on the payroll of Elastic. In my opinion, this should even raise the bar for the way you communicate with the community.
I am one of the developers who helped to improve the codec to the current state. When I joined, only one side of encode/decode even existed as well as there have been lots of incompatibilities to the specification I helped to remove. And last but not least, the unit tests, that are in place, allowed you to implement your scanner based approach in a short period of time in the first place.

CC: @jordansissel, @suyograo

@yaauie
Copy link
Contributor

yaauie commented Aug 19, 2018

@breml I did not mean to insult you or the other contributors to this project, and appreciate that you were willing to call out that the effect of my words was offensive. I am sorry.

You are absolutely right that this codec has been useful to many people, and that the extensive tests that you and others put in the effort to maintain are what allowed me to confidently build a scanning parser that would be non-breaking.

In retrospect, my tone was not respectful of the other contributors and their efforts; in future I will be more intentional with how I communicate.

@breml
Copy link
Collaborator

breml commented Aug 20, 2018

@yaauie thanks for the apology, I really appreciate.
And also thanks for your work on the scanning parser, this will clearly improve this codec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants