Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: added parser_options for more control over XML parsing #68

Merged
merged 4 commits into from Jan 28, 2020

Conversation

kares
Copy link
Contributor

@kares kares commented Jan 22, 2020

users sometimes throw non valid XML data at this plugin, which is 馃挬 but still
in certain cases (e.g. when the document is huge) this could lead to leaks as the parser by default is very forgiving and does not fail on errors but instead tries to build smt out of the provided content really hard.

in those cases internal errors (such as SAXParseException) are stored in a list at the native end.
(to be mapped and retrievable as doc.errors when parsing completes)

having more control of the parsing process would allow us to fail early instead of piling up errors.

the proposal supports xml { parse_options => 'strict' }

resolves #66

@elasticsearch-bot elasticsearch-bot self-assigned this Jan 22, 2020

def xml_parse_options
return Nokogiri::XML::ParseOptions::DEFAULT_XML unless @parse_options # (RECOVER | NONET)
@xml_parse_options ||= begin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a question @kares, today in the docs we say that supported options are:

  • strict
  • no_warning
  • no_error
    the ones provided by Nokojiri parser.
    In the future if we update the parser version, and the parser changes/updates the list of flag without we synch the documentation, then we have a misalignment. Do you think is best to have strict check of the flags here instead of relay on the validation done by the parser library, also if it means repeating it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the reasoning: initially I thought about only having strict => true option.
but than the options allow finer control, for whatever scenarios we might run into ...
(e.g. might wan to allow network access for the parser to download schemas etc.)

so there's more parser options we can set, I purposefully only documented these few.
maybe even no_error and no_warning might be a bit too much as what is a warning/error
depends on the parser (might not be clear).

being able to (unoficially) set all of the internal ones is the reason why I decided not to validate against a fixed set of options (that would be enumerated here).

going to remove no_error and no_warning from the docs and only document strict for now.
and checking that the official parser_options work ('strict') will be covered by specs.

does that sound okay or do you have concerns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok, go with the strict as parser_options. My thoughts was based on the assumption that everything we provide should be documented, if we let the door open so that someone can tinkering with undocumented behaviours, then in case we change library we potentially break more pipeline configurations than expected.

@andsel andsel assigned kares and unassigned elasticsearch-bot Jan 22, 2020
By default the parser is not strict and thus accepts some invalid content.
Currently supported options are:

- _strict_ - forces the parser to fail early when content is not valid xml
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karenzone could you please check these for me.
we might add more to the list later (unofficially other options work) but for now only strict is to be documented

Copy link
Contributor

@karenzone karenzone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of suggestions for your consideration. Otherwise, LGTM

===== `parser_options`

* Value type is <<string,string>>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding default value here as a bullet, since it sounds like strict is not enabled by default. I'm not sure how to notate that. Would something like * Default value is [] work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, did the same as with rest ... to mention there's no default

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We documented parser_options, but the actual parameter is parse_options in the code. I can submit a PR to fix the doc but is it ok to keep parse_options?

By default the parser is not strict and thus accepts some invalid content.
Currently supported options are:

- _strict_ - forces the parser to fail early when content is not valid xml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- _strict_ - forces the parser to fail early when content is not valid xml
- `strict` - forces the parser to fail early when content is not valid xml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"forces the parser to fail early when content is not valid xml"

Maybe include the alternative. Something like "forces the parser to fail early instead of accumulating errors when content is not valid xml."

@kares kares merged commit 14863ea into logstash-plugins:master Jan 28, 2020
it 'does fail parsing' do
subject.filter(event)
expect( event.get("tags") ).to_not be nil
expect( event.get("tags") ).to include '_xmlparsefailure'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we document this tag? In other plugins such as grok, it can be changed/customized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the docs the tag isn't documented.
we usually try to keep PRs such as this one minimalistic, so I think you should open an issue as a feature request to be able to configure the failure tag, on top of that (as the feature is delivered) documentation will get improved ...

@uzairarif
Copy link

Hi, can we get the documentation on the website updated, it seems the setting has to be mentioned like below:

filter
{
xml {
source => "sourceMessage"
target => "XMLtargetMessage"
parse_options => "strict"
}
}

But the documenation on the website states that the setting to be: parser_options (snapshot attached below)

xml-filer-plugin

@karenzone
Copy link
Contributor

Hi @uzairarif. Thanks for this comment. I've opened #75 to get this change vetted and merged.

If you see other inconsistencies such as this, please open a github issue in the appropriate plugin repo. We rarely interact with PRs after they have been merged, and comments in merged PRs are likely to get lost. It would be a shame for valuable comments such as this to be lost.

Thanks for helping us make our plugins and documentation better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parsing/transforming xml could lead to leaks
6 participants