Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: added parser_options for more control over XML parsing #68

Merged
merged 4 commits into from Jan 28, 2020
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.md
@@ -1,3 +1,6 @@
## 4.1.0
- Feat: added parser_options for more control over XML parsing [#68](https://github.com/logstash-plugins/logstash-filter-xml/pull/68)

## 4.0.7
- Fixed creation of empty arrays when xpath failed [#59](https://github.com/logstash-plugins/logstash-filter-xml/pull/59)

Expand Down
14 changes: 14 additions & 0 deletions docs/index.asciidoc
Expand Up @@ -34,6 +34,7 @@ This plugin supports the following configuration options plus the <<plugins-{typ
| <<plugins-{type}s-{plugin}-force_array>> |<<boolean,boolean>>|No
| <<plugins-{type}s-{plugin}-force_content>> |<<boolean,boolean>>|No
| <<plugins-{type}s-{plugin}-namespaces>> |<<hash,hash>>|No
| <<plugins-{type}s-{plugin}-parser_options>> |<<string,string>>|No
| <<plugins-{type}s-{plugin}-remove_namespaces>> |<<boolean,boolean>>|No
| <<plugins-{type}s-{plugin}-source>> |<<string,string>>|Yes
| <<plugins-{type}s-{plugin}-store_xml>> |<<boolean,boolean>>|No
Expand Down Expand Up @@ -87,6 +88,19 @@ filter {
}
}

[id="plugins-{type}s-{plugin}-parser_options"]
===== `parser_options`

* Value type is <<string,string>>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding default value here as a bullet, since it sounds like strict is not enabled by default. I'm not sure how to notate that. Would something like * Default value is [] work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, did the same as with rest ... to mention there's no default

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We documented parser_options, but the actual parameter is parse_options in the code. I can submit a PR to fix the doc but is it ok to keep parse_options?

Setting XML parser options allows for more control of the parsing process.
By default the parser is non strict and thus accepts some invalid content.
Multiple options are separated by a comma (e.g. `'strict,no_warning'`),
currently supported options are:

- _strict_ - forces the parser to fail early when content is not valid xml
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karenzone could you please check these for me.
we might add more to the list later (unofficially other options work) but for now only strict is to be documented

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- _strict_ - forces the parser to fail early when content is not valid xml
- `strict` - forces the parser to fail early when content is not valid xml

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"forces the parser to fail early when content is not valid xml"

Maybe include the alternative. Something like "forces the parser to fail early instead of accumulating errors when content is not valid xml."

- _no_warning_ - allows to parse content when there are only warnings
- _no_error_ - allows to parse content on non fatal parser errors

[id="plugins-{type}s-{plugin}-remove_namespaces"]
===== `remove_namespaces`
Expand Down
34 changes: 33 additions & 1 deletion lib/logstash/filters/xml.rb
Expand Up @@ -58,6 +58,13 @@ class LogStash::Filters::Xml < LogStash::Filters::Base
#
config :xpath, :validate => :hash, :default => {}

# Supported XML parsing options are 'strict', 'no_error' and 'no_warning'.
# - strict mode turns on strict parsing rules (non-compliant xml will fail)
# - no_error and no_warning can be used to suppress errors/warnings
config :parse_options, :validate => :string
# NOTE: technically we support more but we purposefully do not document those.
# e.g. setting "strict|recover" will not turn on strict as they're conflicting

# By default the filter will store the whole parsed XML in the destination
# field as described above. Setting this to false will prevent that.
config :store_xml, :validate => :boolean, :default => true
Expand Down Expand Up @@ -110,6 +117,7 @@ def register
:error => "When the 'store_xml' configuration option is true, 'target' must also be set"
)
end
xml_parse_options # validates parse_options => ...
end

def filter(event)
Expand Down Expand Up @@ -141,11 +149,13 @@ def filter(event)

if @xpath
begin
doc = Nokogiri::XML(value, nil, value.encoding.to_s)
doc = Nokogiri::XML::Document.parse(value, nil, value.encoding.to_s, xml_parse_options)
rescue => e
event.tag(XMLPARSEFAILURE_TAG)
@logger.warn("Error parsing xml", :source => @source, :value => value, :exception => e, :backtrace => e.backtrace)
return
else
doc.errors.any? && @logger.debug? && @logger.debug("Parsed xml with #{doc.errors.size} errors")
end
doc.remove_namespaces! if @remove_namespaces

Expand Down Expand Up @@ -194,4 +204,26 @@ def filter(event)
filter_matched(event) if matched
@logger.debug? && @logger.debug("Event after xml filter", :event => event)
end

private

def xml_parse_options
return Nokogiri::XML::ParseOptions::DEFAULT_XML unless @parse_options # (RECOVER | NONET)
@xml_parse_options ||= begin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a question @kares, today in the docs we say that supported options are:

  • strict
  • no_warning
  • no_error
    the ones provided by Nokojiri parser.
    In the future if we update the parser version, and the parser changes/updates the list of flag without we synch the documentation, then we have a misalignment. Do you think is best to have strict check of the flags here instead of relay on the validation done by the parser library, also if it means repeating it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the reasoning: initially I thought about only having strict => true option.
but than the options allow finer control, for whatever scenarios we might run into ...
(e.g. might wan to allow network access for the parser to download schemas etc.)

so there's more parser options we can set, I purposefully only documented these few.
maybe even no_error and no_warning might be a bit too much as what is a warning/error
depends on the parser (might not be clear).

being able to (unoficially) set all of the internal ones is the reason why I decided not to validate against a fixed set of options (that would be enumerated here).

going to remove no_error and no_warning from the docs and only document strict for now.
and checking that the official parser_options work ('strict') will be covered by specs.

does that sound okay or do you have concerns?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok, go with the strict as parser_options. My thoughts was based on the assumption that everything we provide should be documented, if we let the door open so that someone can tinkering with undocumented behaviours, then in case we change library we potentially break more pipeline configurations than expected.

parse_options = @parse_options.split(/,|\|/).map do |opt|
name = opt.strip.tr('_', '').upcase
if name.empty?
nil
else
begin
Nokogiri::XML::ParseOptions.const_get(name)
rescue NameError
raise LogStash::ConfigurationError, "unsupported parse option: #{opt.inspect}"
end
end
end
parse_options.compact.inject(0, :|) # e.g. NOERROR | NOWARNING
end
end

end
2 changes: 1 addition & 1 deletion logstash-filter-xml.gemspec
@@ -1,7 +1,7 @@
Gem::Specification.new do |s|

s.name = 'logstash-filter-xml'
s.version = '4.0.7'
s.version = '4.1.0'
s.licenses = ['Apache License (2.0)']
s.summary = "Parses XML into fields"
s.description = "This gem is a Logstash plugin required to be installed on top of the Logstash core pipeline using $LS_HOME/bin/logstash-plugin install gemname. This gem is not a stand-alone program"
Expand Down
49 changes: 49 additions & 0 deletions spec/filters/xml_spec.rb
Expand Up @@ -418,4 +418,53 @@
end
end
end

describe "parsing invalid xml" do
subject { described_class.new(options) }
let(:options) { ({ 'source' => 'xmldata', 'store_xml' => false }) }
let(:xmldata) { "<xml> <sample attr='foo' attr=\"bar\"> <invalid> </sample> </xml>" }
let(:event) { LogStash::Event.new(data) }
let(:data) { { "xmldata" => xmldata } }

before { subject.register }
after { subject.close }

it 'does not fail (by default)' do
subject.filter(event)
expect( event.get("tags") ).to be nil
end

context 'strict option' do
let(:options) { super.merge({ 'parse_options' => 'strict' }) }

it 'does fail parsing' do
subject.filter(event)
expect( event.get("tags") ).to_not be nil
expect( event.get("tags") ).to include '_xmlparsefailure'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we document this tag? In other plugins such as grok, it can be changed/customized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the docs the tag isn't documented.
we usually try to keep PRs such as this one minimalistic, so I think you should open an issue as a feature request to be able to configure the failure tag, on top of that (as the feature is delivered) documentation will get improved ...

end
end
end

describe "parse_options" do
subject { described_class.new(options) }
let(:options) { ({ 'source' => 'xmldata', 'store_xml' => false, 'parse_options' => parse_options }) }

context 'valid' do
let(:parse_options) { 'no_error,NOWARNING' }

it 'registers filter' do
subject.register
expect( subject.send(:xml_parse_options) ).
to eql Nokogiri::XML::ParseOptions::NOERROR | Nokogiri::XML::ParseOptions::NOWARNING
end
end

context 'invalid' do
let(:parse_options) { 'strict,invalid0' }

it 'fails to register' do
expect { subject.register }.to raise_error(LogStash::ConfigurationError, 'unsupported parse option: "invalid0"')
end
end
end
end