Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error caused by empty xml elements being represented as an empty hash and then output to elasticsearch #24

Closed
ChrisMagnuson opened this issue Jan 28, 2016 · 1 comment

Comments

@ChrisMagnuson
Copy link

Quick summary

The default behavior of the xml plugin is to represent empty xml elements as an empty hash, {}.

When used with the elastic search output plugin this results in empty xml #elements being mapped as properties of type object.

When a subsequent document that has that same xml element populated with a value is output you get the following error object mapping for [fieldName] tried to parse field [null] as object, but found a concrete value.

I have submitted pull request #23 with the necessary code changes to provide a suppress_empty option including unit tests to resolve this issue.

Detailed summary

The default behavior of the xml plugin is to represent empty xml elements as an empty hash, {}.

When you are using the elasticsearch output plugin with logstash- as the prefix of the index name then it uses this index template when it creates the index.

Using this logstash configuration:

input {
    stdin {
    }
}
filter {
    xml {
        target => ParsedXML
        source => message
    }
}
output {
    stdout{ codec => rubydebug }

    elasticsearch { 
        hosts => localhost
        index => "logstash-test"
    }
}

and then pasting this xml sample into the terminal window followed by hitting enter once:

<Address><AddressLine1>555 Some Address</AddressLine1><AddressLine2></AddressLine2></Address>
<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Address>

results in this error message:

←[33mFailed action.  {:status=>400, :action=>["index", {:_id=>nil, :_index=>"logstash-test", :_type=>"logs", :_routing=>
nil}, #<LogStash::Event:0x68f2a46a @metadata_accessors=#<LogStash::Util::Accessors:0x68905b47 @store={}, @lut={}>, @canc
elled=false, @data={"message"=>"<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</Addres
sLine2></Address>\r", "@version"=>"1", "@timestamp"=>"2016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"=>{"
AddressLine1"=>["333 Some Address"], "AddressLine2"=>["Apartment 12"]}}, @metadata={}, @accessors=#<LogStash::Util::Acce
ssors:0x466367b @store={"message"=>"<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</Ad
dressLine2></Address>\r", "@version"=>"1", "@timestamp"=>"2016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"
=>{"AddressLine1"=>["333 Some Address"], "AddressLine2"=>["Apartment 12"]}}, @lut={"host"=>[{"message"=>"<Address><Addre
ssLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Address>\r", "@version"=>"1", "@timest
amp"=>"2016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"=>{"AddressLine1"=>["333 Some Address"], "AddressLi
ne2"=>["Apartment 12"]}}, "host"], "message"=>[{"message"=>"<Address><AddressLine1>333 Some Address</AddressLine1><Addre
ssLine2>Apartment 12</AddressLine2></Address>\r", "@version"=>"1", "@timestamp"=>"2016-01-28T14:18:43.967Z", "host"=>"cm
agnuson-lt", "ParsedXML"=>{"AddressLine1"=>["333 Some Address"], "AddressLine2"=>["Apartment 12"]}}, "message"], "Parsed
XML"=>[{"message"=>"<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Add
ress>\r", "@version"=>"1", "@timestamp"=>"2016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"=>{"AddressLine1
"=>["333 Some Address"], "AddressLine2"=>["Apartment 12"]}}, "ParsedXML"], "type"=>[{"message"=>"<Address><AddressLine1>
333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Address>\r", "@version"=>"1", "@timestamp"=>"2
016-01-28T14:18:43.967Z", "host"=>"cmagnuson-lt", "ParsedXML"=>{"AddressLine1"=>["333 Some Address"], "AddressLine2"=>["
Apartment 12"]}}, "type"]}>>], :response=>{"create"=>{"_index"=>"logstash-test", "_type"=>"logs", "_id"=>"AVKImcTNxEzXzs
2VmWC5", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"object mapping for [ParsedXML.AddressLi
ne2] tried to parse field [null] as object, but found a concrete value"}}}, :level=>:warn}←[0m

The most important snippet from this appears to be "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"object mapping for [ParsedXML.AddressLine2] tried to parse field [null] as object, but found a concrete value"}}}, :level=>:warn}.

The first xml record passed to elasticsearch resulted in the following properties because of the elasticsearch index template used by the elasticsearch output plugin:

 "properties": {
      "ParsedXML": {
        "properties": {
          "AddressLine2": {
            "type": "object"
          },
          "AddressLine1": {
            "fielddata": {
              "format": "disabled"
            },
            "norms": {
              "enabled": false
            },
            "type": "string",
            "fields": {
              "raw": {
                "ignore_above": 256,
                "index": "not_analyzed",
                "type": "string"
              }
            }
          }
        }
      }

When this record is output to stdout with the ruby debug codec it looks like

       "message" => "<Address><AddressLine1>555 Some Address</AddressLine1><AddressLine2></AddressLine2></Address>\r",
      "@version" => "1",
    "@timestamp" => "2016-01-28T14:18:43.078Z",
          "host" => "cmagnuson-lt",
     "ParsedXML" => {
        "AddressLine1" => [
            [0] "555 Some Address"
        ],
        "AddressLine2" => [
            [0] {}
        ]
    }
}

You can see that AddressLine2 is represented as an empty hash {} and that the resulting property in elastic search is "type": "object".

When the next xml record is sent to elasticsearch it results in an error because the AddressLine2 now has a string value and elastic search cannot change the property from an object to a string.

The underlying xmlsimply library has an option to suppress empty elements so that they don't show up in the output.

I have updated the xml filter to support a supress_empty boolean property that allows for the following logstash configuration:

input {
    stdin {
    }
}
filter {
    xml {
        target => ParsedXML
        source => message
        suppress_empty => true
    }
}
output {
    stdout{ codec => rubydebug }

    elasticsearch { 
        hosts => localhost
        index => "logstash-test"
    }
}

Now after deleting the index to get rid of the incorrect mapping, if I process the xml records again I get the following output with no errors:

Logstash startup completed
<Address><AddressLine1>555 Some Address</AddressLine1><AddressLine2></AddressLine2></Address>
<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Address>{
       "message" => "<Address><AddressLine1>555 Some Address</AddressLine1><AddressLine2></AddressLine2></Address>\r",
      "@version" => "1",
    "@timestamp" => "2016-01-28T15:25:31.657Z",
          "host" => "cmagnuson-lt",
     "ParsedXML" => {
        "AddressLine1" => [
            [0] "555 Some Address"
        ]
    }
}

{
       "message" => "<Address><AddressLine1>333 Some Address</AddressLine1><AddressLine2>Apartment 12</AddressLine2></Ad
dress>\r",
      "@version" => "1",
    "@timestamp" => "2016-01-28T15:25:32.623Z",
          "host" => "cmagnuson-lt",
     "ParsedXML" => {
        "AddressLine1" => [
            [0] "333 Some Address"
        ],
        "AddressLine2" => [
            [0] "Apartment 12"
        ]
    }
}

Both records were properly parsed and stored as documents in elastic search.

The resulting properties now look like you would expect:

"properties": {
      "ParsedXML": {
        "properties": {
          "AddressLine2": {
            "fielddata": {
              "format": "disabled"
            },
            "norms": {
              "enabled": false
            },
            "type": "string",
            "fields": {
              "raw": {
                "ignore_above": 256,
                "index": "not_analyzed",
                "type": "string"
              }
            }
          },
          "AddressLine1": {
            "fielddata": {
              "format": "disabled"
            },
            "norms": {
              "enabled": false
            },
            "type": "string",
            "fields": {
              "raw": {
                "ignore_above": 256,
                "index": "not_analyzed",
                "type": "string"
              }
            }
          }
        }
      },

Pull request #23 has been submitted to add this feature and resolve this error.

As an aside, I think something other than an empty hash should be the default option as I would not expect to have to configure something special to be able to support outputing xml to elasticsearch where some documents have empty elements and some do not.

@wiibaa
Copy link
Contributor

wiibaa commented May 24, 2016

Fixed by #32 and #34

@suyograo can you close please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants