Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace xmlhash with Nokogiri #10085

Open
dmarcoux opened this issue Aug 27, 2020 · 9 comments
Open

Replace xmlhash with Nokogiri #10085

dmarcoux opened this issue Aug 27, 2020 · 9 comments
Labels
Frontend Things related to the OBS RoR app Refactoring 🏭

Comments

@dmarcoux
Copy link
Contributor

We currently have 2 gems for parsing XML. We either use xmlhash or Nokogiri. We should only use Nokogiri since, unlike xmlhash, it is widely adopted by the Ruby community. It's the de facto gem for parsing XML.

@dmarcoux dmarcoux added Frontend Things related to the OBS RoR app Refactoring 🏭 labels Aug 27, 2020
@dmarcoux
Copy link
Contributor Author

dmarcoux commented Nov 3, 2020

On top of parsing XML, we sometimes need to convert a XML document to a hash. In order to do this, I see two possible solutions which need to be benchmarked for performance:

  1. Set the backend of ActiveSupport::XmlMini to use Nokogiri and rely on the to_hash method on a Nokogiri document:
# Setup in an initializer under `src/api/app/config/initializers`
ActiveSupport::XmlMini.backend = 'Nokogiri'

# Usage example
my_hash = Nokogiri::XML("<some><thing></thing></some>").to_hash
  1. Install the libxml-ruby gem,set the backend of ActiveSupport::XmlMini to use libxml-ruby and rely on Hash.from_xml:
# Setup in an initializer under `src/api/app/config/initializers`
ActiveSupport::XmlMini.backend = 'LibXML'

# Usage example
my_hash = Hash.from_xml("<some><thing></thing></some>")

@dmarcoux
Copy link
Contributor Author

dmarcoux commented Apr 4, 2022

Here's a benchmark:

# 1. Add `libxml-ruby` and `ox` to the Gemfile, then run `bundle install`
# 2. osc api /source/openSUSE:Factory:Staging/dashboard/_history > src/api/history.xml
# 3. Put this file under `src/api/` and run the benchmarks with `ruby src/api/this_file.rb`

require 'benchmark'
require_relative './config/environment'

xml = File.read('history.xml')

Benchmark.bm do |benchmark|
  benchmark.report("Nokogiri") do
    ActiveSupport::XmlMini.backend = 'Nokogiri'

    10.times do
      Nokogiri::XML(xml).to_hash
    end
  end

  benchmark.report("NokogiriSAX") do
    ActiveSupport::XmlMini.backend = 'NokogiriSAX'

    10.times do
      Nokogiri::XML(xml).to_hash
    end
  end


  benchmark.report("LibXML") do
    ActiveSupport::XmlMini.backend = 'LibXML'

    10.times do
      Hash.from_xml(xml)
    end
  end

  benchmark.report("LibXMLSAX") do
    ActiveSupport::XmlMini.backend = 'LibXMLSAX'

    10.times do
      Hash.from_xml(xml)
    end
  end

  benchmark.report("Xmlhash") do
    10.times do
      Xmlhash.parse(xml)
    end
  end

  benchmark.report("Ox") do
    10.times do
      Ox.load(xml, mode: :hash)
    end
  end

  # Read and convert to a Hash and core class objects only without capturing attributes.
  # It might be interesting if we don't need the attributes in some specific cases.
  benchmark.report("Ox without attributes") do
    10.times do
      Ox.load(xml, mode: :hash_no_attrs) 
    end
  end
end

The results:

                       user       system     total     real
Nokogiri               35.284381   0.262906  35.547287 ( 35.551956)
NokogiriSAX            35.691117   0.197636  35.888753 ( 35.892601)
LibXML                 38.660122   0.154310  38.814432 ( 38.819391)
LibXMLSAX              35.097896   0.149209  35.247105 ( 35.250829)
Xmlhash                8.235180    0.078005  8.313185  (  8.313496)
Ox                     11.775726   0.129731  11.905457 ( 11.905924)
Ox without attributes  1.469755    0.015284  1.485039  (  1.485318)

@coolo
Copy link
Member

coolo commented Apr 4, 2022

If you already have nokogiri around, using it instead of the default rexml backend is surely wise - but for completness: I get different results for your XML and there are 2 more xml_mini backends to try:

            user     system      total        real
Nokogiri     1.677316   0.009611   1.686927 (  1.687014)
NokogiriSAX  1.504316   0.001340   1.505656 (  1.505848)
LibXML       1.283016   0.004645   1.287661 (  1.287808)
LibXMLSAX    1.243031   0.004480   1.247511 (  1.247673)
Xmlhash      0.346985   0.000037   0.347022 (  0.347025)

But let's not compare artificial XML, let's benchmark the parsing of /source/openSUSE:Factory:Staging/dashboard/_history (10 times)

Nokogiri 29.944858   0.273983  30.218841 ( 30.221594)
LibXML   30.535395   0.048856  30.584251 ( 30.586223)
Xmlhash   7.832307   0.071991   7.904298 (  7.904410)

So for both measures, XmlMini is four to give times slower.

Btw: your Nokogiri::XML(xml).to_hash is a completely different datastructure, it's still Hash.from_xml(xml) even with nokogiri as backend.

@dmarcoux
Copy link
Contributor Author

dmarcoux commented Apr 4, 2022

Would you be so kind to share the benchmarks themselves, so the actual Ruby script? Because numbers by themselves don't mean much.

@coolo
Copy link
Member

coolo commented Apr 4, 2022

I just extended yours.

benchmark.report("Nokogiri") do
    ActiveSupport::XmlMini.backend = 'Nokogiri'

    10_000.times do
      Nokogiri::XML(xml).to_hash
    end
  end

    benchmark.report("NokogiriSAX") do
    ActiveSupport::XmlMini.backend = 'NokogiriSAX'

    10_000.times do
      Nokogiri::XML(xml).to_hash
    end
  end


  benchmark.report("LibXML") do
    ActiveSupport::XmlMini.backend = 'LibXML'

    10_000.times do
      Hash.from_xml(xml)
    end
  end

    benchmark.report("LibXMLSAX") do
    ActiveSupport::XmlMini.backend = 'LibXMLSAX'

    10_000.times do
      Hash.from_xml(xml)
    end
  end

  benchmark.report("Xmlhash") do
    10_000.times do
      Xmlhash.parse(xml)
    end
  end

And for the history parsing, xml = File.open('history.xml').read() (with osc api /source/openSUSE:Factory:Staging/dashboard/_history > history.xml) - but 10.000 is way too much for that, so went with 10

@dmarcoux
Copy link
Contributor Author

dmarcoux commented Apr 4, 2022

I updated my comment with the benchmarks to update the script and the results. I've found another possible alternative Ox which is now in the benchmarks.

@coolo
Copy link
Member

coolo commented Apr 4, 2022

BTW: my interest in the dashboard _history parsing: visiting https://build.opensuse.org/package/show/openSUSE:Factory:Staging/dashboard loads and parses it for whatever reason

@coolo
Copy link
Member

coolo commented Apr 5, 2022

Ox's returned hash is puzzling (as we experimented further with xmlhash I wanted to look how others handle it):

irb(main):007:0> Ox.load('<a><download test="hallo?hallo&amp;a=1"/><download/></a>', mode: :hash)
=> {:a=>{:download=>[[{:test=>"hallo?hallo&a=1"}], nil]}}

With xmlhash you do

Xmlhash.parse(xml).elements('download') { |e| puts e.value('test') }
hallo?hallo&a=1
              

With Ox's hash if you iterate you have to cope with the nils. So adopting the xmlhash usage to Ox would be quite involved I fear.

@coolo
Copy link
Member

coolo commented Apr 5, 2022

#12406 for the big history parsing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Frontend Things related to the OBS RoR app Refactoring 🏭
Projects
None yet
Development

No branches or pull requests

2 participants