Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
<br> tags are not interpreted as whitespace when converting HTML to plaintext #83
This is very apparent on Tantek's autoformatted posts. Compare the
Here's a bit of code that might be useful and/or spark some ideas. It uses the aforementioned page on Tantek's website and assumes we're only interested in
Save the following to a file (e.g.
require 'net/http' require 'nokogiri' @doc = Nokogiri::XML(Net::HTTP.get(URI('http://tantek.com/2018/061/t2/improving-test-suite-home-pages'))) @doc.css('.e-content br').each do |node| node.replace(Nokogiri::XML::Text.new("\n", @doc)) end puts @doc.css('.e-content').text
The output should look like:
That's a pretty gnarly bit of code to drop in a bunch of places, so it might be worth adding a
What do you think? Seem like a workable solution?