Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<br> tags are not interpreted as whitespace when converting HTML to plaintext #83

Open
aaronpk opened this Issue Mar 3, 2018 · 2 comments

Comments

Projects
None yet
3 participants
@aaronpk
Copy link
Member

aaronpk commented Mar 3, 2018

Similar to microformats/mf2py#51 and microformats/php-mf2#69, the Ruby parser is stripping <br> tags rather than converting them to newlines when converting HTML to plaintext.

This is very apparent on Tantek's autoformatted posts. Compare the name and content.value for one of his posts:

@aaronpk aaronpk added microformats and removed microformats labels Mar 3, 2018

dissolve added a commit that referenced this issue Mar 29, 2018

@dissolve

This comment has been minimized.

Copy link
Collaborator

dissolve commented Mar 29, 2018

this is pretty tricky as this is how nokogiri does it, so it basically means rewriting html to text conversion :(

maybe another library does this better

@jgarber623

This comment has been minimized.

Copy link
Collaborator

jgarber623 commented Jun 21, 2018

@aaronpk @dissolve I have a possible solution to this issue, but I don't know enough about the codebase to know where to make the changes.

Here's a bit of code that might be useful and/or spark some ideas. It uses the aforementioned page on Tantek's website and assumes we're only interested in .e-content. That's a narrowed use case for demonstration purposes, of course.

Save the following to a file (e.g. ~/to_plaintext.rb) and run ruby ~/to_plaintext.rb in a Terminal:

require 'net/http'
require 'nokogiri'

@doc = Nokogiri::XML(Net::HTTP.get(URI('http://tantek.com/2018/061/t2/improving-test-suite-home-pages')))

@doc.css('.e-content br').each do |node|
  node.replace(Nokogiri::XML::Text.new("\n", @doc))
end

puts @doc.css('.e-content').text

The output should look like:

Appreciate the explanation and link to the source file; makes sense.

However there is still a fundamental usability problem of the discoverability of how to file issues and suggested improvements for CSS module test suites.

I would like to suggest improving the generated test suite home pages themselves (e.g. http://test.csswg.org/suites/css-cascade-3_dev/nightly-unstable/) to link directly to https://github.com/w3c/web-platform-tests/ and suggest searching it for any source files one might want to file issues (or contribute patches) for, as you demonstrated in your comment (which I will do shortly for the cascade-import-002.htm source file specifically, thanks for the pointer. Update, done: https://github.com/w3c/web-platform-tests/issues/9910).

Note: the existing text of "More information on the contribution process and test guidelines is available on the wiki page." is not really useful, as the "wiki page" that is linked to (http://wiki.csswg.org/test) has A TON of links (a maze of passages that all appear alike if you will), none of which contain the precisely useful advice that you gave in your comment! Nor is it readily obvious how to fix http://wiki.csswg.org/test as it seems to serve many purposes, and the two likely links "How to Contribute" and "Reviewing Tests" both say on their pages: "This page has been deprecated and is no longer being maintained." with a top-level link to http://web-platform-tests.org/ which is also not useful, and that's already three clicks deep (if you guessed right which links to click) with still no answer as to how to contribute to this specific module's test suite.

Where should I file an issue and/or patch for the template or generation of the home pages of CSS module test suites like the specific page http://test.csswg.org/suites/css-cascade-3_dev/nightly-unstable/?

Thanks!

☝️ Note the line breaks which under-the-hood are \n characters. Success!

That's a pretty gnarly bit of code to drop in a bunch of places, so it might be worth adding a to_plaintext method (private or not) in one of the gem's classes (FormatParser, maybe?) so it can be used more frequently (akin to to_hash, to_json, etc.).

What do you think? Seem like a workable solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.