New algorithm for plain text values #168

Zegnat · 2018-03-26T10:30:01Z

Per the change control, I have earlier opened an issue to standardise textContent and proposed a resolution. This is the implementation of that resolution in the parser, so it can be tested and iterated upon before possibly being included in the specification.

This PHP parser was already using a special innerText method, but it was not adopted by any other parsers nor did it look like anyone wanted to write it out as part of the microformats parsing specification. This method was based on a text function of microformat-shiv, which in its turn was an emulation of Internet Explorer behaviour.

Things of note:

This replaces the old textContent and innerText methods. There is no replacement for innerText, the new textContent is the public method for extracting a plain text value from an element.

The second new method elementToString is set to private as it should not be called outside of textContent. It exists on its own only so it can recursively call itself.
Whenever textContent is called it is no longer wrapped in a unicodeTrim call. Trimming is handled by the algorithm itself. If it turns out the current trimming in the algorithm isn’t sufficient in practice, we should revise the algorithm.
The new PlainTextTest currently validates all 9 examples from aaronpk/microformats-whitespace-tests.
This broke 3 parser tests, which have been resolved:
1. ParseImpliedTest::testParsesImpliedNameConsistentWithPName expected a line break in the name property. With the new algorithm, line breaks are collapsed into spaces the same way browsers would do.
2. ParserTest::testParseEResolvesRelativeLinks expected two spaces in the plain text value of the content property. With the new algorithm, consecutive spaces are collapsed to a single one the same way browsers would do.
3. ParserTest::testHtmlEncodesImpliedProperties was… just wrong? It expected only the string <name> as the value of the name property through implied rules. And somehow it had to sidestep the <img> element completely to do so. I don’t know why the previous parsing even allowed that.

Zegnat · 2018-03-26T10:32:15Z

Lets first get 0.4.2 out before considering this for merge.

It would be interesting if @aaronpk could test this branch out in his reader prior to merging, and get some data on how well it performs with posts that previously troubled him.

aaronpk · 2018-03-26T16:20:58Z

If we publish this as 0.4.3-alpha I can relatively easily run it on Aperture for a while to see how it works.

gRegorLove · 2018-03-26T20:01:25Z

Awesome work, @Zegnat! 🎉

Zegnat added 2 commits March 26, 2018 11:35

Introduce new way of getting plain text values from HTML elements

e1de3ea

Fix three tests that failed with the new algorithm

7dbe03d

aaronpk merged commit e8da04f into microformats:master Mar 29, 2018

gRegorLove mentioned this pull request Jul 7, 2018

Fix <img> handling in implied p-name #180

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New algorithm for plain text values #168

New algorithm for plain text values #168

Zegnat commented Mar 26, 2018

Zegnat commented Mar 26, 2018

aaronpk commented Mar 26, 2018

gRegorLove commented Mar 26, 2018

New algorithm for plain text values #168

New algorithm for plain text values #168

Conversation

Zegnat commented Mar 26, 2018

Zegnat commented Mar 26, 2018

aaronpk commented Mar 26, 2018

gRegorLove commented Mar 26, 2018