Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What should mf2 textContent parsing result in? User expectation vs. DOM specification. #15

Open
Zegnat opened this issue Jan 12, 2018 · 14 comments

Comments

Projects
None yet
7 participants
@Zegnat
Copy link
Member

commented Jan 12, 2018

Summary

At several points the parsing specification says to return the textContent, but it never defines what this means. I personally always assumed the DOM textContent property for the current element, but this does not seem to match with what parsers have been doing.

Discussion

@aaronpk wrote a blogpost today containing the following, emphasis mine:

I think my only solution for this is going to be to create my own plaintext value out of the sanitized HTML. Unfortunately, that is not a straightforward process, as demonstrated by this relatively long function that does this in the PHP parser. However that might be the technically better option anyway, since XRay can’t be sure exactly what method was used to generate the plaintext value from the original HTML anyway.

I replied to the emphasised statement in chat:

DOM’s textContent should be used, IIRC, else the parser is broken.

This started a discussion in the #indieweb-dev chat that is best read in the chat logs. The discussion continued in the #micoformats chat. The important take-away is that the PHP parser includes its own text extraction implementation, after an issue was filed by a user that was missing expected white space in the output.

It turned out that the JavaScript parser (glennjones/microformat-shiv) was already doing something like that.

The important part here is user expectation. The user who opened the issue on the PHP parser was expecting to see a line break in the plain text value where a <br> used to be. It is also what aaronpk would expect. From chat:

no, I would definitely expect newlines in the plaintext
given that's how a browser will render it
and if you copypaste the text from the browser it will have newlines

I don’t have any real personal preference. I do feel that the parsing specification should define what it wants to guarantee compatibility between parsers.

If we end up defining our own textContent algorithm for HTML→plain-text, I do think we should take a good look at what browsers are doing. Especially plain text browsers such as lynx and w3m.

Parser behaviour

Test:

<div class="h-entry"><p>Wow<br><span>This</span></p><p>Is Interesting</p></div>

Tested through microformats.io. Output shortened to only the affected h-entry. Node and Ruby were not available for testing.

PHP

        {
            "type": [
                "h-entry"
            ],
            "properties": {
                "name": [
                    "Wow\nThis Is Interesting"
                ]
            }
        }

Python

  {
   "type": [
    "h-entry"
   ], 
   "properties": {
    "name": [
     "WowThisIs Interesting"
    ]
   }
  }

Go

    {
      "type": [
        "h-entry"
      ],
      "properties": {
        "name": [
          "WowThisIs Interesting"
        ]
      }
    }

@Zegnat Zegnat changed the title Define the value of textContent. What should mf2 textContent parsing result in? User expectation vs. DOM specification. Jan 12, 2018

@gRegorLove

This comment has been minimized.

Copy link
Member

commented Jan 12, 2018

Node

via https://glennjones.net/tools/microformats/ ("Experimental ‐ Text white-space collapsing" option not checked, though even when it's checked it does not seem to change with this example.)

{
    "type": ["h-entry"],
    "properties": {
        "name": ["WowThisIs Interesting"]
    }
}
@aaronpk

This comment has been minimized.

Copy link
Member

commented Jan 12, 2018

The actual result I would expect is what is rendered by a browser:

Wow
This

Is Interesting

Here it is in Lynx

screenshot 2018-01-12 11 00 17

https://pin13.net/mf2/?id=20180112185913831

@Zegnat

This comment has been minimized.

Copy link
Member Author

commented Mar 23, 2018

It only took 2 months, but I have written a draft specification for handling textContent in the microformats context.

The algorithm combines:

I have also implemented it in JavaScript so it can be live tested and hereby announce I am willing to implement it into php-mf2 ASAP.

Please have a look at the text content from HTML page for the live test and the algorithm.

I will probably be moving it to the microformats wiki Soon™. It can then be linked to from other specs. E.g. #20 could be fixed simply by having the vcp spec point at the text content algorithm for its “innertext”.

@Zegnat

This comment has been minimized.

Copy link
Member Author

commented Mar 25, 2018

While implementing this in PHP, I ran into a little snag where a line break was being preserver at the start of the resulting string. So I have updated the algorithm to strip “leading and trailing ASCII whitespace from output” instead of removing “any leading and trailing U+0020 SPACE code points from output.”

I am now throwing more tests at it to see if I should just use ASCII whitespace more often than limiting actions to just spaces etc.

@Zegnat

This comment has been minimized.

Copy link
Member Author

commented Mar 29, 2018

Thought: @sknebel just wondered if this should be tweaked so whitespace is not collapsed within PRE elements. I wonder what the user expectation is there.

@kartikprabhu

This comment has been minimized.

Copy link
Member

commented Apr 1, 2018

How does one deal with <pre> elements
Example:

<article class="h-entry">
    <div class="e-content">
    <p>Hello<br>
    World</p>
    <pre>
    this is some pre formatted text

	this is more pre formatting
    </pre>
    </div>
</article>

what should be the content>value?

@Zegnat

This comment has been minimized.

Copy link
Member Author

commented Apr 3, 2018

According to my browser’s innerText method the plain text of e-content in that example is:

Hello
World

    this is some pre formatted text

	this is more pre formatting
    

Or the following after JSON.stringify (to clearly show whitespace):

"Hello\nWorld\n\n    this is some pre formatted text\n\n\tthis is more pre formatting\n    "
@jgarber623

This comment has been minimized.

Copy link
Member

commented Jul 5, 2018

Adding a link to issue #83 on indieweb/microformats-ruby filed by @aaronpk back in March of this year, which is related to whitespace parsing.

@snarfed

This comment has been minimized.

Copy link
Member

commented Jul 21, 2018

@Zegnat this is great! what's the status? time to move this to the microformats wiki?

@gRegorLove

This comment has been minimized.

Copy link
Member

commented Jul 25, 2018

In #microformats today there was some discussion about <pre> and whitespace. The results of this algorithm don't match with @aaronpk's whitespace-test #11 (from what I can tell that test came after the algorithm).

@snarfed's example as parsed by latest php-mf2:

<div class="h-entry">
  <div class="e-content p-name">
    Hello World
    <pre>
      one
      two
      three
    </pre>
  </div>
</div>
"items": [
    {
        "type": [
            "h-entry"
        ],
        "properties": {
            "name": [
                "Hello World one two three"
            ],
            "content": [
                {
                    "html": "Hello World\n    <pre>\n      one\n      two\n      three\n    <\/pre>",
                    "value": "Hello World one two three"
                }
            ]
        }
    }
]

I don't have a strong opinion about what's "right" here, but at a glance that p-name and e-content.value without newlines and tabs looks nice from a (hypothetical) consumer perspective.

@snarfed

This comment has been minimized.

Copy link
Member

commented Jul 25, 2018

hrm. overriding pre in name might be ok, but almost certainly not in content.value. whitespace inside pre is meaningful the same way tags like br are meaningful, and need to be preserved, as lots of people have argued both here and in applications like bridgy.

@sknebel

This comment has been minimized.

Copy link
Member

commented Jul 25, 2018

Whitespace in <pre> IMHO has to be preserved, with the exception of stripping newlines at the start if it is at the start of the property value and stripping whitespace at the end if it is at the end.

Copying comment I made on a PR regarding <pre>:

I believe there should be a newline before and after <pre>, because <pre> is by default in browsers and in the use cases I can think of in actual posts (code, ASCII art) styled as display: block, meaning it does not stand in line with other content. From my testing, these newlines get collapsed if the <pre>content already starts/ends with one.

@Zegnat

This comment has been minimized.

Copy link
Member Author

commented Jul 25, 2018

@snarfed:

@Zegnat this is great! what's the status? time to move this to the microformats wiki?

Yes, I want to move it to the wiki ASAP, and update it with some of @kartikprabhu’s work. So we can do further iteration of the algorithm there, and have its history preserved. I had planned to have it done already, but currently on holiday and internet connectivity has been spotty.

I should be back this coming weekend and will be catching up on all things microformats next week!

@gRegorLove

This comment has been minimized.

Copy link
Member

commented Jul 25, 2018

@snarfed @sknebel Good points. To keep things simple, I would prefer not adding an exception for p-name. +1 for preserving whitespace inside <pre>

sknebel pushed a commit to sknebel-forks/microformats-whitespace-tests that referenced this issue Jul 25, 2018

Sven Knebel
Add tests showing <pre> on same line as other content
Per recent discussions (microformats/microformats2-parsing#15 (comment)) and my proposal there to treat <pre> similar to <p>/display: inline-block as in browsers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.