Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue237 html5 polyglot: Make default.html5 polyglot markup conformant. #3473

Merged
merged 6 commits into from Mar 4, 2017
Merged

Issue237 html5 polyglot: Make default.html5 polyglot markup conformant. #3473

merged 6 commits into from Mar 4, 2017

Conversation

JohnLukeBentley
Copy link
Contributor

@JohnLukeBentley JohnLukeBentley commented Feb 27, 2017

This is a pull request (pr) to make default.html5 polyglot markup conformant. default.html4 is left alone.

Issue jgm/pandoc-templates#237, "default.html5. Polyglot Markup", (the first two commit messages wrongly reference #237 in jgm/pandoc) contains a prior discussion although the resulting conclusions are expressed here, in the pr comments and commit messages. The following commits are intended to be pushed to this PR. Those ticked have been pushed to this PR:

  • html5. Polyglot. html element. Add xml namespace.
  • html5. Polyglot. Explicitly close meta tags.
  • html5. Polyglot. html element. Add xml:lang. lang values default to blank.
  • html5. Polyglot. Add default title value: "My Title".
  • html5. Polyglot. Add default title value: "Untitled".

In-principle consensus has been reached in issue jgm/pandoc-templates#237 on the changes contained in the first two commits. Consensus has not yet been achieved on the changes contained in the last two commit (yet to be pushed).

Polyglot Markup is a W3C standard: https://www.w3.org/TR/html-polyglot/. Polyglot Markup essentially specifies a syntax that facilitates switching between serving a web document either as HTML or as XHTML, the later requiring a document to be "well-formed" in XML terms. Without these changes default.html5 currently results in a web file that can't be served as XHTML.

Although Polyglot Markup results in a XHTML syntax the specification is not "Use XHTML syntax over HTML syntax", rather it is (using my own words):

Use markup, and other constraints, common to both XHTML syntax and HTML syntax such that you can serve the same document with either a XHTML mime type or a HTML mime type.

(In practice the mime type is generally set by changing the file extension of your (x)html file: your web server generally looks at the file extension and sends the corresponding mime type in the response header).

These commits either include changes that:

  • Are explicitly required by the Polyglot Markup spec to make the output from default.html5 polyglot markup conformant; or
  • Where the Polyglot Markup spec is silent, ambiguous, or wrong: make the output from default.html5 polyglot markup conformant if the spec was corrected.

This is my first in-production git and github pull request.

References:

Edit 01: Struck out "the first two commit messages wrongly reference #237 in jgm/pandoc" due to rebase.

Edit02: Changed last commit message first line to "html5. Polyglot. Add default title value: "Untitled"."

Edit03: struck out "html5. Polyglot. Add default title value: "Untitled"."

@JohnLukeBentley
Copy link
Contributor Author

From https://github.com/jgm/pandoc-templates/pull/248

Thanks! We've changed things so that this repo is now a subtree of pandoc rather than a submodule. Hence changes should be made on jgm/pandoc rather than here.

Your PR will also need to adjust some test output in pandoc (tests will fail and show you what needs changing).

It would be great if you could make the PR against jgm/pandoc instead; sorry for the trouble.

No worries. I suppose part of your motivation for the change it to be able to offer the continuous integration tests against the templates (as for the rest of pandoc).

As you can see I've modified the some of the test files (writer.html5, lhs-test.html+lhs, lhs-test.html) in response to a previous run of your CI setup. All the tests now pass except for number 6, under "Allowed Failures". I'll presume "allowed failures" applies in this case until you tell me otherwise.

Thinking out aloud, I wonder:

  • Whether I ought install Haskell and cabal locally (I don't know Haskell and so I'm going to be unmotivated to contribute to the main pandoc source).
  • Whether I ought write some tests myself that hook into your testing suite, under pandoc\test\ (I have my own local tests, outside of the pandoc directory).

One question for you:

  • Is it correct that the three test files I've changed (writer.html5, lhs-test.html+lhs, lhs-test.html) are only run only against default.html5 and not default.html4?

If "All checks have passed" is reliable in this situation you'd be welcome to now assess my (content) commits before I push the remaining content and concomitant test file commits.

I'm off for the night.

@jgm
Copy link
Owner

jgm commented Feb 27, 2017

This is my first in-production git and github pull request.

Congrats! Excellent PR so far (I'll wait to merge until you make the other changes -- or if you like I can merge now and you can do a new PR for the others).

Your assumptions are correct, the lhs-test.html and lhs-test.html+lhs are run against html5 (since that is now the default 'html'). (These date from back when we only had 'html'.)

The commit messages are excellent, I appreciate the links and detail. Please keep them hard-wrapped to 80 columns, though. (No big deal if it's hard to change that now, but going forward.)

I'm okay with the lang attribute proposal, but I'm not sold on the title proposal. I'd rather just leave it empty (as with the lang attribute).

If you think you might make more contributions, it's probably worth installing Haskell locally. I recommend installing stack, which will give you everything you need (assuming you have lots of disk space). stack setup in the pandoc directory will even download an appropriately version of the compiler. After that stack test will build with tests.

Who knows? You might get into Haskell. The org writer was written by someone with no prior exposure, who just looked at the code for another writer.

@JohnLukeBentley
Copy link
Contributor Author

Thanks for all that!

Yeah, keep the PR open and I'll push more commits. I'll push the lang commit (in a day's time) and get you to look at it.

I know that by convention the first line of commit messages ought be less than 50 chars. (and you mention this in CONTRIBUTING.md) ... by my overstepping that rule I'm reminded of it.

I wasn't aware of the 80 char hard wrap convention of the commit message overall. So thanks for mentioning it. Having been alerted to it, I see why that's a good convention when I git log master..head from the command line. My Git GUI client was soft wrapping my commit messages for me, and the commit message body is soft wrapped on Github, so I was unaware of the need for hard wrapping.

I'm disinclined to correct the current commit messages given my recent investigations into the problems in doing so (with an interactive rebase and force push): http://stackoverflow.com/questions/42405984/is-git-push-force-ever-harmful-when-applied-to-your-origin-contributor-fork (feel free not to read any of that, or correct any misunderstanding there). On the other hand I'm quite able to interactively rebase and force push, and I suppose we can be reasonably confident no one has pulled down JohnLukeBentley:issue237-html5-polyglot (if you haven't), so if you'd rather I clean up my messages I'd be happy to make the attempt (that might teach me something). Either way I intend to hard-wrap in the future.

The title commit message (not yet pushed) is intended to be as follows. Is it persuasive?

html5. Polyglot. Add default title value: "My Title".

One in a series of commits to make default.html5 polyglot markup conformant. This changes default.html5 only.

If an author provides a title in their source document, that value will be used by default.html5 otherwise a non empty default, I've chosen "My Title", will be output.

The Polyglot Markup spec lists its "Minimal HTML template" with an empty <title> element. https://www.w3.org/TR/html-polyglot/#minimal-polyglot-html-document

As in:

<head>
  <title></title>
</head>

However, by the HTML 5 spec <title> can't contain inter-element whitespace (which entails an empty element). See https://www.w3.org/TR/html5/document-metadata.html#the-title-element

Content model:
Text that is not inter-element whitespace.

https://validator.w3.org/nu/ enforces that rule. That is, ...

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
  <title></title>
</head>
<body>
</body>
</html>

... will return ...

Error: Element title must not be empty.

Therefore, if the Polyglot Markup spec was correctly written it would require a title value.


In addition to having a default title value I recommend that, when an author does not provide a title value: on pandoc command execution a warning message like the following be provided:

The HTML 5 spec stipulates a content model for the title element as "Text that is not inter-element whitespace" (https://www.w3.org/TR/html/document-metadata.html#the-title-element). It is therefore recommended you specify a title in your source document. Otherwise "My Title" will be output.


Issue jgm/pandoc-templates#237, "default.html5. Polyglot Markup", contains a prior discussion although the relevant resulting conclusion is expressed here (and in the pull request).

@JohnLukeBentley
Copy link
Contributor Author

Without wanting to hurry you along (take all the time you need) I thought I clarify that I'm waiting on you to:

  1. Either:
    a. To confirm that you've not fetched this pull request (and we can be reasonably confident no one else has) to facilitate my rebase and force push of existing commits, to clean up their messages. If, that is, you are happy to take the risk that my doing this won't cause non-recoverable difficulties due to the perils of force pushing; or
    b. If I'm misunderstanding the best way to clean up those commit messages for your re-education.
  2. Review the proposed commit message for 'html5. Polyglot. Add default title value: "My Title".', above.

@jgm
Copy link
Owner

jgm commented Mar 2, 2017

Go ahead and rebase, that's no problem for me.

I don't know about the title issue, but I really don't like "My Title" as a default.

I think I'd prefer to have pandoc emit a warning message if they've not specified a title. (This would be done in the HTML writer, which contructs a page-title variable and could issue a warning if it's empty.)

The first in a series of commits to make default.html5 polyglot markup
conformant. This changes default.html5 only.

You must have the default xhtml namespace on the root html element, as in ...

    <html xmlns="http://www.w3.org/1999/xhtml">

"Polyglot markup declares the default namespaces on the root HTML element,
html," https://www.w3.org/TR/html-polyglot/#element-level-namespaces

This is also reflected in the Polyglot "Minimal HTML document".
https://www.w3.org/TR/html-polyglot/#minimal-polyglot-html-document

Issue jgm/pandoc-templates#237, "default.html5. Polyglot Markup", contains a
prior discussion although the relevant resulting conclusion is expressed
here (and in the pull request).

This is my first in-production git and github commit.
One in a series of commits to make default.html5 polyglot markup conformant.
This changes default.html5 only.

Make all `meta` elements self closing. E.g. Change:

  * `<meta charset="utf-8">` to
  * `<meta charset="utf-8" />`

Although the HTML5 spec permits void elements that aren't self closing, as
in `<meta charset="utf-8">`, this is not permitted in Polyglot Markup ....

> In the HTML syntax, void elements are elements that always are empty and
never have an end tag. All elements listed as void in the HTML specification
or in an extension spec, MUST in polyglot markup have the syntactic form of
an XML empty-element tag (`<foo/>`).
https://www.w3.org/TR/html-polyglot/#empty-elements

Without the suggested change my current pandoc converted .xhtml file, using
default.html5, will choke in the browser (Firefox 50.1.0) with an error "XML
Parsing Error: mismatched tag. Expected: `</meta>` ...", because it can't
handle  `<meta charset="utf-8">` .

In addition I keep a space between then closing forward slash and the last
character. That is, I recommend  `<meta charset="utf-8" />` rather than
`<meta charset="utf-8"/>`. This is generally unnecessary these days as
modern browsers handle either form. However, the former guarantees backward
compatibility (at no cost).

Issue jgm/pandoc-templates#237, "default.html5. Polyglot Markup", contains a
prior discussion although the relevant resulting conclusion is expressed
here (and in the pull request).
…lank.

One in a series of commits to make default.html5 polyglot markup conformant.
This changes default.html5 only.

This commit changes, where the document author provides a language value, an
output html element from something like:

`<html xmlns="http://www.w3.org/1999/xhtml" lang="en">`

to

`<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">`

This commit changes, where the document author DOES NOT provide a language
value, an output html element from something like:

`<html xmlns="http://www.w3.org/1999/xhtml" lang="">`

to

`<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">`

The change entails two considerations:

1. Add `xml:lang=""` to conform to the Polyglot Markup spec ...

    > When specifying the language mapping of an element, polyglot markup
uses both the lang and the xml:lang attributes. Neither attribute is to
be used without the other, and polyglot markup maintains identical
values for both lang and xml:lang.
https://www.w3.org/TR/html-polyglot/#language-attributes

2. That `lang="" xml:lang=""` will be output when the document author does
not provide a language value. That is, the language values default to blank.

    The key justification for having language values default to blank: it
turns out the HTML5 spec requires it (as I read it).

    Under [the HTML5 spec, section "3.2.5.3. The lang and xml:lang
attributes"](
https://www.w3.org/TR/html/dom.html#the-lang-and-xmllang-attributes),
providing attributes with blank contents both:

    * Has meaning, "unknown", and
    * Is a MUST (written as "must") if a language value is not provided ...

    > The lang attribute (in no namespace) specifies the primary language
for the element's contents and for any of the element's attributes that
contain text. Its value must be a valid BCP 47 language tag, or the
empty string. Setting the attribute to the empty string indicates that
the primary language is unknown.

    In short, it seems that where a language value is not provided then a
blank value MUST be provided for Polyglot Markup conformance, because
the HTML5 spec stipulates a "must". So although the Polyglot Markup spec
is unclear on this issue it would seem that if it was correctly written,
it would therefore require blank attributes.

    Further justifications are found at
https://github.com/jgm/pandoc-templates/issues/237#issuecomment-275584181
 (but the HTML5 spec justification given above would seem to be the
clincher).

---

In addition to having lang-values-default-to-blank I recommend that, when an
author does not provide a lang value, then upon on pandoc command execution
a warning message like the following be provided:

  > Polyglot markup stipulates that 'The root element SHOULD always specify
the language'. It is therefore recommended you specify a language value in
your source document. See
https://www.w3.org/International/articles/language-tags/ for valid
language values.
This adds `lang="" xml:lang=""` to the test files.
@JohnLukeBentley
Copy link
Contributor Author

Thanks. So, as you can see, the rebase was done (and looks good from my side). I then pushed "html5. Polyglot. html element. Add xml:lang. lang values default to blank." and the concomitant test file modification commit.

On title, I'll suggest a slight change to the above proposal and break it down a little differently to facilitate consensus or disagreement.

  • The resulting html file must have a non empty value, for polyglot conformance (because HTML5 requires this). Are you persuaded on this much at least?

  • Then there are two cases to consider:

    1. The source document contains a title. In this case the title goes through default.html5 to the output (as is the situation now). I'd suggest this creates no difficulty.
    2. The source document doesn't contain a title. In this case the html writer, in addition to sending out a warning that no title was supplied, could take the basename of the source document as the title that gets fed to default.html5. That way if the author ignores the warning (which might occur if they are in a drafting phase and don't want to care about meta data just yet) the resulting html has a rather identifying title.

Under that scheme the resulting html will always have a non empty title.

What remains is the question of whether to adjust default.html5. There are up to three decisions to make:

A. To leave it as is, and trust the html writer (after being coded) will always supply a title; or have a default value in case the html writer breaks; and
B. If we are to have a default value what should be that value?
C. Should default.htm5 include a comment that notes the html writer will always supply a value (and overwrite any defaults).

On A, I'd suggest it's good defensive programming to have a default value even though we expect it to be overwritten.

On B, if your objection to "My Title" is based on an aesthetic objection to seeing "My" anywhere then I sympathize with that. I too have some sort of intuitive distaste for "My" (I was gladdened, for example, when windows switched from calling the folder "My Documents" to just "Documents"). But my idea was to have a self documenting value that indicated to the author that this is a value they have to supply (which beginner HTML folk might not be aware of). But alternatives could be simply "Title" or "There is something wrong with the HTML writer: it should always supply a pagetitle".

On C, perhaps something like:

<!-- $pagetitle$ is always supplied by the HTML writer, either: 
        - from the author's title meta field; or
        - the basename of the source file -->   
<title>$if(title-prefix)$$title-prefix$ – $endif$$pagetitle$</title>

So tying one set of those options together I'll suggest, for the sake of something for you to push against:

<!-- $pagetitle$ is always supplied by the HTML writer, either: 
        - from the author's title meta field; or
        - the basename of the source file -->   
<title>$if(title-prefix)$$title-prefix$ – $endif$$if(pagetitle)$$pagetitle$$else$Title$endif$</title>

But each of the decisions A, B, and C are probably largely arbitrary. So if you stipulated an answer for each of those decisions I'd be surprised if I would have any serious counterpoint.

@jgm
Copy link
Owner

jgm commented Mar 4, 2017

I had been thinking that if the result is not valid because of a missing title, that's on the user, not pandoc (especially if pandoc issues a warning). We can't guarantee that the document pandoc produces is valid HTML5, anyway, since raw HTML can be inserted by the user.

But I guess I wouldn't mind too much inserting the word Untitled as the title. (That seems better to me than Title, as it alerts people that no title has been provided.)

I'd rather just handle this in the writer, to avoid making the template more complex than it already is.

So, I'll merge your changes and look into making some further changes in the writer.

@jgm jgm merged commit 07d51d9 into jgm:master Mar 4, 2017
@JohnLukeBentley
Copy link
Contributor Author

If we are to have a default title then Untitled is vastly superior to Title or My Title.

I had been thinking that if the result is not valid because of a missing title, that's on the user, not pandoc (especially if pandoc issues a warning).

Yes there is a line of thinking from that point that might change my mind. Perhaps, indeed, we ought have it on the user. If the user doesn't supply a title in the source doc, and they ignore the pandoc warning, then we might want a blank title value in order to let the resulting HTML(5) be invalid. That way if a user validates their HTML(5) they'll get an error message that would otherwise be suppressed with "Untitled".

If the is right way to go: we keep the default.html5 title as it is but add the pandoc html writer warning.

I'm unsure. What do you think?

@JohnLukeBentley JohnLukeBentley deleted the issue237-html5-polyglot branch March 4, 2017 09:40
@jgm
Copy link
Owner

jgm commented Mar 4, 2017

If the is right way to go: we keep the default.html5 title as it is but add the pandoc html writer warning.

Yes, that's what I've done.

@jgm
Copy link
Owner

jgm commented Mar 4, 2017

Still need to add a warning about lang, but I think I'll explore a more general mechanism that also checks for the right format.

@JohnLukeBentley
Copy link
Contributor Author

Yes, that's what I've done.

I meant: have the warning but don't even pass Untitled along from the HTML writer to default.html5. Let the resulting html output fail as invalid. (This was suggested as a possibility not as something I was convinced about).

Noted your lang warning comment.

Incidentally, from my end I've merged in your repo:master with my:master and deleted my topic branch. My commit history looks like a noodly mess.

  • Is this just an artefact of the rebase or have I done something (else) that is wrong?
  • Is this something you'd want to clean up (or have me clean up); or is it a matter of leaving the mess as it is?

@jgm
Copy link
Owner

jgm commented Mar 6, 2017 via email

@JohnLukeBentley
Copy link
Contributor Author

I'd describe the github workflow slightly differently, but we might have had the same sort of thing in mind:

  1. Have upstream (project owner), origin (contributor fork), and local (contributor clone of fork) repos.
  2. (As you mention) Create a local topic branch. Make one or more commits. Rebase as needed.
  3. Push topic-branch to origin. Open pull request. Don't rebase if others might have merged from your origin.
  4. After merge delete the topic branch.
  5. Pull from the upstream master into your local master.

To fix the current issue I acted on your helpful tip. I deleted the local repo and got a fresh clone of my origin master.

The result: the noodly history has gone. But now my individual commits are missing. The following command ...

PS> git log --graph --decorate --pretty=oneline --abbrev-commit 19bba3a4..450fad81

... used to return ...

* 450fad81 Mod test files for default.html5 polyglot. Add lang stuff.
* f2adf324 html5. Polyglot. html element. Add xml:lang. lang values default to blank.
* 0dd40fe1 Mod test files for default.html5 polyglot change: explicitly close meta tags.
* 1d12cbfd Mod test files for default.html5 polyglot change: Add xml namespace.
* 036a7f8f html5. Polyglot. Explicitly close meta tags. (rebased)
* 0071e6f8 html5. Polyglot. html element. Add xml namespace. (rebased)

... now it returns an error.

I can currently see, however, the commit representing the pull request.

PS> git log --oneline 07d51d9e -1
07d51d9e Make default.html5 polyglot markup conformant. (#3473)

Is it standard that when you, as project owner, merge in a pull request all the individual commits that comprise it are lost (possibly because you, the project owner, chose to rebase when you merge, rather than merge-commit, for a simpler history)?

@jgm
Copy link
Owner

jgm commented Mar 6, 2017 via email

@JohnLukeBentley
Copy link
Contributor Author

That makes sense. The revert consideration is good to point to. I suppose, then, you (project owner) could just as well in another case be wanting to preserve the commits within a (multi-commit) pull request, if you anticipated the possibility of reverting only some of those commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants