Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update <pre class="name"><code> to HTML5 <pre><code class="language-name"> #3858

Open
marc-medley opened this issue Aug 21, 2017 · 18 comments
Open

Comments

@marc-medley
Copy link

marc-medley commented Aug 21, 2017

When converting Markdown to HTML using --no-highlight option with fenced_code_attributes flag enabled, then <pre class="name"><code> tags are generated.

This request is to update <pre class="name"><code> to generate W3C HTML5 recommendation example output syntax <pre><code class="language-name">.

For example, <pre class="markdown"><code> would become HTML5 <pre><code class="language-markdown">.

W3C HTML5 Recommendation: code element

Authors who wish to mark code elements with the language used, e.g. so that syntax highlighting scripts can use the right rules, can use the class attribute, e.g. by adding a class prefixed with "language-" to the element.

Code Example:

The following example shows how a block of code could be marked up using the pre and code elements.

<pre><code class="language-pascal">var i: Integer;
begin
   i := 1;
end.</code></pre>

Prism.js Basic Useage also illustrates use the same HTML5 recommendation example syntax

Therefore, it only works with <code> elements, since marking up code without a <code> element is semantically invalid. According to the HTML5 spec, the recommended way to define a code language is a language-xxxx class, which is what Prism uses.

@mb21
Copy link
Collaborator

mb21 commented Aug 21, 2017

Can confirm on pandoc 1.19.2.1

$ echo -e '```html\nfoo\n```' | pandoc
<div class="sourceCode">
  <pre class="sourceCode html">
    <code class="sourceCode html">foo</code>
  </pre>
</div>

Yet:

$ echo -e '```html\nfoo\n```' | pandoc --no-highlight
<pre class="html"><code>foo</code></pre>

@jgm
Copy link
Owner

jgm commented Aug 22, 2017

When you do

``` foo
bar
```

in pandoc, it's exactly equivalent to

``` {.foo}
bar
```

so foo is just a class. We don't know whether it's meant to be the name of a language syntax or something else entirely.

So adding the language- prefix to all the classes of a code block certainly wouldn't be the right thing to do. We could, I suppose, add the prefix to class names that correspond to known language names, i.e. to language names that pandoc's own highlighter is aware of.

@marc-medley
Copy link
Author

marc-medley commented Aug 22, 2017

We could, I suppose, add the prefix to class names that correspond to known language names, i.e. to language names that pandoc's own highlighter is aware of.

For my use case, this would be an OK approach.

Noting that, semantically, the html <code> tag seems to be an appropriate place for a language- class attribute.

@jgm
Copy link
Owner

jgm commented Aug 22, 2017

@marc-medley
Copy link
Author

marc-medley commented Aug 22, 2017

There are two distinct code highlighting use cases:

  1. Use Case default: Pandoc provides the complete code highlighting in html output.

  2. Use Case --no-highlight: Pandoc code highlighter is disabled. Pandoc produces an "intermediate" html. An external highlighter such as prism.js or highlight.js is later applied to the "intermediate" html when loaded into a viewing browser.

This particular issue is only intented to apply to the --no-highlight use case.

So, yes, the default use case should continue with what works for the Pacdoc highlighter. e.g. use <pre> for background color.

Yet, when the --no-highlight option is used then possible downstream highlighters should be considered.

For example, both highlight.js and prism.js can consume the following clean, simple, maintainable html and produce various colored backgrounds along with full syntax highlighting.

<pre><code class="language-css">p { color: red }</code></pre>

Please see highlight.js usage and demo (supports language-abc, lang-abc and abc)
Please see prism.js basic usage and examples

In both the highlight.js and prism.js examples, the <pre> tag does no have any additional attributes.

So, in the --no-highlight use case, the language- class placed in the <code> tag is sufficiently and complete for downstream highlighters such as prism.js and highlight.js to also provide background coloring in the final html delivered to the viewing browser.

@bpj
Copy link

bpj commented Oct 21, 2017

Couldn't this be handled with a filter which adds the language- prefix to the first class, if any, of all CodeBlock elements, and overrides the builtin HTML rendering of code blocks? It would be very easy with Pandoc::Filter:

#!/usr/bin/env perl
use strict;
use warnings;
use Pandoc::Filter;
use Pandoc::Elements;
use HTML::Entities qw[ encode_entities ];
 
pandoc_filter 'CodeBlock' => sub {
    my $attrs = stringify_attrs($_); # here $_ is a reference to element object
    return unless length $attrs;     # default rendering OK
    my $content = encode($_->content);   # the code
    return RawBlock html => qq(<pre><code $attrs>$content</code></pre>);
};

sub stringify_attrs {
    my($elem) = @_;
    my $kv = $elem->keyvals; # get Hash::MultiValue object
    my @attrs;
    if ( my @classes = $kv->get_all('class') ) {
        $kv->remove('class');
        @classes = map {; encode($_) } @classes; # shouldn't be needed!
        push @attrs, qq(class="language-@classes");
    }
    ATTR:
    for my $attr ( sort keys %$kv ) {
        my @values = $kv->get_all($attr);
        next ATTR unless @values;
        push @attrs, map{ $_ = encode($_); qq($attr="$_"); } sort @values;
    }
    return "@attrs"; # array items as space-separated string
}

sub encode { encode_entities $_[0], '<>&"' }

Note that this requires that the author has the discipline to make sure that it always is appropriate, or at least doesn't break anything, to prefix language- to any first class of a codeblock.

Note also that I'm writing this on my tablet, so the code is untested but it should do the right thing.

@gkjpettet
Copy link

gkjpettet commented Jul 5, 2018

This seems to still be an issue with the current version of pandoc. Even using the --no-highlight option I'm still seeing the class added to the <pre> tag and no class added to the <code> tag.

@averms
Copy link
Contributor

averms commented Nov 13, 2018

Here is a lua filter to do this. It passes through any classes that don't match a programming language name and all ids. Attributes are stripped, but I'm not sure too many people use them anyways.

Just use --lua-filter standard-code.lua

@mrchypark
Copy link

mrchypark commented Nov 28, 2018

How is this issue going?

#3858 (comment) <- this option looks good for me because I want to use highlight.js.

@jgm
Copy link
Owner

jgm commented Nov 29, 2018

There are a couple of possibilities here:

  1. Change the HTML writer so that, when --no-highlight is used (i.e., writerHighlightStyle opts == Nothing), pandoc produces a language-LANG class on the code elements in both inline and block code.

    A question is how the language is identified. (A code span or block may have a number of classes, only one of which is the language -- or it may be that none of the classes are languages.) One possibility would be to check a list of known languages. We could, perhaps, include the list that highlighting-js currently supports.

    Anyway, on this approach you could write

    ``` C
    int i = 0;
    ```
    

    and it would be rendered

    <pre><code class="language-C">int i = 0;
    </code></pre>
  2. Another approach would involve a much more minimal modification. This would simply move any class beginning with language- to the code tag instead of the pre tag, in rendering HTML. It would be insensitive to the setting of --no-highlight. With this approach you'd write

    ``` language-C
    int i = 0;
    ```
    
  3. Another idea would be to always add language- to a single word after the opening code backticks, so that

    ``` C
    int i = [;
    ```
    

    would be parsed as a code block with class language-C rather than C. The logic for highlighting could be modified so that we first check the classes for language-X, then for known languages (so a class C would also work). The main drawback of this approach is that it could break some current setups that are assuming that the class name will be C.

@marc-medley
Copy link
Author

marc-medley commented Nov 29, 2018

@jgm I would go with possibility 3., with 1. as a second choice, based on the following notes…

Possibility 1. any class after opening code backticks

Performance & Maintenance Issue: Looks up each class against some ever evolving language name list, like PrismJs⇗ or highlight.js⇗ supported languages.

Possibility 2. use language-LANG in markdown

Breaks Markdown Editing Highlights Issue: Breaks source and preview highlighting in many markdown editing environments. Widely used markdown fenced code syntax uses just the language name: c, java, swift, etc as the first word after the opening code fence.

Here is an example from editing markdown in Atom:

markdowncodefencing

Note: Requiring language- in markdown code fences breaks thousands of markdown files in my use case.

Possibility 3. use first word after opening code backticks

In my use case, the first word (if present) is the code language name.

Always add language- to a single word after the opening code backticks

Fenced C code in markdown input:

fencedc

Renders HTML5 recommendation compliant output:

<pre><code class="language-c">int i = 0;
</code></pre>

Note: may need to recognize no-highlight (in markdown) as a case for not adding any language highlight class when multiple classes are used after an opening code fence. (Just mentioned from completeness … for use cases which also have non language classes... although this is not my current use case.)

@kiwi0fruit
Copy link

kiwi0fruit commented Jan 31, 2019

I guess it's too late to worry about adding yet another command line option. So the best approach is

  1. may be use --no-highlight
    • move classes to <code> instead of <pre> when --no-highlight ,
  2. Add new CLI option --language-prefix that adds language-* to the first class.

At the moment it's not fixable via pandoc filters: I need to iterate via beautiful soup to move class... Nope.

Both Highlight.js and Prism.js works with attributes set to <pre>

PS By the way: if there is something to worry about CLI options is that they are not in the alphabetical order in the --help

UPD: simple pandoc filter like this solves the issue.

nacnudus added a commit to ukgovdatascience/govdown that referenced this issue Jan 7, 2020
@jrtechs
Copy link

jrtechs commented Aug 1, 2020

I "fixed" this on my website using some hacky regex operators on the HTML produced by pandoc. However, it would be nice if pandoc added a flag to fix this.

                    var re = /\<pre class=".*?"><code>/;
                    while (result.search(re) != -1) // result is the html from pandoc
                    {
                        var preTag = result.match(/\<pre class=".*?"><code>/g)[0];
                        var finishIndex = preTag.split('"', 2).join('"').length;
                        lang = preTag.substring(12, finishIndex);
                        var newHTML = `<pre><code class="language-${lang}">`
                        var original = `<pre class="${lang}"><code>`;
                        result = result.split(original).join(newHTML);
                    }

@krontzo
Copy link

krontzo commented Oct 1, 2020

Both Highlight.js and Prism.js works with attributes set to <pre>

I am trying to have line-numbers in latest version of reveal.js. The included version of highlight.js supports line numbers but only in the <code> tag.

So, I have two questions:

  • Is there a way to write a filter to move attributes from <pre> to <code>?

  • I saw the demo lua writer https://github.com/jgm/pandoc/blob/master/data/sample.lua, it puts the CodeBlock attributes to the <code> tag instead of the <pre>, see line 241: return "<pre><code" .. attributes(attr) .. ">" .. escape(s) ... Is a new writer the only way?

@jgm
Copy link
Owner

jgm commented Oct 1, 2020

You could do it with a filter, by replacing each CodeBlock element with a RawBlock (Format "html") and building the HTML yourself. A bit tedious, and you'd need to be careful about escaping, but not too hard.

@krontzo
Copy link

krontzo commented Oct 1, 2020

Thank you for the information and your quick response. I'll give it a try.

@tarleb
Copy link
Collaborator

tarleb commented Oct 1, 2020

I believe this should do the trick: https://github.com/pandoc/lua-filters/tree/master/revealjs-codeblock

@krontzo
Copy link

krontzo commented Oct 1, 2020

Thank you very much for the information. It works for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests