Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Add CSS classes to fenced code blocks #7

Merged
merged 6 commits into from

6 participants

@coreyworrell

Not sure about the newline callback, but it seemed to not be adding the necessary
tag.

@coreyworrell coreyworrell Fixing fenced code blocks to allow classes to be added. Also the newl…
…ine callback regex seemed to not be working, so I removed the ^ character from before \n.
4a48b46
@michelf
Owner

By the way, all you have to do if you want to update the pull request is to put your changes in the same branch. But there's no harm in creating a new pull request if you want.

Thanks for your work. I'll definitely merge this once it's ready.

@simensen

What is the expected fenced codeblock going to look like? Will it be something like:

~~~php
<?php /* no dot, just php */ ?>
~~~

or

~~~.php
<?php /* with a . */ ?>
~~~

... or will it support both?

@michelf
Owner

@simensen only with a .. Is there a good reason to accept it without the dot?

@simensen

@michelf Depends on what you find to be a good reason. :)

GFM actually support both with and without the ., but it is only documented to need the language name. Many projects, including redcarpet, seem to have implemented fenced code blacks based on the example set forth by PHP-Markdown Extra but have already started to support the language hint without needing a leading ..

I know that it won't make sense to mirror the GFM, sundown, redcarpet, etc. parsers feature by feature, but if it is possible to make the leading . optional (or at least configurable?) I think that could be pretty useful since there are already a number of Markdown implementation that handle languages hinting in this way.

Are there good reasons that the . should be required?

@michelf
Owner

Making the . mandatory leaves more room for future extensions. For instance you could allow setting a class with .class and an id with #id. I'll have to think about it a little more.

@coreyworrell

I don't really see a good reason. But I like the idea of the code block using whatever you pass it as the class name, instead of other versions where it only allows the language. Because, for example, Google's Prettify uses the class "prettyprint lang-php" to work. So this way you can specify that easily.

@coreyworrell

I hit submit too soon. That's a good reason I think. Maybe we could still make it optional and if no . or # is there then it will just default as a class. But still might be easier to just require a .

@simensen

Great discussion.

I can see how it would be nice to be able to support something like .prettyprint.lang-php or .php#snippet-334 to add classes and ID's to the <code> element.

If it is possible to make the . optional, that would be great if only for compatibility with existing implementations. From a parsing/future standpoint, I don't see why it needs to be required.

In any event, I'm excited to see this support coming here. :)

@coreyworrell

Just pushed some changes that allows both. You can specify IDs, classes, and any other attributes you like.

Examples

~~~ #myid.prettyprint.lang-php[lang=php][rel="something"]
Code block
~~~

<pre id="myid" class="prettyprint lang-php" lang="php" rel="something"><code>Code block</code></pre>

~~~
Code block
~~~ prettyprint lang-php

<pre class="prettyprint lang-php"><code>Code block</code></pre>
@simensen

@coreyworrell very cool. :)

@waylan

I would like to point out that the HTML5 spec provides a suggestion (not a requirement) for how language is to be indicated in both inline code and code blocks. I note a few things you are doing that do not follow that suggestion here:

You are putting the classes, etc on the pre tag, not the code tag. If you think about it, the language applies to the "code" and not all pre tags necessarily have code in them, so that makes sense. I realize many (most) JavaScript libs expect the language to be set on the pre tag but IMO those libs are wrong and they should change to match the HTML5 spec's suggestion. It seems to me like Markdown should do the right thing here from the start.

The second is that the language should be indicated as class="language-php" where the actual language is prefaced with "language-". My suggestion would be to treat classes set with a dot to be added as classes as-is, but if a single word without a dot is defined, assume that is a language and automatically preface with "language-". That gives flexibility If the author is lazy and sets a language only (no dot) they get output following the spec. However, if they are using one of the many JavaScript code highlighting libraries (I'm aware of at least 10) which all use different ways to designate the language, they can set the class and/or id to whatever they want by prefacing with a dot.

Thirdly, if you are going to support more than a dot for classes and # for ids, then I would suggest following Maruku's Attribute List feature which is already supported by multiple (I'm aware of three) markdown implementations. That's how I'm intending to implement upcoming improvements to Python-Markdown.

@coreyworrell

@waylan good suggestions. I agree with them, but it's the Javascript libraries, like you said, that almost always use classes on the <pre> and not on the <code> elements. And every library handles it differently. Some just want the language php, some want language-php, so we can't account for everything. This way, just being specific and maybe typing a few extra characters handles every case. And it's always best to put classes and identifiers on the outermost element, because you can always select its child with CSS.

Maruku's attribute list feature looks interesting, and might fit better with Markdown, but using the standard CSS selectors is also very common. We'll have to get opinions on this.

@michelf
Owner

I'm still on the fence with all this. ;-) Here are some thoughts...

Regarding the Github flavor of fenced code blocks incorporating the language name, they aren't using the same syntax at all, and I don't see the point in trying to accommodate its syntax only in half. Is there any other Markdown variant implementing fenced code block using ~~~? If so, any of those having a syntax for setting a class name?

I know that Markdown usually try to let people express the same thing in different ways. This is why there are two emphasis markers and in Extra two code block syntaxes. But duplicating features should be done only where there is a strong justification for doing so. I don't think it's a good idea to make ~~~ php and ~~~ .php synonymous. People will see that, wonder "what's the difference", and get the feeling there's something subtle they are missing.

Neither do I think it is a good idea to automatically translate ~~~ php to class="language-php". It's a good idea in theory, but in practice I expect it'll mostly cause surprise and confusion -- "help, markdown is adding 'language-' and nothing works because of it!". Perhaps this will change in the future, but right now it doesn't add anything except for the good feeling of following some not-quite-a-standard syntax nobody is using. (Or is someone really using that in the wild?)

Where to put the attribute. <pre> or <code>? Semantically it makes more sense for it to be on <code>, because theoretically that's something that could work on code spans too. I mean by that that a syntax highlighter could colorize code in code spans. I understand that putting it on <pre> however let you style the box for the code block more easily. I'm not sure yet.

Finally when was the last time anyone here had to put an attribute other than class on a code block? Is that feature of adding arbitrary attributes justified? Because if no one use it, it's just another vector for bugs to creep in and something totally useless to write documentation about. And it is even more useless if you can't control whether the attribute will fall on the <pre> or the <code> element.

@waylan

Michel,

I can't argue much with the points that you make except to say that my personal opinion differs on a few.

The better styling hook verses more semantically correct issue is something I've gone back and forth on, but when I saw the suggestion in the HTML5 spec, that convinced me. I realize it won't convince everyone and it seems to have not affected any of the JavaScript lib authors. Personally, I find it appalling the number of different (and sometimes non-validating) ways that JS lib authors have implemented language labeling. If there was any chance that a standard defined by markdown could force their hand, I'd be all for it, but I highly doubt that would work.

Regarding other implementations, Python-Markdown has a working implementation (which only currently works on document root) and the dot is optional (not documented), but does nothing to change the behavior. Just the other day I accepted a pull request which also adds Github's syntax as an alternative (still need to document that). While we support Maruku's Attribute List syntax elsewhere, that does not currently work on fenced code blocks although we optionally allow the curly brackets which is consistent with that syntax.

Speaking of Github's syntax, I protested when they introduced their new syntax, and they responded by implementing (but not documenting) the existing syntax and saying they didn't care what Markdown's syntax actually was. I don't think they even knew prior art existed before their implementation.

@michelf
Owner

Regarding the actual code of this pull request: I just realized that the HTML block parser will need to become aware of any change to the fence syntax here and here. Otherwise fenced code blocks using the new feature will have a hard time containing any block-level html tags.

@will123195

It is intuitive that ~~~php should translate to <code class="php">. I would be unhappy to see language- added automatically in this case since it would break a lot of my code. It should be one's choice to use the HTML5 spec, ie. ~~~language-php.

Also, Highlight.js is one of the best code highlighters available and uses <code class="php">.

Perhaps the middle ground is to have an optional setting like:

@define( 'CSS_CLASS_PREFIX', 'language-' ); // default is null

Also it seems correct to have the class on the code element. But if it's important to have the pre option make it another optional setting like:

@define( 'CSS_CLASS_TAG', 'pre' ); // default is 'code'
@waylan

I think @will123195's proposal is a nice compromise.

I also prefer Highlight.js and that is what I use personally. In fact to my knowledge it is the only such tool which allows you to pass in a DOM element as a code block, so you can use any type of markup you want. Of course, this same tool is the one that has the most sensible default of all of them which negates the need for that feature in the first place. You do have to use their method of defining the language used, although it is a much less objectionable method than many.

I will also note that while I would prefer using language- as the default prefix on @will123195's proposal, the fact that I use Highlight.js would mean even I wouldn't use that default. So defaulting to no prefix makes sense. I am also aware of other libs which require a different prefix, so allowing one to be set programmicly makes sense in general. I don't want to have to go back and manually edit every markdown document when I change JS libs in the future. I'd much rather be able to change a setting in one place that effects all my documents at once.

Regarding the option to set the language on either the pre or code elements, this is probably necessary as some tools (like Highlight.js) use code, while others use pre. Why should we exclude some and favor others. Set the default as the semantically correct option and make it easy to override with a setting. Perhaps even include a link to the HTML5 spec in the docs/comments to help curb those requests to change the default because their tool works the other way.

Regarding @michelf's question regarding the need to support other attributes beside class and id, there are a few JS libs which use other non-standard (and non-validating) attributes to define the language. If you want to support those tools, then any arbitrary key=value combo needs to be allowed. Personally, I would never recommend anyone use those libs, so I wouldn't feel bad if they were not supported, but is may be something to consider.

@michelf
Owner

It seems that Github supports these forms:

~~~ php
~~~ .php
~~~ {.php}
~~~ {.php #id}

If I understand well, this is what is supported in Python Markdown too (although I'm not sure about the id). I think we should standardize on those. Also, if we allow multiple class names and #id it should work in any order, with or without a space separating them.

I wouldn't support arbitrary attributes. Supporting arcane highlighting libraries using custom attributes should not be encouraged, so I'd leave those out.

As for adding the "language-" prefix, I just wouldn't. I think giving ~~~ php and ~~~ .php and ~~~ {.php} the same meaning except in non-default configuration isn't going to work well unless all writers understand the difference, which will rarely be the case. So I wouldn't even support it.

As for <pre> vs. <code>, let's add this at the top of the file and make it work:

# User-specified class attribute for code blocks goes on the "code" tag,
# but setting this to true will put attributes on the "pre" tag instead
@define( 'MARKDOWN_ATTR_ON_PRE', true );
@coreyworrell

@michelf I wasn't sure about the block hashing and what needs to be changed there? Is that only for using fenced code blocks inside HTML elements?

@michelf
Owner

@coreyworrell

The HTML block parser hashes everything that looks like a non-indented HTML block so HTML blocks don't get processed by the regexes that follows. Indented HTML blocks are standard Markdown code blocks so they are ignored by the HTML block parser. It also does the same for fenced code blocks, so it must detect where those blocks begin and ends.

So there's in reality two fenced code block parsers: the first to run is inside the HTML block parser which detects fenced code blocks and leaves the block and its content intact for the second "real" fenced code block parser. I recognize it's a little silly, but this is how things have evolved. Ideally we'd parse all block-level constructs in one pass similarly to how it is done in the HTML block parser. It'd help fix a few issues like this one where both fenced code block parsers are a little out of sync.

Anyway, the HTML block parser since it needs to handle HTML tags, indented code blocks, and fenced code blocks -- and to some extent code spans -- all together works using a tokenizing approach. There's a giant regex for going to the next "token" (the beginning of any of those) and a couple of "ifs" following to check what the token is and act appropriately. What you need is to change the regex for the opening fence (in the giant regex) and the closing fence (in the token handler code).

@coreyworrell

Ok thanks, that makes sense. Can you give me some example code and the expected output to test on?

I am running the current code on my dev site, and it seems to be working fine right now, even w/out changing those two regexs. Here's the markdown I'm using and the HTML output:

This is a paragraph.

~~~ php
<div>
    <h1>Hello</h1>
</div>
~~~

And that correctly prints out:

<p>This is a paragraph.</p>

<pre><code class="php">&lt;div&gt;
    &lt;h1&gt;Hello&lt;/h1&gt;
&lt;/div&gt;
</code></pre>
@michelf
Owner

@coreyworrell Sorry for taking so long to answer. Regarding potential problems with the HTML block parser, you should test with this:

~~~ html
<div>
~~~

Some *markdown* `formatting`.

~~~
</div>
~~~ html

... and other similar variants you can come with. I expect it'll give strange results.

@coreyworrell

Thanks. I tested against that, and all I had to change was the fenced code block opening marker regex for it to work. But it still fails against the following (which I think should all work?), these also don't work with the current version of php-markdown extra (I removed the class names from the fence block of course before testing):

~~~ html
<div>
~~~

Some *markdown* `formatting`.

~~~
</div>
~~~ html

Some *markdown*

~~~
<div>
    <html>
~~~

~~~
function test();
~~~

<div markdown="1">
    <html>
        <title>
</div>

<div markdown="1">
~~~
<html>
    <title>
~~~
</div>

You can see the test output here: http://dev.coreyworrell.com/first-article

@michelf
Owner

@coreyworrell You found another bug, but I'd say it's a separate issue. So I think this patch is ready to be merged. I'll do some tests of my own and merge it some time before next release. Good work!

@ghost

@michelf After reading all the comments above, I am a tad-bit confused, and would love to have some clarification -- has this syntax been finalized yet? Are there any known bugs?

Also, please check if I am using the right syntax in the example below.

Syntax for Adding Classes, IDs and Custom Attributes to code blocks

~~~ #myid .myclass .another-class [attribute1=value1] [attribute2=value2]
Code block
~~~

Gives this:

<pre id="myid" class="myclass another-class" attribute1="value1" attribute2="value2"><code>Code block</code></pre>

Considering that I am using PHP Markdown Extra 1.2.5 provided here, is that correct, or am I missing something?

@michelf
Owner

@geekpanth3r This pull request has not been merged yet. I looked at the patch and I'm confident it is good, but I want to test it myself before merging it in the official version, and I hadn't had time for that yet. My goal is to do a new release this summer that'll include those changes.

The idea currently is to support id and class, but not arbitrary attributes.

@ghost

@michelf I am using SyntaxHighlighter by Alex Gorbatchev, and it requires that I add a class to it (to define the language). As of now, is there a standard way for me to do this? I mean, adding class to a pre tag. If I use direct HTML, I should escape the code, which isn't a good idea (it can break, not standard).

@coreyworrell

@geekpanth3r No, not as of now. You can use my fork of the repo, or wait until this pull request is merged. Or use googles prettyprint, which auto-detects code block languages, thus not requiring any classes.

@ghost

@coreyworrell Are you aware of any issues with your fork? And even if there are, do you believe it won't break things when @michelf fixes them this summer? I would love to use it!! :D

Or use googles prettyprint, which auto-detects code block languages, thus not requiring any classes.

I have tried that, and am currently using an even better solution (Highlight.js). Both are great, but they are not able to recognize "small" code blocks, and large code blocks with multiple languages, well.

I will still continue to use Highlight.js though, but want to define the languages myself. :)

@coreyworrell

Everything I tested works fine with the fenced code blocks. There is some interesting behavior when including code blocks inside of divs and the like, but that already existed. So it doesn't introduce any new issues or anything from what I can tell.

@ghost

@coreyworrell and @michelf : I am using Corey's forked version of PHP Markdown Extra, and looks like there's a glitch. When I put this fenced code block in my post, nothing shows up, just a blank container:

~~~ .php
<?php echo fimage(); ?>
~~~

Shows up like this in the front-end (yeah, no code, just the empty code container:

Empty Container

Am I doing something wrong, or is this a bug?

UPDATE: The source code of my page shows this -- clearly, it's being parsed as a comment, no idea why.

Source Code

PS: This appears to happen even with the simple fence (~~~) without the class; but it's fine when I indent it (4 spaces).

@michelf
Owner

"clearly, it's being parsed as a comment, no idea why"

Looks like the HTML block parser doesn't see the code block in this case and treats the content as if it was an HTML block (which is the correct thing to do when outside a code block).

@geekpanth3r If you remove the ".php" after the opening fence, does the same thing happen?

@ghost

PS: This appears to happen even with the simple fence (~~~) without the class; but it's fine when I indent it (4 spaces).

@michelf So yeah, like I said earlier, this happens even without the .php after the opening fence.

@coreyworrell

@geekpanth3r would you mind trying the same code (without the .php after the fence) on @michelf s fork? If it works fine with his, then I guess I need to fix the HTML block parser.

@ghost

@coreyworrell Oops! Looks like the issue is NOT with PHP Markdown Extra, but your fork. Yes, looks like you need to fix the HTML block parser.

@jackmcdade

Is this going to make it in by chance?

@michelf michelf merged commit e90b912 into michelf:extra
@michelf
Owner

Ok, so I just merged this. While merging, I did some refactoring:

  1. Now using the same attribute parser as is used for headers (added yesterday) when parsing id and class attributes in braces.
  2. Adjusted regular expressions to be less forgiving about the actual content after the fence: the exact format expected is part of the expression.
  3. Removed support for putting attributes on the closing brace. (I know it was my idea, sorry for wasting some of your time on this).
  4. Changed the name of one configuration variable, added instance variables for those settings so you can override them for the parser on a per-instance basis.

Now its time to check all the issues that have been reported for that fork to see if they still apply. Thank you for this @coreyworrell. Even though I ended up changing a few things, its pointed me at everything that needed to be changed, it's a very clean patch.

@coreyworrell coreyworrell deleted the unknown repository branch
@coreyworrell

Awesome. I know the edits I made didn't work quite right, but I'm glad you were able to use them to build off of!

@waylan waylan referenced this pull request in waylan/Python-Markdown
Merged

Add feature for emphasizing some lines in a code block. #274

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jan 13, 2012
  1. @coreyworrell

    Fixing fenced code blocks to allow classes to be added. Also the newl…

    coreyworrell authored
    …ine callback regex seemed to not be working, so I removed the ^ character from before \n.
Commits on Jan 15, 2012
  1. @coreyworrell

    Fixing CSS class regex.

    coreyworrell authored
  2. @coreyworrell
Commits on Jan 16, 2012
  1. @coreyworrell
Commits on Jan 28, 2012
  1. @coreyworrell
Commits on Feb 20, 2012
  1. @coreyworrell
This page is out of date. Refresh to see the latest.
Showing with 69 additions and 7 deletions.
  1. +69 −7 markdown.php
View
76 markdown.php
@@ -34,6 +34,13 @@
@define( 'MARKDOWN_FN_LINK_CLASS', "" );
@define( 'MARKDOWN_FN_BACKLINK_CLASS', "" );
+# Fenced code block class prefix
+@define( 'MARKDOWN_CODE_CLASS_PREFIX', "" );
+
+# User-specified class attribute for code blocks goes on the "code" tag,
+# but setting this to true will put attributes on the "pre" tag instead
+@define( 'MARKDOWN_ATTR_ON_PRE', false );
+
#
# WordPress settings:
@@ -1885,7 +1892,7 @@ function _hashHTMLBlocks_inMarkdown($text, $indent = 0,
|
# Fenced code block marker
(?> ^ | \n )
- [ ]{0,'.($indent).'}~~~+[ ]*\n
+ [ ]{0,'.($indent).'}~~~+[^\n]*\n
' : '' ). ' # End (if not is span).
)
}xs';
@@ -2552,35 +2559,90 @@ function doFencedCodeBlocks($text) {
(
~{3,} # Marker: three tilde or more.
)
- [ ]* \n # Whitespace and newline following marker.
- # 2: Content
+ # 2: Attributes or whitespace
+ ([^\n]*)
+
+ \n # Newline following marker.
+
+ # 3: Content
(
(?>
- (?!\1 [ ]* \n) # Not a closing marker.
+ (?!\1 [^\n]* \n) # Not a closing marker.
.*\n+
)+
)
# Closing marker.
- \1 [ ]* \n
+ \1
+
+ # 4: Attributes or whitespace
+ ([^\n]*)
+
+ # End of line
+ \n
}xm',
array(&$this, '_doFencedCodeBlocks_callback'), $text);
return $text;
}
function _doFencedCodeBlocks_callback($matches) {
- $codeblock = $matches[2];
+ $codeblock = $matches[3];
$codeblock = htmlspecialchars($codeblock, ENT_NOQUOTES);
$codeblock = preg_replace_callback('/^\n+/',
array(&$this, '_doFencedCodeBlocks_newlines'), $codeblock);
- $codeblock = "<pre><code>$codeblock</code></pre>";
+
+ $attrs = ! empty($matches[2]) ? $matches[2] : ( ! empty($matches[4]) ? $matches[4] : '');
+ $attrs = $this->_doFencedCodeBlocks_parseAttrs($attrs);
+ $pre_attrs = MARKDOWN_ATTR_ON_PRE ? $attrs : '';
+ $code_attrs = MARKDOWN_ATTR_ON_PRE ? '' : $attrs;
+ $codeblock = "<pre$pre_attrs><code$code_attrs>$codeblock</code></pre>";
+
return "\n\n".$this->hashBlock($codeblock)."\n\n";
}
function _doFencedCodeBlocks_newlines($matches) {
return str_repeat("<br$this->empty_element_suffix",
strlen($matches[0]));
}
+ function _doFencedCodeBlocks_parseAttrs($attrs) {
+ $attrs = trim($attrs, ' {}');
+
+ if ( ! preg_match_all('/([#\.][^#\.]+)/', $attrs, $matches) AND ! empty($attrs))
+ return ' class="'.MARKDOWN_CODE_CLASS_PREFIX.$attrs.'"';
+
+ if (isset($matches[0]) AND ! empty($matches[0])) {
+ $list = array();
+ $list['class'] = '';
+
+ foreach ($matches[0] as $match) {
+ switch ($match[0]) {
+ case '#':
+ $list['id'] = trim($match, ' #');
+ break;
+ case '.':
+ $list['class'] .= trim($match, ' .').' ';
+ break;
+ }
+ }
+
+ if ( ! empty($list['class'])) {
+ $list['class'] = MARKDOWN_CODE_CLASS_PREFIX.$list['class'];
+ }
+
+ $attrs = '';
+
+ foreach ($list as $attr => $value) {
+ if ( ! empty($value)) {
+ $attrs .= $attr.'="'.trim($this->encodeAttribute($value)).'" ';
+ }
+ }
+
+ if ( ! empty($attrs))
+ return ' '.trim($attrs);
+ }
+
+ return '';
+ }
#
Something went wrong with that request. Please try again.