smart punctuation option? #119

sodiumjoe · 2017-12-15T00:30:34Z

Is there a way to parse with smart punctuation?

ErichDonGubler · 2019-05-02T19:24:47Z

Let me offer some more detail -- seems like this issue doesn't have much detail to act on.

One of the features I thought would be interesting from using Jekyll is "smart" typography, which comes from the SmartyPants spec implemented as an extension in the RedCarpet markdown parser, as noted in Jekyll docs.

To give some examples, that would mean turning text like this:

"some curly quotes" 'like this' -- they're pretty awesome --- yo

...into this:

“some curly quotes” ‘like this’ – they're pretty awesome — yo

By no means do I think this should be a default, but it would be interesting to add -- I love {en,em}-dashes in particular. :)

Keats · 2019-08-09T13:25:45Z

@marcusklaas it looks like smart punctuation exists as an extension in the C reference parser, along with tests: https://github.com/commonmark/cmark/blob/master/test/smart_punct.txt

marcusklaas · 2019-08-09T13:37:29Z

That's very good to know! This is a very good candidate for addition to pulldown as an extension.

The tricky part is the implementation. Smart punctuation would add a number of special tokens, quotes and dashes, that we'd need to scan for. That could have a non-trivial performance impact, since the main character scan routine is the costliest part of the parse. This cost would even be apparent when the option is disabled, since many optimizations rely on this character set being static.

Do you have ideas on how we could solve this efficiently, @raphlinus? I recall you mentioned this problem when we added table support and its pipe character to the "new" pulldown.

cormacrelf · 2019-11-10T03:15:20Z

Ok, I think I've solved the perf issue.

First, I added ', " and - to the special char lookup tables and crdt_parse was ~2-3% slower. About what I expected, and if I'm not mistaken you don't pay for much more if you condition the character match arm on if self.options.contains(ENABLE_SMART) => {...}.

However, I expected to see no change if crdt.md was stripped of these three characters before running the baseline and subsequent benchmarks, but there was still a 1.3% slowdown. Here's the complete diff of cargo asm pulldown_cmark::parse::FirstPass::parse_line:

diff --git a/parse_line_baseline b/parse_line_master
index 0b6d3c9..e4cc529 100644
--- a/parse_line_baseline
+++ b/parse_line_master
@@ -26,16 +26,18 @@
  mov     byte, ptr, [rbp, -, 558], 1
  mov     word, ptr, [rbp, -, 557], 0
  mov     byte, ptr, [rbp, -, 555], 1
+ movabs  rax, 1103806595072
+ mov     qword, ptr, [rbp, -, 539], rax
  mov     qword, ptr, [rbp, -, 554], 0
  mov     qword, ptr, [rbp, -, 546], 0
- mov     dword, ptr, [rbp, -, 539], 0
- mov     byte, ptr, [rbp, -, 535], 1
- mov     dword, ptr, [rbp, -, 534], 0
- mov     byte, ptr, [rbp, -, 530], 1
- mov     dword, ptr, [rbp, -, 529], 16777216
- mov     qword, ptr, [rbp, -, 525], 0
- mov     qword, ptr, [rbp, -, 517], 0
- mov     word, ptr, [rbp, -, 509], 256
+ mov     byte, ptr, [rbp, -, 531], 0
+ mov     dword, ptr, [rbp, -, 530], 257
+ mov     byte, ptr, [rbp, -, 526], 1
+ mov     word, ptr, [rbp, -, 525], 0
+ mov     byte, ptr, [rbp, -, 523], 1
+ mov     qword, ptr, [rbp, -, 516], 0
+ mov     qword, ptr, [rbp, -, 522], 0
+ mov     byte, ptr, [rbp, -, 508], 1
  mov     qword, ptr, [rbp, -, 485], 0
  mov     qword, ptr, [rbp, -, 491], 0
  mov     qword, ptr, [rbp, -, 499], 0

So iterate_special_bytes is being inlined, as expected. However, the end result is creating the LUT on the stack inside parse_line, on every call, which seems to be because the LUT is a const when it should actually be a static.

I changed special_bytes to a static and for the case with no smart characters in crdt.md, crdt_parse was between no-change and 1% (barely over the noise threshold) faster. So that's fixed. But importantly, here's the comparison for the original crdt.md between origin/master and my revision that also scans for quote characters:

⌁  g/pulldown-cmark ╍ (master)+ cargo bench --bench html_rendering -- --baseline baseline crdt_parse
    Finished bench [optimized] target(s) in 0.13s
     Running target/release/deps/html_rendering-09cbd52ae1f5c0d2
crdt_parse              time:   [337.29 us 337.75 us 338.24 us]
                        change: [-2.1103% -1.8468% -1.5824%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)

Yep. It's almost 2% faster even with the extra LUT entries. 3.3% faster without them. SIMD wasn't impacted by the static change, probably because the _mm_loadu_si128 call was itself inlined into a bunch of setup instructions based on the const. So SIMD got 1.6% slower with static + the extra characters, which is the same delta as non-SIMD.

if you're worried about the perf of non-extended CommonMark mode, then you can add one (100% predictable) branch at the start of parse_line to choose a different LUT when there are options enabled, and this won't destroy inlining, it will just change one pointer to a different static location. Even one LUT per permutation of options, though one that handles all of them would probably suffice. I will PR the static change in a moment. But otherwise, implement away!

It was being recreated on the stack each time parse_line was called. This will make it faster when more special chars are added, and to swap out LUTs for different parse options without affecting the performance of the first pass when those options are not enabled. See pulldown-cmark#119 (comment)

poupryc · 2020-01-11T16:31:09Z

Hello, is there any news on this issue?

camelid · 2020-03-29T22:58:30Z

Will this be implemented? It would be a really useful feature for me.

jackwherry · 2020-04-05T00:12:29Z

If you're like me and want a quick visual fix in lieu of a proper solution to this problem at the Markdown parser stage, you can try Smartquotes.js. You can see it in action at my website. I imagine that this is a decently common problem for Zola websites. While it means that your website now depends on JavaScript, it's not a huge deal since it still works just fine without JS and the script is pretty tiny.

Just pop the JavaScript file in static/smartquotes.js and add this to the bottom of your <body>:

<script src="smartquotes.js"></script>
<script>smartquotes()</script>

Edit: I now have my own fork that also supports turning --- into —. I was previously doing this in a different part of my Zola config (through a polish macro) that would clobber any dashes used for line-drawing in <pre> or <code> tags, like the illustration here. Now that this is done correctly in JavaScript, it's smart enough to avoid a subset of tags that would cause issues.

marcusklaas · 2020-08-26T14:03:51Z

I've picked up this issue. In the spirit of @cormacrelf's work, I'm using the dynamic lookup table idea. I'm hoping to reuse much of the work he has already done, especially on smart quotes. Hopefully smart punctuation can make it into the 0.8 release of pulldown.

marcusklaas · 2020-09-01T12:48:47Z

Smart punctuation landed on master!

Libbum mentioned this issue Mar 4, 2018

Solution to pandoc extensions Libbum/AxiomaticSemantics#2

Closed

Keats mentioned this issue Jul 7, 2019

“Smart” typographic punctuation getzola/zola#740

Closed

cormacrelf mentioned this issue Nov 10, 2019

Change the special character lookup tables to statics #409

Closed

marcusklaas self-assigned this Aug 26, 2020

marcusklaas mentioned this issue Aug 27, 2020

Implement smart punctuation #470

Merged

marcusklaas closed this as completed Sep 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smart punctuation option? #119

smart punctuation option? #119

sodiumjoe commented Dec 15, 2017

ErichDonGubler commented May 2, 2019

Keats commented Aug 9, 2019

marcusklaas commented Aug 9, 2019 •

edited

cormacrelf commented Nov 10, 2019

poupryc commented Jan 11, 2020

camelid commented Mar 29, 2020

jackwherry commented Apr 5, 2020 •

edited

marcusklaas commented Aug 26, 2020

marcusklaas commented Sep 1, 2020

smart punctuation option? #119

smart punctuation option? #119

Comments

sodiumjoe commented Dec 15, 2017

ErichDonGubler commented May 2, 2019

Keats commented Aug 9, 2019

marcusklaas commented Aug 9, 2019 • edited

cormacrelf commented Nov 10, 2019

poupryc commented Jan 11, 2020

camelid commented Mar 29, 2020

jackwherry commented Apr 5, 2020 • edited

marcusklaas commented Aug 26, 2020

marcusklaas commented Sep 1, 2020

marcusklaas commented Aug 9, 2019 •

edited

jackwherry commented Apr 5, 2020 •

edited