Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

smart punctuation option? #119

Closed
sodiumjoe opened this issue Dec 15, 2017 · 9 comments
Closed

smart punctuation option? #119

sodiumjoe opened this issue Dec 15, 2017 · 9 comments
Assignees

Comments

@sodiumjoe
Copy link

Is there a way to parse with smart punctuation?

@ErichDonGubler
Copy link

Let me offer some more detail -- seems like this issue doesn't have much detail to act on.

One of the features I thought would be interesting from using Jekyll is "smart" typography, which comes from the SmartyPants spec implemented as an extension in the RedCarpet markdown parser, as noted in Jekyll docs.

To give some examples, that would mean turning text like this:

"some curly quotes" 'like this' -- they're pretty awesome --- yo

...into this:

“some curly quotes” ‘like this’ – they're pretty awesome — yo

By no means do I think this should be a default, but it would be interesting to add -- I love {en,em}-dashes in particular. :)

@Keats
Copy link
Contributor

Keats commented Aug 9, 2019

@marcusklaas it looks like smart punctuation exists as an extension in the C reference parser, along with tests: https://github.com/commonmark/cmark/blob/master/test/smart_punct.txt

@marcusklaas
Copy link
Collaborator

marcusklaas commented Aug 9, 2019

That's very good to know! This is a very good candidate for addition to pulldown as an extension.

The tricky part is the implementation. Smart punctuation would add a number of special tokens, quotes and dashes, that we'd need to scan for. That could have a non-trivial performance impact, since the main character scan routine is the costliest part of the parse. This cost would even be apparent when the option is disabled, since many optimizations rely on this character set being static.

Do you have ideas on how we could solve this efficiently, @raphlinus? I recall you mentioned this problem when we added table support and its pipe character to the "new" pulldown.

@cormacrelf
Copy link

Ok, I think I've solved the perf issue.

First, I added ', " and - to the special char lookup tables and crdt_parse was ~2-3% slower. About what I expected, and if I'm not mistaken you don't pay for much more if you condition the character match arm on if self.options.contains(ENABLE_SMART) => {...}.

However, I expected to see no change if crdt.md was stripped of these three characters before running the baseline and subsequent benchmarks, but there was still a 1.3% slowdown. Here's the complete diff of cargo asm pulldown_cmark::parse::FirstPass::parse_line:

diff --git a/parse_line_baseline b/parse_line_master
index 0b6d3c9..e4cc529 100644
--- a/parse_line_baseline
+++ b/parse_line_master
@@ -26,16 +26,18 @@
  mov     byte, ptr, [rbp, -, 558], 1
  mov     word, ptr, [rbp, -, 557], 0
  mov     byte, ptr, [rbp, -, 555], 1
+ movabs  rax, 1103806595072
+ mov     qword, ptr, [rbp, -, 539], rax
  mov     qword, ptr, [rbp, -, 554], 0
  mov     qword, ptr, [rbp, -, 546], 0
- mov     dword, ptr, [rbp, -, 539], 0
- mov     byte, ptr, [rbp, -, 535], 1
- mov     dword, ptr, [rbp, -, 534], 0
- mov     byte, ptr, [rbp, -, 530], 1
- mov     dword, ptr, [rbp, -, 529], 16777216
- mov     qword, ptr, [rbp, -, 525], 0
- mov     qword, ptr, [rbp, -, 517], 0
- mov     word, ptr, [rbp, -, 509], 256
+ mov     byte, ptr, [rbp, -, 531], 0
+ mov     dword, ptr, [rbp, -, 530], 257
+ mov     byte, ptr, [rbp, -, 526], 1
+ mov     word, ptr, [rbp, -, 525], 0
+ mov     byte, ptr, [rbp, -, 523], 1
+ mov     qword, ptr, [rbp, -, 516], 0
+ mov     qword, ptr, [rbp, -, 522], 0
+ mov     byte, ptr, [rbp, -, 508], 1
  mov     qword, ptr, [rbp, -, 485], 0
  mov     qword, ptr, [rbp, -, 491], 0
  mov     qword, ptr, [rbp, -, 499], 0

So iterate_special_bytes is being inlined, as expected. However, the end result is creating the LUT on the stack inside parse_line, on every call, which seems to be because the LUT is a const when it should actually be a static.

I changed special_bytes to a static and for the case with no smart characters in crdt.md, crdt_parse was between no-change and 1% (barely over the noise threshold) faster. So that's fixed. But importantly, here's the comparison for the original crdt.md between origin/master and my revision that also scans for quote characters:

⌁  g/pulldown-cmark ╍ (master)+ cargo bench --bench html_rendering -- --baseline baseline crdt_parse
    Finished bench [optimized] target(s) in 0.13s
     Running target/release/deps/html_rendering-09cbd52ae1f5c0d2
crdt_parse              time:   [337.29 us 337.75 us 338.24 us]
                        change: [-2.1103% -1.8468% -1.5824%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)

Yep. It's almost 2% faster even with the extra LUT entries. 3.3% faster without them. SIMD wasn't impacted by the static change, probably because the _mm_loadu_si128 call was itself inlined into a bunch of setup instructions based on the const. So SIMD got 1.6% slower with static + the extra characters, which is the same delta as non-SIMD.

if you're worried about the perf of non-extended CommonMark mode, then you can add one (100% predictable) branch at the start of parse_line to choose a different LUT when there are options enabled, and this won't destroy inlining, it will just change one pointer to a different static location. Even one LUT per permutation of options, though one that handles all of them would probably suffice. I will PR the static change in a moment. But otherwise, implement away!

cormacrelf added a commit to cormacrelf/pulldown-cmark that referenced this issue Nov 10, 2019
It was being recreated on the stack each time parse_line was called.

This will make it faster when more special chars are added, and to swap out LUTs for different parse options without affecting the performance of the first pass when those options are not enabled.

See pulldown-cmark#119 (comment)
cormacrelf added a commit to cormacrelf/pulldown-cmark that referenced this issue Nov 25, 2019
It was being recreated on the stack each time parse_line was called.

This will make it faster when more special chars are added, and to swap out LUTs for different parse options without affecting the performance of the first pass when those options are not enabled.

See pulldown-cmark#119 (comment)
@poupryc
Copy link

poupryc commented Jan 11, 2020

Hello, is there any news on this issue?

@camelid
Copy link
Contributor

camelid commented Mar 29, 2020

Will this be implemented? It would be a really useful feature for me.

@jackwherry
Copy link

jackwherry commented Apr 5, 2020

If you're like me and want a quick visual fix in lieu of a proper solution to this problem at the Markdown parser stage, you can try Smartquotes.js. You can see it in action at my website. I imagine that this is a decently common problem for Zola websites. While it means that your website now depends on JavaScript, it's not a huge deal since it still works just fine without JS and the script is pretty tiny.

Just pop the JavaScript file in static/smartquotes.js and add this to the bottom of your <body>:

<script src="smartquotes.js"></script>
<script>smartquotes()</script>

Edit: I now have my own fork that also supports turning --- into —. I was previously doing this in a different part of my Zola config (through a polish macro) that would clobber any dashes used for line-drawing in <pre> or <code> tags, like the illustration here. Now that this is done correctly in JavaScript, it's smart enough to avoid a subset of tags that would cause issues.

@marcusklaas
Copy link
Collaborator

I've picked up this issue. In the spirit of @cormacrelf's work, I'm using the dynamic lookup table idea. I'm hoping to reuse much of the work he has already done, especially on smart quotes. Hopefully smart punctuation can make it into the 0.8 release of pulldown.

@marcusklaas
Copy link
Collaborator

Smart punctuation landed on master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants