Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typographer (Replacing ©, ®, ™, ... etc) #4

Merged
merged 13 commits into from
Dec 15, 2022

Conversation

black-puppydog
Copy link
Contributor

@black-puppydog black-puppydog commented Nov 13, 2022

This is a straight-up re-implementation of lib/rules_core/replacements.js from the JS library.

There are still a few corner cases I'm not sure about, but this is progressed enough that I'd like feedback if anyone is willing to give. 🙂

I'm especially unsure about the linkify.rs modification. I needed this because the JS implementation checks for auto-linkified links, and doesn't touch the text in those. I tried multiple ways to do this.

The "easiest" would have been to be able to walk up the tree of nodes, to determine whether one of my parents is Linkified. We don't have that though, right?

The next attempt was to modify walk_mut so that I could pass a second closure that would be called when exiting the node. That way I could keep track of the linkification status, similar to how the JS lib does it. However, I never managed to get this past the borrow checker, because I was using a single linkification_level: i32 which I needed to mut-pass to both the entry and exit closure. It might well be that there would be a trivial fix for this (I'm still quite new to Rust and my stdlib-foo is not great) but in any case that would mean a bunch of boiler-plate to insert into all four walk[_post][_mut] methods and everywhere they're used.

So my solution is to attach an empty LinkifyMarker struct to the nodes when they get auto-linkified. I then just check if the text node I'm dealing with has that marker, and ignore it if it's the case.

Side note: the Error I'm encountering in #3 is trivial to fix with this branch: just add the typographer plugin to the parser during test, and it seems work!

To do

  • Documentation
  • Decide on where/whether this should be included in any default values in the crate

@black-puppydog
Copy link
Contributor Author

black-puppydog commented Nov 13, 2022

Okay after adding another test (fixture) and cross-checking that one with the JS implementation as well, I'm now reasonably confident that this works as intended. :)

Cargo.toml Outdated Show resolved Hide resolved
src/plugins/extra/typographer.rs Outdated Show resolved Hide resolved
// Note: the regex docs state that multi-line regexes (with "(?m)" that
// is) are not unicode-aware, meaning they *only* match '\n' as the
// newline and not any other newline characters which... apparently
// exist... Oh well...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in markdown-it here \n is the only possible newline, other newlines will get normalized for simplicity

so I don't think any problem with newlines exists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked this by inserting this test into markdown-it-typographer.rs:

#[test]
fn can_handle_unicode_line_tabulation() {
    run("--\u{000B}---", "<p>– —</p>");
}

This uses the line tabulation character which println! does interpret as a newline (but preserving the horizontal indentation as expected).

The typography works fine, it produces "<p>–\u{b}—</p>\n" and the same happens in the JS lib (I checked). Other newlines like \u{0085} aka next line get completely ignored by JS, while the rust version would produce \u{85} in the output.

So for the typography I will actually remove the comment since it works fine (matching these more exotic newlines works just as fine as matching $ or ^) but I'm not sure what you mean by "other newlines will get normalized". Is that something that is planned, or a missing feature, or should it already work and it' somehow broken in these tests?

]);

static ref RARE_RE: Regex = Regex::new(r"\+-|\.\.|\?\?\?\?|!!!!|,,|--").unwrap();
static ref SCOPED_RE: Regex = Regex::new(r"\((C|c|TM|tm|R|r)\)").unwrap();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Case-insensitive flag in Rust looks something like this:
Regex::new(r"(?i)\((c|tm|r)\)").unwrap()

In your version, (Tm) will not get replaced, which differs from JS.

But I guess that was intentional to simplify replace_abbreviation below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was actually not intentional; I can add a test for that, and thanks for the pointer towards (?i)! :)

@rlidwka
Copy link
Collaborator

rlidwka commented Nov 19, 2022

The "easiest" would have been to be able to walk up the tree of nodes, to determine whether one of my parents is Linkified. We don't have that though, right?

In theory, custom walk function will do the job (i.e. not using walk_mut, but writing your own). Not ideal, I know.

So my solution is to attach an empty LinkifyMarker struct to the nodes when they get auto-linkified. I then just check if the text node I'm dealing with has that marker, and ignore it if it's the case.

Actually, as far as I remember, text nodes that shouldn't be replaced should have TextSpecial type instead of Text (which is not done in JS for historical reasons). I'll look into fixing that.

Decide on where/whether this should be included in any default values in the crate

People usually use replacements together with smartquotes. Since smartquotes aren't implemented, I don't think this feature is complete, thus don't enable it by default yet.

@rlidwka
Copy link
Collaborator

rlidwka commented Nov 19, 2022

I'm especially unsure about the linkify.rs modification. I needed this because the JS implementation checks for auto-linkified links, and doesn't touch the text in those. I tried multiple ways to do this.

I believe this is the best way to fix it: 547505e

@codecov-commenter
Copy link

Codecov Report

Base: 95.13% // Head: 95.32% // Increases project coverage by +0.19% 🎉

Coverage data is based on head (547505e) compared to base (c9b087a).
Patch coverage: 100.00% of modified lines in pull request are covered.

❗ Current head 547505e differs from pull request most recent head 7fa1847. Consider uploading reports for the commit 7fa1847 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master       #4      +/-   ##
==========================================
+ Coverage   95.13%   95.32%   +0.19%     
==========================================
  Files          58       54       -4     
  Lines        2302     2226      -76     
==========================================
- Hits         2190     2122      -68     
+ Misses        112      104       -8     
Impacted Files Coverage Δ
src/plugins/cmark/inline/autolink.rs 100.00% <100.00%> (ø)
src/plugins/extra/linkify.rs 96.72% <100.00%> (+0.11%) ⬆️
examples/ferris/main.rs
examples/ferris/inline_rule.rs
examples/ferris/block_rule.rs
examples/ferris/core_rule.rs

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@black-puppydog
Copy link
Contributor Author

Thanks very much for the feedback, I'll look into it once I find a few quiet minutes. :)

People usually use replacements together with smartquotes. Since smartquotes aren't implemented, I don't think this feature is complete

Indeed, that's a good point. I'm already looking into porting the smartquotes next, although the algorithm used in JS is annoying to implement in rust due to borrow checking. But I'm sure I'll get there :D

black-puppydog and others added 9 commits November 20, 2022 01:56
This will only apply to test fixtures when they get re-generated.
The test fixture is copied from markdown-it, with the rust code being
just a copy of the tables file with modified path and plugin set.

The core rule is to be filled; for now it is a no-op.
@rlidwka pointed out that this is how the JS implementation handles this.
Confirmed by adding a unit test to the JS lib.
The `plus_minus` test works!
The rest should be a walk in the park now. 😆
`cargo test -- --test fixtures_markdown_it_typographer_txt::ellipsis`

# Conflicts:
#	src/plugins/extra/typographer.rs
# Conflicts:
#	src/plugins/extra/typographer.rs
# Conflicts:
#	src/plugins/extra/typographer.rs
@black-puppydog
Copy link
Contributor Author

This became quite a comprehensive rebase so that lazy_static doesn't get used anywhere in the commit history. Thanks to your commit, the whole linkify modification seems to have become unnecessary. Yay for simplicity! 🎉

The linebreak question is a separate issue, if any.
I also added a bit of documentation, especially also noting that this is not enabled with extras by default.

@black-puppydog
Copy link
Contributor Author

Just found time to add one last thing here; the way I had used once_cell was the "simple but verbose" way. I just added a commit that takes advantage of once_cell::Lazy which simplifies the code a bit.

@black-puppydog black-puppydog mentioned this pull request Nov 27, 2022
This one turned out to be super difficult because we don't want to be
making these replacements inside linkified URLs.
Luckily, @rlidwka had the preceding commit ready to prevent that from
happening.

This fixes the last tests from the markdown-it fixture.
I had somehow missed that pattern when reading the docs on `once_cell` but
now that I know about it, this is _the_ obvious way to do it.

# Conflicts:
#	src/plugins/extra/typographer.rs
Thanks to @rlidwka who provided this piece in an act of kind teaching. :)
@black-puppydog
Copy link
Contributor Author

Following the comments in #5 I went ahead and made the changes you indicated. Thanks again for the feedback, this is a great learning experience for me! I appreciate that it takes a lot of time to do these reviews.

unreachable instead of panic
Cow and if let instead of to_string and while
✅ closure instead of impl Replacer

@rlidwka rlidwka merged commit fbc43e9 into markdown-it-rust:master Dec 15, 2022
@black-puppydog black-puppydog deleted the typographer branch December 15, 2022 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants