[Emphasis/strong] Consolidate issues #1036

Feder1co5oave · 2018-01-25T21:46:58Z

Emphasis syntax regression (?) #641 nested em and strong
Emphasis with underscore #1010 Emphasized URL with underscores #872 em words with underscores
The double underscores at the beginning of the word break the html layout #583 incorrectly detects right-flanking strong delimiter
Emphasis and underscores in the middle of a word #343 underscores are interpreted incorrectly in GFM #544 underscore with brace not rendered correctly #494 italic html incorrectly rendered? #216 incorrectly detects opening em delimiter:
GFM underscore parsing is broken #491 (comment) incorrectly detects opening and closing delimiters
Multiple underscores in words #59 incorrectly detects opening strong delimiter
Bug with asterisks (if this is the current version of GFM on GitHub) #520 (comment) incorrectly treats escaped * as delimiter
not allow empty content in em regular expression #735 empty (whitespace) em

Feder1co5oave · 2018-01-26T00:47:23Z

I apologize in advance for the following rant.

Generally speaking, our emphasis parsing is quite broken with respect to commonmark. The fact is, the spec is really a p. of s. because they wrote it with the goal of parsing any possible combination of multiple/nested/escaped/unescaped underscores and/or asterisks any monkey could type on a keyboard. So much so that they came up with idiotic examples such as:

_(_foo_)_
***foo** bar*
**foo*bar*baz**
***foo**
foo __\___
____foo_
*foo __bar *baz bim__ bam*
_a `_`_

So, really, really enlightening stuff.

/sarcasm

We have a couple alternatives:

we totally refuse to try to "correctly" parse all that crap and limit ourselves to a more limited range of possibilities. Such as
- no em inside em; no strong inside strong;
- no more than 3 consecutive underscores;
- no more than 3 consecutive asterisks;
- no unescaped underscores/asterisks inside underscores/asterisks emphasis (or strong)
I like the "political" connotation of this one. I feel like the commonmark people should get a grip on reality and stop writing neverending specs that are longer than the html5 syntax and made up of obscure (computer science-driven) language. Fun fact: even the word "minimized" shows up in rule 13, referring to the number of nestings of the different interpretations.

Yes, the spec was written in order to make syntax non-ambiguous for parser developers. But markdown is user-centric, it is beautiful because you can read it and easily get the meaning of the syntax. What's the meaning behind foo __\___? What is the goal in writing syntax rules that are not understandable by authors who will use that syntax?
we try to adhere as much as we can to the standard.

Now, marked was built with speed in mind. The bulk of the parsing is done by regex rules and it should be as "greedy" as possible. So, no backtracking, very little to no manual fiddling with the source. If we did alternative 2, we'll have to break this "rule". There's no way to parse emphasis as the spec dictates using only regular expressions. My guess is a recursive algorithm would be needed. I can't really predict the performance of that, but I don't want to write it and then scrape it because it's too slow, or causes stack overflow on pathologically long inputs.

joshbruce · 2018-01-26T02:01:54Z

Agreed. Emphasis and strong are the tough ones because of the overemphasis possibilities from the HTML spec not translating well to the plain text world. Like you point out. I mean what is this supposed to end up as:

****Hello****

<em><em><em><em>Hello</em></em></em></em>

// or

<strong><em>*Hello*</em></strong>

I also agree on the beauty and elegance of Markdown and, honestly, I can't imagine someone designing a written document that would have more than three consecutive asterisks or underscores. Bold, italic, bold + italic.

*Hello*

<em>Hello</em>

**Hello**

<strong>Hello</strong>

***Hello***

<strong><em>Hello</em></strong>

At which point, maybe we can take a note from one of Kevlin Henney's FizzBuzz examples??

Do all the asterisks or underscores as em - then any double em convert to strong. Feels a little like a hack though.

***Hello***

<em><em><em>Hello</em></em></em>

<strong><em>Hello</em></strong>

I'm not a big regex person; so, not sure how to help here. I thought regex had a recurse-type option to it. For this, I'm definitely leaning toward solving the most likely cases (one, two, three) and giving a caveat in the docs...if someone can solve that problem while maintaining the spirit and principles of Marked - great; otherwise, some of those cases in the spec just seem like someone making things complicated based on the HTML spec.

Feder1co5oave · 2018-01-26T17:45:03Z

At which point, maybe we can take a note from one of Kevlin Henney's FizzBuzz examples??

?what's this about? 🤔

I'm not a big regex person; so, not sure how to help here. I thought regex had a recurse-type option to it.

Well, there are different implementations of regexes, and almost none of them support recursion. I'd like to give the possible recursive algorithm some more thought and then decide whethere to implement it, or just monkey patch regexes until all test cases we deemed interesting pass. (but I prefer having a clean recursive algorithm than an ugly 2-rows long regex that no one can understand, not even me).

Now that I think about this? When are we gonna tackle performance? Have you got a milestone in mind for that? I'd do it after fixing issues and improving the architecture, but not much late in the process. I think our goal should be to take on commonmark.js and especially markdown-it from the point of view of performance and extensiblity. They're both fast, but with #975 we still have an edge on them, even after my work on compliance with commonmark, that usually complicates rules. After that, I think it should be worth doing some work on profiling code and analyzing hot spots.

joshbruce · 2018-01-26T18:40:19Z

Roman Numerals - not FizzBuzz - sorry: https://youtu.be/nrVIlhtoE3Y?t=38m18s (was thinking of the last one where you get all the ones and convert in passes...sarcasm was heavy in that)

The question of peformance and things is a good one. Think once we get into developer exerience on extensibility the milestones will go away and the general idea will be to always be doing all of them continuously - fixing defects, making minor architecture changes, and improving the developer experience, which I put performance in with...faster processing == happier developers. :)

(We do have that one PR we've been sitting on that has an apparent nice performance boost - just want to make sure it's working consistently as expected.)

Thanks for all the work you've been putting into this by the way. You're awesome.

joshbruce · 2018-01-26T18:45:46Z

ps. Would be nice if our benches could display the call stack in its entirety with the times for each thing...that way we can actually target areas for optimization instead of guessing at it. (A bit more empirical.)

Maybe regex isn't the way to go in some cases because they are slower than a method-based approach, for example. Roping in @UziTech on that score since it's in the realm of testing and whatnot.

As to taking someone else on - I tend to not worry about other people and just make myself the best version I can using the knowledge and skills I acquire over time. (I lean heavily on the collaborative side of Game Theory as opposed to the competitive. Not saying everyone else should but, between me wanting to help make Marked awesome by itself and others wanting to make it better compared to others...we should be good to go.) :)

Feder1co5oave · 2018-01-26T18:55:37Z

By "taking on" I meant trying to do better than them. Probably a language mishap on my part there :)

joshbruce · 2018-01-26T19:12:49Z

lol - I think I was picking up what you were putting down there. Maybe the language failing is on my side (despite language used in #963). Think the comparison piece is more an after-the-fact discovery from just trying to make Marked the greatest parser it can be - not something I target others for during the process - it's not the thing that drives me. Making marked awesome and making the team, such as it is more awesome every day. :)

Feder1co5oave · 2018-01-26T19:34:46Z

Yep. I believe I'm not trying to get into a "competition" on who can do more and faster, but I like having a baseline to compare my work with. When I look at the alternatives I see a bunch of people doing an amazing work, so if I'm willing to keep contributing to marked, I think it should be worth my time.

joshbruce · 2018-01-26T20:10:10Z

Agree wholeheartedly!

UziTech · 2018-01-26T20:36:42Z

Sorry I haven't been as active in this project lately. I'm finishing up a big project at work and you know how it goes (90% of the project takes 90% of the time and the other 10% of the project takes up the other 90% of the time) so I should have more time in the near future.

As far as benchmarks go, I think it would be a great idea to drill down into each benchmark test for some more insight into specific cases.

Here is my idea of priorities:

Security; Make sure there are no security concerns for our dependents. (Which I think we have already taken care of)
Bugs/Specs; Make sure marked is compliant with the specifications (GFM, CommonMark, maybe some others?) and has tests for each spec so changing things doesn't regress.
Benchmarks; Make sure additions/changes don't make the code slower.
Modularization; Make the code base easier to edit (including linting and automated testing. maybe code coverage)
Extensibility; Make it as easy as possible for users to change the render process. (I think this is already pretty good.)

I feel like we are still on #2 trying to make things spec compliant.

I really like @Feder1co5oave's comment on #659 (comment) about not adding additional features that can be implemented by overriding the renderer.

I would also like to see some of the options removed in v1.0 that could be implemented by overriding the rederer (e.g. baseUrl, xhtml, etc.) so marked can become self sufficient and be somewhat of a blunt tool that users can mold instead of trying to be a swiss army knife that needs constant maintenance.

joshbruce · 2018-01-27T00:18:38Z

@UziTech - The three of us are definitely on the same page, I think.

Not sure how far out of the loop you've been - totally understandable (had a hiccup like that myself) - so, we have the three milestones; 0.4.0 is definitely number two in the list you provided. I do try to make sure the tests pass and the benches don't change dramatically before merging.

The 0.5.0 milestone is the 4 and 5 in the list. Think this is when we would start implementing some of these things - if we can even do it; I know it's possible, just not sure with the overall setup.

Definitely agree that we should consider deprecating some of these added features in favor of a more coherent extensibility strategy. My inner product owner (and UX brain) would love to find out how many people are using which features...make sure we have a transition strategy laid out for them.

The notion of deep-dives into the benches would allow us to also empirically demostrate to the community of users why we are making some of these decisions; thereby, reducing feelings that we're just doing it because we want to.

Think we would be using the 0.6.0 milestone for a lot of the performance-related decisions moving forward. Then, once things have calmed down, go ahead and make a 1.0 release. By then we should have the biggest hurdles overcome, a stable architecture going into the foreseeable future that is easy to change in future, and a crystalized product vision and scope.

Thank you both so much. This is actually proving to be one of my favorite projects. I'm glad we've been able to revive it and work pretty seamlessly. (Was thinking it might be worthwhile to put together a Skype or Hangouts session going into 0.5.0 - gonna be some major changes there, might be good to use something other than text-based communication. Could tie in @Nalinc as well from the #746 issue. Just a thought and no pressure.)

robertwbradford · 2018-01-30T17:58:37Z

I was looking at an issue we were having with a BoldBoldItalicItalic combination. That is, "Bold" next to "BoldItalic" next to "Italic" or **Bold**_**BoldItalic**__Italic_. @DSMann pointed out that it may be an issue with the em regex using word boundaries (\b) and the fact that underscores are treated as word characters. (See table in http://www.ecma-international.org/ecma-262/5.1/#sec-15.10.2.6).

The following em regex seemed to correctly handle the **Bold**_**BoldItalic**__Italic_ string:

em: /^_((?:\\_|[\s\S])+?)_|^\*((?:\\\*|[\s\S])+?)\*/

Feder1co5oave · 2018-01-30T18:48:35Z

@robertwbradford thanks for reporting this, it is one of those cases where commonmark parsers get it right and more legacy markdown.pl-inspired ones get it wrong: try it.

However, just replacing the em rule with the regex you suggest seems to fail the nested_em test:

bin/marked
*test **test** test*

_test __test__ test_

<p><em>test </em><em>test</em><em> test</em></p>
<p><em>test </em><em>test</em><em> test</em></p>

whereas it should be

<p><em>test <strong>test</strong> test</em></p>
<p><em>test <strong>test</strong> test</em></p>

joshbruce · 2018-02-11T16:56:27Z

#956

Feder1co5oave · 2018-02-11T16:57:40Z

No need to post that on every single issue.

joshbruce · 2018-02-11T17:10:01Z

@Feder1co5oave: Agreed.

I'm actually trying to work up the case for moving the main repo to a different owner and demonstrate how effectively we are working as a team to make Marked awesome again...this gives the players involved a central starting place - I just wish I hadn't marked some of the other issues that aren't directly related to our efforts as a team.

joshbruce added L2 - annoying Similar to L1 - broken but there is a known workaround available for the issue category: inline elements labels Jan 25, 2018

joshbruce mentioned this issue Jan 25, 2018

Allow <b> and <i> tags #434

Closed

joshbruce added the good first issue Something easy to get started with label Feb 28, 2018

This was referenced Feb 28, 2018

Auto-linking - cm_autolinks test #1092

Closed

You're friend here is only mostly dead... #1106

Closed

joshbruce added epic A large goal or objective for a major release. and removed L2 - annoying Similar to L1 - broken but there is a known workaround available for the issue labels Apr 4, 2018

joshbruce added this to the v0.x milestone Apr 4, 2018

joshbruce removed the good first issue Something easy to get started with label Apr 9, 2018

erkannt mentioned this issue Nov 10, 2020

Markdown is processed incorrectly sciety/sciety#317

Closed

42 tasks

UziTech closed this as completed Dec 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Emphasis/strong] Consolidate issues #1036

[Emphasis/strong] Consolidate issues #1036

Feder1co5oave commented Jan 25, 2018 •

edited

Loading

Feder1co5oave commented Jan 26, 2018 •

edited

Loading

joshbruce commented Jan 26, 2018

Feder1co5oave commented Jan 26, 2018

joshbruce commented Jan 26, 2018

joshbruce commented Jan 26, 2018

Feder1co5oave commented Jan 26, 2018 via email •

edited

Loading

joshbruce commented Jan 26, 2018

Feder1co5oave commented Jan 26, 2018 via email •

edited

Loading

joshbruce commented Jan 26, 2018

UziTech commented Jan 26, 2018 •

edited

Loading

joshbruce commented Jan 27, 2018 •

edited

Loading

robertwbradford commented Jan 30, 2018

Feder1co5oave commented Jan 30, 2018 •

edited

Loading

joshbruce commented Feb 11, 2018

Feder1co5oave commented Feb 11, 2018 via email

joshbruce commented Feb 11, 2018

[Emphasis/strong] Consolidate issues #1036

[Emphasis/strong] Consolidate issues #1036

Comments

Feder1co5oave commented Jan 25, 2018 • edited Loading

Feder1co5oave commented Jan 26, 2018 • edited Loading

joshbruce commented Jan 26, 2018

Feder1co5oave commented Jan 26, 2018

joshbruce commented Jan 26, 2018

joshbruce commented Jan 26, 2018

Feder1co5oave commented Jan 26, 2018 via email • edited Loading

joshbruce commented Jan 26, 2018

Feder1co5oave commented Jan 26, 2018 via email • edited Loading

joshbruce commented Jan 26, 2018

UziTech commented Jan 26, 2018 • edited Loading

joshbruce commented Jan 27, 2018 • edited Loading

robertwbradford commented Jan 30, 2018

Feder1co5oave commented Jan 30, 2018 • edited Loading

joshbruce commented Feb 11, 2018

Feder1co5oave commented Feb 11, 2018 via email

joshbruce commented Feb 11, 2018

Feder1co5oave commented Jan 25, 2018 •

edited

Loading

Feder1co5oave commented Jan 26, 2018 •

edited

Loading

Feder1co5oave commented Jan 26, 2018 via email •

edited

Loading

Feder1co5oave commented Jan 26, 2018 via email •

edited

Loading

UziTech commented Jan 26, 2018 •

edited

Loading

joshbruce commented Jan 27, 2018 •

edited

Loading

Feder1co5oave commented Jan 30, 2018 •

edited

Loading