Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Emphasis/strong] Consolidate issues #1036

Closed
7 of 8 tasks
Feder1co5oave opened this issue Jan 25, 2018 · 16 comments
Closed
7 of 8 tasks

[Emphasis/strong] Consolidate issues #1036

Feder1co5oave opened this issue Jan 25, 2018 · 16 comments
Labels
category: inline elements epic A large goal or objective for a major release.
Milestone

Comments

@Feder1co5oave
Copy link
Contributor

Feder1co5oave commented Jan 25, 2018

@joshbruce joshbruce added L2 - annoying Similar to L1 - broken but there is a known workaround available for the issue category: inline elements labels Jan 25, 2018
@Feder1co5oave
Copy link
Contributor Author

Feder1co5oave commented Jan 26, 2018

I apologize in advance for the following rant.

Generally speaking, our emphasis parsing is quite broken with respect to commonmark. The fact is, the spec is really a p. of s. because they wrote it with the goal of parsing any possible combination of multiple/nested/escaped/unescaped underscores and/or asterisks any monkey could type on a keyboard. So much so that they came up with idiotic examples such as:

_(_foo_)_
***foo** bar*
**foo*bar*baz**
***foo**
foo __\___
____foo_
*foo __bar *baz bim__ bam*
_a `_`_

So, really, really enlightening stuff.

/sarcasm

We have a couple alternatives:

  1. we totally refuse to try to "correctly" parse all that crap and limit ourselves to a more limited range of possibilities. Such as

    • no em inside em; no strong inside strong;
    • no more than 3 consecutive underscores;
    • no more than 3 consecutive asterisks;
    • no unescaped underscores/asterisks inside underscores/asterisks emphasis (or strong)

    I like the "political" connotation of this one. I feel like the commonmark people should get a grip on reality and stop writing neverending specs that are longer than the html5 syntax and made up of obscure (computer science-driven) language. Fun fact: even the word "minimized" shows up in rule 13, referring to the number of nestings of the different interpretations.

    Yes, the spec was written in order to make syntax non-ambiguous for parser developers. But markdown is user-centric, it is beautiful because you can read it and easily get the meaning of the syntax. What's the meaning behind foo __\___? What is the goal in writing syntax rules that are not understandable by authors who will use that syntax?

  2. we try to adhere as much as we can to the standard.

Now, marked was built with speed in mind. The bulk of the parsing is done by regex rules and it should be as "greedy" as possible. So, no backtracking, very little to no manual fiddling with the source. If we did alternative 2, we'll have to break this "rule". There's no way to parse emphasis as the spec dictates using only regular expressions. My guess is a recursive algorithm would be needed. I can't really predict the performance of that, but I don't want to write it and then scrape it because it's too slow, or causes stack overflow on pathologically long inputs.

@joshbruce
Copy link
Member

Agreed. Emphasis and strong are the tough ones because of the overemphasis possibilities from the HTML spec not translating well to the plain text world. Like you point out. I mean what is this supposed to end up as:

****Hello****

<em><em><em><em>Hello</em></em></em></em>

// or

<strong><em>*Hello*</em></strong>

I also agree on the beauty and elegance of Markdown and, honestly, I can't imagine someone designing a written document that would have more than three consecutive asterisks or underscores. Bold, italic, bold + italic.

*Hello*

<em>Hello</em>

**Hello**

<strong>Hello</strong>

***Hello***

<strong><em>Hello</em></strong>

At which point, maybe we can take a note from one of Kevlin Henney's FizzBuzz examples??

Do all the asterisks or underscores as em - then any double em convert to strong. Feels a little like a hack though.

***Hello***

<em><em><em>Hello</em></em></em>

<strong><em>Hello</em></strong>

I'm not a big regex person; so, not sure how to help here. I thought regex had a recurse-type option to it. For this, I'm definitely leaning toward solving the most likely cases (one, two, three) and giving a caveat in the docs...if someone can solve that problem while maintaining the spirit and principles of Marked - great; otherwise, some of those cases in the spec just seem like someone making things complicated based on the HTML spec.

@Feder1co5oave
Copy link
Contributor Author

At which point, maybe we can take a note from one of Kevlin Henney's FizzBuzz examples??

?what's this about? 🤔

I'm not a big regex person; so, not sure how to help here. I thought regex had a recurse-type option to it.

Well, there are different implementations of regexes, and almost none of them support recursion. I'd like to give the possible recursive algorithm some more thought and then decide whethere to implement it, or just monkey patch regexes until all test cases we deemed interesting pass. (but I prefer having a clean recursive algorithm than an ugly 2-rows long regex that no one can understand, not even me).

Now that I think about this? When are we gonna tackle performance? Have you got a milestone in mind for that? I'd do it after fixing issues and improving the architecture, but not much late in the process. I think our goal should be to take on commonmark.js and especially markdown-it from the point of view of performance and extensiblity. They're both fast, but with #975 we still have an edge on them, even after my work on compliance with commonmark, that usually complicates rules. After that, I think it should be worth doing some work on profiling code and analyzing hot spots.

@joshbruce
Copy link
Member

Roman Numerals - not FizzBuzz - sorry: https://youtu.be/nrVIlhtoE3Y?t=38m18s (was thinking of the last one where you get all the ones and convert in passes...sarcasm was heavy in that)

The question of peformance and things is a good one. Think once we get into developer exerience on extensibility the milestones will go away and the general idea will be to always be doing all of them continuously - fixing defects, making minor architecture changes, and improving the developer experience, which I put performance in with...faster processing == happier developers. :)

(We do have that one PR we've been sitting on that has an apparent nice performance boost - just want to make sure it's working consistently as expected.)

Thanks for all the work you've been putting into this by the way. You're awesome.

@joshbruce
Copy link
Member

ps. Would be nice if our benches could display the call stack in its entirety with the times for each thing...that way we can actually target areas for optimization instead of guessing at it. (A bit more empirical.)

Maybe regex isn't the way to go in some cases because they are slower than a method-based approach, for example. Roping in @UziTech on that score since it's in the realm of testing and whatnot.

As to taking someone else on - I tend to not worry about other people and just make myself the best version I can using the knowledge and skills I acquire over time. (I lean heavily on the collaborative side of Game Theory as opposed to the competitive. Not saying everyone else should but, between me wanting to help make Marked awesome by itself and others wanting to make it better compared to others...we should be good to go.) :)

@Feder1co5oave
Copy link
Contributor Author

Feder1co5oave commented Jan 26, 2018 via email

@joshbruce
Copy link
Member

lol - I think I was picking up what you were putting down there. Maybe the language failing is on my side (despite language used in #963). Think the comparison piece is more an after-the-fact discovery from just trying to make Marked the greatest parser it can be - not something I target others for during the process - it's not the thing that drives me. Making marked awesome and making the team, such as it is more awesome every day. :)

@Feder1co5oave
Copy link
Contributor Author

Feder1co5oave commented Jan 26, 2018 via email

@joshbruce
Copy link
Member

Agree wholeheartedly!

@UziTech
Copy link
Member

UziTech commented Jan 26, 2018

Sorry I haven't been as active in this project lately. I'm finishing up a big project at work and you know how it goes (90% of the project takes 90% of the time and the other 10% of the project takes up the other 90% of the time) so I should have more time in the near future.

As far as benchmarks go, I think it would be a great idea to drill down into each benchmark test for some more insight into specific cases.

Here is my idea of priorities:

  1. Security; Make sure there are no security concerns for our dependents. (Which I think we have already taken care of)
  2. Bugs/Specs; Make sure marked is compliant with the specifications (GFM, CommonMark, maybe some others?) and has tests for each spec so changing things doesn't regress.
  3. Benchmarks; Make sure additions/changes don't make the code slower.
  4. Modularization; Make the code base easier to edit (including linting and automated testing. maybe code coverage)
  5. Extensibility; Make it as easy as possible for users to change the render process. (I think this is already pretty good.)

I feel like we are still on #2 trying to make things spec compliant.

I really like @Feder1co5oave's comment on #659 (comment) about not adding additional features that can be implemented by overriding the renderer.

I would also like to see some of the options removed in v1.0 that could be implemented by overriding the rederer (e.g. baseUrl, xhtml, etc.) so marked can become self sufficient and be somewhat of a blunt tool that users can mold instead of trying to be a swiss army knife that needs constant maintenance.

@joshbruce
Copy link
Member

joshbruce commented Jan 27, 2018

@UziTech - The three of us are definitely on the same page, I think.

Not sure how far out of the loop you've been - totally understandable (had a hiccup like that myself) - so, we have the three milestones; 0.4.0 is definitely number two in the list you provided. I do try to make sure the tests pass and the benches don't change dramatically before merging.

The 0.5.0 milestone is the 4 and 5 in the list. Think this is when we would start implementing some of these things - if we can even do it; I know it's possible, just not sure with the overall setup.

Definitely agree that we should consider deprecating some of these added features in favor of a more coherent extensibility strategy. My inner product owner (and UX brain) would love to find out how many people are using which features...make sure we have a transition strategy laid out for them.

The notion of deep-dives into the benches would allow us to also empirically demostrate to the community of users why we are making some of these decisions; thereby, reducing feelings that we're just doing it because we want to.

Think we would be using the 0.6.0 milestone for a lot of the performance-related decisions moving forward. Then, once things have calmed down, go ahead and make a 1.0 release. By then we should have the biggest hurdles overcome, a stable architecture going into the foreseeable future that is easy to change in future, and a crystalized product vision and scope.

Thank you both so much. This is actually proving to be one of my favorite projects. I'm glad we've been able to revive it and work pretty seamlessly. (Was thinking it might be worthwhile to put together a Skype or Hangouts session going into 0.5.0 - gonna be some major changes there, might be good to use something other than text-based communication. Could tie in @Nalinc as well from the #746 issue. Just a thought and no pressure.)

@robertwbradford
Copy link

I was looking at an issue we were having with a BoldBoldItalicItalic combination. That is, "Bold" next to "BoldItalic" next to "Italic" or **Bold**_**BoldItalic**__Italic_. @DSMann pointed out that it may be an issue with the em regex using word boundaries (\b) and the fact that underscores are treated as word characters. (See table in http://www.ecma-international.org/ecma-262/5.1/#sec-15.10.2.6).

The following em regex seemed to correctly handle the **Bold**_**BoldItalic**__Italic_ string:

em: /^_((?:\\_|[\s\S])+?)_|^\*((?:\\\*|[\s\S])+?)\*/

@Feder1co5oave
Copy link
Contributor Author

Feder1co5oave commented Jan 30, 2018

@robertwbradford thanks for reporting this, it is one of those cases where commonmark parsers get it right and more legacy markdown.pl-inspired ones get it wrong: try it.

However, just replacing the em rule with the regex you suggest seems to fail the nested_em test:

bin/marked
*test **test** test*

_test __test__ test_

<p><em>test </em><em>test</em><em> test</em></p>
<p><em>test </em><em>test</em><em> test</em></p>

whereas it should be

<p><em>test <strong>test</strong> test</em></p>
<p><em>test <strong>test</strong> test</em></p>

@joshbruce
Copy link
Member

#956

@Feder1co5oave
Copy link
Contributor Author

Feder1co5oave commented Feb 11, 2018 via email

@joshbruce
Copy link
Member

@Feder1co5oave: Agreed.

I'm actually trying to work up the case for moving the main repo to a different owner and demonstrate how effectively we are working as a team to make Marked awesome again...this gives the players involved a central starting place - I just wish I hadn't marked some of the other issues that aren't directly related to our efforts as a team.

@joshbruce joshbruce added the good first issue Something easy to get started with label Feb 28, 2018
@joshbruce joshbruce added epic A large goal or objective for a major release. and removed L2 - annoying Similar to L1 - broken but there is a known workaround available for the issue labels Apr 4, 2018
@joshbruce joshbruce added this to the v0.x milestone Apr 4, 2018
@joshbruce joshbruce removed the good first issue Something easy to get started with label Apr 9, 2018
@UziTech UziTech closed this as completed Dec 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: inline elements epic A large goal or objective for a major release.
Projects
None yet
Development

No branches or pull requests

4 participants