Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add baseLangage to language grammars #2275

Closed
joshgoebel opened this issue Nov 12, 2019 · 11 comments · Fixed by #2844
Closed

Add baseLangage to language grammars #2275

joshgoebel opened this issue Nov 12, 2019 · 11 comments · Fixed by #2844
Labels
auto-detect Issue with auto detection of language type discuss/propose Proposal for a new feature/direction enhancement An enhancement or new feature parser

Comments

@joshgoebel
Copy link
Member

joshgoebel commented Nov 12, 2019

From my tweaked language checker:

┌────────────────────┬────────────────────┬──────────┬────────────────────┬──────────┬────────────────────┐
│ expected           │ actual             │ score    │ 2nd best           │ score    │ info               │
├────────────────────┼────────────────────┼──────────┼────────────────────┼──────────┼────────────────────┤
│ cpp                │ cpp                │ 16       │ arduino            │ 16       │ Relevance match.   │
├────────────────────┼────────────────────┼──────────┼────────────────────┼──────────┼────────────────────┤
│ cpp                │ cpp                │ 25       │ arduino            │ 25       │ Relevance match.   │
├────────────────────┼────────────────────┼──────────┼────────────────────┼──────────┼────────────────────┤
│ xml                │ xml                │ 46       │ django             │ 46       │ Relevance match.   │
├────────────────────┼────────────────────┼──────────┼────────────────────┼──────────┼────────────────────┤
│ xml                │ xml                │ 7        │ django             │ 7        │ Relevance match.   │
├────────────────────┼────────────────────┼──────────┼────────────────────┼──────────┼────────────────────┤
│ xml                │ xml                │ 3        │ django             │ 3        │ Relevance match.   │

These aren't surprising, but also aren't very helpful. They would be reported as errors except for the load order that's enforced by the requires... and the fact that we magically prefer the "first match" even if the scores are the same.

I'd like to add an explicit key to explain these relationships to the parser.

So that Arduino would be a superset: 'cpp' and Django would be a superset: 'xml'. I'm open to better naming. If two languages are detected to be related in this way then the auto-detect would simply ignore the superset in cases of EQUAL matches, only preferred it if it had a HIGHER match (which is what you'd expect if it was the superset, and not the original).

Also if you're looking to correct an incorrect match it's not helpful to know the second match to cpp was Arduino, they are essentially the same (not going to color differently, etc). It would be useful to know the second match (OTHER than Arduino) was say Pascal.

That's the change I'm suggested here if no one is opposed. Should only be a few lines of code and adding the appropriate attribute to a few grammars.

@joshgoebel joshgoebel added parser auto-detect Issue with auto detection of language type enhancement An enhancement or new feature labels Nov 12, 2019
@joshgoebel
Copy link
Member Author

joshgoebel commented Nov 12, 2019

And when we're thinking about naming it might be helpful to keep things like CoffeeScript and LiveScript in mind also:

LiveScript is a fork of Coco and an indirect descendant of CoffeeScript, with which it has much compatibility.

This is another tie:

│ coffeescript │ coffeescript │ 26 │ livescript │ 26 │ Relevance match. │

We're only saved from error because of alphabetical ordering. But our CoffeeScript snippet actually compiles as-is to LiveScript... so really I think the same type of relationship could apply here. If the parser gets "confused" about the difference between a snippet of CoffeeScript and LiveScript it should be smart enough to know that's "OK".

Perhaps:

LiveScript verySimilarTo: 'livescript'

And that would allow matches to be generally interchangeable between the two. It's also possible the tests themselves might be a better place to encode "similar to", but then when I think about seamless integration with 3rd party grammars (perhaps including running a much larger set of auto detect tests) I lean a little towards embedding this in the grammar itself (since it's so small a detail).

@egor-rogov
Copy link
Collaborator

If two languages are detected to be related in this way then the auto-detect would simply ignore the superset in cases of EQUAL matches, only preferred it if it had a HIGHER match (which is what you'd expect if it was the superset, and not the original).

Hmm... C++ is a superset for Arduino (Arduino adds some new keywords to C++). Hence, Arduino relevance match is always greater or equal to C++ relevance match, for any code snippet. Hence, in case of equal scores you should assume C++ (superset), not Arduino. Am I wrong?

@joshgoebel
Copy link
Member Author

joshgoebel commented Nov 13, 2019

Yes. That’s why I’m saying the specialized case should be safe to ignore if none of it’s rules result in a more specialized match. Though I’m really mixing two things here a bit.

First: There is autodetect in a very general/philosophical sense. “This short code snippet looks like C++ but it also looks very much like Arduino. It could possibly be either.” I suppos sometimes that might be useful. But really if it has no Arduino-isms is it Ardunio code in a meaningful sense for us? I think the same thinking applies to XML and Django.

If this first level is actually useful from an API standpoint I am more than happy to think about how to preserve this.

Secondly: I’m trying to make the auto detect/tests a lot smarter about actual “conflicts” between languages. If you have a matching relevance score for C++ or Arduino that’s NOT a real problem. It honestly doesn’t matter which won. It should be expected and not result in test failure.

Meanwhile, a matching relevance score between say JavaScript and SQL might actually be a big problem - poor auto detection, etc. those are the things I want to get better at focusing on.

@egor-rogov
Copy link
Collaborator

I think I'm a bit lost.

First: There is autodetect in a very general/philosophical sense. “This short code snippet looks like C++ but it also looks very much like Arduino. It could possibly be either.” I suppos sometimes that might be useful. But really if it has no Arduino-isms is it Ardunio code in a meaningful sense for us?

For me, the answer is "No, it is not Arduino, give me C++ please." But, as you mentioned:

If you have a matching relevance score for C++ or Arduino that’s NOT a real problem. It honestly doesn’t matter which won.

So, what is superset for then?

Meanwhile, a matching relevance score between say JavaScript and SQL might actually be a big problem - poor auto detection, etc. those are the things I want to get better at focusing on.

Right. And how superset can help?

@joshgoebel
Copy link
Member Author

joshgoebel commented Nov 20, 2019

So, what is superset for then?

To resolve the ambiguity of cases like C++ score:20 vs Arduino score:20. Whichever comes last or first alphabetically is a pretty terrible way to do it (and that's what we do now). Our tests passing or failing should NOT depend solely on the ordering of the data or naming of the languages.

If we decided to rename "XML" to "HTML" right now a whole bunch of tests might suddenly fail based on the renaming alone. This should not be the case.

Right. And how superset can help?

It's not designed to solve that problem. One thing at a time.

@egor-rogov
Copy link
Collaborator

Aha. Ok then. It makes sense.

@joshgoebel
Copy link
Member Author

We could also call it requires or dependsOn and express it that way... and then the "base" class would always win in cases of exact matches... I told you I wasn't sure on the naming. :-)

@egor-rogov
Copy link
Collaborator

egor-rogov commented Nov 21, 2019

I personally have no problems with superset, but it requires a bit of mathematical background, so probably another name will serve better. Both requires and depensOn don't look good to me, because they tell about implementation details, not about relationship between languages. base or baseLang seems better in that respect. Also parent comes to mind.

@joshgoebel
Copy link
Member Author

I like baseLanguage.

@joshgoebel joshgoebel changed the title Add superset to language grammar Add baseLangage to language grammars Nov 22, 2019
@joshgoebel
Copy link
Member Author

joshgoebel commented Nov 22, 2019

@egor-rogov How would you feel if as part of this we dinged a language instead of just removing it... So I'd add a scoredMatches key (or similar) and you would have a list of languages you could introspect. We could also add a base: true key or similar (for the results) to let someone process them as made the most sense to them... this would pretty easily allow for those use cases where someone might actually want to know "This could be Arduino or CPP". (maybe for a UI or something)

[
{ score: 46, language: "XML", ... }
{ score: 45.9, language: "Django", base: "XML", ... } // been dinged because it's not a base language
{ score: 45.9, language: "Handlebars", base: "XML", ... } // been dinged because it's not a base language
{ score: 45.9, language: "ERB", base: "XML",... } // been dinged because it's not a base language
]

The existing top match/second match would just become the first two items from this new list.

@egor-rogov
Copy link
Collaborator

I didn't quite understand how you're going to use it, but it's fine with me.

@joshgoebel joshgoebel added this to the 10.0 milestone Dec 7, 2019
@joshgoebel joshgoebel modified the milestones: 10.0, 10.1 Dec 24, 2019
joshgoebel added a commit to joshgoebel/highlight.js that referenced this issue Apr 21, 2020
- *technically* exact relevance ties in auto-detect are bad
- HLJS resolves ties in _alphabetic order_ and the fact
  that they currently work is more a matter of luck
  than anything else
- this now bubbles up such issues when the checker tool is run

Eventually something like highlightjs#2275 should improve matters.
@joshgoebel joshgoebel removed this from the 10.1 milestone Apr 21, 2020
@joshgoebel joshgoebel added the discuss/propose Proposal for a new feature/direction label Apr 30, 2020
joshgoebel added a commit that referenced this issue Nov 13, 2020
- Entire library now no longer uses any run-time dependencies (no grammar will break because another is missing)
- Add docs for fixMarkup deprecation
- Add docs for requireLanguage deprecation
- Remove `Requires:` meta-data used only for run-time dependencies.
- Gzip size is only ~230 bytes larger than before 
- Fix auto-detect tests now that the equivalency between `cpp` and `arduino` is no longer hidden by the load order. 
- adds `supersetOf` to resolve conflicts between very similar languages (C++, Arduino).  The base language wins in the case of a tie. ie `arduino` is now a `supersetOf` `cpp`. (closes #2275)

Note: `Requires: ` still stays overall as it's used to clue the build process about "loose" (non-breaking) sublanguage dependencies between languages.  This dependency is purposely loose and therefore still a thing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-detect Issue with auto detection of language type discuss/propose Proposal for a new feature/direction enhancement An enhancement or new feature parser
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants