Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enh(autodetect) multiple autodetect fixes #2745

Merged
merged 30 commits into from Nov 15, 2020

Conversation

joshgoebel
Copy link
Member

@joshgoebel joshgoebel commented Oct 8, 2020

These changes were made using a modified test suite from https://github.com/andreasjansson/language-detection.el

Before:

67% correct. 80% first or second correct.

After:

72% detection is correct.  84% first OR second detected language is correct.

So a 4-5% measurable improvement and there are probably also additional improvements "hidden" as in matches that are still not correct, but likely to be better than they were before - though that's a hard one to measure. (it's very hard to tell the difference between Lisp-like languages, etc)

There were a few common areas of focus:

  • Removing incorrect illegal to allow auto-detect to work for languages that were being flagged as illegal
  • Remove double scoring element of many beginKeywords rules... I'm wondering if this shouldn't be done at the parser level itself.
  • Add illegal in a few places to make false positives less likely.
  • Increase match precision of some rules (also guard against false positives)
  • Reduce relevancy of rules that could potentially match almost anything (name rules for Lisp like languages, || in Ruby matching params when it could be OR or a concat operator in SQL, etc...)

@joshgoebel joshgoebel changed the title enh(autodetect) multiple autodetect fixes WIP: enh(autodetect) multiple autodetect fixes Oct 8, 2020
@joshgoebel joshgoebel added this to the 10.3 milestone Oct 14, 2020
@joshgoebel joshgoebel removed this from the 10.3 milestone Oct 16, 2020
@joshgoebel joshgoebel added this to the 10.4 milestone Oct 16, 2020
@joshgoebel
Copy link
Member Author

@joshgoebel joshgoebel commented Oct 16, 2020

Moving this to 10.4 also since it'll probably conflict with the UTF8 stuff and I still want to merge in our long lost CSS branch also...

@joshgoebel joshgoebel added the WIP label Oct 20, 2020
@joshgoebel joshgoebel changed the title WIP: enh(autodetect) multiple autodetect fixes enh(autodetect) multiple autodetect fixes Oct 29, 2020
@joshgoebel joshgoebel removed the WIP label Oct 29, 2020
@joshgoebel joshgoebel requested a review from egor-rogov Oct 29, 2020
@joshgoebel joshgoebel requested a review from allejo Oct 29, 2020
@joshgoebel joshgoebel force-pushed the autodetect_fixes branch 2 times, most recently from 6478b01 to e18cd94 Compare Nov 3, 2020
@joshgoebel
Copy link
Member Author

@joshgoebel joshgoebel commented Nov 3, 2020

This is definitely going to make more sense if you view the commits one at a time vs as a whole.

@joshgoebel joshgoebel added this to Remaining in 10.4 Nov 14, 2020
Copy link
Collaborator

@egor-rogov egor-rogov left a comment

Extensive work! Generally looks good, see some comments.

src/languages/bash.js Outdated Show resolved Hide resolved
@@ -49,7 +49,7 @@ export default function(hljs) {
var AT_PROPERTY_RE = /@-?\w[\w]*(-\w+)*/ // @-webkit-keyframes
var IDENT_RE = '[a-zA-Z-][a-zA-Z0-9_-]*';
var RULE = {
begin: /(?:[A-Z_.-]+|--[a-zA-Z0-9_-]+)\s*:/, returnBegin: true, end: ';', endsWithParent: true,
begin: /([*]\s?)?(?:[A-Z_.\-\\]+|--[a-zA-Z0-9_-]+)\s*(\/\*\*\/)?:/, returnBegin: true, end: ';', endsWithParent: true,
Copy link
Collaborator

@egor-rogov egor-rogov Nov 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What /**/ is meant for?

Copy link
Member Author

@joshgoebel joshgoebel Nov 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Common css hack from the old days.

src/languages/php.js Outdated Show resolved Hide resolved
begin: '\\$(\\{)?(::)?[a-zA-Z_]((::)?[a-zA-Z0-9_])*\\(([a-zA-Z0-9_])*\\)',
end: '[^a-zA-Z0-9_\\}\\$]'
begin: '\\$(::)?[a-zA-Z_]+((::)?[a-zA-Z0-9_]+)*' + regex.optional(ARRAY_ACCESS),
},
{
begin: '\\$(\\{)?(::)?[a-zA-Z_]((::)?[a-zA-Z0-9_])*',
end: '(\\))?[^a-zA-Z0-9_\\}\\$]'
begin: '\\$\\{(::)?[a-zA-Z_]((::)?[a-zA-Z0-9_])*' + regex.optional(ARRAY_ACCESS) + '\\}',
Copy link
Collaborator

@egor-rogov egor-rogov Nov 14, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... This doesn't look right to me. Consider e. g. $foo($bar): array index can be a full blown expression. You can have more fun:

set foo bar 
set ${foo}(1) 42 ; # this is not captured by the grammar
puts $bar(1)

It has nothing to do with autodection fixes though.

Copy link
Member Author

@joshgoebel joshgoebel Nov 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually know TCL are you just looked it up? I’ll go back and review this again because I think maybe I see some issues now but I’m not sure. It be really helpful to have some additional test to show what supposed to be possible here. On some of these languages I’m really shooting blind just looking at the examples I have and what the existing rules seem to do.

Copy link
Collaborator

@egor-rogov egor-rogov Nov 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used to knew it... wrote several programs in Tck/Tk long ago. Now I can hardly remember anything.

src/languages/vbscript.js Outdated Show resolved Hide resolved
joshgoebel added 13 commits Nov 15, 2020
- can start with `*` (css hacks)
- can include a comment after attribute name before : (css hacks)
- `value` is too common variable name to score points as keyword
- reduce 2x relevance for beginKeywords
- bump csharp relevance slightly
- operators get 0 relevance (consistency: no other grammars score them)
- "name" gets 0 relevance since almost any identifier will match

This reduces false positives in the language-detection.el rosetta data
set significantly.
- Add relevance for groovy shebang line
- Ternary should not grant extra relevance
- "name" gets 0 relevance since almost any identifier will match

Applying same logic as used with Clojure.
- only count => in `fn` context
- prevent beginKeywords double relevancy
- reduce relevance of `match`
joshgoebel added 17 commits Nov 15, 2020
- add `__FILE__` to keywords
- add `proc` and `lambda` Kernel methods to build_ins
- stricter rule for identifying method definition
- highlight variables
- `|` style params now get no relevance (can be too many other things)
- add SHEBANG rule
- make Ruby REPL matching a little stricter
- built-ins should only match if they are a call
- fix function detection
For languages with $ident and @Ident style variables this attempts
to prevent positives for $ident$ and @Ident@ type expressions, which
are likely something else entirely.

- bash
- perl
- php
- ruby
- I looked but couldn't find any reference to this.
- There is no reason to do this every other language gets
  credit for simple strings.
- This is found in other langauges and isn't a strong signal.
- also fix pgsql markup test
@joshgoebel joshgoebel merged commit 8acfeeb into highlightjs:master Nov 15, 2020
11 checks passed
@joshgoebel joshgoebel deleted the autodetect_fixes branch Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
10.4
Remaining
Development

Successfully merging this pull request may close these issues.

None yet

2 participants