Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistetected Octave/Matlab code snippet #1747

Closed
MayeulC opened this issue May 15, 2018 · 7 comments
Closed

Mistetected Octave/Matlab code snippet #1747

MayeulC opened this issue May 15, 2018 · 7 comments
Labels
auto-detect Issue with auto detection of language type

Comments

@MayeulC
Copy link

MayeulC commented May 15, 2018

Hi,

This code doesn't appear to have syntax highlighting, as far as I know (as it appears on my gitea install with v9.11.0, at least), yet it is a valid Octave script:

%% function cellarray = removeregexp(cellaray, expression_array)
% Removes matching expressions from cell array strings.
% If a string in the cell array is or becomes empty, it is removed from the cell
% array.
function cellarray = removeregexp(cellarray, expression_array)
  cellarray = regexprep(cellarray, expression_array, '');
  cellarray(cellfun(@isempty, cellarray)) = [];

The leading %% should give it away (Though TeX might be a contender). The end/endFunction is optional, and omitted here. It also appears that functions that do not have leading comments are more easily misdetected.

@joshgoebel
Copy link
Member

We don't tend to score comments higher than other things, so assuming they are counted (they seem to be) then %% just gets equal points to %. The problem is the rest of this code doesn't look that different than many other languages...

It also appears that functions that do not have leading comments are more easily misdetected.

Because % is unusual for commenting (at least in my experience)... so the more comments the more relevancy points you get for "looking like" Matlab code vs some other code.

Not sure there is anything to be done here unless we wanted to boost the relevancy of %% a bit, but that isn't going to help in larger samples of code there the ratio of %% to code is small and the code still looks pretty generic.

@joshgoebel
Copy link
Member

@MayeulC Ruby's ERB also uses %%. What are the rules for %% in matlab? Does it alway have to start on a newline?

@joshgoebel joshgoebel added the auto-detect Issue with auto detection of language type label Oct 7, 2019
@MayeulC
Copy link
Author

MayeulC commented Oct 8, 2019

@yyyc514 There aren't that many. Basically, in Matlab, those serve as section delimiters.
Sections can be executed individually, and the section you are currently editing is highlighted. Comments that follow %% are emboldened, so this is often used cosmetically to delimit code blocks, thus often at the top of the scripts, as it is serves as a "title". It does need to be at the start of a new line.

I would like to take a bit of what I said. Although it is encountered quite often in Matlab code, it is not necessarily a giveaway. And I do get your point, as the ratio of %% lines is generally around 1 per 20, 1 to 10, 1 to 40 or even zero depending on code bases.

Matlab has a lot of builtins, since everything is imported in the scope by default, and you can shadow functions. Here, regexprep, isempty, cellfun are builtins.

@ is an operator to get a function handle (can be used for lambdas as well). It is in my experience quite commonly used to prefix builtins.


I had a further look at the matching script. Would it help to know the generic function syntax for matlab?

It's (< > denotes optional parts), I saw that the script doesn't try to find =.

function < < [ list, of,> returnvalue < ,matching ] > = > func_name <(argument, list)>

this is what the xdg mime database uses for detection (of course, the rules there are very simple).


GNU Octave can be considered a dialect, with a few modifications, among which:

  • %% no longer serves a special purpose
  • # can be used for comments (and is thus quite often used in a #!/usr/bin/env octave shebang).
  • endif, endfunction, endwhile, endfor, do ... until are keywords.
  • split a number of toolboxes or specific domain functions in dedicated packages, introduce pkg as a builtin to load those.
  • special %! unit tests : %!assert, %!error, %!demo.

List of Octave builtins

@joshgoebel
Copy link
Member

joshgoebel commented Oct 9, 2019

Does it alway have to start on a newline?

Not sure you answered this. This would be one easy way to perhaps increase the relevancy of Matlab a bit over other things as this would be a rather unusual thing I think.

Auto-detection is something I'm very curious about and a tough problem. Especially given that many of our parsers are intended to be "simple" rather than complete. Allowing us to detect/color a LOT of languages without having a huge size.

Would it help to know the generic function syntax for matlab?

Not sure this helps that much... From 10 miles high it looks a lot like expressions in other languages with < ( identifiers, etc...

@joshgoebel
Copy link
Member

Long-term I'd like to see us get to a place (with tools, metrics, tests) where someone interested in this (improving detection for Matlab say) could get involved and play around and see if anything "sticks" and if they can improve the detection in "meaningful ways"...

Right now it's kind of hard/tricky to do that...

@joshgoebel
Copy link
Member

joshgoebel commented Oct 16, 2019

@MayeulC If you'd like to submit a PR that bumps the relevancy of %% when it starts a new line by just a little and that doesn't break the brittle balance fo auto-detect that might be useful here.

But first you might want to start by seeing how far apart you currently even are between matching Matlab and whatever else it's matching. You have loaded the matlab grammar JS file - or bundled it, yes? I realize your initial message says "code doesn't appear to have syntax highlighting" rather than that it was highlighted incorrectly.

Otherwise not sure how we can improve this much. Your code sample is just hard to identify without more context. (as all small code samples often are)

@joshgoebel
Copy link
Member

Closing as resolved/answered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-detect Issue with auto detection of language type
Projects
None yet
Development

No branches or pull requests

2 participants