add an extension point to run per-lang setups #661

capfredf · 2023-05-30T15:29:46Z

It works, but I think @greghendershott should have a better idea of how the code is organized.

Notes on the implementation:

A lang name is extracted textually in the Racket backend. I feel this piece of information should be provided by Drracket or some related library eventually.

Also, It looks to me that a per-language configuration is quite desirable.

This commit is a squash of my old commits from over a year ago on the old "lang-lexer" branch. This was my starting point for the new "hash-lang" branch.

Change the token-map struct into a hash-lang% class. We need to supply an object% with certain methods for drracket:indentation functions, so, we may as well just make the whole thing a class instead of a struct. (Maybe this could be refactored into smaller pieces via mixins or something because it's starting to feel a little "fat".) Abandon our old strategies for parens and whitespace. Just go with the latest flow from drracket/expeditor.

Including: Implement a skip-whitespace method that produces same result as that of racket:text%. However {backward forward}-match and don't pass. Start to sketch out an alternative to forward-sexp-function handling, using drracket:grouping-position.

Although it's not baked into the token struct anymore, we create similar values when notifying, using paren-matches.

Rename some things from "lexindent" and "token-map" to hash-lang. Delete many files that are now N/A compared to the approach from a year ago, including steps toward making some of the code its own "token-map" package. For now, just focus on this being code for Racket Mode. Although I want to contribute the efficient interval-map idea for use in expeditor and other tools, some of that is still a moving target, and anyway I want to continue validating/refining things for use in Emacs. When my piece settles down, as well as some things in expeditor, I can take another look at how best to make some of this available for reuse. Also note: - This needs Racket 6.12+ for interval-map-ref/bounds. - A few tests need `framework` and are skipped on headless Racket -- e.g. CI like GitHub Actions -- but do run locally.

This is only necessary due to hash-lang.rkt test using a racket:text% object from the framework collection.

Eliminate bounds+token struct; just use list. Eliminate -get-tokens; use get-tokens in tests. Initial implementation to use drracket:range-indent -- but not yet actually tested with a lang that supplies it, like e.g. rhombus.

Implement failure token bounds as needed by forward-sexp-function. Implement racket-hash-lang-{up down} commands. I'm still not 100% if these are needed vs. standard Emacs {up down}-list commands which use forward-sexp, now that the previous item is implemented. But I think good to have for now, for development/testing. Re-implement position-paragraph and paragraph-{start end}-position methods. These still do a linear search from the start, which isn't ideal, but I'm not yet convinced it's worth trying to do a trickier incremental update like we do for the tokens interval-map. Start to add tests exercising #lang rhombus. Some of the indentation tests don't yet pass, for reasons I don't yet understand.

Incorporate some expeditor commits: racket/expeditor@e017835 racket/expeditor@5b1a374 Also update tests. At this point all the tests pass except for some shrubbery indentation tests in small number of lines in the shrubbery demo.rkt file.

Also adjust shrubbery test to retrieve https://raw.githubusercontent.com/mflatt/shrubbery-rhombus-0/master/demo.rkt

racket/expeditor@2cd0a74

Handle it being available (with new-enough versions of Racket and/or syntax-color-lib) or not. After all, a goal of Racket Mode is for it work to the extent possible on older versions of Racket. This is a good opportunity to move the tests to their own file, since they need to be skipped when color-textoid is not available. Anyway, they were starting to become unwieldy in the same file; it was becoming more tests than tested. Some of this will become N/A when we move it to syntax-color-lib itself; then hash-lang-bridge.rkt will need to do a similar test. But moving the tests to their own file is good prep, as those will end up in syntax-color-test not syntax-color-lib.

From racket/expeditor#10 it sounds like the "traditional" approach isn't desirable after all; remove it. Make the paragraph numbering "moderately" optimized: On do-udpate! we invalidate it, and then the paragraph methods recalculate it on-demand. This is a mid point on design spectrum between "naively scan from 0 every single time" and "do some minimal rebuild on every update". Note that I'm not 100% confident about the concurrency safety. But I'm committing anyway, for now, because I plan to take hard look at concurrency for the class, soon, including this as well as the potential for update generations arriving on command threads out of order and needing to be queued something like TCP packets.

Make hash-lang.rkt itself less idiosyncratic to how we want to use it in Emacs. Instead of supplying the class an async-channel, now it takes an optional on-notify procedure. Something using the class only for indent might not need changed-token notifications at all. And even something that needs them can choose how to handle them (although some non-blocking technique like an async-channel is important). Similarly, move the code that massaged the notifications into the desired format for Emacs, from hash-lang.rkt to hash-lang-bridge.rkt. By supplying paren-matches as the first argument to on-notify, we give it the ability to massage parenthesis tokens. Also, since on-notify gets the full token struct, it can handle the case where token-type is a hash-table instead of a symbol, as with module-lexer*. Although I'm not sure this is everything that needs to be done, or that all of these details are just right, it's at least a first step.

This test allows ours to take <= 3X the time. Although I'm not sure that's good enough, it sets an initial bar. (Although 3X isn't superb, the test example is range-indents rhombus/demo.rkt, a 600 line file -- and does so 10 times. The total time is around 54 ms for us vs. 24 ms racket:text%.) (To some extent this is measuring how "fancy" we are being in supporting the position-paragraph method, which effectively is position->line-number.)

Some things broke in commit 45197ed. Furthermore, now that hash-lang% uses an on-notify procedure (instead of a channel), we can simplify things have our procedure transform and put things directly to the token-notify-channel; we no longer need a channel per hash-lang% in hash-lang-bridge.rkt.

Upon RET, the electric-indent stuff does an indent-line for the original line, as well as an indent-line for the new line. The former doesn't work well with hash-lang indenters. Although maybe I should figure out why, it also seems reasonable simply to rebind RET to newline-and-indent, which this commit does.

In real use in Emacs, I saw this happen sometimes, which caused a blocking command like indent to timeout and things to be left in an inconsistent state.

This commit just updates a bunch of variables "tm" (after the old toke-map struct) to "o" (objects of the hash-lang% class).

Only check for new #lang when a change position is <= the end of the last lang spec we read. When we have a changed lang, notify the front end. So now, there is a 'hash-lang notification of varieties 'lang or 'token. The notification includes lang-info information like paren-matches, quote-matches, and booleans for whether the lang supplies a grouping-position, line-indenter, and range-indenter. The idea here is to supply enough information to appropriately configure things like the Emacs syntax-table, as well as indent-line-function, indent-region-function, and forward-sexp-function. That will be the next commit. The idea of this commit is to get enough back end stuff sorted out for this that I can focus on the Emacs end for awhile.

Back end: Handle case of inserting in middle of existing token; split it before expanding the interval-map to make the old/new compare work correctly and emit minimal changed-token notifications. Add a message when racket-hash-lang-mode is used with a #lang that supplies nothing special beyond a color-lexer -- no grouping-position or indent functions. In that case plain old racket-mode would suffice, unless a user prefers the "more boring" (or "less garish", depending on your opinion) coloring. Similarly append symbols to the mode-line lighter when each of these is supplied by a lang. (Although I'm not sure this is really the right UX, it's information that at least is helpful to me now when dogfooding and trying to understand what is supposed to work and why.) Front end: Stop setting forward-exp-function. Instead supply navigation commands. When the lang supplies no grouping-position at all, or when it does but a specific call returns "use s-expression", we use the default Emacs commands. Switch from cl-case to pcase.

greghendershott · 2023-08-25T18:26:08Z

Looking at this at again, I'm coming around to your original suggestion, mostly.

I do think it's fine to rely on Emacs auto-mode-alist as the mechanism to say, "Given some file extension, what Emacs major mode to use?" So for example all of .rkt, .scrbl, .rhm, etc. can use racket-mode, and, the user racket-mode-hook can enable the racket-hash-lang-mode minor mode. All these extensions are handled the same way with the hash langs.
Things like comment-start seem like a hole in the lang info spec. Whether it's DrRacket or Emacs or vscode, any editor that wants to offer comment/uncomment commands needs to know this. So stuff like this seems like something where Robby and Matthew and I could/should coordindate to add an info key.
Having said that, there may be miscellaneous config that a user wants to do based on the module language. Like in your example, you want M-q to fill-paragraph (not reindent) when in scribble lang. In that vein:
- It turns out that a language's "info" function can support a 'module-language key, as discussed for the #:info option of syntax/module-reader. (It's possible that older langs might not use this, but maybe the best answer there is we submit PRs to update those?) We don't need to re-implement read-language with regexps.
- I could define an Emacs hook, say racket-hash-lang-module-language, which is called with the mod lang value whenever that changes (from loading a file or from user editing). Users can add hook functions. This could be the point to do stuff like tweak M-q.
So this could handle "all other" config. Even in cases where we think adding a new lang info key is the ultimate Right Way, this could help in the meantime.

Any thoughts?

I've been sketching this out and doing some initial testing, so I'm not asking you to update your PR.

This is roughly in the same spirit as PR #661, but using the module-language key supported by new lang's info function. Also this defines a hook as the means for users to customize. A default hook function sets comment-start for a few popular langs (although ultimately comment-start should become a new "official" lang info key).

greghendershott · 2023-08-25T20:42:06Z

I pushed a commit to the hash-lang branch. When you have a chance, let me know if it seems OK?

Although we keep the racket-hash-lang-module-language-hook added in the previous commit, for end users, we no longer use a default hook to set comment-{start end padding}. Instead have the back end provide those values. The goal is to have a new info key, e.g. "drracket:comments", for langs to supply this. Meanwhile, any fallbacks live down there (although a user could still use the hook to add their own fallback or work-around). For background on this commit and the previous commit see PR #661.

greghendershott · 2023-08-26T15:26:33Z

I pushed commit 2075184 which moves some stuff down to the back end with a view toward adding a new info key for langs to supply.

capfredf · 2023-08-27T23:59:59Z

@greghendershott Thank you. I will give it a shot and get back to you in a week or so.

Use token class to decide whether to do prog-indent-sexp, fill-paragraph, fill-comment, or nothing. This should avoid users needing to do the kind of configuration as in issue #661. Furthermore it can vary smartly within e.g. a Scribble doc to support three possible behaviors based on location. (In my dog-fooding so far this works well.)

greghendershott · 2023-09-05T13:04:20Z

This isn't a nudge; on the contrary it's a summary for when you do have time to catch up:

A lang can now supply a drracket:comment-delimiters info key. racket-hash-lang-mode will use this to set comment-xxx variables.

Proposal: New info key drracket:comments racket/drracket#634 tracks the progress. But even now, before my PRs for scribble and rhombus are merged to supply this, the Racket Mode back end supplies fallbacks for those.
A new command racket-mode-C-M-q-dwim is bound to C-M-q by default. Based on the lang lexer's token under point, it does a prog-indent-sexp or fill-paragraph or fill-comment.

This has worked well for me so far editing a .scrbl file -- it fills in text section, but indents in racketblock code examples. But let me know of any problems/omissions.
Although the previous points address your configuration motivation (IIUC), there is also the new racket-hash-lang-module-language-hook for other configuration.

Finally I think I might go ahead and merge the hash-lang branch by the end of this week. (I might slap an "experimental" caveat in the docs. But these days it's probably better for this to live on the main branch, to get more use and improvement.)

capfredf · 2023-09-05T13:25:01Z

I haven't tried out the latest change yet. But out of curiosity, what do you have in mind when it comes to font locking ?

…

On Tue, Sep 5, 2023, 9:04 AM Greg Hendershott ***@***.***> wrote: This isn't a nudge; on the contrary it's a summary for when you do have time to catch up: 1. A lang can now supply a drracket:comment-delimiters info key. racket-hash-lang-mode will use this to set comment-xxx variables. racket/drracket#634 <racket/drracket#634> tracks the progress. But even now, before my PRs for scribble and rhombus are merged to supply this, the Racket Mode back end supplies fallbacks for those. 2. A new command racket-mode-C-M-q-dwim is bound to C-M-q by default. Based on the lang lexer's token under point, it does a prog-indent-sexp or fill-paragraph or fill-comment. This has worked well for me so far editing a .scrbl file -- it fills in text section, but indents in racketblock code examples. But let me know of any problems/omissions. 3. Although the previous points address your configuration motivation (IIUC), there is also the new racket-hash-lang-module-language-hook for other configuration. ------------------------------ Finally I think I might go ahead and merge the hash-lang branch by the end of this week. (I might slap an "experimental" caveat in the docs. But these days it's probably better for this to live on the main branch, to get more use and improvement.) — Reply to this email directly, view it on GitHub <#661 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAD5HCVLI4B722O6NHQHFITXY4PN7ANCNFSM6AAAAAAYUGAK4M> . You are receiving this because you authored the thread.Message ID: ***@***.***>

greghendershott · 2023-09-05T14:44:10Z

That's an open ended question. 😄 A couple answers:

Apparently the set of tokens a lexer may return is open.
- If a relatively popular lang like rhombus or scribble adds a new token type, then I'll probably update this to look for that and map it to a specific face. e.g. I did this for rhombus "at" and "operator" tokens.
- At the same time, there should probably be a config var alist for users to add/override those choices. That's still TO-DO.
Overall, the font-lock is rather "plain" compared to classic racket-mode. Much like DrRacket. e.g There aren't regexp rules to highlight popular functions, or variable names in let or define, etc. Of course that also means there aren't buggy corner cases, because regexps. My current thinking is:
- The lang lexer and racket-hash-lang-mode is about "syntactic" highlighting. By design that's somewhat basic... but also guaranteed to be correct.
- Sometimes people refer to "semantic" highlighting. This is much of the value-add from the "gaudy" regexp rules in classic Racket Mode. But here something like racket-xp-mode, based on check-syntax analysis, could do a good, and more-correct job. e.g. Highlight everything that's a variable. Or give font-lock-keyword-face to things imported from certain modules like racket/base. etc. I think that would give back much of the classic variety, but again with fewer regexp gotchas? Probably, but TBD.
And in fact this is another reason I'd like to merge to racket-hash-lang-mode. My other long-running project is a check-syntax db, "pdb". With those on separate branches, it's awkward to experiment with this mix of lexer highlighting and semantic highlighting.

That's my little brain dump. If you were actually asking some other, third question, please let me know. 😄