-
-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode affects Check Syntax #638
Comments
I see the expected alignment: Note that it's shown with a strikeout face, which is the default for In your screenshot, it's a red wavy underline. This makes me think of This makes me wonder if you're reporting behavior from flycheck and/or its Maybe you just customized |
Sorry for the confusion. I have
in my config file. |
My speculation is that this might be due to the Racket version difference. My Racket is essentially at the HEAD, which has the grapheme clustering change recently added. EDITED: I confirm that when I set my PATH to Racket 8.5 (and regenerate env in Emacs), the position is correct. (I haven't tried 8.6, but it does already suggest that this is related to Racket version) |
@rfindler said he would be making changes, but my understanding was that it would be only up in the DrRacket application, not down at the level of the positions reported by A quick scan of recent commits in the But I guess that must not be the case, something did recently change/break somehow. It looks like I haven't rebuilt Racket from source since commit 1db3e1829d circa late May. I'll try to find time to do that, confirm it doesn't work for me either with that, see if I can help narrow down what changed/broke, and point it out to Robby. |
Yes, what @greghendershott says is right. (There could be a bug of course :). Also, stuff is currently in flux and @mflatt has another approach to handling how things work with editors that will cause other changes and might break/unbreak something here. But more generally, it should be the case that positions coming from the |
@greghendershott if you wanted to try things out today, it might make sense to use this and this instead of the drracket and gui packages that you'd get by default. Those are the future, I believe. |
Although maybe there are edge cases of which I've been blissfully ignorant, historically everything already has "just worked" between So as we discussed I'm definitely a fan of "don't change that". I'm glad that's still the intent. Of course things might be flux on the main branches and there might be bugs, which seems to be the case here (?). I'll see what I can figure out. Thanks for the links; although it seems like |
Assuming that Emacs really does do graphemes = positions then I think nothing should have to change for you but it is fantastic that you're offering to help debug. (But that code can certainly break stuff if it is wrong :) |
@sorawee Unsurprisingly I can reproduce this using Racket built from commit 0cd6f5631e plus the versions of drracket etc. that it pulls in. |
Continuing to be stumped by the lack of apparent change in TL;DR it seems that Consider this little program to exercise #lang racket/base
(require racket/path
syntax/modread
racket/match)
(define (string->syntax code-str)
(define path (build-path (current-directory) "foo.rkt"))
(define dir (path-only path))
(parameterize ([current-load-relative-directory dir]
[current-directory dir])
(with-module-reading-parameterization
(λ ()
(define in (open-input-string code-str path))
(port-count-lines! in)
(match (read-syntax path in)
[(? eof-object?) #'""]
[stx stx])))))
(define stx (string->syntax "\"☠️\""))
stx
(syntax-span stx) When run by older Rackets (I tried 8.0 and 8.4 BC because I happened to have them handy):
But run by Racket built from source as of today:
In other words they both read the same string (which looks weird to my eyes but I'm a Unicode noob). However Although this program doesn't exercise/show it, I'm guessing this also throws of the This seems like a change down in |
Yes, this is a (n intentional) change at the port-counting layer. Ports used to count in terms of unicode code points, but now it counts in terms of graphemes (assuming you call
produces 1, 2, 3 in older versions of racket and 1, 1, 2 in git head. This change then gets into the syntax objects, and the spans and whatnot that you're seeing in the example above. And this is how we get grapheme-based counting in the new racket, as opposed to code-point-based counting of the older versions. If there aren't any fancy emojis (or maybe other things) then there isn't a difference, but if there are, then hopefully life should be an improvement due to this change. And it means that, in Emacs, you'll want to use methods that accept positions in the buffer to be counting based on graphemes, not based on other things. I took a quick look at Emacs and it wasn't obvious to me if this is easy or hard ... |
Hmm, could there be a way (or a parameter, or something) to convert back to codepoints? My linter/autograder scripts produces outputs in terms of syntax object spans, which are then fed to Codemirror to highlight, and Codemirror expects to work in terms of codepoints. I can't prevail on Codemirror to change its habits, and it already has some quirks in how it handles graphemes. But I might have a hope here... :) (Also -- is this change in HEAD only, or in 8.6 as well?) |
@blerner maybe we should open a separate issue for that one? I think the change has not yet been in a release. |
Feel free to open it -- I'm not sure which repo would be best for it ;-) In the meantime, I'm relieved to hear it's not in 8.6 yet, so I can punt on dealing with it until next year :) |
A parameter could work. I had earlier considered adding an option in The parameter I have in mind would affect only built-in line-counting and any port implementation that opts to pay attention to the parameters. It would apply when |
The scenario I have is
so as long as the port from |
Yes, the parameter would apply in that case. |
Would setting the environment variable potentially break programs running in DrRacket (or DrRacket itself)? |
It could cause source locations to not be in sync with editor content, but I think that's no different than a |
About it being intentional: Yes, after posting, I had taken a break, come back, and discovered racket/racket@48fda3e and adjacent commits. The more I thought about that, the more discouraged I was feeling, wondering how even to translate from graphemes back to codepoints, which I'm pretty sure Emacs needs (and it looks like some other tools need, too). It would be a huge relief to have a parameter to support the previous, codepoint-based positions. I could try to |
Maybe there should be a parameter that affects only ports opened as part of |
Maybe I don't have the right example in mind, but in an environment where non-grapheme counting is needed, it seems you probably want it pervasively. I've added the parameter as |
My thought was a situation where someone wants to work with their program in Emacs but their program also opens and manipulates grapheme-containing files. So they do the Emacs equivalent of the Run button in DrRacket. If there is an error, we want the error message to come out in code-point-based positions, but if there is no error, we want the file that they open and manipulate to report positions in graphemes. (One example of this might be running DrRacket from inside Emacs, although that doesn't seem super common :) |
I just tried an experiment and its results suggest that Emacs wants grapheme-based counting. This is a very naive experiment, however :). I created a file with these bytes: Is racket-mode using goto-char and related operations that accept positions like its to move the insertion point around and highlight things? |
Add our own codepoint-port-count-lines! which calls port-count-lines! in a parameterization disabling grapheme counting, which becomes the default after Racket 8.6. We still want codepoint counting, to match Emacs. Use it for the reported bug with check-syntax and racket-xp-mode. Also proactively use it elsewhere, in case that avoids similar problems not yet discovered.
@mflatt Thanks!! Using that parameter fixes this problem, from a quick test. I do want to marinade in this a bit before merging. |
@rfindler Racket Mode uses I'm not super confident in my understanding of all the intricacies of Unicode, and characters vs. codepoints vs. graphemes vs. glphys. I want to keep believing many falsehoods about text as long as possible. 😄 I doubt I can explain your experiment in a way where you'd give me a passing grade, much less an A. 😄 But I think part of what's going on here is that @sorawee's original example In an Emacs buffer it actually results in three navigable positions within the quotation marks. The skull and crossbones per se is sandwiched between two zero-width characters. With point at each one of the three, you can C-u C-x = to get a buffer showing details. These are:
And this representation in Emacs has aligned (AFAIK) with Racket's codepoint-based port counting. |
Add our own codepoint-port-count-lines! which calls port-count-lines! in a parameterization disabling grapheme counting, which becomes the default after Racket 8.6. We still want codepoint counting, to match Emacs. Use it for the reported bug with check-syntax and racket-xp-mode. Also proactively use it elsewhere, in case that avoids similar problems not yet discovered.
Thanks, @greghendershott ! I also definitely do not know what's going on, so instead of using @sorawee 's example, I went back to the pirate flag example that I've been using in the framework and drr test suites. Here's what I see: if you put these bytes into a file:
then you should get a file that has four places you can navigate to on the first line (before the "a", before the pirate flag, before the "b" and after the "b"). The file has three grapheme thingies in it's first line, but it has more unicode code points than that (I think the pirate flag is five unicode code points? I'm not sure). So: here's a program to demonstrate that:
In git-head racket, I get 5 back which makes sense to me. We start at position 1 (that's what In racket v8.6, with the code above, I see 8, and that is about what the codepoint number seems to be. Furthermore in Emacs v27.1, I also see a similar kind of behavior, where I have some empty space surrounding the pirate flag and the flag looks wrong (I'm attaching the two emacs screenshots both opening the same file.) So: could I dare hope that racket-mode could check to see if someone is using racket v8.7 and Emacs 28.1 and then not set the world-killing flag [*] that Matthew just added? And if someone isn't using this combination and reports grapheme problems, we ask them to upgrade? Does that seem remotely plausible? [*] the one that disables grapheme counting for |
Using this program to create a file with your bytes:
I open it in the oldest Emacs supported by Racket Mode, as well as the newest (Emacs built from source recently).
Keep in mind that in Emacs, there is not necessarily a 1:1 correspondence between characters in the buffer and columns on the screen:
Emacs 25.2.2 from Ubuntu 18.04Emacs 29.0.50 (Emacs built from source recently) |
Thank you @greghendershott ! I see the same thing as you when I put the insertion point on the "b" and do C-u C-x =. My confusion stemmed from the assumption that the right-arrow key would move forward only one position, but that doesn't seem to be the case when I type it interactively. That is, when I type the right arrow key three times, I move from the "a" to the flag to the "b". My version 28.1 emacs reports that "<right> runs the command right-char" (via C-h c) but, unfortunately, when I do I see what you mean about the display property, too. When I do C-u C-x = on the pirate flag, I get this response:
Overall, I am left wondering if there is a way to ask the buffer to go to the nth grapheme instead of the nth position, or to somehow convert between grapheme counts and position counts. FWIW, that was the state we were in before Matthew's most recent big change -- positions in a text% were still in terms of code points but there were conversion functions you could call ( Perhaps there is a way to coax Emacs to take these display properties into account when using positions? Functions like the It is also kind of a mystery why typing Here's another idea we could try: we could add something to the |
Selfish observation: So far this whole change (i.e. not just using the parameter to disable the change) seems like a solution to a problem I don't have. In fact it feels like it would (if I didn't use the parameter) create new problems I would need to solve just to keep things working the same as before. To avoid breaking, as opposed to improving. General observations: I'm kind of unclear on the motivation wrt to I think (waves hands) it is about having compositions like pirate flags be counted as 1 thing not multiple things. To move up to a higher level of abstraction. I do wonder about the case where someone (e.g. me in reality, above) lacks a font with a glyph for the pirate flag. Emacs displays the two components, a don't-have-it box followed by the skull+crossbones glpyh. What would DrRacket do? If it needs to show 2 items, what happens to grapheme counting and that nice 1:1 correspondence? Similarly, I get the impression that manual (de)composition is part of how people sometimes handle some languages (e.g. diacritics). So sometimes the composed "single" thing on the screen can be split into 2 or more, at least temporarily. What if someone is building a Idea: Maybe Emacs works this way not just because it's old and hairy and too concerned with backward compatibility (although Emacs definitely can be all those things! 😄). Maybe it really does make sense for there to exist a layer where positions are non-composed, and then also a display layer where they might be composed (when possible)? To be clear I'm not arguing this for sure -- sometimes Emacs is just "bad old" not "good old", and anyway I don't know the character encoding and display representation space well enough to have a valuable opinion. I'm throwing it out just as a question. |
Your observations and questions are all good ones and I don't have answers for them. Maybe someone wiser or more knowledgeable than me will be able to offer some insight. From where I sit, the problem to be solved is that changing the parameter can break programs. That is, I believe that some racket programs will be written on the assumption that One way to avoid that problem is for the |
Ok, it looks like I've mostly just wasted everyone's time here. The goal was to bring racket source-location reporting up-to-date by paying attention to graphemes, particularly in the context of Rhombus where column counting affects parsing. But, in fact, existing editors that handle graphemes don't do that. I didn't pay enough attention to the column information reported in other editors that I tried (VSCode, Xcode, and now Emacs 28.1), where even though a right-arrow key will move past a whole grapheme, the column count increments by the width in Unicode characters. So the change made Racket inconsistent with editors, the opposite of the intent. It looks like the way forward is
|
I just tried sublime text and it also jumps over the grapheme cluster with a single right-arrow keystroke, but increments the column count EDIT: Sublime, Emacs, and XCode don't agree, however. With the bytes in the file as defined by |
@rfindler Thanks for the correction and clarification. So, it turns out that Emacs and Xcode don't actually count by code points, and whatever those do, Racket source locations are just not going to be compatible for a stream that has certain characters. As one extra data point, VSCode behaves like Sublime and counts code points. I’m still inclined to revert the Racket backward-incompatible changes to counting and |
In a file with this:
where, when rendered with a duospace font, the |
Are you sure? What's an example of that? In all the examples so far: The code-points correspond to Emacs "characters" actually existing in the buffer -- independent of what's displayed (don't rely on that or motion commands, use There is a lot of variety and uncertainty related to how things get displayed:
Despite all that variety, if there's an error at Maybe you know some counter-example that I don't, but that's why I'm a fan of the old status quo. In fact, the more I think about it, the more I feel like code-point units are better. They are more precise, avoid varying ideas about "columns" and motion, and avoid the problem that not every user system can display every glyph. |
I should have been more clear that I was talking about "columns" as shown, for example, by I understand that positions correspond to the internal thing in Emacs, and so finding things by position and span will work. That's great, and we're back to the status quo. But if a human (or a tool that parses error messages) is trying to navigate to code based on a printed source location in line—column format, then that's the thing that isn't going to always work. Probably not a big deal. |
results in:
Notice the "no bound occurrences" underline is in the incorrect position.
Note that DrRacket doesn't have this issue (this is essentially a test in the DrRacket repo, so I think there might be an issue before, and it got fixed)
The text was updated successfully, but these errors were encountered: