Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode affects Check Syntax #638

Closed
sorawee opened this issue Aug 22, 2022 · 37 comments
Closed

Unicode affects Check Syntax #638

sorawee opened this issue Aug 22, 2022 · 37 comments
Labels

Comments

@sorawee
Copy link
Contributor

sorawee commented Aug 22, 2022

#lang racket/base

(define aaa 1)

(values "‍☠️")

(define bbb 1)

results in:

Screen Shot 2022-08-21 at 8 37 57 PM

Notice the "no bound occurrences" underline is in the incorrect position.

Note that DrRacket doesn't have this issue (this is essentially a test in the DrRacket repo, so I think there might be an issue before, and it got fixed)

((alist-get 'racket-mode package-alist))
((emacs-version "28.1")
 (system-type darwin)
 (x-gtk-use-system-tooltips UNDEFINED)
 (major-mode racket-mode)
 (racket--el-source-dir "/Users/soraweep/.emacs.d/.local/straight/build-28.1/racket-mode/")
 (racket--rkt-source-dir "/Users/soraweep/.emacs.d/.local/straight/build-28.1/racket-mode/racket/")
 (racket-program "racket")
 (racket-command-timeout 10)
 (racket-path-from-emacs-to-racket-function UNDEFINED)
 (racket-path-from-racket-to-emacs-function UNDEFINED)
 (racket-browse-url-function racket-browse-url-using-temporary-file)
 (racket-documentation-search-location "https://docs.racket-lang.org/search/index.html?q=%s")
 (racket-xp-after-change-refresh-delay 1)
 (racket-xp-mode-lighter
  (:eval
   (racket--xp-mode-lighter)))
 (racket-xp-highlight-unused-regexp "^[^_]")
 (racket-repl-buffer-name-function nil)
 (racket-submodules-to-run
  ((test)
   (main)))
 (racket-memory-limit 2048)
 (racket-error-context medium)
 (racket-repl-history-directory "~/.emacs.d/.local/cache/racket-mode/")
 (racket-history-filter-regexp "\\`\\s *\\'")
 (racket-images-inline t)
 (racket-imagemagick-props nil)
 (racket-images-keep-last 100)
 (racket-images-system-viewer "open")
 (racket-pretty-print t)
 (racket-use-repl-submit-predicate nil)
 (racket-pretty-print t)
 (racket-indent-curly-as-sequence t)
 (racket-indent-sequence-depth 0)
 (racket-pretty-lambda nil)
 (racket-smart-open-bracket-enable nil)
 (racket-module-forms "\\s(\\(?:module[*+]?\\|library\\)")
 (racket-logger-config
  ((cm-accomplice . warning)
   (GC . info)
   (module-prefetch . warning)
   (optimizer . info)
   (racket/contract . error)
   (racket-mode-debugger . info)
   (sequence-specialization . info)
   (* . fatal)))
 (racket-show-functions
  (racket-show-pseudo-tooltip)))
(enabled-minor-modes
 (+popup-mode)
 (auto-composition-mode)
 (auto-compression-mode)
 (auto-encryption-mode)
 (auto-fill-mode)
 (auto-save-mode)
 (better-jumper-local-mode)
 (better-jumper-mode)
 (centaur-tabs-mode)
 (column-number-mode)
 (company-mode)
 (delete-selection-mode)
 (display-fill-column-indicator-mode)
 (display-line-numbers-mode)
 (doom-modeline-mode)
 (dtrt-indent-mode)
 (editorconfig-mode)
 (electric-indent-mode)
 (evil-escape-mode)
 (evil-goggles-mode)
 (evil-local-mode)
 (evil-mode)
 (evil-snipe-local-mode)
 (evil-snipe-mode)
 (evil-snipe-override-local-mode)
 (evil-snipe-override-mode)
 (evil-surround-mode)
 (file-name-shadow-mode)
 (flycheck-mode)
 (flycheck-popup-tip-mode)
 (font-lock-mode)
 (gcmh-mode)
 (general-override-mode)
 (git-gutter-mode)
 (global-company-mode)
 (global-eldoc-mode)
 (global-evil-surround-mode)
 (global-flycheck-mode)
 (global-font-lock-mode)
 (global-git-commit-mode)
 (global-hl-line-mode)
 (global-so-long-mode)
 (global-undo-fu-session-mode)
 (global-visual-line-mode)
 (highlight-indent-guides-mode)
 (highlight-numbers-mode)
 (highlight-quoted-mode)
 (hl-line-mode)
 (hl-todo-mode)
 (hs-minor-mode)
 (ivy-mode)
 (ivy-rich-mode)
 (ivy-rich-project-root-cache-mode)
 (line-number-mode)
 (lispy-mode)
 (lispyville-mode)
 (mouse-wheel-mode)
 (ns-auto-titlebar-mode)
 (persp-mode)
 (projectile-mode)
 (racket-xp-mode)
 (rainbow-delimiters-mode)
 (ranger-override-dired-mode)
 (recentf-mode)
 (save-place-mode)
 (savehist-mode)
 (semantic-minor-modes-format)
 (shell-dirtrack-mode)
 (show-paren-mode)
 (show-smartparens-global-mode)
 (show-smartparens-mode)
 (size-indication-mode)
 (smartparens-global-mode)
 (smartparens-mode)
 (solaire-global-mode)
 (text-scale-mode)
 (transient-mark-mode)
 (undo-fu-mode)
 (undo-fu-session-mode)
 (vi-tilde-fringe-mode)
 (visual-line-mode)
 (which-key-mode)
 (whitespace-mode)
 (windmove-mode)
 (window-divider-mode)
 (winner-mode)
 (ws-butler-global-mode)
 (ws-butler-mode)
 (yas-global-mode)
 (yas-minor-mode))
(disabled-minor-modes
 (+emacs-lisp-ert-mode)
 (+javascript-gulp-mode)
 (+javascript-npm-mode)
 (+lsp-optimization-mode)
 (+popup-buffer-mode)
 (+web-angularjs-mode)
 (+web-django-mode)
 (+web-jekyll-mode)
 (+web-phaser-mode)
 (+web-react-mode)
 (+web-wordpress-mode)
 (abbrev-mode)
 (amx-debug-mode)
 (amx-mode)
 (auto-fill-function)
 (auto-revert-mode)
 (auto-revert-tail-mode)
 (auto-save-visited-mode)
 (avy-linum-mode)
 (blink-cursor-mode)
 (buffer-face-mode)
 (buffer-read-only)
 (button-mode)
 (centaur-tabs-local-mode)
 (cl-old-struct-compat-mode)
 (company-search-mode)
 (compilation-minor-mode)
 (compilation-shell-minor-mode)
 (completion-in-region-mode)
 (context-menu-mode)
 (counsel-mode)
 (counsel-projectile-mode)
 (cursor-intangible-mode)
 (cursor-sensor-mode)
 (dash-fontify-mode)
 (defining-kbd-macro)
 (diff-auto-refine-mode)
 (diff-minor-mode)
 (dired-hide-details-mode)
 (doom-docs-mode)
 (dtrt-indent-global-mode)
 (edebug-mode)
 (eldoc-mode)
 (electric-layout-mode)
 (electric-quote-mode)
 (evil-collection-magit-toggle-text-minor-mode)
 (flyspell-mode)
 (general-override-local-mode)
 (git-commit-mode)
 (global-auto-revert-mode)
 (global-dash-fontify-mode)
 (global-display-fill-column-indicator-mode)
 (global-display-line-numbers-mode)
 (global-git-gutter-mode)
 (global-hl-todo-mode)
 (global-prettify-symbols-mode)
 (global-reveal-mode)
 (global-semantic-highlight-edits-mode)
 (global-semantic-highlight-func-mode)
 (global-semantic-show-parser-state-mode)
 (global-semantic-show-unmatched-syntax-mode)
 (global-semantic-stickyfunc-mode)
 (global-vi-tilde-fringe-mode)
 (global-whitespace-mode)
 (global-whitespace-newline-mode)
 (horizontal-scroll-bar-mode)
 (ibuffer-auto-mode)
 (indent-tabs-mode)
 (isearch-mode)
 (ispell-minor-mode)
 (jit-lock-debug-mode)
 (lispy-goto-mode)
 (lispy-other-mode)
 (lock-file-mode)
 (magit-auto-revert-mode)
 (magit-blame-mode)
 (magit-blame-read-only-mode)
 (magit-blob-mode)
 (magit-gitflow-mode)
 (magit-popup-help-mode)
 (magit-todos-mode)
 (magit-wip-after-apply-mode)
 (magit-wip-after-save-local-mode)
 (magit-wip-after-save-mode)
 (magit-wip-before-change-mode)
 (magit-wip-initial-backup-mode)
 (magit-wip-mode)
 (mail-abbrevs-mode)
 (menu-bar-mode)
 (mml-mode)
 (next-error-follow-minor-mode)
 (org-capture-mode)
 (org-cdlatex-mode)
 (org-list-checkbox-radio-mode)
 (org-src-mode)
 (org-table-follow-field-mode)
 (org-table-header-line-mode)
 (orgtbl-mode)
 (outline-minor-mode)
 (overwrite-mode)
 (paragraph-indent-minor-mode)
 (pcre-mode)
 (prettify-symbols-mode)
 (racket-smart-open-bracket-mode)
 (rectangle-mark-mode)
 (reveal-mode)
 (rxt--read-pcre-mode)
 (rxt-global-mode)
 (rxt-mode)
 (semantic-highlight-edits-mode)
 (semantic-highlight-func-mode)
 (semantic-mode)
 (semantic-show-parser-state-mode)
 (semantic-show-unmatched-syntax-mode)
 (semantic-stickyfunc-mode)
 (server-mode)
 (sh-electric-here-document-mode)
 (shell-command-with-editor-mode)
 (smartparens-global-strict-mode)
 (smartparens-strict-mode)
 (smerge-mode)
 (so-long-minor-mode)
 (solaire-mode)
 (tab-bar-history-mode)
 (tab-bar-mode)
 (temp-buffer-resize-mode)
 (tool-bar-mode)
 (tooltip-mode)
 (transient-resume-mode)
 (unify-8859-on-decoding-mode)
 (unify-8859-on-encoding-mode)
 (url-handler-mode)
 (use-hard-newlines)
 (view-mode)
 (visible-mode)
 (which-function-mode)
 (whitespace-newline-mode)
 (with-editor-mode)
 (xref-etags-mode))
@sorawee sorawee added the bug label Aug 22, 2022
@greghendershott
Copy link
Owner

greghendershott commented Aug 22, 2022

I see the expected alignment:

image

Note that it's shown with a strikeout face, which is the default for racket-xp-unused-face.

In your screenshot, it's a red wavy underline. This makes me think of flycheck-mode. In your supplied details, I see you have flycheck-mode enabled as well as racket-xp-mode.

This makes me wonder if you're reporting behavior from flycheck and/or its raco wrapper. However AFAICT that doesn't report unused definitions; at least I can't make it do so without trying too hard.

Maybe you just customized racket-xp-unused-face to be a red wavy underline, and flycheck is a red herring. But. What happens if you try this after M-x flycheck-mode to disable in the buffer, then make some edit to cause a refresh of racket-xp-mode check-syntax?

@sorawee
Copy link
Contributor Author

sorawee commented Aug 22, 2022

Sorry for the confusion. I have

(set-face-attribute 'racket-xp-unused-face nil
                        :strike-through nil
                        :underline '(:color "orange" :style wave))

in my config file.

@sorawee
Copy link
Contributor Author

sorawee commented Aug 22, 2022

My speculation is that this might be due to the Racket version difference. My Racket is essentially at the HEAD, which has the grapheme clustering change recently added.

EDITED: I confirm that when I set my PATH to Racket 8.5 (and regenerate env in Emacs), the position is correct. (I haven't tried 8.6, but it does already suggest that this is related to Racket version)

Screen Shot 2022-08-22 at 8 48 08 AM

@greghendershott
Copy link
Owner

@rfindler said he would be making changes, but my understanding was that it would be only up in the DrRacket application, not down at the level of the positions reported by drracket-tool-text-lib (as used by Racket Mode's back end).

A quick scan of recent commits in the racket/drracket repo seems to match my understanding; they seem to be solely for the DrRacket app or tests?

But I guess that must not be the case, something did recently change/break somehow.


It looks like I haven't rebuilt Racket from source since commit 1db3e1829d circa late May. I'll try to find time to do that, confirm it doesn't work for me either with that, see if I can help narrow down what changed/broke, and point it out to Robby.

@rfindler
Copy link

Yes, what @greghendershott says is right. (There could be a bug of course :). Also, stuff is currently in flux and @mflatt has another approach to handling how things work with editors that will cause other changes and might break/unbreak something here.

But more generally, it should be the case that positions coming from the racket/drracket to the Emacs mode code should be grapheme based (as the positions in a port are (when port-count-lines! is turned on). So hopefully that works out on the Emacs side too.

@rfindler
Copy link

@greghendershott if you wanted to try things out today, it might make sense to use this and this instead of the drracket and gui packages that you'd get by default. Those are the future, I believe.

@greghendershott
Copy link
Owner

Although maybe there are edge cases of which I've been blissfully ignorant, historically everything already has "just worked" between drracket/check-syntax and Emacs. Assuming port-count-lines! is used, the syntax-position and -span values reflected by check-syntax correspond to what Emacs thinks are character ("grapheme"?) positions. They even both agree that positions are numbered from 1 not 0. 😄

So as we discussed I'm definitely a fan of "don't change that". I'm glad that's still the intent. Of course things might be flux on the main branches and there might be bugs, which seems to be the case here (?).

I'll see what I can figure out. Thanks for the links; although it seems like drracket and racket/gui would be "off the path" and N/A for dracket-tool-text-lib <=> Emacs, I'll take a look anyway to hopefully better understand the motivation for the overall changes.

@rfindler
Copy link

I'll see what I can figure out. Thanks for the links; although it seems like drracket and racket/gui would be "off the path" and N/A for dracket-tool-text-lib <=> Emacs, I'll take a look anyway to hopefully better understand the motivation for the overall changes.

Assuming that Emacs really does do graphemes = positions then I think nothing should have to change for you but it is fantastic that you're offering to help debug. (But that code can certainly break stuff if it is wrong :)

@greghendershott
Copy link
Owner

@sorawee Unsurprisingly I can reproduce this using Racket built from commit 0cd6f5631e plus the versions of drracket etc. that it pulls in.

@greghendershott
Copy link
Owner

Continuing to be stumped by the lack of apparent change in drracket-tool-text-lib, I had a crazy idea that it might be an even lower level change.

TL;DR it seems that syntax-span now returns a different value.

Consider this little program to exercise read-syntax:

#lang racket/base
(require racket/path
         syntax/modread
         racket/match)
(define (string->syntax code-str)
  (define path (build-path (current-directory) "foo.rkt"))
  (define dir (path-only path))
  (parameterize ([current-load-relative-directory dir]
                 [current-directory               dir])
    (with-module-reading-parameterization
      (λ ()
        (define in (open-input-string code-str path))
        (port-count-lines! in)
        (match (read-syntax path in)
          [(? eof-object?) #'""]
          [stx stx])))))
(define stx (string->syntax "\"‍☠️\""))
stx
(syntax-span stx)

When run by older Rackets (I tried 8.0 and 8.4 BC because I happened to have them handy):

#<syntax:foo.rkt:1:0 "\u200D☠️">
5

But run by Racket built from source as of today:

#<syntax:foo.rkt:1:0 "\u200D☠️">
3

In other words they both read the same string (which looks weird to my eyes but I'm a Unicode noob).

However syntax-span for read-syntax of "\"‍☠️\"" used to be 5 in older versions of Racket --- but is now 3.

Although this program doesn't exercise/show it, I'm guessing this also throws of the syntax-position of subsequent pieces of read syntax? (And (waves hands) leads to the behavior seen via check-syntax.)

This seems like a change down in read-syntax, so maybe @mflatt would know?

@rfindler
Copy link

Yes, this is a (n intentional) change at the port-counting layer. Ports used to count in terms of unicode code points, but now it counts in terms of graphemes (assuming you call port-count-lines! -- otherwise it is all bytes all the time). This program:

#lang racket
(define sp (open-input-string "☠️"))
(port-count-lines! sp)
(void (read-line sp))
(port-next-location sp)

produces 1, 2, 3 in older versions of racket and 1, 1, 2 in git head.

This change then gets into the syntax objects, and the spans and whatnot that you're seeing in the example above. And this is how we get grapheme-based counting in the new racket, as opposed to code-point-based counting of the older versions.

If there aren't any fancy emojis (or maybe other things) then there isn't a difference, but if there are, then hopefully life should be an improvement due to this change. And it means that, in Emacs, you'll want to use methods that accept positions in the buffer to be counting based on graphemes, not based on other things. I took a quick look at Emacs and it wasn't obvious to me if this is easy or hard ...

@blerner
Copy link

blerner commented Aug 22, 2022

Hmm, could there be a way (or a parameter, or something) to convert back to codepoints? My linter/autograder scripts produces outputs in terms of syntax object spans, which are then fed to Codemirror to highlight, and Codemirror expects to work in terms of codepoints. I can't prevail on Codemirror to change its habits, and it already has some quirks in how it handles graphemes. But I might have a hope here... :)

(Also -- is this change in HEAD only, or in 8.6 as well?)

@rfindler
Copy link

@blerner maybe we should open a separate issue for that one? I think the change has not yet been in a release.

@blerner
Copy link

blerner commented Aug 22, 2022

Feel free to open it -- I'm not sure which repo would be best for it ;-) In the meantime, I'm relieved to hear it's not in 8.6 yet, so I can punt on dealing with it until next year :)

@mflatt
Copy link

mflatt commented Aug 22, 2022

A parameter could work. I had earlier considered adding an option in port-count-lines!, but propagating that choice through the make-input-port and make-output-port APIs seemed too painful to be worthwhile. In retrospect, a parameter is a good choice (and maybe I didn't think of it just because my reflex is to avoid parameters).

The parameter I have in mind would affect only built-in line-counting and any port implementation that opts to pay attention to the parameters. It would apply when port-count-lines! is called, and the setting would stick to a port at the point where line counting is enabled. Probably it would make sense to have an environment variable that can initialize the parameter on startup, but the default still would be grapheme-based. I think DrRacket and text% generally would not try to adapt to the parameter, though, so it wouldn't make any more sense to change the parameter in DrRacket than to change the text encoding.

@blerner
Copy link

blerner commented Aug 22, 2022

The scenario I have is

 (with-handlers ([exn:fail? (λ (x) #f)])  
    ... (parameterize ([read-accept-reader #t]
                                [read-accept-lang #t])
                   (call-with-input-file source
                     (λ(port)
                       (port-count-lines! port)
                       ...
                       (define stx-first (read-syntax source port))
                       ... work with stx-first ...))))

so as long as the port from call-with-input-file recognizes and accommodates this parameter, I should be fine. Does that match your thinking?

@mflatt
Copy link

mflatt commented Aug 22, 2022

Does that match your thinking?

Yes, the parameter would apply in that case.

@rfindler
Copy link

Probably it would make sense to have an environment variable that can initialize the parameter on startup, but the default still would be grapheme-based. I think DrRacket and text% generally would not try to adapt to the parameter, though, so it wouldn't make any more sense to change the parameter in DrRacket than to change the text encoding.

Would setting the environment variable potentially break programs running in DrRacket (or DrRacket itself)?

@mflatt
Copy link

mflatt commented Aug 23, 2022

Would setting the environment variable potentially break programs running in DrRacket (or DrRacket itself)?

It could cause source locations to not be in sync with editor content, but I think that's no different than a #lang that creates weird source locations.

@greghendershott
Copy link
Owner

About it being intentional: Yes, after posting, I had taken a break, come back, and discovered racket/racket@48fda3e and adjacent commits.

The more I thought about that, the more discouraged I was feeling, wondering how even to translate from graphemes back to codepoints, which I'm pretty sure Emacs needs (and it looks like some other tools need, too).

It would be a huge relief to have a parameter to support the previous, codepoint-based positions. I could try to dynamic-require that, and if it exists because it's a newer Racket, I could set it. All versions of Racket thereby supported smoothly.

@rfindler
Copy link

Maybe there should be a parameter that affects only ports opened as part of require (clearly something in a longer-named function, maybe current-load/use-compiled or something) instead of all ports in some dynamic extent?

@mflatt
Copy link

mflatt commented Aug 23, 2022

Maybe I don't have the right example in mind, but in an environment where non-grapheme counting is needed, it seems you probably want it pervasively.

I've added the parameter as port-count-graphemes-enabled with environment variable PLT_PORT_COUNT_CHARS, and we can see how those work it out. I also added use-char-as-grapheme as part of the text% changes.

@rfindler
Copy link

My thought was a situation where someone wants to work with their program in Emacs but their program also opens and manipulates grapheme-containing files. So they do the Emacs equivalent of the Run button in DrRacket. If there is an error, we want the error message to come out in code-point-based positions, but if there is no error, we want the file that they open and manipulate to report positions in graphemes. (One example of this might be running DrRacket from inside Emacs, although that doesn't seem super common :)

@rfindler
Copy link

I just tried an experiment and its results suggest that Emacs wants grapheme-based counting. This is a very naive experiment, however :). I created a file with these bytes: #"a\342\230\240b\357\270\217\n" (that's an "a" followed by "☠️" followed by "c") and then opened it in Emacs. I then did M-: an evaluated (goto-char 1), (goto-char 2) and (goto-char 3), which sent the insertion point to just before the "a", just before the "☠️", and just before the "c".

Is racket-mode using goto-char and related operations that accept positions like its to move the insertion point around and highlight things?

greghendershott added a commit that referenced this issue Aug 23, 2022
Add our own codepoint-port-count-lines! which calls port-count-lines!
in a parameterization disabling grapheme counting, which becomes the
default after Racket 8.6. We still want codepoint counting, to match
Emacs.

Use it for the reported bug with check-syntax and racket-xp-mode.

Also proactively use it elsewhere, in case that avoids similar
problems not yet discovered.
@greghendershott
Copy link
Owner

@mflatt Thanks!! Using that parameter fixes this problem, from a quick test. I do want to marinade in this a bit before merging.

@greghendershott
Copy link
Owner

@rfindler Racket Mode uses goto-char etc. Also it uses Emacs "text properties", which are arbitrary key/values that can be associated with position spans, which is the mechanism being used here.

I'm not super confident in my understanding of all the intricacies of Unicode, and characters vs. codepoints vs. graphemes vs. glphys. I want to keep believing many falsehoods about text as long as possible. 😄 I doubt I can explain your experiment in a way where you'd give me a passing grade, much less an A. 😄

But I think part of what's going on here is that @sorawee's original example "‍☠️" is a grapheme cluster or combination (not sure the exact terminology)?

In an Emacs buffer it actually results in three navigable positions within the quotation marks. The skull and crossbones per se is sandwiched between two zero-width characters. With point at each one of the three, you can C-u C-x = to get a buffer showing details. These are:

             position: 45 of 66 (67%), column: 9
            character: ‍ (displayed as ‍) (codepoint 8205, #o20015, #x200d)
    preferred charset: unicode (Unicode (ISO10646))
code point in charset: 0x200D
               script: symbol
               syntax: . 	which means: punctuation
             to input: type "C-x 8 RET 200d" or "C-x 8 RET ZERO WIDTH JOINER"
          buffer code: #xE2 #x80 #x8D
            file code: #xE2 #x80 #x8D (encoded by coding system utf-8-unix)
              display: by this font (glyph code)
    xft:-HL  -Khmer OS-normal-normal-normal-*-20-*-*-*-*-0-iso10646-1 (#x2D1)

Character code properties: customize what to show
  name: ZERO WIDTH JOINER
  general-category: Cf (Other, Format)
  decomposition: (8205) ('‍')

There are text properties here:
  face                 font-lock-string-face
  fontified            t
             position: 46 of 66 (68%), column: 9
            character: ☠ (displayed as ☠) (codepoint 9760, #o23040, #x2620)
    preferred charset: unicode (Unicode (ISO10646))
code point in charset: 0x2620
               script: symbol
               syntax: w 	which means: word
             category: .:Base
             to input: type "C-x 8 RET 2620" or "C-x 8 RET SKULL AND CROSSBONES"
          buffer code: #xE2 #x98 #xA0
            file code: #xE2 #x98 #xA0 (encoded by coding system utf-8-unix)
              display: by this font (glyph code)
    xft:-PfEd-DejaVu Sans Mono-normal-normal-normal-*-20-*-*-*-m-0-iso10646-1 (#xA77)

Character code properties: customize what to show
  name: SKULL AND CROSSBONES
  general-category: So (Symbol, Other)
  decomposition: (9760) ('☠')

There are text properties here:
  face                 font-lock-string-face
  fontified            t

             position: 47 of 66 (70%), column: 10
            character: ️ (displayed as ️) (codepoint 65039, #o177017, #xfe0f)
    preferred charset: unicode (Unicode (ISO10646))
code point in charset: 0xFE0F
               syntax: w 	which means: word
             category: ^:Combining
             to input: type "C-x 8 RET fe0f" or "C-x 8 RET VARIATION SELECTOR-16"
          buffer code: #xEF #xB8 #x8F
            file code: #xEF #xB8 #x8F (encoded by coding system utf-8-unix)
              display: by this font (glyph code)
    xft:-SIL -Abyssinica SIL-normal-normal-normal-*-20-*-*-*-*-0-iso10646-1 (#x3AE)

Character code properties: customize what to show
  name: VARIATION SELECTOR-16
  general-category: Mn (Mark, Nonspacing)
  decomposition: (65039) ('️')

There are text properties here:
  face                 font-lock-string-face
  fontified            t

And this representation in Emacs has aligned (AFAIK) with Racket's codepoint-based port counting.

greghendershott added a commit that referenced this issue Aug 23, 2022
Add our own codepoint-port-count-lines! which calls port-count-lines!
in a parameterization disabling grapheme counting, which becomes the
default after Racket 8.6. We still want codepoint counting, to match
Emacs.

Use it for the reported bug with check-syntax and racket-xp-mode.

Also proactively use it elsewhere, in case that avoids similar
problems not yet discovered.
@rfindler
Copy link

Thanks, @greghendershott ! I also definitely do not know what's going on, so instead of using @sorawee 's example, I went back to the pirate flag example that I've been using in the framework and drr test suites. Here's what I see: if you put these bytes into a file:

#"a\360\237\217\264\342\200\215\342\230\240\357\270\217b\n"

then you should get a file that has four places you can navigate to on the first line (before the "a", before the pirate flag, before the "b" and after the "b"). The file has three grapheme thingies in it's first line, but it has more unicode code points than that (I think the pirate flag is five unicode code points? I'm not sure).

So: here's a program to demonstrate that:

#lang racket
(define bts #"a\360\237\217\264\342\200\215\342\230\240\357\270\217b\n")
(define p (open-input-bytes bts))
(define-values (_1 _2 start-pos) (port-next-location p))
(port-count-lines! p)
(read-line p)
(define-values (_3 _4 end-pos) (port-next-location p))
end-pos

In git-head racket, I get 5 back which makes sense to me. We start at position 1 (that's what start-pos is) then get the "a", the flag, the "b", and then the newline, for a total of 4 more spots so end-pos comes out to be 5. This also matches what I see in emacs v28.1 (which is one that this blog post reports as the one where grapheme support got added).

In racket v8.6, with the code above, I see 8, and that is about what the codepoint number seems to be. Furthermore in Emacs v27.1, I also see a similar kind of behavior, where I have some empty space surrounding the pirate flag and the flag looks wrong (I'm attaching the two emacs screenshots both opening the same file.)

So: could I dare hope that racket-mode could check to see if someone is using racket v8.7 and Emacs 28.1 and then not set the world-killing flag [*] that Matthew just added? And if someone isn't using this combination and reports grapheme problems, we ask them to upgrade? Does that seem remotely plausible?

e27

e28

[*] the one that disables grapheme counting for port-count-lines! that I wish we wouldn't have to support in the future .... but I'm probably just being lazy with this thought :)

@greghendershott
Copy link
Owner

Using this program to create a file with your bytes:

#lang racket/base

(define rfindler-bytes #"a\360\237\217\264\342\200\215\342\230\240\357\270\217b\n")
(bytes->string/utf-8 rfindler-bytes)
(call-with-output-file* "issue-638-3.rkt" #:mode 'binary #:exists 'replace
  (λ (out) (display rfindler-bytes out)))

I open it in the oldest Emacs supported by Racket Mode, as well as the newest (Emacs built from source recently).

  • In both cases, I don't get the pirate flag because I lack a font supplying that on my Ubuntu system.

  • In both cases, I get the same buffer contents.

  • Note that the letter "b" is at position 6 of 7 in both cases -- that is, Emacs didn't change the position units from codepoints to graphemes.


Keep in mind that in Emacs, there is not necessarily a 1:1 correspondence between characters in the buffer and columns on the screen:

  • There is a 'display text property that can cause any span of buffer text to be displayed as any other text of any size. This doesn't change the positions of the characters in the buffer.

  • Furthermore, IIUC some composition may automatically be done for speed down at the display engine level. (But again that doesn't impact the buffer position units: In my previous example above notice that positions 45, 46, 47 are all reported as "column 9".)

Emacs 25.2.2 from Ubuntu 18.04

image

Emacs 29.0.50 (Emacs built from source recently)

image

@rfindler
Copy link

rfindler commented Aug 24, 2022

Thank you @greghendershott ! I see the same thing as you when I put the insertion point on the "b" and do C-u C-x =.

My confusion stemmed from the assumption that the right-arrow key would move forward only one position, but that doesn't seem to be the case when I type it interactively. That is, when I type the right arrow key three times, I move from the "a" to the flag to the "b". My version 28.1 emacs reports that "<right> runs the command right-char" (via C-h c) but, unfortunately, when I do M-: (right-char) three times from the start of the buffer I end up in the middle of the flag, as you and the position=codepoint understanding would have predicted. Similarly, goto-char sends me to the "b" only when I pass 6 to it and not any smaller numbers.

I see what you mean about the display property, too. When I do C-u C-x = on the pirate flag, I get this response:

             position: 2 of 7 (14%), column: 1
            character: 🏴 (displayed as 🏴) (codepoint 127988, #o371764, #x1f3f4)
              charset: unicode (Unicode (ISO10646))
code point in charset: 0x1F3F4
               script: emoji
               syntax: w 	which means: word
             category: .:Base
             to input: type "C-x 8 RET 1f3f4" or "C-x 8 RET WAVING BLACK FLAG"
          buffer code: #xF0 #x9F #x8F #xB4
            file code: #xF0 #x9F #x8F #xB4 (encoded by coding system utf-8-unix)
              display: composed to form "🏴‍☠️" (see below)

Composed with the following character(s) "‍☠️" using this font:
  mac-ct:-*-Apple Color Emoji-normal-normal-normal-*-12-*-*-*-p-0-iso10646-1
by these glyphs:
  [0 3 0 912 16 -1 16 12 3 nil]
with these character(s):
  ‍ (#x200d) ZERO WIDTH JOINER
  ☠ (#x2620) SKULL AND CROSSBONES
  ️ (#xfe0f) VARIATION SELECTOR-16

Character code properties: customize what to show
  name: WAVING BLACK FLAG
  general-category: So (Symbol, Other)
  decomposition: (127988) ('🏴')

There are text properties here:
  fontified            t

Overall, I am left wondering if there is a way to ask the buffer to go to the nth grapheme instead of the nth position, or to somehow convert between grapheme counts and position counts. FWIW, that was the state we were in before Matthew's most recent big change -- positions in a text% were still in terms of code points but there were conversion functions you could call (position-grapheme and grapheme-position which are still there but are useful in different situations now).

Perhaps there is a way to coax Emacs to take these display properties into account when using positions? Functions like the text% methods position-grapheme and grapheme-position but using Emacs's 'display property? I tried using object-intervals but the display property doesn't show up there. I was trying to use these docs, which are pointed to from this explanation of the 'display property doing things like this: (get-text-property 3 'display (find-buffer-visiting "issue-638-3.rkt")) also with variations that have numbers other than 3 in there and having no luck.

It is also kind of a mystery why typing <right> works differently than calling the function bound to <right>.

Here's another idea we could try: we could add something to the drracket/syncheck library so that you could ask for the positions that come back from it to be in code-point numbers instead of in grapheme numbers. It could call the text% position-grapheme functions on the numbers before returning them. This would introduce a dependency on racket/gui/base if done in a simple way, but I guess there is a more complex way where we take the code that's been added to text% to build the position<->grapheme mapping and reuse it. What do you think of that approach?

@greghendershott
Copy link
Owner

greghendershott commented Aug 24, 2022

Selfish observation: So far this whole change (i.e. not just using the parameter to disable the change) seems like a solution to a problem I don't have.

In fact it feels like it would (if I didn't use the parameter) create new problems I would need to solve just to keep things working the same as before. To avoid breaking, as opposed to improving.


General observations:

I'm kind of unclear on the motivation wrt to text% and DrRacket, i.e. what problems this change is trying to solve.

I think (waves hands) it is about having compositions like pirate flags be counted as 1 thing not multiple things. To move up to a higher level of abstraction.

I do wonder about the case where someone (e.g. me in reality, above) lacks a font with a glyph for the pirate flag. Emacs displays the two components, a don't-have-it box followed by the skull+crossbones glpyh. What would DrRacket do? If it needs to show 2 items, what happens to grapheme counting and that nice 1:1 correspondence?

Similarly, I get the impression that manual (de)composition is part of how people sometimes handle some languages (e.g. diacritics). So sometimes the composed "single" thing on the screen can be split into 2 or more, at least temporarily. What if someone is building a text% app and for some reason needs port counting to correspond to that? (Maybe this is unlikely beyond DrRacket. Maybe it's N/A because at any given moment that something like a check-syntax would run, the composed-or-not state would be consistent.)


Idea: Maybe Emacs works this way not just because it's old and hairy and too concerned with backward compatibility (although Emacs definitely can be all those things! 😄). Maybe it really does make sense for there to exist a layer where positions are non-composed, and then also a display layer where they might be composed (when possible)? To be clear I'm not arguing this for sure -- sometimes Emacs is just "bad old" not "good old", and anyway I don't know the character encoding and display representation space well enough to have a valuable opinion. I'm throwing it out just as a question.

@rfindler
Copy link

Your observations and questions are all good ones and I don't have answers for them. Maybe someone wiser or more knowledgeable than me will be able to offer some insight.

From where I sit, the problem to be solved is that changing the parameter can break programs. That is, I believe that some racket programs will be written on the assumption that port-count-lines! does one specific thing (whether this is implicit by looking at the spans of its own syntax objects (in a test suite say) or explicit by working with ports directly). When that program is run in racket-mode it will do a different thing than when the program is run in drracket (or command-line racket). That is the situation I'm trying to avoid -- although I do agree that there will not be many programs like that and perhaps I am wasting my time! Especially because this situation comes up only for folks using grapheme clusters.

One way to avoid that problem is for the drracket/syncheck library to offer a mode where it reports positions in a way that matches what racket-mode wants. I do realize (now), tho, that this introduces more pain for you as you'd need to call into the drracket/syncheck library in (yet another) version-specific way. Are you using show-content? Maybe I could make it default to code-point based positions and have an option to turn them into grapheme-based ones? Then you would need to change nothing (I think)?

@mflatt
Copy link

mflatt commented Aug 24, 2022

Ok, it looks like I've mostly just wasted everyone's time here.

The goal was to bring racket source-location reporting up-to-date by paying attention to graphemes, particularly in the context of Rhombus where column counting affects parsing. But, in fact, existing editors that handle graphemes don't do that. I didn't pay enough attention to the column information reported in other editors that I tried (VSCode, Xcode, and now Emacs 28.1), where even though a right-arrow key will move past a whole grapheme, the column count increments by the width in Unicode characters. So the change made Racket inconsistent with editors, the opposite of the intent.

It looks like the way forward is

  • revert location counting in Racket to be character-based and not grapheme-based (except that CRLF is still a special case)
  • get rid of the new parameter that controls counting
  • change text% back to always working in character-position mode
  • keep the new grapheme string and drawing operations
  • keep editor support that aligns snips and graphemes, including position-grapheme and grapheme-position (but those are not needed for things like syntax coloring or error highlighting)

@rfindler
Copy link

rfindler commented Aug 25, 2022

I just tried sublime text and it also jumps over the grapheme cluster with a single right-arrow keystroke, but increments the column count in a way that matches codepoints by more than one. 😞

EDIT: Sublime, Emacs, and XCode don't agree, however. With the bytes in the file as defined by rfindler-bytes above, as I move from the start of the file to the right, I see the column counts: 0, 1, 4, 5 in Emacs and 1, 2, 6, 7 in Sublime Text and 1, 2, 7, 8 in XCode. Three different answers?!? Maybe your way is in their future, @mflatt .

@mflatt
Copy link

mflatt commented Aug 25, 2022

@rfindler Thanks for the correction and clarification. So, it turns out that Emacs and Xcode don't actually count by code points, and whatever those do, Racket source locations are just not going to be compatible for a stream that has certain characters. As one extra data point, VSCode behaves like Sublime and counts code points.

I’m still inclined to revert the Racket backward-incompatible changes to counting and text%, which overall will make DrRacket count the way Sublime and VSCode do when "Count columns numbers from one" is enabled.

@AlexKnauth
Copy link

In a file with this:

a🏴‍☠️b
1234

where, when rendered with a duospace font, the b at the end of the first line aligns with the 4 at the end of the second line, I would expect to see the column counts 0, 1, 3, 4. More than the number of Grapheme clusters, but less than the number of Unicode scalars.

@greghendershott
Copy link
Owner

greghendershott commented Aug 25, 2022

So, it turns out that Emacs and Xcode don't actually count by code points, and whatever those do, Racket source locations are just not going to be compatible for a stream that has certain characters.

Are you sure? What's an example of that?

In all the examples so far: The code-points correspond to Emacs "characters" actually existing in the buffer -- independent of what's displayed (don't rely on that or motion commands, use C-u C-x = to inspect the real position values). The Emacs position values correspond exactly to the old Racket syntax-position values.

There is a lot of variety and uncertainty related to how things get displayed:

  • The code-points might compose into different displayable graphemes that would look like just 1 character in 1 column.

  • I lack the pirate flag glyph in any installed font, so I get two graphemes in two columns.

  • forward-char never, ever skips more than one character for me, although you report it does for you.

Despite all that variety, if there's an error at syntax-position 42 with a syntax-span of 5, and I tell Emacs to highlight that -- it highlights the correct stuff on the screen. Emacs knows how the underlying positions map onto the displayed presentation.

Maybe you know some counter-example that I don't, but that's why I'm a fan of the old status quo.

In fact, the more I think about it, the more I feel like code-point units are better. They are more precise, avoid varying ideas about "columns" and motion, and avoid the problem that not every user system can display every glyph.

@mflatt
Copy link

mflatt commented Aug 25, 2022

So, it turns out that Emacs and Xcode don't actually count by code points

I should have been more clear that I was talking about "columns" as shown, for example, by column-number-mode and with the values that Robby reported.

I understand that positions correspond to the internal thing in Emacs, and so finding things by position and span will work. That's great, and we're back to the status quo.

But if a human (or a tool that parses error messages) is trying to navigate to code based on a printed source location in line—column format, then that's the thing that isn't going to always work. Probably not a big deal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants