Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulties trying to fine tune skipping of LaTeX macros with regexps #40

Closed
gusbrs opened this issue Apr 14, 2023 · 15 comments
Closed

Comments

@gusbrs
Copy link

gusbrs commented Apr 14, 2023

I'm trying to fine tune the skipping of LaTeX macros for jinx with regexps and have been facing some difficulties. As far as I can tell, they boil down to point placement when the regexp predicate is called. But they actually stem from two different sources. The first is the order of the predicates in jinx--predicates. Since jinx--face-ignored-p comes before jinx--regexp-ignored-p, when we try to ignore a macro with a regexp it's too late, because the macro itself has already been skipped because font-latex-sedate-face is part of jinx-exclude-faces for tex-mode (I'm using AUCTeX's latex-mode, so I'm assuming the faces it uses, not sure about tex-mode.el). Hence, when the regexp predicate kicks in, we are actually looking at the argument, and can no longer retrieve the macro without looking back (which jinx--regexp-ignored-p does not do). Second, even if we reorder the predicates in jinx--predicates to make the regexp one comes before the face one, the point at which the regexp predicate is called is after the backslash, so it is hard to distinguish between a macro and a regular word. Or, at least without looking backwards for the backslash which, again, jinx--regexp-ignored-p does not do.

Consider the following document:

\documentclass{article}

\newcommand*{\mymacro}[1]{Hi!}
\newcommand*{\mymacronoargs}{Hey!}

\begin{document}

\mymacro{fooarg} \mymacro{bararg} \mymacro{bazarg}

\mymacronoargs{} \mymacronoargs\ \mymacronoargs, \mymacronoargs.

These are not macros: mymacro mymacronoargs, they are misspellings.

\end{document}
% Local Variables:
% jinx-languages: "en_US"
% End:

The default results are the following:

01  defaults

We'd like to skip the arguments of \mymacro by adding them to a tex-mode entry of jinx-exclude-regexps. I've tried things such as "mymacro{[^}]*}" and "\\\\mymacro{[^}]*}", without success.

So I redefined jinx--regexp-ignored-p to visualize where the regexps were being called:

(defun jinx--regexp-ignored-p (start)
  "Return non-nil if word at START matches ignore regexps."
  (save-excursion
    (goto-char start)
    (let ((hl (make-overlay start (1+ start))))
      (overlay-put hl 'face 'highlight)
      (overlay-put hl 'category 'gb/test))
    (when (and jinx--exclude-regexp (looking-at-p jinx--exclude-regexp))
      (save-match-data
        (looking-at jinx--exclude-regexp)
        (match-end 0)))))

And the results were:

02  default regexp ignores

So, indeed, the regexp predicate comes too late for a LaTeX macro, and we can no longer get the whole structure without looking backwards.

Let's try things by reordering the predicates:

(setq jinx--predicates
      (list #'jinx--regexp-ignored-p
            #'jinx--face-ignored-p
            #'jinx--word-valid-p))

We now get:

03  regexp ignores reorderded predicates

Now we can see the macros when the regexp predicate is called, and we can add something like the following to our setup:

(add-to-list 'jinx-exclude-regexps
             '(tex-mode
               "mymacro{[^}]*}"
               "mymacronoargs"
               ))

With results:

04  with exclude regexps

So that works for the arguments of \mymacro, but it does unduly ignore the word "mymacronoargs". Of course, in this case, if we excluded the "mymacronoargs" from jinx-exclude-regexps things would work because the macro would be ignored later by the face predicate. But the point is more general, given the point at which it is called, jinx--regexp-ignored-p cannot distinguish between a regular word and a macro. The argument specification helps a little, but relying just on that is bound to be fragile. The predicate would have to, at least, (looking-back "\\\\" (1- (point))) to distinguish the cases.

Now, with that diagnosis in hand, some thoughts.

The first problem is much trickier than the second. And, in part, it is is related to #25 being a case of novel difficulty arising from ignoring words based on faces. Of course, it is obvious why you've chosen the cheapest predicate to run first, but as shown, it may complicate the job of later predicates in recognizing syntactical structures of some languages.

You can have a good example of what would mean to recognize if we are in a macro argument by looking backwards with the gb/tex-flyspell-bibkey-p function I shared at #9 (comment). It was so done because citation commands have optional arguments which should not be ignored coming before mandatory arguments which should be ignored, and can have multiple sets of them. And the regexp based predicates of flyspell and ispell, as far as I know, ignore the whole match, so it was not possible to not ignore only the optional arguments. And, as far as I know you, I doubt you'd ever want anything similar to gb/tex-flyspell-bibkey-p near jinx.

One alternative would be to invert the order of the predicates. In which case you may want to consider having a mode specific jinx--predicates. Another would be to exclude font-latex-sedate-face from jinx-exclude-faces for tex-mode (but how good would this be for the multiple derived modes?). In which case probably a dedicated regexp predicate for tex-mode would be in order (it could look backwards for the backslash, eventually check font-latex-sedate-face if there's no regexp match, etc.).

But, I know, both of these ideas seem like slippery slopes. The issue is tricky indeed. And, even if I'm not acquainted with other markup languages (except for Org and markdown, which are much simpler), I'd be willing to bet this kind of problem is not specific to LaTeX.

@minad
Copy link
Owner

minad commented Apr 14, 2023

As I understand, the solution is to just create a custom predicate which runs before all the other predicates only for tex-mode. The existing regexp and face predicates seem too weak for what you are trying to achieve.

There are other ideas we could consider, which may help solving general issues. For this we need more testing with different document formats.

  1. Adding an additional jinx--early-regexp-ignored-p called before the face predicate.
  2. Extend jinx--regexp-ignore-p to also use looking-back (if e.g., a pair of regexps is specified in the exclusion list). But looking-back is better avoided since it is slow. Otoh the looking-back would only happen rarely. Alternatively we could just look-back for a literal string, but that would be much more limited of course.

How do other systems solve this? I am not familiar with the Flyspell, Ispell and Spell-fu exclusion logic. Maybe we can find some good ideas there?

But, I know, both of these ideas seem like slippery slopes. The issue is tricky indeed. And, even if I'm not acquainted with other markup languages (except for Org and markdown, which are much simpler), I'd be willing to bet this kind of problem is not specific to LaTeX.

The problem is that we need more testing of other document formats to understand on how we could improve the general situation. It is not sufficient to look only at LaTeX. LaTeX is quite problematic since you can arbitrarily extend the language, think TikZ etc.

@gusbrs
Copy link
Author

gusbrs commented Apr 14, 2023

As I understand, the solution is to just create a custom predicate which runs before all the other predicates only for tex-mode. The existing regexp and face predicates seem too weak for what you are trying to achieve.

My first impression is that a dedicated predicate for LaTeX (or LaTeX macros) would be a good idea. But...

The problem is that we need more testing of other document formats to understand on how we could improve the general situation. It is not sufficient to look only at LaTeX. LaTeX is quite problematic since you can arbitrarily extend the language, think TikZ etc.

... you are right there. I'm not sure if just trying to solve this "for each major mode" is the best long term strategy. Alas, I'm reporting the LaTeX case, because that's the (tricky) one I use heavily and am more acquainted with. Still, my guess is that LaTeX will probably be close to the "worst case scenario" for spell checking.

There are other ideas we could consider, which may help solving general issues. For this we need more testing with different document formats.

  1. Adding an additional jinx--early-regexp-ignored-p called before the face predicate.

  2. Extend jinx--regexp-ignore-p to also use looking-back (if e.g., a pair of regexps is specified in the exclusion list). But looking-back is better avoided since it is slow. Otoh the looking-back would only happen rarely. Alternatively we could just look-back for a literal string, but that would be much more limited of course.

Perhaps "3."?: include \ to wordchars for LaTeX. That seems "smart" at first sight, but I'd be surprised if we wouldn't find out pretty fast why we shouldn't have. ;-)

Regarding looking-back, why do you say it would only happen rarely? I don't see why. And, indeed it is a problem that it is slow.

How do other systems solve this? I am not familiar with the Flyspell, Ispell and Spell-fu exclusion logic. Maybe we can find some good ideas there?

I'm more familiar with flyspell, but even there I have only really dwelt in the user facing side of things, and never dug that deep into the machinery. My setup at #9 shows some common(?) techniques. I'd guess flyspell-tex-command-regexp is likely to be the most frequent use case for many people. I also use there flyspell-generic-check-word-predicate. I did have some similar (less developed) old setup for ispell and I can try to find it here if you want.

Anyway, agreed, they are likely good resources of ideas for this problem.

@minad
Copy link
Owner

minad commented Apr 14, 2023

Perhaps "3."?: include \ to wordchars for LaTeX. That seems "smart" at first sight, but I'd be surprised if we wouldn't find out pretty fast why we shouldn't have. ;-)

We should definitely not do that.

Regarding looking-back, why do you say it would only happen rarely? I don't see why. And, indeed it is a problem that it is slow.

Since it would only trigger after the looking-at-p check.

@gusbrs
Copy link
Author

gusbrs commented Apr 14, 2023

We should definitely not do that.

😄

Since it would only trigger after the looking-at-p check.

Good point, for some reason I hadn't thought of that.

Indeed, thinking further, in a dedicated predicate for ignoring LaTeX macros, we could actually use the face to tell if we are at a macro or not, instead of looking backwards. And, instead of skipping them right away, then we check the regexps to handle the arguments.

And, let me ask you a technical question on a related issue. I had some ideas last night on the task of ignoring LaTeX macros (which I'm yet to explore and, if they work, I'll share), but whether they stand a chance depends on how jit-lock flags regions as pending. Do you know what exactly jit-lock marks as pending when a given point is edited in a buffer? Current line? More? Less? Depends?

@minad
Copy link
Owner

minad commented Apr 14, 2023

Do you know what exactly jit-lock marks as pending when a given point is edited in a buffer? Current line? More? Less? Depends?

It depends.

@gusbrs
Copy link
Author

gusbrs commented Apr 14, 2023

It depends.

Thanks. I'll play with it and see what it does, and if it works well for what I have in mind.

@minad
Copy link
Owner

minad commented Apr 14, 2023

Btw. for TeX one basic problem is to filter out environments, e.g., \begin{equation}. This is currently not handled by Jinx. AucTeX provides texmathp but I suspect it will be too slow, since there is no caching.

@gusbrs
Copy link
Author

gusbrs commented Apr 14, 2023

Yes indeed, I haven't even looked into environments, but something is bound to be needed there.

Taking a look at this from ispell's and flyspell's experience. ispell has a whole tex-ispell.el. flyspell offers variables: flyspell-tex-command-regexp (used by flyspell-tex-command-p), flyspell-check-tex-math-command (uses texmathp under the hood), flyspell-prog-text-faces (it uses faces too! I didn't know that). Also tex-mode-flyspell-verify (as flyspell-mode-predicate), flyspell-word checks inside if (eq ispell-parser 'tex) for some tasks. There's probably more. Hunspell also has a latex parser (https://github.com/hunspell/hunspell/blob/master/src/parsers/latexparser.cxx). I take from all this that they felt the need to treat (La)TeX specially.

@minad
Copy link
Owner

minad commented Apr 14, 2023

I take from all this that they felt the need to treat (La)TeX specially.

Every mode needs special treatment and a whole parsing infrastructure. Only then can we rely on the faces for spell-checking. Jinx owes its simplicity to font locking. For example org-block can only be trusted, since the Org font locking (and parsing) is sophisticated enough. For programming modes the whole infrastructure is in the process of being switched over to tree sitter, which in turn will yield reliable font locking by doing real parsing.

The problem with TeX is that the infrastructure is incoherent, with some of the code built into Emacs and the rest in AucTeX. Ispell does its own thing, Flyspell does its own thing, and so on.

@gusbrs
Copy link
Author

gusbrs commented Apr 16, 2023

@minad I was playing with that little idea I had had, and I think it came out interesting, at least in principle. Actually, the basic concept is the one you used for inhibiting the overlay at the current word, I'm just applying it for the purpose of selective regexp ignore rules. To do so I used as syntax that a numbered group indicates a submatch for which spell checking should be retried. And I'm passing the match data to jinx--check-region (that's why I had to tamper with it a little). So, once there is a match found, the whole match is skipped, but the numbered groups are marked to be rechecked. This way there's no backwards looking (except for the backslash there, but limited to a forward match existing and limited to one char). I don't think either flyspell or ispell are able to operate this way.

Also, if the basic concept is sound, things could get fancier. Like, skipping the arguments with sexps, and so on. It'd be just a matter of passing the same information around to jinx--check-region.

Here it goes:

(with-eval-after-load 'jinx

  (add-hook 'LaTeX-mode-hook 'gb/jinx-latex-setup)
  (defun gb/jinx-latex-setup ()
    ;; Just this predicate for testing purposes only, to isolate the effects.
    (setq-local jinx--predicates
                (list #'gb/jinx-latex-macro-ignored-p
                      ;; #'jinx--face-ignored-p
                      ;; #'jinx--regexp-ignored-p
                      #'jinx--word-valid-p))
    (setq gb/jinx--latex-macros-regexp
          (when gb/jinx-latex-macro-regexps
            (mapconcat (lambda (r) (format "\\(?:%s\\)" r))
                       gb/jinx-latex-macro-regexps "\\|"))))

  (defun gb/jinx-latex-macro-ignored-p (start)
    (save-excursion
      (goto-char start)
      (when (and gb/jinx--latex-macros-regexp
                 (looking-at-p gb/jinx--latex-macros-regexp)
                 (looking-back "\\\\" (1- (point))))
        (save-match-data
          (looking-at gb/jinx--latex-macros-regexp)
          (let* ((matches (remove nil (match-data)))
                 (submatchesp (length> matches 2)))
            (if submatchesp
                (append (list 'retry (match-end 0)) (cddr matches))
              (match-end 0)))))))

  (defvar-local gb/jinx--latex-macros-regexp nil)
  (defvar gb/jinx-latex-macro-regexps
    '("mymacro{[^}]*}"
      "mymacroii{[^}]*}{\\([^}]*\\)}"
      "mymacroiii{[^}]*}{\\([^}]*\\)}{\\([^}]*\\)}"))

  (defun jinx--check-region (start end &optional retry)
    "Check region between START and END.
Optionally RETRY word at given position.  Return updated END
position."
    (let ((st (syntax-table)) case-fold-search
          retry-list)
      (unwind-protect
          (with-silent-modifications
            (save-excursion
              (save-match-data
                ;; Use dictionary-dependent syntax table
                (set-syntax-table jinx--syntax-table)
                ;; Ensure that region starts and ends at word boundaries
                (goto-char start)
                (unless (looking-at-p "\\<")
                  (re-search-backward "\\<\\|^")
                  (setq start (match-beginning 0)))
                (goto-char end)
                (unless (looking-at-p "\\>")
                  (re-search-forward "\\>\\|$")
                  (setq end (match-beginning 0)))
                (jinx--delete-overlays start end)
                (goto-char start)
                (while (re-search-forward "\\<\\w+\\>" end t)
                  (let ((word-start (match-beginning 0))
                        (word-end (match-end 0)))
                    ;; No quote or apostrophe at start or end
                    (while (and (< word-start word-end)
                                (let ((c (char-after word-start)))
                                  (or (= c 39) (= c 8217))))
                      (cl-incf word-start))
                    (while (and (< word-start word-end)
                                (let ((c (char-before word-end)))
                                  (or (= c 39) (= c 8217))))
                      (cl-decf word-end))
                    (while (< word-start word-end)
                      (let ((subword-end word-end))
                        (when jinx--camel
                          (goto-char word-start)
                          (when (looking-at "\\([[:upper:]]?[[:lower:]]+\\)\\(?:[[:upper:]][[:lower:]]+\\)+\\>")
                            (setq subword-end (match-end 1))))
                        (goto-char subword-end)
                        (pcase (run-hook-with-args-until-success 'jinx--predicates word-start)
                          ((and (pred integerp) skip)
                           (goto-char (max subword-end (min end skip))))
                          ((and (pred (lambda (p) (and (listp p) (equal (car p) 'retry))))
                                retlst)
                           (goto-char (max subword-end (min end (cadr retlst))))
                           (setq retry-list (append retry-list (cddr retlst))))
                          ('nil
                           (if (and retry (<= word-start retry subword-end))
                               (setq retry-list (append retry-list (list word-start subword-end))
                                     retry nil)
                             (overlay-put (make-overlay word-start subword-end) 'category 'jinx))))
                        (setq word-start subword-end)))))
                (remove-list-of-text-properties start end '(jinx--pending))
                (while retry-list
                  (put-text-property (pop retry-list) (pop retry-list) 'jinx--pending t)))))
        (set-syntax-table st)))
    end))

And this is what it looks like:

Screenshot from 2023-04-16 16-52-48

@minad
Copy link
Owner

minad commented Apr 16, 2023

Actually, the basic concept is the one you used for inhibiting the overlay at the current word, I'm just applying it for the purpose of selective regexp ignore rules.

This is not great for two reasons:

  1. The approach used to skip the current word is a hack. I considered also to filter out the current word early on in jinx--check-pending, such that only the region before and after the word is checked. It is slightly more complicated to do this, but it may have been better.
  2. The idea to retry makes only sense if you want to delay rechecking, not as a mechanism to implement parsing/filtering.

@gusbrs
Copy link
Author

gusbrs commented Apr 16, 2023

This is not great for two reasons:

  1. The approach used to skip the current word is a hack. I considered also to filter out the current word early on in jinx--check-pending, such that only the region before and after the word is checked. It is slightly more complicated to do this, but it may have been better.

  2. The idea to retry makes only sense if you want to delay rechecking, not as a mechanism to implement parsing/filtering.

Well, that was the idea I had had. Back to square 1, it seems.

@minad
Copy link
Owner

minad commented Apr 16, 2023

Yes, I am just being honest. :)

@gusbrs
Copy link
Author

gusbrs commented Apr 16, 2023

Yes, I am just being honest. :)

That is fine, of course. And it wouldn't be you otherwise. ;-)

@minad
Copy link
Owner

minad commented Apr 17, 2023

Closing, see #25 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants