Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infinite loop when saving org file #177

Open
pp5x opened this issue Mar 24, 2023 · 70 comments
Open

Infinite loop when saving org file #177

pp5x opened this issue Mar 24, 2023 · 70 comments
Labels
help wanted Extra attention is needed

Comments

@pp5x
Copy link

pp5x commented Mar 24, 2023

What?

It seems org-transclusion has a bug when saving file and rendering. When the file is saved for the first time on a freshly launched emacs : no issue.

Then when I try to save the file a second (without modification) : emacs goes in an infinite loop and becomes unresponsive.

I managed to make it stop by using pkill --signal USR2 emacs and got a backtrace of what it was doing (see screenshot). The backtrace seems to show an interaction with org-element / org-transclusion was writing an infinite amount of time #+transclude: (see the number of lines. My file is originally 200 lines long). Looks like it's related to the way org-transclusion is saving files (that was mentioned in #109)

I suspect the bug is a bad interaction with org-element--cache-active-p which grows very very quickly. I noticed that running org-element-cache-reset can help when the file is opened again. Emacs also becomes unresponsive when i just move the cursor around the #+transclude: lines... 💥

Screenshot from 2023-03-24 01-12-26

Doom Emacs

I am running doom-emacs:

generated  mars 24, 2023 01:26:09
system     NixOS 22.11.3299.9ef6e7727f4 (Raccoon) Linux 5.15.103 x86_64
emacs      28.2 ~/.dotfiles/emacs/.emacs.d/
doom       3.0.0-pre PROFILE=_@0 HEAD, master 4e105a95a 2023-03-22 18:29:38 -0400 ~/.doom.d/
shell      /run/current-system/sw/bin/bash
features   CAIRO DBUS FREETYPE GIF GLIB GMP GNUTLS GSETTINGS HARFBUZZ JPEG JSON LIBOTF
	  LIBSELINUX LIBSYSTEMD LIBXML2 M17N_FLT MODULES NATIVE_COMP NOTIFY INOTIFY PDUMPER
	  PNG RSVG SECCOMP SOUND THREADS TIFF TOOLKIT_SCROLL_BARS X11 XAW3D XDBE XIM XPM
	  LUCID ZLIB
traits     batch server-running custom-file
modules    :config use-package :completion company vertico :ui doom doom-dashboard doom-quit
	  (emoji +unicode) hl-todo modeline ophints (popup +defaults) treemacs vc-gutter
	  vi-tilde-fringe workspaces zen :editor (evil +everywhere) file-templates fold
	  (format +onsave) snippets :emacs dired electric undo vc :term vterm :checkers
	  syntax (spell +flyspell) grammar :tools direnv editorconfig (eval +overlay) lookup
	  lsp magit make :lang (cc +lsp) emacs-lisp markdown nix (org +journal +pretty) rst
	  sh yaml :config (default +bindings +smartparens)
packages   (org-auto-tangle) (gift-mode :recipe (:host github :repo csrhodes/gift-mode :files
	  (gift-mode.el))) (org-transclusion :recipe (:host github :repo
	  nobiot/org-transclusion :branch main :files (*.el)))
elpa       vterm

Is there any chance we could solve this issue? I would love to be able to use this package to design my programming courses!
Thanks!

@nobiot
Copy link
Owner

nobiot commented Mar 24, 2023

I’d love to fix the issue. Two questions:

  • Can you reproduce it without Doom?
  • What is the exact step I can take from vanilla Emacs to reproduce it (and install Doom)

I suspect there is something specific to Doom Emacs. I do not use it and I don’t know how to use it

@nobiot
Copy link
Owner

nobiot commented Mar 24, 2023

Also do you use the latest commit of org-transclusion? I can see you use the main branch but I do not see which commit.

@nobiot
Copy link
Owner

nobiot commented Mar 25, 2023

I tried doom emacs. I am not sure if this is helpful, but I cannot reproduce the issue. Sorry, it's really hard for me to understand what is going on on your end.

@devcarbon-com
Copy link
Contributor

I've ran into this a couple times myself. I don't use doom, so I don't think it's exclusive to that.

I don't know exactly what caused it, but I suspect it had something to do with undo. (Or maybe undo-fu in particular).

If I can find a way to trigger this consistently I'll let you know.

@nobiot
Copy link
Owner

nobiot commented Mar 29, 2023

I don't know exactly what caused it, but I suspect it had something to do with undo. (Or maybe undo-fu in particular).

If I can find a way to trigger this consistently I'll let you know.

Thank you. I really struggle with this error; I'd really appreciate it if we can find a reproducible procedure.

I don't use undo-fu either.

I have encountered the error when I was developing features. For my experience, it was caused by the incorrect way transclusion is removed and the origianl #+transclude: line is inserted back (related to the beg/end marks). I think I have managed to eliminate all the cases I have encountered -- this is reflected in the current code.

@pp5x
Copy link
Author

pp5x commented Mar 30, 2023

Hi @nobiot Sorry I did not try to dig more in this issue. I would need to setup a blank emacs with the faulty file I've been using. If I can, I'll for sure try to improve this issue by identifying the reproduction steps.

I wanted to let you know of the issue, which is for now too problematic for me to adopt transclusion in my org production (as I have to almost kill emacs to get out of the infinite loop). I'd love to see this package included in doom distribution, it's a great community which could help you get lots of traction (they are on Discord to help).

To answer your previous question : I was on main 'up to date' 2d0502f.

@devcarbon-com
Copy link
Contributor

@nobiot Could you point me to where you in the code (or the commit) you fixed this occurring on the occasions you know about?

I'm thinking to debug this we could throw in a check that sees if the buffer size is growing since save was triggered and if so throw an error so we can debug-on-error instead of freezing emacs.

@devcarbon-com
Copy link
Contributor

So looking a little closer this should be simple to do. We already have org-transclusion-before-save-buffer and I can throw the check into org-transclusion-remove to capture local variables and experiment to help find the cause.

nobiot added a commit that referenced this issue May 10, 2023
Reported in GitHub issues #109 #177.

I cannot reproduce the issue myself so far. I am put in place (1) small
preventive measure and (2) heuristics to defect and break the infinite
loop on save-buffer.

(1) Org-transclusion (OT) tries not to save the transcluded buffer
content and instead save only the #+transclude keyword line to the file.
To achieve this, OT uses 'before-' and 'after-save-hook' to remove-all
the transclusions and then add-all them. This operation relies on the
returned value of the point from 'org-transclusion-remove' function. In
this commit, the point (integer) is changed to marker. This way, any
arbitrary buffer change between these remove-all and add-all processes
can have less impact on the moving points of reference -- makers
automatically move to adopt to the new buffer state. I suspect something
like 'whitespace-cleanup` put in 'before-save-buffer' might dislocate
the positions in some situations. This preventive measure hopefully
preempt the issues.

(2) The heuristics is simple but should work if there is an unexpected
number loop happens. Since it is simply compare the length of a list,
and the 'dolist' loops for the same list, logically this should be
redundant; however, since the infinite loop itself to me is anomaly,
this heuristics might catch the issue and break the loop.

As you can see, both attempts are not based on causal analysis but rather
"stabbing in the dark" heuristics.
@nobiot
Copy link
Owner

nobiot commented May 10, 2023

In order to contain the issue of infinite loop, I have pushed commit 43c478c. I'd appreciate it if anyone has bumped into the error message: "org-transclusion: Aborting. You may be in an infinite loop".

fix: heuristics to identify & break infinite loop on save

Reported in GitHub issues #109 #177.

I cannot reproduce the issue myself so far. I am put in place (1) small
preventive measure and (2) heuristics to defect and break the infinite
loop on save-buffer.

(1) Org-transclusion (OT) tries not to save the transcluded buffer
content and instead save only the #+transclude keyword line to the file.
To achieve this, OT uses 'before-' and 'after-save-hook' to remove-all
the transclusions and then add-all them. This operation relies on the
returned value of the point from 'org-transclusion-remove' function. In
this commit, the point (integer) is changed to marker. This way, any
arbitrary buffer change between these remove-all and add-all processes
can have less impact on the moving points of reference -- makers
automatically move to adopt to the new buffer state. I suspect something
like 'whitespace-cleanup` put in 'before-save-buffer' might dislocate
the positions in some situations. This preventive measure hopefully
preempt the issues.

(2) The heuristics is simple but should work if there is an unexpected
number loop happens. Since it is simply compare the length of a list,
and the 'dolist' loops for the same list, logically this should be
redundant; however, since the infinite loop itself to me is anomaly,
this heuristics might catch the issue and break the loop.

As you can see, both attempts are not based on causal analysis but rather
"stabbing in the dark" heuristics.

@japhir
Copy link

japhir commented Jan 3, 2024

Hi nobiot, thanks for the great package!

I'm also running into this issue on 1.3.2, likely because I use evil-mode and have set:
(add-hook 'evil-insert-state-exit-hook 'my-save-if-bufferfilename)

Right now it's unfortunately unusable for me, as it repeats the links infinitely and I can't break out other than killing the emacs session. How can I try out this fix to see if it helps?

@nobiot
Copy link
Owner

nobiot commented Jan 6, 2024

@japhir Thank you.

I don't use evil mode and I can't reproduce the issue any longer.

The key is to reproduce the issue reliably so that we can analyze the code that causes the infinite loop.

What I can suggest is:

  1. Remove 'my-save-if-bufferfilename from the hook and see if you can reproduce the issue reliably

  2. Remove evil mode and see if you can reliably reproduce the issue reliably

  3. Remove everything but evil mode and org-transcluion ans see if you can reproduce the issue reliably

I have begun to suspect there may be some insistency with evil mode... but I can't use vim keybinding so I am no use here.

@devcarbon-com
Copy link
Contributor

@nobiot I've observed this behavior a few times, and I'm not using evil.

It seems to trigger when undoing just the right amount in the org file with the transclusions and then save.

I'll see if I can reproduce with emacs -Q.

@nobiot
Copy link
Owner

nobiot commented Jan 6, 2024

@devcarbon-com Thank you. I see, undoing never occurred to me. I am really curious to see if anyone can repro reliably with emacs -q...

Have you seen the infinite loop after commit 43c478c I mentioned above? It was merged in May 2023.

@devcarbon-com
Copy link
Contributor

It was after that, only a couple months ago, but I'm not entirely sure if my version was current at the time.

@devcarbon-com
Copy link
Contributor

devcarbon-com commented Jan 6, 2024

Okay, I just reproduced this bug.

Emacs version:

GNU Emacs 29.1 (build 1, aarch64-apple-darwin21.6.0, NS appkit-2113.60
 Version 12.6.6 (Build 21G646)) of 2023-08-16

org-transclusion version: 1.3.2 installed via package-install after fresh install of emacs.

org file:

* OT test
#+transclude: [[./code.el::bar][bar]] :thing-at-point defun  :src elisp

code.el:

(defun bar ()
  (interactive)
  (message "hello"))

Steps to reproduce:

  1. Enable OT mode via org-transclusion-mode.
  2. org-transclusion-add with cursor at transclusion link.
  3. Save buffer.
  4. Undo exactly once.
  5. Save buffer.
  6. Bug is triggered, upon c-g org file looks like this:
#+transclude: [[./code.el::bar][bar]]  :src elisp
... (many more identical lines)
#+transclude: [[./code.el::bar][bar]]  :src elisp
#+begin_src elisp
(defun bob ()
  (interactive)
  (message "hello"))
#+end_src

@devcarbon-com
Copy link
Contributor

(edited to be accurate)

@nobiot
Copy link
Owner

nobiot commented Jan 7, 2024

Steps to reproduce:

1. Enable OT mode via `org-transclusion-mode`.

2. `org-transclusion-add` with cursor at transclusion link.

3. Save buffer.

4. Undo exactly once.

5. Save buffer.

6. Bug is triggered, upon `c-g` org file looks like this:

@devcarbon-com Thank you for the steps. It's the first time I have seen a concrete step to repro. I feel one step closer to know what's really happening.

Now I have tried following the steps with emacs -q. I cannot follow exactly your steps (see below). I tried a little differently; no infinite loop triggered.

  1. In terminal, type emacs -q. Emacs opens a splash screen
  2. Press "q" to move to the "scratch" buffer.
  3. CTRL-Y to yank the following setup script and evaluate the buffer
(add-to-list 'load-path "~/.config/emacs/elpa/org-transclusion-1.3.2.0.20230819.63913")
(load-library "org-transclusion")
(define-key global-map (kbd "<f12>") #'org-transclusion-add)
(define-key global-map (kbd "C-z") #'undo)
  1. C-x C-f to visit the test org file (infinite-loop.org)
    image

infinite-loop.org is identical with yours.
code.el is identical with yours.
setup.el is the setup script above to record the setup.

  1. M-x org-transclusion-mode. This already adds the transclusion (default, expected behaviour). Calling org-transclusion-add does not do anything
  2. Step 5 does not flag the buffer to "modified"; you cannot save the buffer (no change) -- this is expected
  3. C-z to call undo exactly once. The transclusion gets removed.
  4. You cannot save the buffer because there is nothing registered in undo (expected).

I cannot exactly follow your steps, so I tried this:

In the buffer org-transclusion-mode is already active. Transclusion has been removed.

  1. Change "* OT Test" to "* OT Test-changed" so that the buffer is marked as modified (ensure the buffer-modified is "on" while do the rest)
  2. Move the cursor to the line of #+transclusion.
  3. F12 (org-transclusion-add) to add
  4. Undo exactly once; the transclusion gets removed
  5. Ensure the modified-buffer has not changed and buffer-save (C-x C-s)

I cannot reproduce the infinite loop.

GNU Emacs 29.1.50 (build 1, x86_64-pc-linux-gnu, GTK+ Version 3.24.33, cairo version 1.16.0) of 2023-07-30

Org mode version 9.6.6 (release_9.6.6 @ /usr/local/share/emacs/29.1.50/lisp/org/)

I use Ubuntu with Wayland (default), so I compile my own Emacs with pgtk (pure gtk) feature on -- instead of the default gtk for Xorg

@nobiot
Copy link
Owner

nobiot commented Jan 7, 2024

  1. Enable OT mode via org-transclusion-mode.
  2. org-transclusion-add with cursor at transclusion link.

You can do these steps only when you have customizing org-transclusion-add-all-on-activate to nil. The default is t. Did you do the steps with emacs -q or emacs -Q?

@devcarbon-com
Copy link
Contributor

devcarbon-com commented Jan 8, 2024

Ah, yes, I see where I was not clear now. Following your steps, I get the exact same behavior as you do, and do not get an infinite loop. The key difference seems to be using a file that is already written, vs. writing one from scratch. Perhaps also the timing of activating org-transclusion-mode. Undo amalgamation may also pay a factor. At first I could not get consistent results, until I added an 'intentional mistake'.

  1. start emacs with emacs -q (sorry, -Q was by mistake earlier, not the command I ran).
  2. q from splash to scratch buffer.
  3. paste in your setup code and eval-buffer.
  4. find-file to new .org file.
  5. new file is empty. Enable org-transclusion-mode.
  6. type in (not paste) contents of inifinite-loop.org, but intentionally "forget" the colon after #+transclude.
  7. org-transclusion-add. (user-error: Not at a transclude keyword or transclusion in a block at point). (intentional).
  8. Add colon.
  9. org-transclusion-add.
  10. save-buffer.
  11. undo. (If transclusion disappears, error will not be triggered. I get this when I don't add the intentional mistake at times).
  12. save-buffer.
  13. Loop is triggered, cancel via c-g

@devcarbon-com
Copy link
Contributor

Note that saving the buffer in step 10 seems to also be a key step. If you leave this out, it works without issue. (no loop, and transclution is removed on undo.)

@nobiot
Copy link
Owner

nobiot commented Jan 8, 2024

@devcarbon-com Thank you for the detail! I can reproduce the infinite loop now. I need to spend some time to get my head around it, though. It's a great leap forward!

@japhir
Copy link

japhir commented Jan 9, 2024

Omg, devcarbon-com managed to reproduce exactly what I was doing at the time that I encountered the crash, but I just didn't recall it anymore!

short description of what I was doing at the time

I was working on some data analysis in R, using org-mode and org-babel. At some point I started to accumulate a few too many functions, so I tangled the functions to separate files in the R subdirectory, so I could make a package out of it. After that, I wanted to work in those R files directly to make debugging and potential duplication errors easier to deal with.
But I also liked having direct access to the functions from my org-mode file, so I wanted to transclude each of the functions that I had just tangled to a separate file.

It was the first time in a long time using org-transclusion, so I typed out the new #+transclude [[file:R/a_function_for_this_R_package.R]] :src R lines manually. Then realized I had forgotten the colon, and added them with some automatic intermediate saving. Maybe I hit undo at an unfortunate time and this caused the crash.

I would not have been able to make a reproducible example but your description triggered my memory 😄

@nobiot nobiot pinned this issue Apr 20, 2024
nobiot added a commit that referenced this issue Apr 20, 2024
It is a common mistake for users to omit the colon. It is a workaround to
minimize the chance for users experience the known infinite issue. Refer to
issue #177 on the GitHub repository:
#177."
@nobiot
Copy link
Owner

nobiot commented Apr 20, 2024

FYI
With #177, I have added another workaround to minimize the possibility of infinite loop occurring.

The workaround is this:

  • When you call the command org-transclusion-add to an individual #+transclusde line
  • If you have a misspellig -- ie if #+translclude is missing the colon ":" at the end, org-transclusion-add automatically corrects it by adding a colon. You will get a message telling this has been done for you.
  • Then transclusion works as normal.
  • This auto-correction does not happen when org-transclusion-add-all is used.

This workaround is motivated by the observation that the infinite loop issue happens mostly because of the tiny misspelling by missing the colon ":". The hope is to minimize the experience of infinite loop.

-- I realize it's not fixing the root cause. I have gone back to it but now I can only intermittently reproduce the problem (but more reliably than before, but not always). When I see the infinite loop happening, I am not able to determine the root cause... It would be great if anyone out there has experience in fixing this type of issue in Emacs and can help us.

@nobiot nobiot added the help wanted Extra attention is needed label Apr 20, 2024
@akashpal-21
Copy link

akashpal-21 commented May 11, 2024

@nobiot

I was using this

(defun org-transclusion-save-buffer ()
  "Save buffer protocols. Ensures file on disk is cleaned of transclusions;
Before writing to disk run `org-transclusion-before-save-buffer' which removes
active transclusions and generates a list of transclusions that were removed;
after writing to disk is complete, re-enable transclusions as and how they were
by running `org-transclusion-after-save-buffer' over the previously generated
list.

Run within `with-silent-modifications' so that none of this is recorded in the
`buffer-undo-list' of the buffer concerned."
  (with-silent-modifications
    (save-restriction
      (widen)
      (org-transclusion-before-save-buffer)
      (write-region (point-min) (point-max) buffer-file-name nil t)
      (org-transclusion-after-save-buffer)))
  t)

With write-region nil t that is visit set to `t' I can see no problem with this implementation as is from my side.

@akashpal-21
Copy link

akashpal-21 commented May 11, 2024

@nobiot @josephmturner

can we get a report on the values of the text-props as it passes through org-transclusion-remove
such as

++ text-props during save as it passes through `#'org-transclusion-remove`
Variables:
beg: 1
end: 17
keyword-plist: (:link [[file:source.org]] :current-indentation 0)
indent: 0
keyword: #+transclude: [[file:source.org]]
tc-pair-ov: #<overlay from 1 to 17 in source.org>

etc?

The infinite recursion happens when beg and end goes missing.

What I am trying to say is that the root cause of the infinite recursion is that text-properties of the overlay may be get corrupted for x,y,z number of reasons and if the beg and end of a text prop is corrupted then the org-transclusion-remove-all falls in problem -- so our real solution should be some sort of self correction mechanism for the remove function.

@akashpal-21
Copy link

This solution will not take care of the infinite recursion once and for all -- but this improvement is warranted nevertheless since users might want to save their buffer-undo-lists and the save-hooks should definitely not touch them.

But for the infinite recursion - if and how it exits - it has to be solved in the org-transclusion-remove function by making it fault tolerant - and detect corruptions when beg and end are equal.

@akashpal-21
Copy link

@ALL

One final hypothesis I wish to present for the enduring problem that is faced by all of us trying to replicate the issue - some of us can replicate consistently on a certain recipe while some of cannot - it is because we are fundamentally dealing with a non-deterministic system. EMACS GARBAGE COLLECTION - for more certainty we need more data and collaboration that cannot be done over the internet,

It may be those of us that fall into this problem through undo list corruption have a certain characteristic to our machines - maybe in the domain of available memory or what not? This is my instinct -- youd have to determine whether this is a pure conspiracy theory because elisp is by design deterministic or there is some truth to the claim.

@nobiot We should defer the PR to a completely different discussion - that is general improvement to the save protocol and distinguish it from this problem altogether.

For the problem in my opinion has to be solved permanently in the org-transclusion-remove function and not anywhere else -- but this should be above and all superflous - an insurance mechanism -- a fault tolerance protocol.

@nobiot
Copy link
Owner

nobiot commented May 11, 2024

@nobiot We should defer the PR to a completely different discussion - that is general improvement to the save protocol and distinguish it from this problem altogether.

Sounds reasonable to me. Agree.

I just re-compiled Emacs 29.1.90; now I cannot reproduce with devcarbon-com's procedure. So I agree it's not deterministic... or there is something we don't know yet.

For the problem in my opinion has to be solved permanently in the org-transclusion-remove function and not anywhere else -- but this should be above and all superflous - an insurance mechanism -- a fault tolerance protocol.

In testing 29.1.90 this time, beg and end passed to remove look fine.

@nobiot
Copy link
Owner

nobiot commented May 11, 2024

@akashpal-21

My latest understanding is that:

  1. buffer-undo-list is NOT a direct cause of the infinite recursion.
  2. The direct cause is org-transclusion-remove (more specifically the beg and end variables in it).
  3. The "corruption" of buffer-undo-list may cause 2., but we do not know exactly how this happens (you suspect it may be a combination of buffer-undo-list and GC in someway).

Is this correct?

This understanding is from my reading of how you phrase #245 and this from your previous comment:

@nobiot We should defer the PR to a completely different discussion - that is general improvement to the save protocol and distinguish it from this problem altogether.

For the problem in my opinion has to be solved permanently in the org-transclusion-remove function and not anywhere else -- but this should be above and all superflous - an insurance mechanism -- a fault tolerance protocol.

Thank you.

@akashpal-21
Copy link

@nobiot Yes that is my understanding - buffer-undo-list corruption is simply one way in which text-properties regarding BEG and END may get corrupted and be set as equal -- in which case the org-transclusion-remove falls in a problem.

But there may be N number of ways in which the text-property may get corrupted - and for a permanent fix - we should make org-transclusion-remove fault tolerant -- detect explicitly when BEG and END have been made equal by whatever process and refuse to process as is--

Maybe even try to recreate the BEG and END from the keyword-plist since the overlay still exists it just points wrong --

However that property may get corrupted.

@josephmturner
Copy link
Contributor

I tried reproducing the issue by let-binding gc-cons-threshold to a low value (99) around the snippet I shared above. I also tried inserting some explicit calls to garbage-collect in the snippet. Still no infinite loop.

can we get a report on the values of the text-props as it passes through org-transclusion-remove
such as...

@akashpal-21 I was not ever able to reproduce the issue. If it would still be helpful for me to report on these values, would you please send an Elisp snippet for me to run in emacs -Q which returns the relevant values?

@akashpal-21
Copy link

@josephmturner Hmm that's good to hear - I didn't know how to emulate GC circumstances.

To get the values we can debug the org-transclusion-remove function or just use describe-text-properties on the overlay.

For example I cannot reproduce the bug myself - but I can report on partial corruption of one property -- that I talked about earlier that does not result in the infinite loop but causes corruption still. For infinite loop the beg and end of the overlay should equate - this causes the org-transclusion-remove to fall in problem.

Please allow me a minute to attach a screen record to show the partial corruption still - I cannot reproduce the full corruption myself and therefore get the infinite loop -

@akashpal-21
Copy link

@josephmturner Just when I told you I cannot replicate it - I fell into it - now I forgot how to exit when such a situation arises I cannot quit emacs - it wont let me quit in any way

Screenshot_2024-05-12_02-34-49

Allow me a minute to recover.

@akashpal-21
Copy link

untitled.mp4

Partial corruption case as noted earlier - I cannot now replicate the infinite loop since beg and end wont coincide - but it did then -- now I dont even know what to say ?

@akashpal-21
Copy link

Ok recreated it again !! Lmao - Please see the video

untitled.mp4

@josephmturner
Copy link
Contributor

Thank you @akashpal-21! I can now reproduce the issue.

I was able to stop the hung Emacs with kill -SIGUSR2 <PID>, which then allowed
me to examine the backtrace:

Debugger entered--Lisp error: (quit)
  insert-before-markers("#+transclude: [[./org-transclusion-test-codebYgV7g...")
  org-transclusion-remove()
  org-transclusion-remove-all()
  org-transclusion-before-save-buffer()
  run-hooks(before-save-hook)
  basic-save-buffer(t)
  save-buffer(1)
  funcall-interactively(save-buffer 1)
  command-execute(save-buffer)

Furthermore, I was able to go back to the transclusion buffer, put point on the
transclusion and verify that the markers are in the same location:

(equal
  (get-text-property (point) 'org-transclusion-beg-mkr)
  (get-text-property (point) 'org-transclusion-end-mkr))
  ;;   => t

josephmturner added a commit to josephmturner/org-transclusion that referenced this issue May 12, 2024
Resolves nobiot#177 by making `org-transclusion-add` and `org-transclusion-remove` not
affect the buffer undo history.
@josephmturner
Copy link
Contributor

josephmturner commented May 12, 2024

Here's a recipe for reproducing the infinite loop on the current main branch (b23ead2) with less manual action.

In emacs -Q, evaluate the following form:

(let ((source-file (make-temp-file "org-transclusion-test-source"))
      (org-file (make-temp-file "org-transclusion-test-org" nil ".org")))
  ;; *Change to location on your machine where org-transclusion is installed.*
  (add-to-list 'load-path "~/.local/src/org-transclusion/")
  (load-library "org-transclusion")
  
  (with-temp-file source-file (insert "foobar"))
  (with-current-buffer (find-file org-file)
    (insert (format "#+transclude: [[./%s]]" (file-relative-name source-file)))
    (org-transclusion-add)))

Then in the org-mode buffer which appears containing the transcluded word "foobar", repeat the following steps until Emacs hangs:

  1. save the buffer with C-x C-s
  2. undo the buffer history as far as it will go with C-/ C-/ C-/ C-/ C-/ (until No further undo information error)
  3. save the buffer with C-x C-s
  4. redo the buffer history as far as it will go with C-? C-? C-? C-? C-? (until No undone changes to redo error)

On my machine, this reliably reproduces the infinite loop after repeating these steps a few times.

For some reason I don't yet understand, I wasn't able to reproduce the loop with non-interactive calls to save-buffer, undo, and undo-redo (even using call-interactively), so these instructions still involve some manual interaction.

@josephmturner
Copy link
Contributor

I thought not modifying the buffer-undo-list when adding/removing transclusions would solve the problem, but I was wrong. Please ignore the accidental commit above which claims to resolve this issue.

@nobiot
Copy link
Owner

nobiot commented May 12, 2024

I have tested different Emacs versions from compiling it from source: 29.1.90, 29.1, 29.2, 29.3; I have not been able to reliably reproduce the infinite loop with @devcarbon-com's procedure -- I used to be able to, but no longer.

I have managed to make it happen a couple of times (I cannot record the exact steps) with 29.1 and 29.3.

Based on @akashpal-21's analysis, I have pushed a new branch and commit to force the infinite loop to occur -- I still do not know exactly when we will arrive at this condition in real use of transclusions, but I think the branch can be used as a test harness to craft a preventive measure and test it.

I also modified @josephmturner's automation code as below to be able to reproduce the infinite loop easily with a command.

Instruction:

  1. (prep) Disable auto-save-visited-mode or auto-save-mode (better to have a full control of buffer-save)
  2. (prep) Checkout the repro-infinite branch
  3. (prep) Adjust the location of ~/src/org-transclusion/ in the command below
  4. (prep) Evaluate test/infinite-loop command
  5. Call test/infinite-loop command
  6. --- STOP HERE TO UNDERSTAND HOW TO UNDO THE INFINITE LOOP :)
  7. Manually save the buffer
  8. (alternative to step 6). Manually remove the transclusion
  9. (After the test) Once test is done, check out main branch and evaluate org-transclusion-remove to reset the test harness...

Infinite loop starts as soon as you save the buffer. Stop it immediately with C-g -- even with my old machine, I get about 2000 lines added within a fraction of a second. You need to delete the transclusion completely in order for you to properly kill the buffer (because of kill-buffer-hook).

(defun test/infinite-loop ()
  (interactive)
  (let ((code-file (make-temp-file "org-transclusion-test-code" nil ".el"))
        (org-file (make-temp-file "org-transclusion-test-org" nil ".org")))
    ;; *Change to location on your machine where org-transclusion is installed.*
    (add-to-list 'load-path "~/src/org-transclusion/")
    (load-library "org-transclusion")

    (with-temp-file code-file
      (insert "(defun bar ()\n")
      (insert "  (interactive)\n")
      (insert "  (message \"hello\"))"))

    (with-current-buffer (find-file org-file)
      ;; Inhibit read-only so that we can easily remove the problem for
      ;; testing purposes.
      (setq-local inhibit-read-only t)
      (org-transclusion-mode +1)
      (insert "* OT test\n")
      ;; Colon after #+transclude intentionally omitted
      (insert
       (format "#+transclude: [[./%s::bar][bar]] :thing-at-point defun  :src elisp"
               (file-relative-name code-file)))
      (org-transclusion-add))))

@nobiot
Copy link
Owner

nobiot commented May 12, 2024

@josephmturner, I posted the comment above before noticing your latest comment about recipe for repro. I will come back to it later. Thank you.

@akashpal-21
Copy link

I still do not know exactly when we will arrive at this condition in real use of transclusions,

I also think so - particularly because org-transclusion can never generate this result -- it is impossible for a null file to be transcluded

Nobody should under any normal circumstances trip upon this bug, possibly triggered by users when testing the program's functionality. Requires very rare alignment of known circumstances.

The problem is inherited from the environment it is functioning in - not under normal operations but exceedingly rare alignment of circumstances, the problem really isn't that the problem exists - but that when - in a 1:100000 circumstance it is reached - it results in a catastrophe. The user is imprisoned until they figure out a way to exit. Should the remove protocol refuse to entertain the impossible case of getting a 0 length overlay - it doesn't even need to try to rectify errors - but it should give the user the choice to manually delete the overlays and be allowed to save and quit.

nobiot added a commit that referenced this issue May 12, 2024
Collective analysis efforts have found that infinite loop occurs when variables
`beg` and `end` have an identical value in `org-transclusion-remove`.

It's at the top of the function and looks like this:

(beg (marker-position (get-char-property (point)
                                         'org-transclusion-beg-mkr)))
(end (marker-position (get-char-property (point)
                                         'org-transclusion-end-mkr)))

These values are used to know the size of the transclusion to remove it. The
size of overlay on the source cannot be used because filter can alter the text
size. `text-property-search-forward` can be a reliable way to do this.

The message "(org-transclusion) Something is off" is still a WIP version. I
think we need to come up with a better message if the method above is deemed
viable as a preventive measure.
@nobiot
Copy link
Owner

nobiot commented May 12, 2024

My first attempt at a preventive measure.

The reproduce recipe from @josephmturner works on my end but only intermittently. So I used 7ad7936 to force infinite loop and to test the preventive measure. It seems to work with multiple transclusions.

@akashpal-21
Copy link

akashpal-21 commented May 12, 2024

@nobiot It is always a pleasure to see you come up with solutions, I introduced the change to the org-transclusion-content-insert as you suggested to force beg and end to coincide and checked the same with describe-text-properties - the remove works flawlessly. Thank you so much. It is working on my end.

@josephmturner
Copy link
Contributor

@nobiot Your patch-infinite-loop workaround successfully avoids the infinite loop on my machine. Would it be appropriate to say something like

[org-transclusion] Recovering from error: incorrect values for transclusion BEG and END

We could use warn instead of message? This could be an opportunity to ask the user to report the issue, if we have an idea of more data the user could share with us to help figure out a true solution.

Thank you!

@nobiot
Copy link
Owner

nobiot commented May 13, 2024

I have managed to reproduce the infinite loop 100% of times on my end -- code in the "details" toggle below.

The theory of what happens is this, and the the code supports the theory.

[Fact / design of org-transclusion]

  • Each transclusion has text-properties org-transclusion-beg-mkr and org-transclusion-end-mkr.
  • They hold markers to keep track of where the transclusion begins and ends.

[Now what happens]

  • In some combination of undo and buffer-save with transclusions, the markers can temporarily point to non-existing locations in the buffer.

  • If the garbage collection happens to run at this moment, it will sweep these pointers. Now they end up pointing to start of the buffer (point 1).

Well, my description using a human language may not be precise and accurate, but based on this theory, I have come up with the code. Now I can reproduce the infinite loop condition 100% of my attempts.

If this theory is confirmed, I think I need to work to get the current "workaround" to be the way forward, removing the use of makers in the current way.

Let me know how you go...

To use the code provided below, adjust the src location, evaluate the snippet, and call the command test/combine. You should see this, along the temporary file as @josephmturner's code does. See the both beg and end markers point to the same location.

image

(defun test/repro-loop ()
  (interactive)
  (let ((source-file (make-temp-file "org-transclusion-test-source"))
        (org-file (make-temp-file "org-transclusion-test-org" nil ".org")))
    ;; *Change to location on your machine where org-transclusion is installed.*
    (add-to-list 'load-path "~/src/org-transclusion/")
    (load-library "org-transclusion")

    (with-temp-file source-file (insert "foobar"))
    (pop-to-buffer (find-file org-file))
    (insert (format "#+transclude: [[./%s]]" (file-relative-name source-file)))
    (org-transclusion-add)
    (save-buffer)))

(defun test/force-gc ()
  (interactive)
  (undo)
  ;; This save-buffer is the key. If you comment it out, the infinite loop won't
  ;; happen.
  (save-buffer)
  ;; Garbage collect is also the key
  (garbage-collect)
  (undo-redo)
  (describe-text-properties 1))

(defun test/combine ()
  (interactive)
  (test/repro-loop)
  ;; `undo-boundary' is necessary to get undo to work through calling the test
  ;; functions in Elisp.
  (undo-boundary)
  (test/force-gc))

@akashpal-21
Copy link

My opps at silicon valley have hidden the comment but feels good to know my intuition was correct

#177 (comment)
#177 (comment)

We finally have a confirmed fix to this issue. Both by simulating the conditions in a deterministic manner and solving it using a more general solution -- rather than specific.

@nobiot
Copy link
Owner

nobiot commented May 13, 2024

Can you reproduce the problem on your end with the code, too?

@akashpal-21
Copy link

Can you reproduce the problem on your end with the code, too?

Positively yes.

Screenshot_2024-05-14_03-11-51

@josephmturner
Copy link
Contributor

josephmturner commented May 13, 2024

Can you reproduce the problem on your end with the code, too?

Me too.

[Now what happens]

* In some combination of `undo` and `buffer-save` with transclusions, the markers can temporarily point to non-existing locations in the buffer.

* If the garbage collection happens to run at this moment, it will sweep these pointers. Now they end up pointing to start of the buffer (point 1).

This is very clear. Thank you, @nobiot!

IIUC, org-transclusion-beg-mkr and org-transclusion-end-mkr always point to the current buffer (the buffer containing the text they are applied to). Instead of markers, we could store position numbers in org-transclusion-remember-transclusions by converting each marker to a number in org-transclusion-before-save-buffer. Then, in org-transclusion-after-save-buffer, we'd make a new marker for each number. This may avoid garbage collection at the cost of unnecessary memory usage, code complexity, and potential bugs (what happens if a different before-save-hook or after-save-hook inserts some text and now each number is wrong?). It's a hack -- I'm just thinking "out loud" :)

It would be nice if we could mark certain markers so that GC doesn't collect them. Like specifying "weakness" in make-hash-table.

@nobiot
Copy link
Owner

nobiot commented May 15, 2024

Thank you all -- just letting you know that it looks like I won't have much capacity until mid-late June -- I will try to come back earlier to this work here, but I cannot promise. I wanted to let you know before I "disappear". I will try not to go completely radio silent but I may. Talk to you soon =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants