Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack integrity during EGC #2881

Closed
TerryE opened this issue Aug 7, 2019 · 6 comments
Closed

Stack integrity during EGC #2881

TerryE opened this issue Aug 7, 2019 · 6 comments
Assignees

Comments

@TerryE
Copy link
Collaborator

TerryE commented Aug 7, 2019

The issue

If a new resource is allocated, then the VM adds an extra step in the malloc process. If the malloc fails returning NULL, then the VM carries out a GC sweep to free up garbage memory and retries the malloc before aborting the allocation request with an out of memory error thrown.

Expected behaviour

This process should maintain internal integrity and either succeed or correctly throw an out-of-memory error which can be handled and reported by the application. It should also be economic as practical -- that is avoid throwing an error when GC could make sufficient available.

Actual behaviour

The complication arise here if collectable objects have an associated _gc methamethod. This means the the GC can end up calling Lua functions during the GC sweep. Such calls can trigger a Lua stack resize. This occurs very rarely and when it does the Lua VM internally handles all of the housekeeping to update the stack references in the lua_State, etc. preserving data integrity. Issues can occur if the API code does something like:

  StkId o = index2addr(L, idx);          /* Cache a TValue pointer to the Lua stack */
  funcWhichAllocatesResource(L, ...);    /* this can result in a stack resize */
  /* any use of (o) now points to the wrong location and can crater the VM */

We have hit and tracked down one example of this in a live DiUS application. The manifestation is that the VM will throw a H/W exception when this happens. This is very occasionally and seemingly randomly, and so is very hard to reproduce systematically and therefore is hard to identify.

A specific example is the use of the seterrorobj() function in the lua53/ldo.c:

  StkId oldtop = restorestack(L, ci->extra);
  seterrorobj(L, status, oldtop);

Here seterrobj() can push and error message as a TString to the stack when can trigger a GC and if it does then oldtop is still pointing to the old stack and not the new one. If seterrobj used a ptrdiff_t as its third argument seterrorobj(L, status, oldtop - L->stack) then this offset reference is stack relative and won't cause an issue if the stack is resized and relocated.

So what am I asking here?

Standard Lua doesn't experience this because it is working in a large RAM heap and recreating this issue is almost impossible, IMO. This isn't the case when you've got 10s of Kb RAM available. The eLua patch addressed on of the cases but missed the one which tripped up DiUS. So what our approach to this architectural issue. I think that we have three broad responses:

  1. Ignore it and hope that it goes away.
  2. Add a layer of complexity to the GC so that emergency sweeps bypass collecting resources which trigger a __gc method. This is going to hard to implement well, I think.
  3. Recode to avoid using spanning StkId pointers and use stack-offsets where necessary.

My preference is for (3) and to do this as a separate commit pass so that it is more easily reviewable. That means that in implementation (1) and (3) are the same in the first instance.

Comments?

@nwf
Copy link
Member

nwf commented Aug 7, 2019

How sure are you about the complexity of (2)? It would seem, naively, fairly simple to leave __gc-bearing garbage as garbage during a sweep?

I think it's likely going to be complicated to explain which C API functions can cause egc passes and so need special handling w.r.t. local C variables.

@TerryE
Copy link
Collaborator Author

TerryE commented Aug 7, 2019

It would seem, naively, fairly simple to leave __gc-bearing garbage as garbage during a sweep?

The issue is that this involves some changes to the sweep algo which might need us to add an extra pass, and we also need to add extra state to the sweep process. The eLua approach attempted this and missed some -- hence the DiUS issue. The API changes are reasonably small and internal to the Lua components: lapi.c, lauxlib.c, lgc.c and ldo.c. The rule is quite simple: don't cache pointers to stack locations across any API call which allocated resources; if you need to cache then use a stack offset instead. Because these issues aren't exposed at a module level in the public API (lua_State is an anonymous struct), then we don't need to explain them to module developers. It's only the committers who are involved with or reviewing the Lua53 port that need to understand the options and to buy into the chosen one. Hence my poking you and Johny.

@jmattsson
Copy link
Member

Option 3 would seem to be the "correct" approach, and one which we might get traction with upstream as well(?).

@stale
Copy link

stale bot commented Aug 2, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Aug 2, 2020
@jmattsson jmattsson removed the stale label Aug 2, 2020
@TerryE
Copy link
Collaborator Author

TerryE commented Aug 5, 2020

This addressed architecturally in Lua53 in that finalisers are not called by the GC during an emergency sweep, so it makes sense to close this IMO.

@TerryE TerryE closed this as completed Aug 5, 2020
@jmattsson
Copy link
Member

In that case I think we should apply at least one of the workarounds to 5.1 before we freeze it (if that didn't already happen when I wasn't paying attention?)

Here at work we're still using our initial workaround so it won't bother us (as we're hoping to do the switch to 5.3 anyway next time we get to rebase), but it's such a nasty little surprise I don't want anyone else having to debug it unnecessarily!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants