Skip to content

Memory leak in implementation #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
pauldix opened this issue Oct 19, 2018 · 10 comments
Open

Memory leak in implementation #2

pauldix opened this issue Oct 19, 2018 · 10 comments

Comments

@pauldix
Copy link
Owner

pauldix commented Oct 19, 2018

The Function and Environment structs in the object module reference each other and have circular references. I'm using Rc here to make things work, but I have to figure out a way to clean all these things up after calls to Monkey functions happen.

The Monkey language has closures which cause a new environment to be created for the function, but an outer environment to be referenced.

I'm not quite sure how to structure this to clean things up. In the book, Thorsten glossed over this and relied on Go's garbage collector to do the work. Does this mean that I'll have to implement a GC for the language to get all of this to work without memory leaks?

Without further reading and research I've hit the limits of my knowledge so any ideas or pointers would be greatly appreciated.

@Diggsey
Copy link

Diggsey commented Oct 22, 2018

From what I can tell the monkey language is garbage collected, and so you would need to implement a garbage collector/cycle detector in order to interpret it.

A simple GC should be easy for an interpreted language, as you should already know whats on the stack/in globals (the roots) and your data structures are easily traceable, so you just need to start from the roots, find everything reachable from them, and free anything you didn't encounter.

@boomshroom
Copy link

If you know you're going to be having back-pointers, I suggest taking a look at std::rc::Weak. Weak pointers allow access to a reference counted object without forcing it to stay alive, making them well suited for things like doubly-linked lists and parent pointers in a tree structure.

@pauldix
Copy link
Owner Author

pauldix commented Oct 22, 2018

@boomshroom I tried Weak at some point, but I had a tricky thing where I didn't have a least one strong reference so it dropped everything while the closure still needed to be around. I should take another look at that to see if I can make it work.

@Diggsey
Copy link

Diggsey commented Oct 22, 2018

Using Weak can't work in general for the language, because it has no clear ownership tree: functions own (should keep alive) their environment, and environments should own their functions. You can't even split them, because it's possible for functions to capture the same environment which owns them. You either have to introduce restrictions to the monkey language so that this is no longer possible, or implement some form of GC.

@kestred
Copy link

kestred commented Oct 23, 2018

A big rust contributor has recently been experimenting with garbage collected pointers written in rust--- it's not production ready yet; but if you can't use it directly it might at least provide inspiration.

It provides a Gc type which is a substitute for Rc in most cases.

Repo: https://github.com/withoutboats/shifgrethor
Blog: https://boats.gitlab.io/blog/post/shifgrethor-i/

@pandulaDW
Copy link

pandulaDW commented Nov 12, 2022

I recently finished implementing monkey in Rust myself as well. (https://github.com/pandulaDW/interpreter-in-rust). And from the look of it, I've taken an approach somewhat similar to yours. Sorry for my ignorance, but shouldn't the new environments that gets created during function calls be dropped at the end of each call evaluation?, which in turn would also drop the RC references attached to its outer environments?

@pauldix
Copy link
Owner Author

pauldix commented Nov 12, 2022

@pandulaDW It's been years since I thought about this code, but I think what I was referencing was that if you had a long running Monkey script, it would leak memory. If you just fire one-off Monkey scripts within a long running Rust process, then yes, that would be freed once the individual script invocation is done (I think).

@aabbtree77
Copy link

This talk is from 2016, but it has some focus on implementing closures with Rust:

"Rusty Runtimes: Building Languages In Rust" by Aditya Siram

It looks complicated to me. Note that many compilers written in C do not clean up their resources, and a lot of interpreters (the very first Lisp, perhaps the C interpreters of 1970s, Tcl?) have been used long before they got their first GC/reference counting.

@aabbtree77
Copy link

DeepThink R1 (185s.)

Okay, so I'm trying to figure out how to handle memory management in my Rust implementation of the Monkey language, especially with the circular references between Function and Environment structs. The user mentioned using Rc, which is Rust's reference-counted smart pointer, but they're worried about cleaning up after Monkey function calls to avoid memory leaks. Since closures in Monkey create new environments that reference outer ones, there's a potential for reference cycles that Rc alone can't handle because it doesn't do cycle detection.

First, I remember that Rc in Rust is a non-atomic reference count, which is good for single-threaded scenarios. But when you have circular references, like two Rc pointers pointing to each other, the reference counts never drop to zero, leading to memory leaks. So in this case, Function and Environment structs reference each other, creating such a cycle.

In the book "Writing an Interpreter in Go," the author relies on Go's garbage collector to clean up, which automatically handles cycles. But Rust doesn't have a built-in GC, so we need another approach here.

One possible solution is to use Weak references. Weak pointers don't increase the reference count, so they don't keep the value alive. If a struct has a reference to another that should not keep it alive, using Weak can break the cycle. For example, if an Environment has a reference to its outer Environment, maybe that outer reference can be a Weak pointer. But wait, in the case of closures, the function's environment might need to reference the outer environment strongly, because the closure could outlive the outer scope. Hmm, this needs careful consideration.

Alternatively, maybe the Function struct holds an Rc, and the Environment holds a list of Functions, but if those Functions also hold Rc, that creates a cycle. So perhaps the Environment's reference to the Function should be a Weak pointer. Or maybe the Function's reference to Environment is a Weak? But that might not be safe because the environment needs to exist as long as the function does.

Wait, maybe the Function's environment should be a strong Rc, and the Environment's functions should be stored as Weak. Because if the Function is dropped, the Environment can still exist, but the Functions in the Environment shouldn't keep the Environment alive. Or maybe the Environment's functions are stored with Rc, but then when the Environment is dropped, it can clear those functions. But that seems complicated.

Alternatively, maybe using a different smart pointer structure. For example, maybe the outer environment in a closure is a Weak reference. Because the outer environment is created outside the closure and might be part of a parent scope. If the closure's environment holds a Weak reference to the outer, then when the outer environment is no longer referenced strongly elsewhere, it can be dropped, breaking the cycle. But closures in Monkey need to capture variables from the outer environment, so if the outer is Weak, when it's dropped, the closure's environment would have a dangling reference. That's a problem.

Wait, but in Rust, a Weak pointer can be upgraded to an Rc, which returns an Option<Rc>. So when the Environment is accessed via a Weak, if the outer environment is still alive, you can get it; otherwise, it's gone. But in the context of a programming language's environment, you don't want the outer environment to disappear while a closure still exists that references it. So perhaps the outer environment must be kept alive as long as any closures referencing it exist.

Therefore, the closure's environment must hold a strong Rc to its outer environment. But that creates a chain of strong references. For example, if you have a closure inside a closure, each environment references the outer one via Rc. Then, as long as the innermost closure exists, all the outer environments up the chain are kept alive. But when the closures are no longer referenced, all environments are dropped. That's acceptable, but if there's a cycle between Function and Environment, that's a problem.

So the original problem is that Function and Environment reference each other. Let's say the Function struct has an Rc to capture its closure environment. The Environment struct might have a list of Functions that were declared in it. If those Functions are Rc, then each Function holds an Rc, and the Environment holds Rc, creating a cycle. This cycle would prevent both from being dropped, even when they're no longer reachable from the root.

To break this cycle, one of the references needs to be a Weak. The question is which one. If the Environment holds Weak references, then the Functions can be dropped when there are no other strong references, and the Environment's Weak pointers won't keep them alive. But if the Functions are part of the Environment (like functions declared in that scope), maybe the Environment should own them. But if Functions can close over their Environment, which includes themselves, that could be a problem.

Wait, maybe the Environment's functions are not directly stored in the Environment. Or perhaps the Functions are stored in the Environment with Rc, but the Function's environment is a Weak reference to the Environment. Wait, that might not work because the Function needs the Environment to exist as long as the Function does.

Alternatively, maybe the Function's environment is an Rc, and the Environment contains functions as Weak. That way, the Environment doesn't keep the Functions alive, but the Functions keep their Environment alive. So if all references to the Function are dropped, the Function is destroyed, which releases the Rc on the Environment. If the Environment had Weak references to the Function, then the cycle is broken. But that depends on whether the Environment needs to hold onto Functions beyond their usage.

Wait, in Monkey, functions are values that can be assigned to variables or passed around. So when a function is defined inside an Environment, the Environment might not need to hold a strong reference to it. The variables in the Environment would hold the Function values. If those variables are Rc, then the Environment's variables keep the Functions alive. The Function in turn references the Environment via Rc. So if a Function is still referenced elsewhere, its Environment is kept alive. But if the Environment's variables are the only owners of the Function, then when the Environment is dropped, the Function's Rc count decreases. But if the Function is referenced elsewhere, the Environment stays alive.

But if the Environment is part of a closure's environment chain, then even if the Function is kept alive, it's possible that the Environment is kept alive through other closures. Hmm.

Alternatively, maybe the Environment should not directly own the Functions. Instead, Functions are created and stored in variables, which are managed by the Environment's variables table. The variables would hold Rc where Object can be a Function. So the Environment's variables are the ones that hold strong references to the Functions. When the Environment is dropped, those variables are dropped, decrementing the Rc counts. If the Function has a reference back to the Environment (as part of its closure), that's where the cycle could happen.

So if Function A has a reference to Environment X, and Environment X's variables have a reference to Function A, that's a cycle between Function A and Environment X. Both are Rc pointers, so their counts will never reach zero. This is the crux of the problem.

To break this cycle, one of the references needs to be a Weak pointer. Since the Function needs the Environment to live as long as the Function exists (because when the Function is called, it needs to access its environment), the Function should hold a strong Rc. However, the Environment's variable that holds the Function should perhaps hold a Weak<Rc>. Wait, but variables in the environment need to keep the value alive. If the variable is a Weak pointer, then the Function could be dropped even if it's in the Environment's variables. That's not correct. The variable should own the Function, so it should be a strong Rc.

But this brings us back to the cycle. So perhaps the Function should not hold a strong reference to the Environment. Wait, but the Function's closure needs to capture the Environment, so when the Function is called, it can access variables from that Environment. If the Function's Environment is a Weak pointer, then when the Environment is dropped, the Function's Environment becomes invalid. But in reality, the Environment should be kept alive as long as the Function exists because the Function's closure depends on it.

Therefore, the Function must hold a strong Rc, and the Environment's variables hold a strong Rc. Thus creating a cycle. To break this cycle, the Environment's reference to the Function can be a Weak pointer. But then the Environment's variables don't keep the Function alive. But in reality, the variable should own the Function. So that approach might not work.

Alternatively, maybe the Function's reference to the Environment can be a Weak pointer. But then, if the Environment is dropped, the Function's Environment becomes invalid, which would be unsafe. Unless we check when the Function is called whether the Environment is still alive. But that's not feasible in Rust because Weak pointers can be upgraded to Rc, but if the Environment is gone, the Function would have no Environment to use, which is a problem.

Hmm. This seems like a tricky situation. In garbage-collected languages, this is handled automatically, but in Rust, we need to manage it manually.

Another approach is to structure the code in such a way that cycles cannot occur. For example, maybe Functions don't need to reference the Environment directly. Wait, but closures in Monkey need to capture their environment. So when a closure is created, it captures the current environment, which is stored in the Function struct. Therefore, the Function must have a reference to the Environment.

Alternatively, perhaps the Environment can be structured as a tree, where each child environment has a strong reference to its parent, and parents don't have references to children. That way, there's no cycle between parent and child environments. But the problem here is between Function and Environment, not between parent and child environments.

Wait, the Function holds an Environment, and the Environment holds variables, which can hold Functions. So each Function in the Environment's variables that captures the Environment creates a cycle.

This seems like a classic reference cycle problem. The solution in Rust would be to use Weak in one direction. So perhaps, when a Function is stored in an Environment's variable, the Environment holds a Weak reference to the Function, while the Function holds a strong Rc to the Environment. That way, the Function is kept alive by other references (like if it's assigned to a variable elsewhere or returned from a function), and the Environment's variable's Weak reference doesn't contribute to the reference count. When the last strong reference to the Function is dropped, the Function is destroyed, which in turn drops the strong reference to the Environment, possibly allowing the Environment to be destroyed as well.

But then, the problem is that the Environment's variables are supposed to own the values they hold. If they hold Weak pointers, then the values might be dropped even if they're still in the Environment. That's not correct. The variables need to own their values. So this approach might not work.

Alternatively, perhaps the Function does not need to store a reference to the Environment. Wait, but how else would the closure capture the environment? The closure needs to have access to the variables in the environment when it's called. So the Function struct must include the Environment it was created in.

Hmm. Maybe the solution here is to use an arena or a garbage collection library. Since implementing a GC in Rust is non-trivial, perhaps using an existing library like rust-gc could help. The gc crate provides garbage-collected pointers, which can handle cycles. However, using such a crate might have performance implications and adds a dependency.

Looking at the gc crate (https://crates.io/crates/gc), it allows you to allocate objects in a garbage-collected heap, and the collector will handle cycles. So, perhaps the Function and Environment structs can be wrapped in Gc pointers instead of Rc. Then, even if there's a cycle between them, the GC will detect and collect them when they're no longer reachable.

But using a GC in Rust is somewhat against the usual ownership model, and might complicate things. However, for a language interpreter, it might be necessary. The alternative is to carefully structure the code to avoid cycles using Weak where possible.

Let me think again. If the Function holds an Rc, and the Environment holds its variables in a HashMap<String, Rc>, where Object::Function(Function), then when a Function is stored in the Environment, it creates a cycle: Function's environment is Rc, and Environment's variable is Rc.

So breaking this cycle requires one of these links to be a Weak. Which one?

If the Environment's variables store Weak<Rc>, then the variable doesn't keep the Function alive. But that's not correct because the variable should own the value. If the variable is the only owner, then storing a Weak pointer would allow the Function to be dropped, which is wrong. So that's not the solution.

If the Function's environment is a Weak<Rc>, then the Function doesn't keep the Environment alive. But the Environment is needed as long as the Function exists. So if the Function is alive, the Environment must be alive. Therefore, the Function must hold a strong reference to the Environment.

So this seems like a dead end. The cycle is unavoidable with Rc, leading to memory leaks.

In this case, using a garbage collector that can handle cycles might be the only way. So the answer might be: yes, you need some form of garbage collection to handle cycles, which Rc and Weak can't do on their own. Therefore, using the gc crate or implementing a simple mark-and-sweep GC for your interpreter might be necessary.

Alternatively, rethink the data structures to avoid cycles. Maybe the Function doesn't need to reference the Environment directly. Wait, but closures need their environment. So perhaps the Function contains the captured variables as a copy, but that's not how closures work. Closures capture their environment by reference, so the Function must have a reference to the Environment.

Alternatively, maybe the Environment is not a struct that's reference-counted, but instead uses a different approach. For example, using indices into an arena. So instead of Rc, the Function holds an index into an arena managed by the interpreter. The arena would own all environments, and when the interpreter is done, it can clear the arena. But this approach would require that the lifetime of environments is tied to the interpreter's execution, which might not be feasible if functions can outlive the scope in which they were created (like if a closure is returned from a function and stored elsewhere).

Hmm. That's a possibility. Using an arena allocator, where all environments are allocated in an arena and their lifetimes are managed by the arena. But in a language with closures, closures can escape their original scope, so the arena approach would require that all environments live as long as the arena, which might be the entire duration of the interpreter. That could lead to memory bloat, as environments are not freed until the end.

Alternatively, using a generational arena or some other memory management strategy. But this is getting complicated.

Another thought: in many interpreters, environments are hierarchical and form a tree. When a function is called, a new environment is created pointing to the outer one. If there are no references to a child environment, it can be safely dropped. But when closures are created, they hold references to their environment. So the closure's environment must be kept alive as long as the closure exists.

If the closure's environment is reference-counted, and the closure (Function) holds an Rc to it, and the environment does not hold any references back to the Function, then there's no cycle. Wait, but the environment's variables can hold the Function, which would create a cycle if the Function is stored in its own environment.

Wait, for example, consider a recursive function. The function is stored in the environment as a variable, and the function's closure references the same environment. So Function A is stored in Environment E, and Function A's closure references Environment E. So E has a variable holding Rc, and Function A holds Rc. This creates a cycle between E and Function A.

In this case, unless one of the references is weak, the cycle will leak.

So for recursive functions, this is a problem. Therefore, the only way to handle this without a GC is to avoid such cycles by not allowing functions to reference their own environment. But that's not feasible for closures.

Therefore, perhaps the conclusion is that in Rust, to handle such cycles, you need a garbage collector that can detect and collect cycles. Thus, using the gc crate or implementing a simple GC is necessary.

The gc crate's documentation shows examples where Gc pointers can form cycles and are collected when unreachable. So replacing Rc with Gc for Function and Environment structs might resolve the memory leaks caused by cycles.

Alternatively, if the user wants to avoid a GC, they might need to redesign the data structures. For instance, perhaps the environment does not store the functions directly. But in Monkey, functions are first-class values, so they can be assigned to variables, which are stored in the environment. Thus, the environment must hold references to the functions.

Another approach: instead of having the Function hold an Rc, maybe use a different type of reference. For example, if the Environment is owned by a parent structure, like the interpreter's call stack, and the Function holds a non-owning reference, like a borrowed reference. But this is complicated in Rust due to lifetimes, especially when closures can escape their original scope.

This seems to circle back to the original problem: without a GC, handling such cycles in Rust is difficult. Therefore, the answer might be that yes, implementing a GC or using an existing GC library is necessary to clean up the circular references between Function and Environment structs, similar to how Go's garbage collector handles it automatically.

So, in summary, the options are:

  1. Use Weak references in one direction to break the cycle. This requires careful analysis to determine which references can be weak without causing dangling pointers.

  2. Use a garbage-collected pointer type, such as those provided by the gc crate, to allow cycles and let the GC handle collection.

  3. Redesign the data structures to avoid cycles altogether, which might not be feasible given the requirements of closures in the Monkey language.

Given the constraints, option 2 might be the most straightforward solution, as it mirrors the approach taken in garbage-collected languages like Go, which the original book uses.

To address the circular references between Function and Environment structs in your Rust implementation of the Monkey language, you have a few options, each with trade-offs:

  1. Use Weak References to Break Cycles
    Approach: Convert one of the Rc references to a Weak to avoid creating a reference cycle. For example, if an Environment holds Weak<Rc> references to its functions, while Function holds a strong Rc for its closure environment, this breaks the cycle.

Considerations:

Ensure that Weak references are upgraded to Rc only when necessary (e.g., during function calls) to avoid dangling references.

Variables in the environment should generally hold strong references to values. Use Weak only where ownership isn't required (e.g., parent environments if they can be safely dropped earlier).

  1. Use a Garbage Collection Crate
    Approach: Replace Rc with garbage-collected pointers like Gc from the gc crate. This allows cycles to be automatically collected when they become unreachable.

Implementation:

use gc::{Gc, Trace, Finalize};

#[derive(Trace, Finalize)]
struct Environment {
    functions: Vec<Gc<Function>>,
    outer: Option<Gc<Environment>>,
}

#[derive(Trace, Finalize)]
struct Function {
    env: Gc<Environment>,
}

Benefits: Mimics Go’s garbage collector, handling cycles without manual intervention.

Drawbacks: Introduces a runtime overhead and a dependency on the gc crate.

  1. Redesign Data Structures to Avoid Cycles
    Approach: Structure the code to avoid circular ownership. For example:

Arenas: Allocate Environment and Function in an arena that owns all instances, using indices instead of pointers. This avoids cycles but complicates lifetime management.

Hierarchical Ownership: Ensure child environments don’t reference parents strongly (use Weak for outer environments if possible).

Recommended Solution
Given Monkey’s closure semantics and the need for environments to outlive their scope when captured by closures, using the gc crate is the most straightforward approach. It directly addresses cyclic references without invasive redesigns, similar to how Go’s garbage collector works. Here’s how to apply it:

Add the gc crate to your Cargo.toml:

[dependencies]
gc = "0.4"

Wrap Cyclic Structs with Gc:
Modify Function and Environment to use Gc and implement Trace and Finalize:

use gc::{Gc, Trace, Finalize};

#[derive(Trace, Finalize)]
struct Environment {
    outer: Option<Gc<Environment>>,
    functions: Vec<Gc<Function>>,
    // Other fields...
}

#[derive(Trace, Finalize)]
struct Function {
    env: Gc<Environment>,
    // Other fields...
}

Initialize and Collect:
Use Gc::new to allocate instances, and rely on the crate’s garbage collector to reclaim unused cycles:

let env = Gc::new(Environment { ... });
let func = Gc::new(Function { env: env.clone(), ... });

Conclusion
While Rust’s ownership model makes manual cycle management challenging, using the gc crate provides a practical solution that aligns with the original Go implementation’s approach. This avoids memory leaks from circular references while maintaining the flexibility required for closures in Monkey. If minimizing dependencies is critical, consider restructuring with Weak references, but this requires careful design to ensure safety and correctness.

@aabbtree77
Copy link

I doubt this will help, but since DeepSeek is all over the radio today, this could be amusing a bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants