Memory leak in arena with `Atom`s #1803

overlookmotel · 2023-12-24T20:04:08Z

OXC appears to leak memory when source includes any strings over 24 bytes in length.

A demonstration here: https://github.com/overlookmotel/oxc/tree/leak-test

Checkout that branch, and cargo run -p test_leak.

The problem

I believe the problem is this:

Strings are stored as Atoms.
Atom wraps CompactString from compact_str.
CompactString allocates to the heap if string is longer than 24 bytes.
Atoms may appear anywhere in the AST, likely within a Box.
Implementation of Box used for the arena has no Drop impl.
So when parsed Program is dropped, Atoms don't get dropped, and the heap allocations CompactString made are not freed.

In the demo above, I've added impl Drop for Atom which logs when an Atom is dropped. It never gets called, demonstrating that the heap memory is leaked.

Possible solutions

1. Add `impl Drop` to `Box`

This would be like bumpalo::collections::Box.

Problem is that this would make dropping the AST more expensive, negating a major benefit of using an arena.

2. Track and drop `Atom`s manually

Store pointers to all non-inline Atoms as they're created. A Drop impl on the arena could then drop them when the arena is dropped.

3. Replace `CompactString`

Replace CompactString with a similar "small string" implementation which stores long strings in the bumpalo arena, instead of the general heap.

As well as solving the leak, I believe this would also be more performant:

Less heap allocations.
Strings would be stored close to the AST nodes containing them, producing better cache locality.

CompactString is quite complicated, but OXC only uses a small fraction of CompactString's capabilities, so creating a custom small string implementation in OXC I don't think would be too tricky.

In addition:

CompactString is built as a replacement for std's String, and includes a capacity field to allow resizing. But OXC's Atom is immutable, so the capacity field is not required. All that's needed is a Box<str> equivalent.

So a new "small string" type could lose the capacity field and be reduced in size to 16 bytes, rather than the current 24. Obviously, only strings up to 16 bytes could then be stored inline, but still most JS identifiers are under 16 characters, and if string content is stored in the arena, storing longer strings "out of line" has less performance penalty.

Further optimizations

For identifers without escapes (pretty much all identifiers) and string literals without escape sequences (most strings), Atom could just be a pointer to the string in source code, rather than copying the string into the arena.

Secondly, when lexing strings and identifiers containing escape sequences, AutoCow creates a string in the arena. That string can be pointed to by Atom rather than copied to another location in the arena.

In fact, escaped strings for some reason currently are stored in the arena three times, though I'm not sure why. Dump of arena contents after parsing x = 'ABCDE\n';:

|0c000000 00000000 486d602e 01000000| ........Hm`..... 00000000
|00000000 0e000000 00000000 00000000| ................ 00000010
|00000000 00000000 b06d602e 01000000| .........m`..... 00000020
|05000000 00000000 806d602e 01000000| .........m`..... 00000030
|00000000 0d000000 0098aa6f 01000000| ...........o.... 00000040
|04000000 0d000000 41424344 450a0000| ........ABCDE... 00000050
|00000000 00000000 00000000 000000c6| ................ 00000060
|f0010000 00004142 4344450a 42434445| ......ABCDE.BCDE 00000070
|00000000 01000000 00000000 01000000| ................ 00000080
|78000000 00000000 00000000 00000000| x............... 00000090
|00000000 000000c1 0090aa6f 01000000| ...........o.... 000000a0

The text was updated successfully, but these errors were encountered:

overlookmotel · 2023-12-24T20:06:58Z

For similar reasons, BigIntLiteral also leaks memory because num_bigint::BigInt stores data on the heap.

Any other structures which store data on the heap will also leak, though I couldn't see any other cases at first glance.

Boshen · 2023-12-25T03:21:53Z

Thank you for the in-depth research.

For strings, my previous attempt was #451, I think we can look into this problem space a bit more to find a good solution.

For big ints, the usages are rare - perhaps we could box it into the arena?

@overlookmotel Would you like to explore and find a way to mitigate these memory leak problems with us? Thank you in advance 🙇

Boshen · 2023-12-25T03:23:52Z

Let's start with option 3 (replace compact string).

I still want to try the hack from https://github.com/rust-lang/rust/blob/f9097f87c9c094f80826fb60a1a624b5f9f1ed82/compiler/rustc_span/src/symbol.rs#L1999-L2007.

Boshen · 2023-12-25T03:26:40Z

In fact, escaped strings for some reason currently are stored in the arena three times, though I'm not sure why. Dump of arena contents after parsing x = 'ABCDE\n';:

This is going to be fun bug ... AutoCow is copied from jsparagus https://github.com/mozilla-spidermonkey/jsparagus/blob/212f6bdbc2cae909e7d5cfebf36284560c3c4ef4/crates/parser/src/lexer.rs#L2256

Maybe there is a usage problem in our lexer code?

overlookmotel · 2023-12-27T02:01:01Z

First go at a new Atom implementation here: https://github.com/overlookmotel/oxc/tree/new-atom/crates/oxc_atom
(lots of TODO comments)

A few questions:

Mutablility

Are there plans/potential needs in future for Atoms to be mutable?

What I've done so far assumes not, and therefore can reduce size of Atom to 16 bytes, and make the implementation much simpler and more performant.

Use references?

Should all strings be added to the arena? Or if there's an existing reference to the string's content whose lifetime is long enough, can Atom just store a reference to it e.g. if the string in the source code being parsed?

Advantages of storing references:

Lower memory usage.
Parsing should be faster as the majority of strings and identifiers can just be pointed to in source code, so copying strings into the arena will be much reduced.
Cloning Atom becomes very cheap (no copying string content).

Advantage of putting everything in the arena is that strings will always be close in arena to the AST nodes which use them, so better cache locality. Then again, strings also bloat the arena, so possibly result in more swapping in and out of the cache while traversing an AST.

My current WIP is based on storing references.

Support for 32-bit systems

How important is support for 32-bit systems?

What I've done so far produces a maximum cap on length of any individual string of 512 MiB on 32-bit machines (no practical limit for 64-bit). CompactString has a workaround for that limit, but implementing it here would be horrible, as it doesn't mesh well with arenas.

Personally, I imagine 512 MiB is big enough for almost all use cases. And does anyone actually use 32-bit machines any more?

overlookmotel · 2023-12-27T02:21:16Z

On your other points:

I'm not sure I understand the point of the hack you mentioned in this context. As far as I can see, the reason that code is tricking the compiler that strings have a 'static lifetime is in order to use them as keys in a hash map.

It needs to do that because it's de-duplicating strings. But I thought you'd found that approach was slower due to the locks and hash table lookups, and discounted it?

Or am I missing the point?

I figured out why string content appears 3 times in the arena. It's because bumpalo has the unfortunate property that it grows downward. i.e. in each chunk the first allocation is at end of the chunk, the next allocation before it, and so on.

So when AutoCow pushes to the bumpalo::collections::Vec, each time the Vec has to resize, it can't grow forwards and instead creates a new allocation earlier in the chunk, copies the data earlier, and carries on. The 3rd incidence of the string I noticed was the leftover from that process.

I imagine this same effect also occurs when building any other Vec in the parser e.g. Vec<Statement>. It probably eats a lot of space in the arena, and results in a lot of memory copying, since Vecs are so common in ASTs.

My first thought is this could be improved on by using a separate "scratch space" allocation which can grow forwards.

Boshen · 2023-12-27T14:46:22Z

Are there plans/potential needs in future for Atoms to be mutable?

No, I haven't encountered a usecase where it needs to be mutable.

Should all strings be added to the arena? Or if there's an existing reference to the string's content whose lifetime is long enough, can Atom just store a reference to it

I wanted to do this for a long time. But handling out &'a str or String<'a> to my downstream tools complicates their code - adding 'a to their code is too anonoying, so I went for a lifetimeless implementation.

If we can hack it and unsafe the 'a, then I'm all in.

how important is support for 32-bit systems?

Your reasoning is solid around 512 MiB 👍

I'm not sure I understand the point of the hack

It's basically this comment:

// SAFETY: we can extend the arena allocation to `'static` because we
// only access these while the arena is still alive.

removing the lifetime will make all downstream code easier to write, since people don't need to carry around the 'a everywhere.

Although we can force an allocation for external use ... so everything is still safe Rust.

I figured out why string content appears 3 times in the arena.

TIL. We can create a separate discussion around this topic. Could be fun to optimize it, or perhaps wasting a lot of time 😅

overlookmotel · 2023-12-27T17:00:14Z

Thanks for coming back.

I wanted to do this for a long time. But handling out &'a str or String<'a> to my downstream tools complicates their code - adding 'a to their code is too anonoying, so I went for a lifetimeless implementation.

If we can hack it and unsafe the 'a, then I'm all in.

Hmm. Well yes we could hack it, but then wouldn't it be unsound? You could drop the String containing the source code, and continue using the AST which contains references to the string's contents, which would be a use-after-free.

I'm also a little confused as to why Atom<'a> having a lifetime is problematic. Most AST node types have an 'a lifetime already, so why is Atom<'a>'s lifetime more annoying than Statement<'a>'s?

Also, NumberLiteral's raw field is a &'a str.

If the problem is that the source code is stored separately from the arena, so it produces a 2nd lifetime to deal with, how about Allocator storing an immutable reference to the source code buffer along with the bumpalo arena? That'd produce an invariant that the source code buffer must live longer than the arena, and then a consumer only has 1 lifetime to deal with. i.e.:

let src = "x = 123".to_string();
let alloc = Allocator::from_src(&src);
// `src` must live as long as `alloc`

I think it'd be a shame to leave this potential performance gain unrealised, as it could be significant for any code containing long strings (which would include a lot of UI code).

Out of interest, what downstream tools are you talking about? Other OXC crates, or external consumers?

I figured out why string content appears 3 times in the arena.

TIL. We can create a separate discussion around this topic. Could be fun to optimize it, or perhaps wasting a lot of time 😅

Solving this problem in AutoCow should not be difficult. The broader problem with Vec is more tricky, and yes, separate issue.

Boshen · 2023-12-28T14:32:53Z

How about this - we can expose different APIs for different usages, i.e.

-> String
-> &'a str
-> CompactString
... maybe other variations

Out of interest, what downstream tools are you talking about? Other OXC crates, or external consumers?

All crates other than the AST crate are considered external crates:

oxc_linter reads strings everywhere
oxc_semantic exposes a bunch of public facing APIs with Atoms - https://github.com/oxc-project/oxc/blob/main/crates/oxc_syntax/src/module_record.rs
The Rolldown bundler (which isn't public yet), reads these Atoms and stores them into their own data structures

I love the overall direction this issue is going! @overlookmotel do you wanna find a way to hack some of the code so we can get a PR going, and observe performance changes?

Both performance and memory consumption should be improved in theory.

As stated in the policy https://oxc-project.github.io/docs/contribute/rules.html#development-policy

All performance issues (runtime and compilation speed) are considered as bugs in this project.

;-)

Boshen · 2023-12-28T14:36:09Z

I think I've provided enough context for you to make good judgements, you've gained my trust.

Let's break things and have fun!

overlookmotel · 2023-12-29T01:23:14Z

you've gained my trust.

Thanks. Or maybe I've just worn you down with my questions!

But yes, I think I have a fairly clear picture now.

Just one clarification, and 1 last question:

-> String

-> &'a str

-> CompactString

To clarify, you mean that Atom<'a> can have a lifetime, but it should provide methods for easy conversion to lifetime-less structures for consumers who need them
e.g. impl Atom<'a> { fn to_string(&self) -> String }?

And the last question:

Am I right in thinking that OXC places a limit on size of a source file to max u32::MAX bytes (4 GiB)? (I'm gathering that from Span's start and end being u32s). Therefore string length can be capped at u32::MAX too?

That'd free 4 bytes in Atom for additional metadata (which I have some ideas how to use).

Boshen · 2023-12-29T02:25:48Z

To clarify, you mean that Atom<'a> can have a lifetime, but it should provide methods for easy conversion to lifetime-less structures for consumers who need them
e.g. impl Atom<'a> { fn to_string(&self) -> String }?

Yes, I think we should allow the user (other crates) to pick for their use case so everybody can make their own tradeoffs, as well as stay in safe Rust.

Boshen · 2023-12-29T02:26:30Z

Am I right in thinking that OXC places a limit on size of a source file to max u32::MAX bytes (4 GiB)? (I'm gathering that from Span's start and end being u32s). Therefore string length can be capped at u32::MAX too?

Yes!

oxc/crates/oxc_span/src/span.rs

Line 14 in 1957770

    
           /// NOTE: `u32` is sufficient for "all" reasonable programs. Larger than u32 is a 4GB JS file.

overlookmotel · 2023-12-29T03:07:15Z

OK great. Thanks for the answers.

I have a WIP that ticks most of the boxes, but now that holiday is coming to an end I'll have much less time to work on it. Please excuse me if I don't come back with a PR for a while.

Boshen · 2023-12-29T03:41:21Z

OK great. Thanks for the answers.

I have a WIP that ticks most of the boxes, but now that holiday is coming to an end I'll have much less time to work on it. Please excuse me if I don't come back with a PR for a while.

Take your time, I'm on holiday as well 😁

overlookmotel · 2023-12-30T02:20:25Z

Sorry, one more question: How much should we care about 32-bit systems? Are they a primary target for OXC?

Reason I ask is implementation would be simpler if could make Atom 12 bytes on 32-bit systems (rather than 8 bytes).

On 64-bit systems it'll be 16 bytes regardless.

Boshen · 2023-12-30T05:04:09Z

Sorry, one more question: How much should we care about 32-bit systems? Are they a primary target for OXC?

Reason I ask is implementation would be simpler if could make Atom 12 bytes on 32-bit systems (rather than 8 bytes).

On 64-bit systems it'll be 16 bytes regardless.

Not that much. Not a primary target.

overlookmotel · 2023-12-30T12:36:30Z

Thanks. I'll get on with the implementation. Will come back if more questions, but hopefully I have all the info I need now.

Boshen · 2024-01-05T16:09:01Z

Why do we need Atom<'a> in the first place if we replace everything to just &'a str on all the AST nodes? Maybe I over-designed it ... we don't actually need to intern nor a lifetimeless string form 🤔

If downstream tool needs a lifetimeless string, they should be able to use CompactStr::from?

overlookmotel · 2024-01-06T02:57:11Z

Yes, I was thinking along similar lines last night.

&'a str is certainly simpler, and would be faster in the parser, as no mucking about figuring out if whether to store a string inline or not.

But I can still see a few advantages to an Atom<'a> type which stores short strings inline:

Faster reading strings

When printing an AST back to Javascript, when the printer holds an Atom, if if the string is short enough to be stored inline, then no further memory lookup is required to access it.

This would be the case for most identifiers, as typically they'll be under 16 bytes.

Faster comparisons

During e.g. scope analysis, I'd imagine there's a lot of comparing one identifier's name to another's. When those identifiers are 16 chars or less, that should be faster if they're stored inline because:

a. The data is right there inline in the Atom. Likely to be in cache.
b. You don't need to worry about reading out of bounds the way you do with a &str, so comparing two inline Atoms is equivalent to comparing two u128s. With AVX, that's only 3 instructions (https://godbolt.org/z/Gvoqc7PYd)

Space for extra data

&str uses 8 bytes for string length. In the context of OXC, we only need 4 bytes, as a string can't be larger than 4 GiB.

A custom Atom type can use those 4 spare bytes for something else. They could, for example, contain a flag for "is ASCII", or the string length in UTF16 characters (to help with #959).

(1) and (2) are the main considerations. Basically, it's a trade-off between speed of creating strings and of reading them. Maybe &str will turn out to be the better choice after all, but maybe not.

I'm most of the way there with the inline Atom implementation, so if you can wait the month or so it'll take me to finish it up (this month is hellish at work, but things will calm down in Feb), we can benchmark the 2 options and see what that reveals.

Does that sound like a reasonable plan?

Boshen · 2024-01-06T03:50:37Z

Does that sound like a reasonable plan?

Yes yes! I totally forgot that all we want is 😅

enum Atom {
  Inline(..),
  ArenaAllocated(..)
}

This PR partially fixes #1803 and is part of #1800. The BigInt value is now boxed, so it can be dropped along with the AST. It is also removed from the `Token` value, so that the token size can be reduced once we removed all the variants.

overlookmotel · 2024-01-08T13:33:26Z

Thanks!

aapoalas · 2024-01-14T16:03:43Z

👍 I just ran into this with parsing a trivial script containing only 3n using MIRI.

Boshen · 2024-01-15T02:36:26Z

👍 I just ran into this with parsing a trivial script containing only 3n using MIRI.

@aapoalas Do you mind sharing your miri setup? I'd like to add a CI for it.

My previous attempt wasn't robust enough https://github.com/oxc-project/oxc/blob/main/.github/workflows/miri.yml

aapoalas · 2024-01-15T09:22:08Z

👍 I just ran into this with parsing a trivial script containing only 3n using MIRI.

@aapoalas Do you mind sharing your miri setup? I'd like to add a CI for it.

My previous attempt wasn't robust enough https://github.com/oxc-project/oxc/blob/main/.github/workflows/miri.yml

Unfortunately I only just ran the rustup script to add nightly toolchain, and then ran cargo +nightly miri test, nothing more than that. It doesn't seem particularly more robust than your workflow.

Boshen · 2024-01-15T12:56:46Z

Weird, I tried cargo miri test -p oxc_parser -p oxc_ast -p oxc_allocator -p oxc_semantic in CI and it passed 🤔 https://github.com/oxc-project/oxc/actions/runs/7529087020/job/20492604930

If I run the whole test suite it throws something from crossbeam

   --> /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/crossbeam-epoch-0.9.17/src/atomic.rs:204:11
    |
204 |         &*(ptr as *const T)
    |           ^^^^^^^^^^^^^^^^^ integer-to-pointer cast
    |
    = help: This program is using integer-to-pointer casts or (equivalently) `ptr::from_exposed_addr`,
    = help: which means that Miri might miss pointer bugs in this program.
    = help: See https://doc.rust-lang.org/nightly/std/ptr/fn.from_exposed_addr.html for more details on that operation.
    = help: To ensure that Miri does not miss bugs in your program, use Strict Provenance APIs (https://doc.rust-lang.org/nightly/std/ptr/index.html#strict-provenance, https://crates.io/crates/sptr) instead.
    = help: You can then pass the `-Zmiri-strict-provenance` flag to Miri, to ensure you are not relying on `from_exposed_addr` semantics.
    = help: Alternatively, the `-Zmiri-permissive-provenance` flag disables this warning.

https://github.com/oxc-project/oxc/actions/runs/7528975555/job/20492273972

overlookmotel · 2024-01-15T14:17:42Z

Is it possible that the tests don't create any AST which contains either BigInts, or strings/identifiers longer than 24 bytes? BigInt always heap allocates, but CompactStr stores strings of up to 24 bytes inline, and only heap allocates for strings longer than that.

If the tests don't do either of these things, they won't trigger the memory leaks.

I imagine running the parser on the files used for the conformance tests would allow miri to find the leaks.

Boshen · 2024-01-15T15:05:17Z

Thank you both. Replicated

https://github.com/oxc-project/oxc/actions/runs/7530649712/job/20497529584

error: memory leaked: alloc57077 (Rust heap, size: 8, align: 8), allocated here:
    --> /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:98:9
     |
98   |         __rust_alloc(layout.size(), layout.align())
     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     |
     = note: inside `std::alloc::alloc` at /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:98:9: 98:52
     = note: inside `std::alloc::Global::alloc_impl` at /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:181:73: 181:86
     = note: inside `<std::alloc::Global as std::alloc::Allocator>::allocate` at /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:241:9: 241:39
     = note: inside `alloc::raw_vec::RawVec::<u64>::allocate_in` at /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/raw_vec.rs:199:45: 199:67
     = note: inside `alloc::raw_vec::RawVec::<u64>::with_capacity_in` at /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/raw_vec.rs:145:9: 145:69
     = note: inside `std::vec::Vec::<u64>::with_capacity_in` at /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:672:20: 672:61
     = note: inside `std::vec::Vec::<u64>::with_capacity` at /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:481:9: 481:49
     = note: inside `num_bigint::biguint::convert::from_radix_digits_be` at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/num-bigint-0.4.4/src/biguint/convert.rs:122:20: 122:74
     = note: inside `num_bigint::biguint::convert::<impl num_traits::Num for oxc_ast::BigUint>::from_str_radix` at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/num-bigint-0.4.4/src/biguint/convert.rs:269:13: 269:44
     = note: inside `num_bigint::bigint::convert::<impl num_traits::Num for num_bigint::BigInt>::from_str_radix` at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/num-bigint-0.4.4/src/bigint/convert.rs:39:18: 39:51
     = note: inside `num_bigint::BigInt::parse_bytes` at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/num-bigint-0.4.4/src/bigint.rs:675:9: 675:41
note: inside `lexer::number::parse_big_int`
    --> crates/oxc_parser/src/lexer/number.rs:97:5
     |
97   |     BigInt::parse_bytes(s.as_bytes(), radix).ok_or("invalid bigint")
     |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Boshen · 2024-01-17T07:36:21Z

Found some arcane code from ratel, where we can use for our atom:

https://github.com/ratel-rust/ratel-core/blob/4929f0ecdcd470346c9dba5d7f570eb960cc312d/core/src/ast/ident.rs#L128-L135

overlookmotel · 2024-01-17T14:05:54Z

Here's where I'm up to with Atom so far:
https://github.com/overlookmotel/oxc/tree/new-atom/crates/oxc_atom/src

It's heavily based on CompactStr, and I've gone fairly deep figuring out how all their optimizations work (and also finding a couple of my own).

But thanks for link to Ratel's implementation - I'll see if anything we can use from it which is specific to strings in JS.

Working on Atom has also thrown up some thoughts which are a bit broader about the parser/lexer overall, and would be good to resolve before locking Atom's API. For various reasons, I think API should be more constrained than what I have now. I'll raise issues about these questions when I get time.

One question for now: How relaxed are you about breaking changes?

I mean: Is it better to work out API design thorougly before implementing, so the breaking changes happens all in one go? Or is it OK to merge an initial implementation (including breaking changes) and then modify API again later (more breaking changes)?

overlookmotel · 2024-01-17T14:19:16Z

Also, a naming thing: Should we change the name from Atom to just String?

The name Atom suggest that identical strings are deduplicated. That used to be that case, but it isn't any more, so the name is now misleading.

Boshen · 2024-01-18T13:10:53Z

Also, a naming thing: Should we change the name from Atom to just String?

Can do since we are not 1.0. But there are many projects that dependents on oxc now, so we'll need to introduce some type alias and the #[deprecated] macro to smooth out the transition.

aapoalas · 2024-01-18T14:07:08Z

Agree that a deprecation period is nice. Our JS engine uses Atom frankly a bit too much. We're just a hobby project so it doesn't really matter that we break though.

overlookmotel · 2024-01-18T14:23:57Z

OK, understood. How about breaking changes to the methods of the type?

The simplest starting point is to have these methods to create an Atom/String:

fn new(s: &'a str) -> Atom<'a>;
fn new_in(s: &'a str, alloc: &'a Allocator) -> Atom<'a>;
// Limited to 16 bytes max
const fn new_const(s: &'static str) -> Atom<'static>;

But I think in future it'd be better to remove new() and replace with:

fn new_from_source(SourceSlice<'a>) -> Atom<'a>;

SourceSlice here is a string slice which is guaranteed to be part of the source code. Constraining Atoms to only be able to reference string content which is either in source code or stored in the arena will have benefits both in performance and functionality.

So my question is: Is it a problem to do the initial implementation first, and then have the breaking change of removing new() later as a further breaking change? (NB deprecation would not be possible on that change, as it would be fundamental to Atom's internals).

Boshen · 2024-01-19T03:20:00Z

Is it a problem to do the initial implementation first, and then have the breaking change of removing new() later as a further breaking change?

Not a problem. We'll need to keep a change log from now on :-)

overlookmotel · 2024-01-19T04:28:26Z

OK great. I'm working on something else at the moment, but will come back to this soon.

overlookmotel · 2024-01-20T21:15:04Z

See comment on #2101:

49% of the entire run time lexing checker.ts is spent in bumpalo::collections::string::String::push.

If we can replace usage of Bumpalo's String with something else, that'd produce an 10-15% speed-up in parsing some inputs.

overlookmotel · 2024-01-21T01:20:34Z

Previous comment turned out to be erroneous. But discovered a couple of other things along the way which are relevant to this issue:

AutoCow is inefficient and could be improved by pushing non-escaped chunks as string slices, rather than pushing many chars 1 by 1.
Bumpalo's String::push_str is extremely slow for similar reasons - it pushes each byte of the string 1 by 1, rather than copying a big chunk all at once.

relates #1803

…_allocator closes #1803 This `String` is currently unsafe, but I won't to get miri working before introducing more changes.

…_allocator (#2294) closes #1803 This string is currently unsafe, but I want to get miri working before introducing more changes. I want to make a progress from memory leak to unsafe then to safety. It's harder to do the steps in one go.

This PR partially fixes oxc-project#1803 and is part of oxc-project#1880. BigInt is removed from the `Token` value, so that the token size can be reduced once we removed all the variants. `Token` is now also `Copy`, which removes all the `clone` and `drop` calls. This yields 5% performance improvement for the parser.

…_allocator (oxc-project#2294) closes oxc-project#1803 This string is currently unsafe, but I want to get miri working before introducing more changes. I want to make a progress from memory leak to unsafe then to safety. It's harder to do the steps in one go.

Boshen added the A-parser Area - Parser label Dec 25, 2023

overlookmotel mentioned this issue Jan 3, 2024

Reduce size of Token? #1880

Closed

Boshen mentioned this issue Jan 7, 2024

refactor(parser): remove TokenValue::BigInt from Token #1924

Merged

Boshen reopened this Jan 8, 2024

Boshen added a commit that referenced this issue Jan 29, 2024

feat(allocator,ast): add BoxWithDrop<BigInt>

fb3d877

relates #1803

Boshen mentioned this issue Jan 29, 2024

feat(allocator,ast): add BoxWithDrop<BigInt> #2201

Closed

Boshen added a commit that referenced this issue Feb 4, 2024

feat(span): fix memory leak by implementing inlineable string for oxc…

8d4b0ef

…_allocator closes #1803 This `String` is currently unsafe, but I won't to get miri working before introducing more changes.

Boshen mentioned this issue Feb 4, 2024

feat(span): fix memory leak by implementing inlineable string for oxc_allocator #2294

Merged

Boshen added a commit that referenced this issue Feb 4, 2024

feat(span): fix memory leak by implementing inlineable string for oxc…

ba1e860

…_allocator closes #1803 This `String` is currently unsafe, but I won't to get miri working before introducing more changes.

Boshen added a commit that referenced this issue Feb 4, 2024

feat(span): fix memory leak by implementing inlineable string for oxc…

9099d5c

…_allocator closes #1803 This `String` is currently unsafe, but I won't to get miri working before introducing more changes.

Boshen closed this as completed in #2294 Feb 4, 2024

Boshen mentioned this issue Feb 4, 2024

Make Atom safe and faster #2295

Closed

overlookmotel mentioned this issue Feb 27, 2024

Proper string type for rolldown rolldown/rolldown#427

Open

Memory leak in arena with Atoms #1803

Memory leak in arena with Atoms #1803

Comments

overlookmotel commented Dec 24, 2023 • edited

The problem

Possible solutions

1. Add impl Drop to Box

2. Track and drop Atoms manually

3. Replace CompactString

Further optimizations

overlookmotel commented Dec 24, 2023 • edited

Boshen commented Dec 25, 2023

Boshen commented Dec 25, 2023

Boshen commented Dec 25, 2023

overlookmotel commented Dec 27, 2023 • edited

Mutablility

Use references?

Support for 32-bit systems

overlookmotel commented Dec 27, 2023

Boshen commented Dec 27, 2023

overlookmotel commented Dec 27, 2023 • edited

Boshen commented Dec 28, 2023

Boshen commented Dec 28, 2023

overlookmotel commented Dec 29, 2023 • edited

Boshen commented Dec 29, 2023

Boshen commented Dec 29, 2023 • edited

overlookmotel commented Dec 29, 2023 • edited

Boshen commented Dec 29, 2023

overlookmotel commented Dec 30, 2023 • edited

Boshen commented Dec 30, 2023

overlookmotel commented Dec 30, 2023

Boshen commented Jan 5, 2024

overlookmotel commented Jan 6, 2024

Boshen commented Jan 6, 2024

overlookmotel commented Jan 8, 2024

aapoalas commented Jan 14, 2024

Boshen commented Jan 15, 2024

aapoalas commented Jan 15, 2024

Boshen commented Jan 15, 2024

overlookmotel commented Jan 15, 2024

Boshen commented Jan 15, 2024

Boshen commented Jan 17, 2024

overlookmotel commented Jan 17, 2024

overlookmotel commented Jan 17, 2024

Boshen commented Jan 18, 2024

aapoalas commented Jan 18, 2024

overlookmotel commented Jan 18, 2024 • edited

Boshen commented Jan 19, 2024

overlookmotel commented Jan 19, 2024

overlookmotel commented Jan 20, 2024 • edited

overlookmotel commented Jan 21, 2024 • edited

Memory leak in arena with `Atom`s #1803

Memory leak in arena with `Atom`s #1803

overlookmotel commented Dec 24, 2023 •

edited

1. Add `impl Drop` to `Box`

2. Track and drop `Atom`s manually

3. Replace `CompactString`

overlookmotel commented Dec 24, 2023 •

edited

overlookmotel commented Dec 27, 2023 •

edited

overlookmotel commented Dec 27, 2023 •

edited

overlookmotel commented Dec 29, 2023 •

edited

Boshen commented Dec 29, 2023 •

edited

overlookmotel commented Dec 29, 2023 •

edited

overlookmotel commented Dec 30, 2023 •

edited

overlookmotel commented Jan 18, 2024 •

edited

overlookmotel commented Jan 20, 2024 •

edited

overlookmotel commented Jan 21, 2024 •

edited