Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in arena with Atoms #1803

Closed
overlookmotel opened this issue Dec 24, 2023 · 39 comments · Fixed by #1924 or #2294
Closed

Memory leak in arena with Atoms #1803

overlookmotel opened this issue Dec 24, 2023 · 39 comments · Fixed by #1924 or #2294
Labels
A-parser Area - Parser

Comments

@overlookmotel
Copy link
Collaborator

overlookmotel commented Dec 24, 2023

OXC appears to leak memory when source includes any strings over 24 bytes in length.

A demonstration here: https://github.com/overlookmotel/oxc/tree/leak-test

Checkout that branch, and cargo run -p test_leak.

The problem

I believe the problem is this:

  • Strings are stored as Atoms.
  • Atom wraps CompactString from compact_str.
  • CompactString allocates to the heap if string is longer than 24 bytes.
  • Atoms may appear anywhere in the AST, likely within a Box.
  • Implementation of Box used for the arena has no Drop impl.
  • So when parsed Program is dropped, Atoms don't get dropped, and the heap allocations CompactString made are not freed.

In the demo above, I've added impl Drop for Atom which logs when an Atom is dropped. It never gets called, demonstrating that the heap memory is leaked.

Possible solutions

1. Add impl Drop to Box

This would be like bumpalo::collections::Box.

Problem is that this would make dropping the AST more expensive, negating a major benefit of using an arena.

2. Track and drop Atoms manually

Store pointers to all non-inline Atoms as they're created. A Drop impl on the arena could then drop them when the arena is dropped.

3. Replace CompactString

Replace CompactString with a similar "small string" implementation which stores long strings in the bumpalo arena, instead of the general heap.

As well as solving the leak, I believe this would also be more performant:

  1. Less heap allocations.
  2. Strings would be stored close to the AST nodes containing them, producing better cache locality.

CompactString is quite complicated, but OXC only uses a small fraction of CompactString's capabilities, so creating a custom small string implementation in OXC I don't think would be too tricky.

In addition:

CompactString is built as a replacement for std's String, and includes a capacity field to allow resizing. But OXC's Atom is immutable, so the capacity field is not required. All that's needed is a Box<str> equivalent.

So a new "small string" type could lose the capacity field and be reduced in size to 16 bytes, rather than the current 24. Obviously, only strings up to 16 bytes could then be stored inline, but still most JS identifiers are under 16 characters, and if string content is stored in the arena, storing longer strings "out of line" has less performance penalty.

Further optimizations

For identifers without escapes (pretty much all identifiers) and string literals without escape sequences (most strings), Atom could just be a pointer to the string in source code, rather than copying the string into the arena.

Secondly, when lexing strings and identifiers containing escape sequences, AutoCow creates a string in the arena. That string can be pointed to by Atom rather than copied to another location in the arena.

In fact, escaped strings for some reason currently are stored in the arena three times, though I'm not sure why. Dump of arena contents after parsing x = 'ABCDE\n';:

|0c000000 00000000 486d602e 01000000| ........Hm`..... 00000000
|00000000 0e000000 00000000 00000000| ................ 00000010
|00000000 00000000 b06d602e 01000000| .........m`..... 00000020
|05000000 00000000 806d602e 01000000| .........m`..... 00000030
|00000000 0d000000 0098aa6f 01000000| ...........o.... 00000040
|04000000 0d000000 41424344 450a0000| ........ABCDE... 00000050
|00000000 00000000 00000000 000000c6| ................ 00000060
|f0010000 00004142 4344450a 42434445| ......ABCDE.BCDE 00000070
|00000000 01000000 00000000 01000000| ................ 00000080
|78000000 00000000 00000000 00000000| x............... 00000090
|00000000 000000c1 0090aa6f 01000000| ...........o.... 000000a0
@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Dec 24, 2023

For similar reasons, BigIntLiteral also leaks memory because num_bigint::BigInt stores data on the heap.

Any other structures which store data on the heap will also leak, though I couldn't see any other cases at first glance.

@Boshen
Copy link
Member

Boshen commented Dec 25, 2023

Thank you for the in-depth research.

For strings, my previous attempt was #451, I think we can look into this problem space a bit more to find a good solution.

For big ints, the usages are rare - perhaps we could box it into the arena?

@overlookmotel Would you like to explore and find a way to mitigate these memory leak problems with us? Thank you in advance 🙇

@Boshen
Copy link
Member

Boshen commented Dec 25, 2023

Let's start with option 3 (replace compact string).

I still want to try the hack from https://github.com/rust-lang/rust/blob/f9097f87c9c094f80826fb60a1a624b5f9f1ed82/compiler/rustc_span/src/symbol.rs#L1999-L2007.

@Boshen
Copy link
Member

Boshen commented Dec 25, 2023

In fact, escaped strings for some reason currently are stored in the arena three times, though I'm not sure why. Dump of arena contents after parsing x = 'ABCDE\n';:

This is going to be fun bug ... AutoCow is copied from jsparagus https://github.com/mozilla-spidermonkey/jsparagus/blob/212f6bdbc2cae909e7d5cfebf36284560c3c4ef4/crates/parser/src/lexer.rs#L2256

Maybe there is a usage problem in our lexer code?

@Boshen Boshen added the A-parser Area - Parser label Dec 25, 2023
@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Dec 27, 2023

First go at a new Atom implementation here: https://github.com/overlookmotel/oxc/tree/new-atom/crates/oxc_atom
(lots of TODO comments)

A few questions:

Mutablility

Are there plans/potential needs in future for Atoms to be mutable?

What I've done so far assumes not, and therefore can reduce size of Atom to 16 bytes, and make the implementation much simpler and more performant.

Use references?

Should all strings be added to the arena? Or if there's an existing reference to the string's content whose lifetime is long enough, can Atom just store a reference to it e.g. if the string in the source code being parsed?

Advantages of storing references:

  1. Lower memory usage.
  2. Parsing should be faster as the majority of strings and identifiers can just be pointed to in source code, so copying strings into the arena will be much reduced.
  3. Cloning Atom becomes very cheap (no copying string content).

Advantage of putting everything in the arena is that strings will always be close in arena to the AST nodes which use them, so better cache locality. Then again, strings also bloat the arena, so possibly result in more swapping in and out of the cache while traversing an AST.

My current WIP is based on storing references.

Support for 32-bit systems

How important is support for 32-bit systems?

What I've done so far produces a maximum cap on length of any individual string of 512 MiB on 32-bit machines (no practical limit for 64-bit). CompactString has a workaround for that limit, but implementing it here would be horrible, as it doesn't mesh well with arenas.

Personally, I imagine 512 MiB is big enough for almost all use cases. And does anyone actually use 32-bit machines any more?

@overlookmotel
Copy link
Collaborator Author

On your other points:

I'm not sure I understand the point of the hack you mentioned in this context. As far as I can see, the reason that code is tricking the compiler that strings have a 'static lifetime is in order to use them as keys in a hash map.

It needs to do that because it's de-duplicating strings. But I thought you'd found that approach was slower due to the locks and hash table lookups, and discounted it?

Or am I missing the point?

I figured out why string content appears 3 times in the arena. It's because bumpalo has the unfortunate property that it grows downward. i.e. in each chunk the first allocation is at end of the chunk, the next allocation before it, and so on.

So when AutoCow pushes to the bumpalo::collections::Vec, each time the Vec has to resize, it can't grow forwards and instead creates a new allocation earlier in the chunk, copies the data earlier, and carries on. The 3rd incidence of the string I noticed was the leftover from that process.

I imagine this same effect also occurs when building any other Vec in the parser e.g. Vec<Statement>. It probably eats a lot of space in the arena, and results in a lot of memory copying, since Vecs are so common in ASTs.

My first thought is this could be improved on by using a separate "scratch space" allocation which can grow forwards.

@Boshen
Copy link
Member

Boshen commented Dec 27, 2023

Are there plans/potential needs in future for Atoms to be mutable?

No, I haven't encountered a usecase where it needs to be mutable.

Should all strings be added to the arena? Or if there's an existing reference to the string's content whose lifetime is long enough, can Atom just store a reference to it

I wanted to do this for a long time. But handling out &'a str or String<'a> to my downstream tools complicates their code - adding 'a to their code is too anonoying, so I went for a lifetimeless implementation.

If we can hack it and unsafe the 'a, then I'm all in.

how important is support for 32-bit systems?

Your reasoning is solid around 512 MiB 👍

I'm not sure I understand the point of the hack

It's basically this comment:

// SAFETY: we can extend the arena allocation to `'static` because we
// only access these while the arena is still alive.

removing the lifetime will make all downstream code easier to write, since people don't need to carry around the 'a everywhere.

Although we can force an allocation for external use ... so everything is still safe Rust.

I figured out why string content appears 3 times in the arena.

TIL. We can create a separate discussion around this topic. Could be fun to optimize it, or perhaps wasting a lot of time 😅

@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Dec 27, 2023

Thanks for coming back.

I wanted to do this for a long time. But handling out &'a str or String<'a> to my downstream tools complicates their code - adding 'a to their code is too anonoying, so I went for a lifetimeless implementation.

If we can hack it and unsafe the 'a, then I'm all in.

Hmm. Well yes we could hack it, but then wouldn't it be unsound? You could drop the String containing the source code, and continue using the AST which contains references to the string's contents, which would be a use-after-free.

I'm also a little confused as to why Atom<'a> having a lifetime is problematic. Most AST node types have an 'a lifetime already, so why is Atom<'a>'s lifetime more annoying than Statement<'a>'s?

Also, NumberLiteral's raw field is a &'a str.

If the problem is that the source code is stored separately from the arena, so it produces a 2nd lifetime to deal with, how about Allocator storing an immutable reference to the source code buffer along with the bumpalo arena? That'd produce an invariant that the source code buffer must live longer than the arena, and then a consumer only has 1 lifetime to deal with. i.e.:

let src = "x = 123".to_string();
let alloc = Allocator::from_src(&src);
// `src` must live as long as `alloc`

I think it'd be a shame to leave this potential performance gain unrealised, as it could be significant for any code containing long strings (which would include a lot of UI code).

Out of interest, what downstream tools are you talking about? Other OXC crates, or external consumers?

I figured out why string content appears 3 times in the arena.

TIL. We can create a separate discussion around this topic. Could be fun to optimize it, or perhaps wasting a lot of time 😅

Solving this problem in AutoCow should not be difficult. The broader problem with Vec is more tricky, and yes, separate issue.

@Boshen
Copy link
Member

Boshen commented Dec 28, 2023

How about this - we can expose different APIs for different usages, i.e.

  • -> String
  • -> &'a str
  • -> CompactString
  • ... maybe other variations

Out of interest, what downstream tools are you talking about? Other OXC crates, or external consumers?

All crates other than the AST crate are considered external crates:


I love the overall direction this issue is going! @overlookmotel do you wanna find a way to hack some of the code so we can get a PR going, and observe performance changes?

Both performance and memory consumption should be improved in theory.

As stated in the policy https://oxc-project.github.io/docs/contribute/rules.html#development-policy

All performance issues (runtime and compilation speed) are considered as bugs in this project.

;-)

@Boshen
Copy link
Member

Boshen commented Dec 28, 2023

I think I've provided enough context for you to make good judgements, you've gained my trust.

Let's break things and have fun!

@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Dec 29, 2023

you've gained my trust.

Thanks. Or maybe I've just worn you down with my questions!

But yes, I think I have a fairly clear picture now.

Just one clarification, and 1 last question:

  • -> String
  • -> &'a str
  • -> CompactString

To clarify, you mean that Atom<'a> can have a lifetime, but it should provide methods for easy conversion to lifetime-less structures for consumers who need them
e.g. impl Atom<'a> { fn to_string(&self) -> String }?

And the last question:

Am I right in thinking that OXC places a limit on size of a source file to max u32::MAX bytes (4 GiB)? (I'm gathering that from Span's start and end being u32s). Therefore string length can be capped at u32::MAX too?

That'd free 4 bytes in Atom for additional metadata (which I have some ideas how to use).

@Boshen
Copy link
Member

Boshen commented Dec 29, 2023

To clarify, you mean that Atom<'a> can have a lifetime, but it should provide methods for easy conversion to lifetime-less structures for consumers who need them
e.g. impl Atom<'a> { fn to_string(&self) -> String }?

Yes, I think we should allow the user (other crates) to pick for their use case so everybody can make their own tradeoffs, as well as stay in safe Rust.

@Boshen
Copy link
Member

Boshen commented Dec 29, 2023

Am I right in thinking that OXC places a limit on size of a source file to max u32::MAX bytes (4 GiB)? (I'm gathering that from Span's start and end being u32s). Therefore string length can be capped at u32::MAX too?

Yes!

/// NOTE: `u32` is sufficient for "all" reasonable programs. Larger than u32 is a 4GB JS file.

@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Dec 29, 2023

OK great. Thanks for the answers.

I have a WIP that ticks most of the boxes, but now that holiday is coming to an end I'll have much less time to work on it. Please excuse me if I don't come back with a PR for a while.

@Boshen
Copy link
Member

Boshen commented Dec 29, 2023

OK great. Thanks for the answers.

I have a WIP that ticks most of the boxes, but now that holiday is coming to an end I'll have much less time to work on it. Please excuse me if I don't come back with a PR for a while.

Take your time, I'm on holiday as well 😁

@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Dec 30, 2023

Sorry, one more question: How much should we care about 32-bit systems? Are they a primary target for OXC?

Reason I ask is implementation would be simpler if could make Atom 12 bytes on 32-bit systems (rather than 8 bytes).

On 64-bit systems it'll be 16 bytes regardless.

@Boshen
Copy link
Member

Boshen commented Dec 30, 2023

Sorry, one more question: How much should we care about 32-bit systems? Are they a primary target for OXC?

Reason I ask is implementation would be simpler if could make Atom 12 bytes on 32-bit systems (rather than 8 bytes).

On 64-bit systems it'll be 16 bytes regardless.

Not that much. Not a primary target.

@overlookmotel
Copy link
Collaborator Author

Thanks. I'll get on with the implementation. Will come back if more questions, but hopefully I have all the info I need now.

@Boshen
Copy link
Member

Boshen commented Jan 5, 2024

Why do we need Atom<'a> in the first place if we replace everything to just &'a str on all the AST nodes? Maybe I over-designed it ... we don't actually need to intern nor a lifetimeless string form 🤔

If downstream tool needs a lifetimeless string, they should be able to use CompactStr::from?

@overlookmotel
Copy link
Collaborator Author

Yes, I was thinking along similar lines last night.

&'a str is certainly simpler, and would be faster in the parser, as no mucking about figuring out if whether to store a string inline or not.

But I can still see a few advantages to an Atom<'a> type which stores short strings inline:

  1. Faster reading strings

When printing an AST back to Javascript, when the printer holds an Atom, if if the string is short enough to be stored inline, then no further memory lookup is required to access it.

This would be the case for most identifiers, as typically they'll be under 16 bytes.

  1. Faster comparisons

During e.g. scope analysis, I'd imagine there's a lot of comparing one identifier's name to another's. When those identifiers are 16 chars or less, that should be faster if they're stored inline because:

a. The data is right there inline in the Atom. Likely to be in cache.
b. You don't need to worry about reading out of bounds the way you do with a &str, so comparing two inline Atoms is equivalent to comparing two u128s. With AVX, that's only 3 instructions (https://godbolt.org/z/Gvoqc7PYd)

  1. Space for extra data

&str uses 8 bytes for string length. In the context of OXC, we only need 4 bytes, as a string can't be larger than 4 GiB.

A custom Atom type can use those 4 spare bytes for something else. They could, for example, contain a flag for "is ASCII", or the string length in UTF16 characters (to help with #959).

(1) and (2) are the main considerations. Basically, it's a trade-off between speed of creating strings and of reading them. Maybe &str will turn out to be the better choice after all, but maybe not.

I'm most of the way there with the inline Atom implementation, so if you can wait the month or so it'll take me to finish it up (this month is hellish at work, but things will calm down in Feb), we can benchmark the 2 options and see what that reveals.

Does that sound like a reasonable plan?

@Boshen
Copy link
Member

Boshen commented Jan 6, 2024

Does that sound like a reasonable plan?

Yes yes! I totally forgot that all we want is 😅

enum Atom {
  Inline(..),
  ArenaAllocated(..)
}

Boshen added a commit that referenced this issue Jan 7, 2024
This PR partially fixes #1803 and is part of #1800.

The BigInt value is now boxed, so it can be dropped along with the AST.

It is also removed from the `Token` value, so that the token size can be
reduced once we removed all the variants.
Boshen added a commit that referenced this issue Jan 7, 2024
This PR partially fixes #1803 and is part of #1800.

The BigInt value is now boxed, so it can be dropped along with the AST.

It is also removed from the `Token` value, so that the token size can be
reduced once we removed all the variants.
Boshen added a commit that referenced this issue Jan 7, 2024
This PR partially fixes #1803 and is part of #1800.

The BigInt value is now boxed, so it can be dropped along with the AST.

It is also removed from the `Token` value, so that the token size can be
reduced once we removed all the variants.
Boshen added a commit that referenced this issue Jan 7, 2024
This PR partially fixes #1803 and is part of #1800.

The BigInt value is now boxed, so it can be dropped along with the AST.

It is also removed from the `Token` value, so that the token size can be
reduced once we removed all the variants.
Boshen added a commit that referenced this issue Jan 7, 2024
This PR partially fixes #1803 and is part of #1800.

The BigInt value is now boxed, so it can be dropped along with the AST.

It is also removed from the `Token` value, so that the token size can be
reduced once we removed all the variants.
Boshen added a commit that referenced this issue Jan 7, 2024
This PR partially fixes #1803 and is part of #1800.

The BigInt value is now boxed, so it can be dropped along with the AST.

It is also removed from the `Token` value, so that the token size can be
reduced once we removed all the variants.
@Boshen Boshen reopened this Jan 8, 2024
@overlookmotel
Copy link
Collaborator Author

Thanks!

@aapoalas
Copy link

👍 I just ran into this with parsing a trivial script containing only 3n using MIRI.

@Boshen
Copy link
Member

Boshen commented Jan 15, 2024

👍 I just ran into this with parsing a trivial script containing only 3n using MIRI.

@aapoalas Do you mind sharing your miri setup? I'd like to add a CI for it.

My previous attempt wasn't robust enough https://github.com/oxc-project/oxc/blob/main/.github/workflows/miri.yml

@aapoalas
Copy link

👍 I just ran into this with parsing a trivial script containing only 3n using MIRI.

@aapoalas Do you mind sharing your miri setup? I'd like to add a CI for it.

My previous attempt wasn't robust enough https://github.com/oxc-project/oxc/blob/main/.github/workflows/miri.yml

Unfortunately I only just ran the rustup script to add nightly toolchain, and then ran cargo +nightly miri test, nothing more than that. It doesn't seem particularly more robust than your workflow.

@Boshen
Copy link
Member

Boshen commented Jan 15, 2024

Weird, I tried cargo miri test -p oxc_parser -p oxc_ast -p oxc_allocator -p oxc_semantic in CI and it passed 🤔 https://github.com/oxc-project/oxc/actions/runs/7529087020/job/20492604930

If I run the whole test suite it throws something from crossbeam

   --> /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/crossbeam-epoch-0.9.17/src/atomic.rs:204:11
    |
204 |         &*(ptr as *const T)
    |           ^^^^^^^^^^^^^^^^^ integer-to-pointer cast
    |
    = help: This program is using integer-to-pointer casts or (equivalently) `ptr::from_exposed_addr`,
    = help: which means that Miri might miss pointer bugs in this program.
    = help: See https://doc.rust-lang.org/nightly/std/ptr/fn.from_exposed_addr.html for more details on that operation.
    = help: To ensure that Miri does not miss bugs in your program, use Strict Provenance APIs (https://doc.rust-lang.org/nightly/std/ptr/index.html#strict-provenance, https://crates.io/crates/sptr) instead.
    = help: You can then pass the `-Zmiri-strict-provenance` flag to Miri, to ensure you are not relying on `from_exposed_addr` semantics.
    = help: Alternatively, the `-Zmiri-permissive-provenance` flag disables this warning.

https://github.com/oxc-project/oxc/actions/runs/7528975555/job/20492273972

@overlookmotel
Copy link
Collaborator Author

Is it possible that the tests don't create any AST which contains either BigInts, or strings/identifiers longer than 24 bytes? BigInt always heap allocates, but CompactStr stores strings of up to 24 bytes inline, and only heap allocates for strings longer than that.

If the tests don't do either of these things, they won't trigger the memory leaks.

I imagine running the parser on the files used for the conformance tests would allow miri to find the leaks.

@Boshen
Copy link
Member

Boshen commented Jan 15, 2024

Thank you both. Replicated

https://github.com/oxc-project/oxc/actions/runs/7530649712/job/20497529584

error: memory leaked: alloc57077 (Rust heap, size: 8, align: 8), allocated here:
    --> /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:98:9
     |
98   |         __rust_alloc(layout.size(), layout.align())
     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     |
     = note: inside `std::alloc::alloc` at /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:98:9: 98:52
     = note: inside `std::alloc::Global::alloc_impl` at /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:181:73: 181:86
     = note: inside `<std::alloc::Global as std::alloc::Allocator>::allocate` at /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/alloc.rs:241:9: 241:39
     = note: inside `alloc::raw_vec::RawVec::<u64>::allocate_in` at /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/raw_vec.rs:199:45: 199:67
     = note: inside `alloc::raw_vec::RawVec::<u64>::with_capacity_in` at /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/raw_vec.rs:145:9: 145:69
     = note: inside `std::vec::Vec::<u64>::with_capacity_in` at /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:672:20: 672:61
     = note: inside `std::vec::Vec::<u64>::with_capacity` at /home/runner/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/alloc/src/vec/mod.rs:481:9: 481:49
     = note: inside `num_bigint::biguint::convert::from_radix_digits_be` at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/num-bigint-0.4.4/src/biguint/convert.rs:122:20: 122:74
     = note: inside `num_bigint::biguint::convert::<impl num_traits::Num for oxc_ast::BigUint>::from_str_radix` at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/num-bigint-0.4.4/src/biguint/convert.rs:269:13: 269:44
     = note: inside `num_bigint::bigint::convert::<impl num_traits::Num for num_bigint::BigInt>::from_str_radix` at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/num-bigint-0.4.4/src/bigint/convert.rs:39:18: 39:51
     = note: inside `num_bigint::BigInt::parse_bytes` at /home/runner/.cargo/registry/src/index.crates.io-6f17d22bba15001f/num-bigint-0.4.4/src/bigint.rs:675:9: 675:41
note: inside `lexer::number::parse_big_int`
    --> crates/oxc_parser/src/lexer/number.rs:97:5
     |
97   |     BigInt::parse_bytes(s.as_bytes(), radix).ok_or("invalid bigint")
     |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@Boshen
Copy link
Member

Boshen commented Jan 17, 2024

Found some arcane code from ratel, where we can use for our atom:

https://github.com/ratel-rust/ratel-core/blob/4929f0ecdcd470346c9dba5d7f570eb960cc312d/core/src/ast/ident.rs#L128-L135

@overlookmotel
Copy link
Collaborator Author

Here's where I'm up to with Atom so far:
https://github.com/overlookmotel/oxc/tree/new-atom/crates/oxc_atom/src

It's heavily based on CompactStr, and I've gone fairly deep figuring out how all their optimizations work (and also finding a couple of my own).

But thanks for link to Ratel's implementation - I'll see if anything we can use from it which is specific to strings in JS.

Working on Atom has also thrown up some thoughts which are a bit broader about the parser/lexer overall, and would be good to resolve before locking Atom's API. For various reasons, I think API should be more constrained than what I have now. I'll raise issues about these questions when I get time.

One question for now: How relaxed are you about breaking changes?

I mean: Is it better to work out API design thorougly before implementing, so the breaking changes happens all in one go? Or is it OK to merge an initial implementation (including breaking changes) and then modify API again later (more breaking changes)?

@overlookmotel
Copy link
Collaborator Author

Also, a naming thing: Should we change the name from Atom to just String?

The name Atom suggest that identical strings are deduplicated. That used to be that case, but it isn't any more, so the name is now misleading.

@Boshen
Copy link
Member

Boshen commented Jan 18, 2024

Also, a naming thing: Should we change the name from Atom to just String?

Can do since we are not 1.0. But there are many projects that dependents on oxc now, so we'll need to introduce some type alias and the #[deprecated] macro to smooth out the transition.

@aapoalas
Copy link

Agree that a deprecation period is nice. Our JS engine uses Atom frankly a bit too much. We're just a hobby project so it doesn't really matter that we break though.

@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Jan 18, 2024

OK, understood. How about breaking changes to the methods of the type?

The simplest starting point is to have these methods to create an Atom/String:

fn new(s: &'a str) -> Atom<'a>;
fn new_in(s: &'a str, alloc: &'a Allocator) -> Atom<'a>;
// Limited to 16 bytes max
const fn new_const(s: &'static str) -> Atom<'static>;

But I think in future it'd be better to remove new() and replace with:

fn new_from_source(SourceSlice<'a>) -> Atom<'a>;

SourceSlice here is a string slice which is guaranteed to be part of the source code. Constraining Atoms to only be able to reference string content which is either in source code or stored in the arena will have benefits both in performance and functionality.

So my question is: Is it a problem to do the initial implementation first, and then have the breaking change of removing new() later as a further breaking change? (NB deprecation would not be possible on that change, as it would be fundamental to Atom's internals).

@Boshen
Copy link
Member

Boshen commented Jan 19, 2024

Is it a problem to do the initial implementation first, and then have the breaking change of removing new() later as a further breaking change?

Not a problem. We'll need to keep a change log from now on :-)

@overlookmotel
Copy link
Collaborator Author

OK great. I'm working on something else at the moment, but will come back to this soon.

@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Jan 20, 2024

See comment on #2101:

49% of the entire run time lexing checker.ts is spent in bumpalo::collections::string::String::push.

If we can replace usage of Bumpalo's String with something else, that'd produce an 10-15% speed-up in parsing some inputs.

@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Jan 21, 2024

Previous comment turned out to be erroneous. But discovered a couple of other things along the way which are relevant to this issue:

  1. AutoCow is inefficient and could be improved by pushing non-escaped chunks as string slices, rather than pushing many chars 1 by 1.
  2. Bumpalo's String::push_str is extremely slow for similar reasons - it pushes each byte of the string 1 by 1, rather than copying a big chunk all at once.

Boshen added a commit that referenced this issue Jan 29, 2024
Boshen added a commit that referenced this issue Feb 4, 2024
…_allocator

closes #1803

This `String` is currently unsafe, but I won't to get miri working
before introducing more changes.
Boshen added a commit that referenced this issue Feb 4, 2024
…_allocator

closes #1803

This `String` is currently unsafe, but I won't to get miri working
before introducing more changes.
Boshen added a commit that referenced this issue Feb 4, 2024
…_allocator

closes #1803

This `String` is currently unsafe, but I won't to get miri working
before introducing more changes.
Boshen added a commit that referenced this issue Feb 4, 2024
…_allocator (#2294)

closes #1803

This string is currently unsafe, but I want to get miri working before
introducing more changes.

I want to make a progress from memory leak to unsafe then to safety.
It's harder to do the steps in one go.
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this issue May 29, 2024
This PR partially fixes oxc-project#1803 and is part of oxc-project#1880.

BigInt is removed from the `Token` value, so that the token size can be
reduced once we removed all the variants.

`Token` is now also `Copy`, which removes all the `clone` and `drop`
calls.

This yields 5% performance improvement for the parser.
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this issue May 29, 2024
…_allocator (oxc-project#2294)

closes oxc-project#1803

This string is currently unsafe, but I want to get miri working before
introducing more changes.

I want to make a progress from memory leak to unsafe then to safety.
It's harder to do the steps in one go.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parser Area - Parser
Projects
None yet
3 participants