Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Atom safe and faster #2295

Closed
Boshen opened this issue Feb 4, 2024 · 11 comments
Closed

Make Atom safe and faster #2295

Boshen opened this issue Feb 4, 2024 · 11 comments
Assignees

Comments

@Boshen
Copy link
Member

Boshen commented Feb 4, 2024

Atom had a memory leak which is not desirable: #1803

I removed the memory and miri test to the parser: #2294

The change introduced some performance regressions and a unsafe

// SAFETY: It is unsafe to use this string after the allocator is dropped.
Self(AtomImpl::Arena(unsafe { std::mem::transmute(s) }))

The next goal is to remove this unsafe.

@overlookmotel
Copy link
Collaborator

Can I tackle this? I already have a WIP which steals a lot of the efficient code of CompactString. @Boshen I know you're aware of that, but obviously you ran out of patience waiting for me (I don't blame you). I seem to have got side-tracked by trying to optimize the lexer!

But now that the memory leak is solved, perhaps it's a little less urgent?

@Boshen
Copy link
Member Author

Boshen commented Feb 4, 2024

I want to enable as many good engineering practices as possible so I was eager to make the code memory-leak free so we can turn on miri.

It wasn't about impatience 😅

@overlookmotel
Copy link
Collaborator

Sorry, that wasn't criticism or a barbed comment. It was more of an apology that I said I'd do it, and then I haven't. I started hacking on the lexer just to gain familiarity with the codebase, but have ended up going down a performance-chasing rabbit hole! It's too much fun.

Anyway, I think I'm getting towards the end of that work now. #2288 is the major building block for lexer optimization, and I should have the PR which builds on it to speed up lexing identifiers (the one that gives the 10% parser speed up) ready in a few days. So should be able to turn my attention to Atom pretty soon.

That sound OK to you?

@Boshen
Copy link
Member Author

Boshen commented Feb 5, 2024

That sound OK to you?

I can't wait! But take your time and enjoy the fun!

@Boshen
Copy link
Member Author

Boshen commented Feb 23, 2024

Instead of exposing a single Atom type, I'm thinking along the lines of:

  • ArenaStr<'a> with the functionaries of inlined string + arena allocated str
  • CompactStr with the functionaries of inlined string + heap allocated str

For maximum performance, ArenaStr<'a> should be the size of u32, with only a pointer and a length (sans capacity).

The ArenaStr<'a> will have to_compact_str and to_string for downstream usages, the user has the choice of choosing the most convenient one for them.

@overlookmotel
Copy link
Collaborator

This is really nice - nice combination of efficiency and usability.

Just to check a couple of things:

ArenaStr<'a> is what will be used in the AST? to_compact_str / to_string are for downstream users, if they don't want to deal with lifetimes in their own code, or if they want to mutate the string?

I'm unclear how ArenaStr<'a> can be size of a u32. I was imagining ArenaStr<'a> being 16 bytes:

  • Bytes 0-7: Pointer
  • Bytes 8-11: Length (u32)
  • Bytes 12-15: Unused (though could find uses for this e.g. index into Vec of escaped strings, or flags for "String is in source code" / "String is escaped" / "String is Unicode").

We could probably squeeze it down smaller using pointer compression, but:

  1. That optimization could also apply to Vecs and Boxs in the AST, so I was thinking of it as a broader issue.
  2. 16 is probably a good size for making majority of strings be stored inline (most JS identifiers are under 16 bytes).

Or have I misunderstood, and you have something else in mind?

@Boshen
Copy link
Member Author

Boshen commented Feb 25, 2024

ArenaStr<'a> is what will be used in the AST?

Yes.

ArenaStr<'a> can be size of a u32. I was imagining ArenaStr<'a> being 16 bytes

16 bytes would be even better! If you can shrink it down :-)

@Boshen
Copy link
Member Author

Boshen commented Feb 25, 2024

Alright, we are going to solve this problem in a few steps:

  • refactor the Atom API without changing the implementation, i.e. add CompactStr and to_compact_str. @Boshen
  • Change some of the downstream usages to use CompactStr, i.e. all crates other than the AST and parser crate should use CompactStr @Boshen
  • Change Atom to ArenaStr<'a> within the ast and parser crates @Boshen
  • Rework the ArenaStr<'a> implementation to make it inlinable. @overlookmotel

@overlookmotel
Copy link
Collaborator

Sounds like a plan!

Do we need CompactStr to be mutable? If not, we could make our own version of CompactStr which has same size and layout as ArenaStr. It would only differ in where it stores out-of-line content for strings over 16 bytes (heap instead of reference to source text/arena).

Advantage of that would be that for strings under 16 bytes, converting from ArenaStr<'a> to CompactStr (or back) would be free - just transmute.

But if CompactStr needs to be mutable, this wouldn't work, as it'd require a capacity field.

@overlookmotel
Copy link
Collaborator

overlookmotel commented Feb 25, 2024

16 bytes would be even better! If you can shrink it down :-)

Yes, this is totally possible. Same size as Box<str>, and use compact_str's trick of smuggling inline/arena discriminant + inline length in the last byte (only byte values 0-191 are legal for last byte of a UTF-8 string, so you have values 192-255 free to represent other states).

Boshen added a commit that referenced this issue Feb 26, 2024
Part of #2295

This PR splits the `Atom` type into `Atom<'a>` and `CompactString`.

All the AST node strings now use `Atom<'a>` instead of `Atom` to signify
it belongs to the arena.

It is now up to the user to select which form of the string to use.

This PR essentially removes the really unsafe code 


https://github.com/oxc-project/oxc/blob/93742f89e91bc7c45491b6d2b49a134fa65f2b5c/crates/oxc_span/src/atom.rs#L98-L107

which can lead to 

![image](https://github.com/oxc-project/oxc/assets/1430279/8c513c4f-19b0-4b63-b61c-e07c187c95b5)
@Boshen
Copy link
Member Author

Boshen commented Feb 26, 2024

Continue on #2516

@Boshen Boshen closed this as completed Feb 26, 2024
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this issue May 29, 2024
…oject#2497)

Part of oxc-project#2295

This PR splits the `Atom` type into `Atom<'a>` and `CompactString`.

All the AST node strings now use `Atom<'a>` instead of `Atom` to signify
it belongs to the arena.

It is now up to the user to select which form of the string to use.

This PR essentially removes the really unsafe code 


https://github.com/oxc-project/oxc/blob/93742f89e91bc7c45491b6d2b49a134fa65f2b5c/crates/oxc_span/src/atom.rs#L98-L107

which can lead to 

![image](https://github.com/oxc-project/oxc/assets/1430279/8c513c4f-19b0-4b63-b61c-e07c187c95b5)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants