Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: simplify Javascript code embedding #27095

Closed
wants to merge 3 commits into from

Conversation

refack
Copy link
Contributor

@refack refack commented Apr 4, 2019

Based on #25518 so only 9a5b688 is new code
This simplifies how we embed the javascript into the binary:

  1. All files are stored as uint16_t[] (no need for UnionBytes anymore)
  2. Raw arrays are wrapped in UInt16Span that is independent of V8 (reduced the amount of definition need to be included)
  3. Turned ToStringChecked into a static class method that is the only place that creates UInt16SpanResource
Checklist
  • make -j4 test (UNIX), or vcbuild test (Windows) passes
  • documentation is changed or added
  • commit message follows commit guidelines

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. labels Apr 4, 2019
@refack refack requested a review from joyeecheung April 4, 2019 23:20
@refack refack self-assigned this Apr 4, 2019
@refack
Copy link
Contributor Author

refack commented Apr 4, 2019

/CC @nodejs/c++

@refack refack requested a review from devsnek April 4, 2019 23:22
@nodejs-github-bot
Copy link
Collaborator

@joyeecheung
Copy link
Member

All files are stored as uint16_t[] (no need for UnionBytes anymore)

Can you elaborate on why? Encoding files without multibyte characters with one byte strings gives us some memory savings.

@refack
Copy link
Contributor Author

refack commented Apr 4, 2019

Can you elaborate on why? Encoding files without multibyte characters with one byte strings gives us some memory savings.

  1. This is not "real memory" since it comes from the data segment, not the stack or the
    heap (i.e. file mapped).
  2. IIUC it gives us speed improvement since V8 always convert code to TwoByte Strings.
  3. Reduces amount of code, and makes it a little bit simpler.

Hopefully we can follow V8 and eliminate this completely in favor of a code snapshot.

@joyeecheung
Copy link
Member

joyeecheung commented Apr 5, 2019

This is not "real memory" since it comes from the data segment, not the stack or the
heap.

I believe there's still a difference when the code are parsed and stored in v8?

IIUC it gives us speed improvement since V8 always convert code to TwoByte Strings.

Um, I don't think so? At least they are not stored that way, and AstRawString is usually in one bytes in the parsers (because --print-ast etc. does not handle multibyte characters well).

diff --git a/src/objects.cc b/src/objects.cc
index f27f067624..e6e20de9f3 100644
--- a/src/objects.cc
+++ b/src/objects.cc
@@ -5278,6 +5278,7 @@ Handle<Object> SharedFunctionInfo::GetSourceCode(
   if (!shared->HasSourceCode()) return isolate->factory()->undefined_value();
   Handle<String> source(String::cast(Script::cast(shared->script())->source()),
                         isolate);
+  DCHECK(!String::IsOneByteRepresentationUnderneath(*source));
   return isolate->factory()->NewSubString(source, shared->StartPosition(),
                                           shared->EndPosition());
 }
$ out.gn/x64.debug/d8 --allow-natives-syntax -e "print(%FunctionGetSourceCode((a, b) => { a + b; }));"


#
# Fatal error in ../../src/objects.cc, line 5281
# Debug check failed: !String::IsOneByteRepresentationUnderneath(*source).

$ out.gn/x64.debug/d8 --allow-natives-syntax -e "print(%FunctionGetSourceCode((a, b) => { 测试; }));"
(a, b) => { 测试; }

Reduces amount of code, and makes it a little bit simpler.

While I generally agree with this, we usually trade simplicity for size savings and optimization, not the other way around, so someone could just reverse this patch in the name of saving size.

Hopefully we can follow V8 and eliminate this completely in favor of a code snapshot.

I don't think we can eliminate this completely? For one most builtins are not loaded during bootstrap and are not very likely to be used in normal use cases (e.g. scripts in deps). And a lot of builtins are not side-effect-free modules but scripts that can e.g. exit the process when executed so we cannot snapshot them.

@joyeecheung
Copy link
Member

Maybe we should have benchmarks for binary sizes and RSS/heap sizes to determine the actual impact in cases like this..

@refack
Copy link
Contributor Author

refack commented Apr 5, 2019

Maybe we should have benchmarks for binary sizes and RSS/heap sizes to determine the actual impact in cases like this..

I'll do that.

As for code scanning, I was going by this:
image

https://v8.dev/blog/scanner

If I got it the wrong way around, we could (1) save everything as UTF-8 encoded uint8_t, or (2) eliminate code points that are > 255 from our code base (I believe there are ~10)

As for size/speed AFAICT we tend to prefer speed over size (within reason of course):

'FavorSizeOrSpeed': 1, # /Ot, favor speed over size

@refack
Copy link
Contributor Author

refack commented Apr 5, 2019

Ping @bmeurer @hashseed ?

@refack
Copy link
Contributor Author

refack commented Apr 5, 2019

BTW 657c979 is all the unicode code-point in our codebase

@joyeecheung
Copy link
Member

I remember we explicitly leave multibyte characters in our code base to make sure our code compiles that way? (but I have no context anymore, maybe @Fishrock123 @bnoordhuis would know)

@joyeecheung
Copy link
Member

joyeecheung commented Apr 5, 2019

@refack I believe in our case, the implementation that the article talks about translates to https://cs.chromium.org/chromium/src/v8/src/parsing/scanner-character-streams.cc?type=cs&q=BufferedCharacterStream&sq=package:chromium&g=0&l=250 ? It's a utf16 view of a uint8_t buffer, the data is not actually converted EDIT: looks like they are copied in chunks. If there is speed up, you should be able to see from the startup benchmark though.

@joyeecheung
Copy link
Member

Lets see what comes out of the startup benchmark: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/325/

@joyeecheung
Copy link
Member

confidence improvement accuracy (*)   (**)  (***)
09:45:03  misc/startup.js mode='process' script='benchmark/fixtures/require-cachable' dur=1                 0.85 %       ±0.88% ±1.17% ±1.53%
09:45:03  misc/startup.js mode='process' script='test/fixtures/semicolon' dur=1                             0.68 %       ±0.85% ±1.13% ±1.47%
09:45:03  misc/startup.js mode='worker' script='benchmark/fixtures/require-cachable' dur=1           *      1.43 %       ±1.31% ±1.75% ±2.30%
09:45:03  misc/startup.js mode='worker' script='test/fixtures/semicolon' dur=1                             -0.33 %       ±1.43% ±1.92% ±2.52%

From the benchmark

@refack
Copy link
Contributor Author

refack commented Apr 5, 2019

File size has for Windows increased by 8%, so I'd say it's probably not worth it:

D:\refael\Downloads>dir /s node.exe
 Directory of D:\refael\Downloads\node-both-x64
2019-04-04  21:38        26,054,656 node.exe
 Directory of D:\refael\Downloads\node-utf16-x64
2019-04-04  21:28        27,846,144 node.exe

 Directory of D:\refael\Downloads\node-both.exe
2019-04-04  21:23        21,798,912 node.exe
 Directory of D:\refael\Downloads\node-utf16.exe
2019-04-04  21:25        23,717,888 node.exe

@devsnek
Copy link
Member

devsnek commented Apr 5, 2019

i'm not as knowledgeable as joyee is about the compile pipeline but moving to u16 without any significant performance improvement seems like a 👎

@nodejs-github-bot
Copy link
Collaborator

@nodejs-github-bot
Copy link
Collaborator

@refack
Copy link
Contributor Author

refack commented Apr 5, 2019

I've taken this is a different direction. Replaced UnionBytes with std::string_view to see what it takes to get this compiling on all of our CI cluster.

So this is blocked until we get gcc-6 level compilers on all platforms.

@bnoordhuis
Copy link
Member

I remember we explicitly leave multibyte characters in our code base to make sure our code compiles that way? (but I have no context anymore, maybe @Fishrock123 @bnoordhuis would know)

That's correct, see #10673 for background.

@refack refack added the blocked PRs that are blocked by other issues or PRs. label Apr 5, 2019
@Fishrock123
Copy link
Contributor

Yes, some code comments in timers use multi-byte characters (the ascii diagram).

Those characters could be safely discarded in a build if necessary, I think.

@devsnek
Copy link
Member

devsnek commented Apr 5, 2019

Those characters could be safely discarded in a build if necessary, I think.

as long as the discard method preserves the proper string length (because of stack traces) this should be fine.

@jasnell
Copy link
Member

jasnell commented Jun 25, 2020

This stalled and hasn't been moved forward in over a year. Closing but it can be reopened if it is picked back up again

@jasnell jasnell closed this Jun 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked PRs that are blocked by other issues or PRs. c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. stalled Issues and PRs that are stalled.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants