Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vm.Script could be used to hide the source by shipping only bytecode. #11842

Closed
hashseed opened this issue Mar 14, 2017 · 13 comments
Closed

vm.Script could be used to hide the source by shipping only bytecode. #11842

hashseed opened this issue Mar 14, 2017 · 13 comments
Labels
discuss Issues opened for discussions and feedbacks. vm Issues and PRs related to the vm subsystem.

Comments

@hashseed
Copy link
Member

hashseed commented Mar 14, 2017

I was playing with this idea this morning, and @bmeck asked me to put this into words.

vm.Script can be used to produce a cache into a buffer, and also used to load from existing cache produced earlier. With Ignition (bytecode interpreter) launched in V8, we could "abuse" it to only ship bytecode and hide the source.

The thing with Ignition is: once a function has been compiled, we don't need the source code anymore. The optimizing compiler can construct the graph from bytecode alone. So a script can be fully shipped as bytecode. There are a couple things missing though. For a proof-of-concept these issues can be hacked before actually thinking about changing the V8 API to accommodate.

  • Eager compilation: every function must be compiled to bytecode already. V8 doesn't do that out of the box, but there is a command line flag called --serialize_eager that you could turn on to force eager compilation if a code cache is being created.
  • The source. vm.Script expects the script source to be provided always. With Ignition we don't actually need it, but we have a checksum when deserializing to check that the source matches expectation. The checksum is simply the script length at this point. So an empty string with the same length would do.
  • Platform dependency. V8's serializer simply walks and serializes the object graph. In case of code cache, we walk the object graph of the function (SharedFunctionInfo). Depending on whether the platform is 32 or 64 bit, the object layout is different, and the code cache would look different. I'm not 100% sure whether x64 and arm64 would produce the same code cache, either.
  • Version dependency: V8's bytecode is purely internal, and not versioned. So for a different version of V8, the bytecode needs to be recompiled.
  • Function.prototype.toString() would just show a window from whatever the dummy source was provided. Duh.

Once these issues are solved, you could ship bytecode and hide the source, without worrying about crashing the optimizing compiler.

Oh and this would only work on versions where V8 uses Ignition. For example at this shameless plug.

@bmeck
Copy link
Member

bmeck commented Mar 14, 2017

@hashseed it sounds like we can provide the source so that things like debuggers can show the source though? I think showing the source can be useful, but avoiding extra parsing and compilation costs would be good.

@hashseed
Copy link
Member Author

I was just pointing out the possibility of hiding the source, if required by use case. If the source is available, then there is not much difference to what vm.Script already does now, except for maybe forced eager compilation.

@addaleax addaleax added the vm Issues and PRs related to the vm subsystem. label Mar 14, 2017
@davidmarkclements
Copy link
Member

davidmarkclements commented Mar 14, 2017

The checksum is really important here I think, for transparency. Say in an open source situation, you publish the byte code with the original code, a collision free checksum provides a gaurantee that the byte code is true to the source. Is there a way to do this without a dummy checksum, and using a strong hash for a legit checksum?

@hashseed
Copy link
Member Author

The header of the code cache contains a bunch of different fields that has to match: V8 version, source length, command line flags, etc. There is also a checksum over the payload. But that's intended for error correction, not security. It uses a Fletcher's checksum, so fairly easy to find a collision for.

@DemiMarie
Copy link

What about switching to a Blake2b hash? That’s very fast (faster than MD5) and as hard to find collisions in as SHA2 (i.e., impossible).

@refack
Copy link
Contributor

refack commented Apr 14, 2017

I'll just put this here and walk away...

.pyc

@hashseed
Copy link
Member Author

What about switching to a Blake2b hash? That’s very fast (faster than MD5) and as hard to find collisions in as SHA2 (i.e., impossible).

Might be worth experimenting with. But you'd still need a safe way to store/transmit the checksum.

As mentioned, the current checksum is to detect accidental data corruption only.

.pyc

What I'm pointing out here is precisely how someone could implement something similar to .pyc for Node.

@bmeck
Copy link
Member

bmeck commented Apr 14, 2017 via email

@hashseed
Copy link
Member Author

JS code is usually the smallest representation. Bytecode take less space than native code, but still larger than JS source on average.

Code caching for individual files has been implemented about two years ago in V8. Prior to bytecode however the source still needs to be available for parsing when code is being recompiled for optimization. Turbofan can create its graph from bytecode though, so no source necessary anymore.

I may be wrong, but I think @indutny's experiments were way before the code cache, and was about putting code into V8's startup snapshot. However, the startup serializer/deserializer had many limitations back then, which were fine for V8's default startup snapshot, but did not work for arbitrary code.

@Trott
Copy link
Member

Trott commented Aug 2, 2017

Should this remain open?

@bmeck
Copy link
Member

bmeck commented Aug 2, 2017

@Trott No bandwidth currently to move it, but still relevant and comes up on social media somewhat often

@Eric24
Copy link

Eric24 commented Sep 25, 2017

Another application of this would be the ability to allow pre-compiled code to be sent between processes via IPC. Of course the "cached code object" would need to be serialized, but even so, it would likely be faster than passing the original source code to the target process and recompiling it there (unfortunately, there's no practical way to test--that I can think of--with the way vm.Script currently works).

@TimothyGu TimothyGu added the discuss Issues opened for discussions and feedbacks. label Feb 1, 2018
@TimothyGu
Copy link
Member

The discussion seems to have quieted down a bit. Closing.

We can reopen this some time later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues opened for discussions and feedbacks. vm Issues and PRs related to the vm subsystem.
Projects
None yet
Development

No branches or pull requests

9 participants