Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Class wrapper for Java serialization friendliness #736

Merged
merged 27 commits into from Feb 16, 2018

Conversation

Projects
None yet
2 participants
@alexarchambault
Copy link
Collaborator

alexarchambault commented Jan 2, 2018

The goal of this PR is to make Ammonite behave well wrt Java serialization. This is a quite big PR, hard to split in smaller ones, that roughly adds a failing test (SerializationTests), then makes it pass.

Its goal, in sessions like this one

@ val n = 2
n: Int = 2

@ val f: Int => Int = { p =>
    p + n
  }
f: Int => Int = <function1>

is to be able to serialize f with Java serialization, without its deserialization to re-trigger the calculation of the values f depends on (n) - these should be serialized / deserialized along with f.

Currently, the user code is wrapped in a singleton. Singletons don't serialize their fields, and re-compute them if needed upon deserialization. So here, n would be re-computed when deserializing f.

To address that, this PR adds a class-based code wrapper, that wraps the user code in classes, and that references the previous commands via instances of these classes (no singleton involved). That way, only class instances are involved during serialization, and the scenario above works fine.

This kind of scenario (serializing closures with Java serialization) is used in Spark (e.g. when mapping a method on RDDs, when creating UDFs, … - where a method is serialized on one machine, then sent to a bunch of machines that deserialize it), but the same kind of thing also happens with other big data frameworks like flink or scio, … Being able to use these frameworks from Ammonite makes it worth the effort IMO.

In more detail, this PR successively:

The first two commits were also sent in separate PRs (#734, #735). The last commit makes the tests run for both the object and class based code wrappers, so could be discarded (less things would be tested). Apart from that, it's quite hard to split it in more atomic PRs…

@alexarchambault alexarchambault force-pushed the alexarchambault:topic/classwrap branch 2 times, most recently from 973ffb6 to ca28fe0 Jan 2, 2018

@alexarchambault

This comment has been minimized.

Copy link
Collaborator Author

alexarchambault commented Jan 2, 2018

Thinking this might still be quite a lot of changes… @lihaoyi If this is too much for you, please review #734 and #735 first, then I might be able to send a0c6c13 and 701044c in a separate PR, then the other commits in separate PRs too… :/

@lihaoyi

This comment has been minimized.

Copy link
Owner

lihaoyi commented Jan 2, 2018

@alexarchambault I think this is small enough to review as is. Some questions:

  • What is that tree.txt thing used for?
  • Why do you need to check isObjDef within your code wrapper?
  • What are the requiredVals and usedThings?
  • As part of the PR description, can you give a short example of how your code-wrapper wraps a snippet of code?
@alexarchambault

This comment has been minimized.

Copy link
Collaborator Author

alexarchambault commented Jan 2, 2018

FYI There are some answers to your questions in some commit messages (the messages of 71b506f, 1f857b5, 96e4a0b, in particular).

@lihaoyi

This comment has been minimized.

Copy link
Owner

lihaoyi commented Jan 2, 2018

Hmm ok. Can you add abbreviated versions of those commit messages to the relevant parts of the codebase? e.g. if we squash/merge through the github UI those commit messages are lost.

I still don't really understand the purpose of the tree.txt file: This is then used to not keep references to unused previous commands, so that these don't hinder serialization of the current one.. Could you explain it in more detail?

@alexarchambault alexarchambault force-pushed the alexarchambault:topic/classwrap branch from ca28fe0 to 47339a6 Jan 3, 2018

@alexarchambault

This comment has been minimized.

Copy link
Collaborator Author

alexarchambault commented Jan 3, 2018

(I should have time to add comments and explain stuff later today.)

@alexarchambault

This comment has been minimized.

Copy link
Collaborator Author

alexarchambault commented Jan 3, 2018

Travis errors seem transient.

@alexarchambault alexarchambault force-pushed the alexarchambault:topic/classwrap branch 2 times, most recently from a27d643 to 97a9228 Jan 3, 2018

@alexarchambault

This comment has been minimized.

Copy link
Collaborator Author

alexarchambault commented Jan 3, 2018

About the -tree.txt files, these are used to know whether the current command actually uses things from the previous ones (like if cmd2-tree.txt contains cmd2.this.cmd0, we know that cmd2 directly uses things from cmd0). This is then loaded via ammonite.interp.Preprocessor.CodeClassWrapper.UsedThings, and used to null-ify the references to the commands that are unused. That way, the unused commands don't hinder serialization.

At the end, code wrapping actually looks like

package ammonite
package $sess

object cmd2{
  val wrapper = new cmd2
  val instance = new wrapper.Helper
  def $main() = instance.$main()
}

final class cmd2 extends java.io.Serializable {

  @_root_.scala.transient private val __amm_usedThings =
    new _root_.ammonite.interp.Preprocessor.CodeClassWrapper.UsedThings(this)

  override def toString = """cmd2"""

  final val cmd0: _root_.ammonite.$sess.cmd0.instance.type = if (__amm_usedThings("""cmd0""")) _root_.ammonite.$sess.cmd0.instance else null
  final val cmd1: _root_.ammonite.$sess.cmd1.instance.type = if (__amm_usedThings("""cmd1""")) _root_.ammonite.$sess.cmd1.instance else null

  // [predef imports removed]

  import cmd0.{
    n
  }
  import cmd1.{
    res1
  }


  final class Helper extends java.io.Serializable{
    val res2 = n + 2
    def $main() = {
      ammonite.repl.ReplBridge.value.Internal.combinePrints(
        _root_.ammonite
              .repl
              .ReplBridge
              .value
              .Internal
              .print(res2, "res2", _root_.scala.None)
      )
    }

    override def toString = "cmd2"

  }
}

This corresponds to the last command in

@ val n = 2
n: Int = 2

@ n + 1
res1: Int = 3

@ n + 2
res2: n = 4

cmd0 and cmd1 reference previous commands. As cmd1 is unused, cmd1 will actually be null above.

@lihaoyi

This comment has been minimized.

Copy link
Owner

lihaoyi commented Jan 4, 2018

@alexarchambault It seems to me that the need for UsedThings,null-ing-out-references, etc. boils down to these lines:

  final val cmd0: _root_.ammonite.$sess.cmd0.instance.type = if (__amm_usedThings("""cmd0""")) _root_.ammonite.$sess.cmd0.instance else null
  final val cmd1: _root_.ammonite.$sess.cmd1.instance.type = if (__amm_usedThings("""cmd1""")) _root_.ammonite.$sess.cmd1.instance else null

However, why do you assign the previous commands to a local final val, rather than importing them:

import _root_.ammonite.$sess.cmd0.{instance => cmd0}
import _root_.ammonite.$sess.cmd1.{instance => cmd1}

import cmd0.{
  n
}
import cmd1.{
  res1
}

Or even importing their contents directly:

import _root_.ammonite.$sess.cmd0.instance.{
  n
}
import _root_.ammonite.$sess.cmd1.instance.{
  res1
}

This would bring them into scope the same way, but avoid the problems with keeping references since imports do not create references, and thus avoid the need for null-ing out un-used ones later.

Now, this might require some changes to the import-generation logic to happen, but seems a lot more straightforward than this roundable unparsed-ASTs-as-string thing that we're currently doing in this PR

alexarchambault added some commits Dec 28, 2017

Add Java serialization test (fails with object wrapping)
This adds a test, serializing a closure (`closure`), that itself
references a value computed in a previous command (`a`).

This is similar to what happens when [mapping a function on an RDD](https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.rdd.RDD@map[U](f:T=%3EU)(implicitevidence$3:scala.reflect.ClassTag[U]):org.apache.spark.rdd.RDD[U])
or using a [UDF](https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.expressions.UserDefinedFunction) in Spark: a closure gets
serialized this way, and is sent to a bunch of "executors" that each
deserialize it.

This kind of closure can reference the results of heavy computations. We
want these results to be serialized with the closure, rather than re-computed
on each machine after deserialization.

In the test, `a` acts as the result of the heavy computation. We do not want
the calculation of `a` to re-run when deserializing `closure`.

After creating the closure, the test serializes it, then deserializes it
via a classloader that can load the exact same bytecode as the classloader of
the current session. We call the deserialized closure by reflection, and check
that the heavy computation was not run again.
Wrap user code in classes rather than objects
This commit poses the bases of class wrapping. For the sake of simplicity,
it simply replaces the former object wrapping with class wrapping.
Object wrapping is added back in later commits.

The idea of class wrapping is that the user code:
- should be in a class rather than a singleton,
- should see the previous commands results via instances of these classes, not referencing singletons along the way.

Roughly, code wrapping now looks like
```scala
package ammonite
package $sess

object cmd1 {
  val wrapper = new cmd1
  val instance = new wrapper.Helper
  def $main() = instance.$main()
}

final class cmd1 extends java.io.Serializable {

  override def toString = "cmd1"

  // Reference previous command via the instance of its class
  final val cmd0: _root_.ammonite.$sess.cmd0.instance.type = _root_.ammonite.$sess.cmd0.instance

  // Removed predef imports for clarity…

  // Import from a previous command, via the instance of its class
  import cmd0.{
    n
  }

  final class Helper extends java.io.Serializable {
    val res1 = n + 1
    def $main() = {
      ammonite.repl.ReplBridge.value.Internal.combinePrints(
        _root_.ammonite
          .repl
          .ReplBridge
          .value
          .Internal
          .print(res1, "res1", _root_.scala.None)
      )
    }
    override def toString = "cmd1"
  }
}
```

User code of this command was `n + 1`, with `n` defined just before it with `val n = 2`. The `n + 1` is run and kept (`res1`) in a class, itself referencing `cmd0` via the instance of its class. When we have a `cmd1` instance, at runtime, it only reaches previous commands via class instances too, no singleton is involved.

Only dealing with class instances at runtime, rather than singletons, behaves well wrt Java serialization. Singletons don't write their fields during serialization, and re-compute them when deserialized. On the other hand, class instances serialize and de-serialize their fields, as expected.
SerializationTests passes!
Clear references to unused commands thanks to the `*-tree.txt` resources, eagerly get the used ones.

Thanks to that, the commands only hold references to the previous ones they actually use. Unused ones
don't hinder Java serialization.
Add raw code as resource
This commit modifies `AmmonitePlugin`, so that it passes the raw string representation of the tree of the user code to Ammonite. These strings
look like
```scala
List(def <init>(): cmd0.this.Helper = {
  Helper.super.<init>();
  ()
}, private[this] val n: Int = 2, <stable> <accessor> def n: Int = Helper.this.n, def $main(): Iterator[String] = ammonite.repl.ReplBridge.value.Internal.combinePrints(ammonite.repl.ReplBridge.value.Internal.print[Int](Helper.this.n, "n", scala.None)((pprint.TPrint.lambda[Int](((cfg$macro$2: pprint.TPrintColors) => "".+(cfg$macro$2.typeColor.apply(fansi.this.Str.implicitApply("Int")).render).+(""))): pprint.TPrint[Int]), ammonite.repl.ReplBridge.value.tprintColorsImplicit, (ClassTag.Int: scala.reflect.ClassTag[Int]))), override def toString: String = "cmd0")
```

It is then written in a resource file (`cmd1-tree.txt` if the wrapper name is `cmd1`). This allows wrappers to process this tree at runtime. This is done
for class wrapping in later commits. E.g. checking for substrings such as `cmd1.this.cmd0` in this string says whether `cmd1` uses things
from `cmd0`. This is then used to not keep references to unused previous commands, so that these don't hinder serialization of the current one.
Abstract away path to wrapped code
This is a first step towards adding back object wrapper, along with the current class wrapper
Add back object wrapping for… objects
If the user code solely consists of a singleton, like
```scala
object Foo {
  val n = 2
}
```
then it is wrapped in an object rather than a class.

This allows users to wrap some code in objects rather than classes if needed.

In particular, this is useful for macro code (than can't be defined in classes that we instantiate ourselves), or for definitions that are later processed by macros (definitions from classes are harder to handle than those in objects)
Abstract away user code nesting level
For class wrapping, the user code is a bit more "nested" than with object
wrapping (`class cmd0 { class Helper { /* user code */ } }` versus `object cmd0 { /* user code */ }`), so both can't be handled the same way.

Ideally, the structure nesting user code could be annotated, and the plugin
would find it via this annotation, rather than by various assumptions about
where it is in the wrapper output and this "nesting level" parameter.
Add DualTestRepl
Allows to run tests with several TestRepl. By default, runs them with
two TestRepl, one using object code wrapping, and one with class code
wrapping.
@alexarchambault

This comment has been minimized.

Copy link
Collaborator Author

alexarchambault commented Jan 5, 2018

(Ok, I think I understand your previous answer, I guess I read it too quickly yesterday)

So I pushed an extra commit relying on tree traversing, rather than string grepping, to find the previous command uses.

Don't unnecessarily pass wrapper name around
It was added as an argument here or there in f2d6792, this removes it a bit.

@alexarchambault alexarchambault force-pushed the alexarchambault:topic/classwrap branch from 811545e to 9bfe332 Jan 5, 2018

@lihaoyi
Copy link
Owner

lihaoyi left a comment

This should be the last round of comments/changes-I'd-like-you-to-make. Otherwise it looks good to me!

@@ -105,7 +163,7 @@ object AmmonitePlugin{

def rec(expr: g.Tree): List[(g.Name, g.Symbol)] = {
expr match {
case s @ g.Select(lhs, name) => (name -> s.symbol) :: rec(lhs)
case s @ g.Select(lhs, _) => (s.symbol.name -> s.symbol) :: rec(lhs)

This comment has been minimized.

Copy link
@lihaoyi

lihaoyi Jan 6, 2018

Owner

Is this change intentional?

This comment has been minimized.

Copy link
@alexarchambault

alexarchambault Jan 8, 2018

Author Collaborator

Yep, that was to make ImportHookTests.repl.file.inline and ImportHookTests.scripts.deepImport pass with class wrap (name was - subtly - different IIRC)

val uses0 = stats
.flatMap(t =>
t.collect {
case g.Select(g.This(g.TypeName(`rawWrapperName`)), g.TermName(name)) =>

This comment has been minimized.

Copy link
@lihaoyi

lihaoyi Jan 6, 2018

Owner

Rather than checking the rawWrapperName, could we check that the symbol on the g.This is the same as the symbol on unit.body.children.last.children.last? That would be more robust.

This comment has been minimized.

Copy link
@alexarchambault

alexarchambault Jan 8, 2018

Author Collaborator

I tried, but IIRC, one corresponded to a term, the other to a type. I'll have a look at it again.

This comment has been minimized.

Copy link
@alexarchambault

alexarchambault Jan 9, 2018

Author Collaborator

I'm not sure that can be done… g.This.unapply only extracts a g.TypeName, no symbol seems involved :/

This comment has been minimized.

Copy link
@lihaoyi

lihaoyi Jan 10, 2018

Owner

Can you bind

g.TypeName(`rawWrapperName`) @ node

and then get the symbol of the node via node.sym?

This comment has been minimized.

Copy link
@alexarchambault

alexarchambault Jan 11, 2018

Author Collaborator

Yep. Pushed an extra commit for it.

)
}

val uses =

This comment has been minimized.

Copy link
@lihaoyi

lihaoyi Jan 6, 2018

Owner

Let's do:

val uses = userCodeNestingLevel match{
  case 1 => ...
  case 2 => ...
}

If someone does something funky and has a userCodeNestingLevel of 0 or 3, we probably want to blow up and force them to handle it explicitly rather than blindly continuing with this logic

This comment has been minimized.

Copy link
@alexarchambault

alexarchambault Jan 8, 2018

Author Collaborator

Ok

@_root_.scala.transient private val __amm_usedThings =
new _root_.ammonite.interp.Preprocessor.CodeClassWrapper.UsedThings(this)
override def toString = $q$q$q${indexedWrapperName.encoded}$q$q$q

This comment has been minimized.

Copy link
@lihaoyi

lihaoyi Jan 6, 2018

Owner

Let's us $tq instead of $q$q$q to be consistent with the rest of the codebase

This comment has been minimized.

Copy link
@alexarchambault

alexarchambault Jan 8, 2018

Author Collaborator

Or maybe use simple quote here… I think I used triple-quotes in doubt, but single ones should be ok for an encoded name.

final class ${indexedWrapperName.backticked} extends java.io.Serializable {
@_root_.scala.transient private val __amm_usedThings =
new _root_.ammonite.interp.Preprocessor.CodeClassWrapper.UsedThings(this)

This comment has been minimized.

Copy link
@lihaoyi

lihaoyi Jan 6, 2018

Owner

Rather than serializing the usedThings data into a classpath file which gets loaded at runtime and deserialized, can we just expose the structured data directly in a repl.usedEarlierDefinitions property? Then your conditionals can just be

if (repl.usedEarlierDefinitions.contains("foo")) ... else null

repl already exposes repl.imports and repl.history, which depend on the current command and the exact order of commands that have run so far, so exposing a repl.usedEarlierDefinitions will fit right in

This comment has been minimized.

Copy link
@alexarchambault

alexarchambault Jan 8, 2018

Author Collaborator

Will do, that's better this way!

This comment has been minimized.

Copy link
@alexarchambault

alexarchambault Jan 11, 2018

Author Collaborator

@lihaoyi Wouldn't something like having the usedEarlierDefinitions rely on the resource be ok? I tried to pass the used commands everywhere around, it was quite heavyweight (I had to add parameters for it to a bunch of methods, add an extra element in the tuples returned by others, plus it has to be cached then - so change the cache api, …)

This comment has been minimized.

Copy link
@alexarchambault

alexarchambault Jan 11, 2018

Author Collaborator

The advantage of the resource approach, is that they are passed along with the generated classes, so only require little changes.


val uses =
if (userCodeNestingLevel <= 1)
Map.empty[String, Seq[String]]

This comment has been minimized.

Copy link
@lihaoyi

lihaoyi Jan 6, 2018

Owner

Let's add a comment here saying we could support computing uses (which we should probably name something longer such as usedEarlierDefinitions) for userCodeNestingLevel == 1, but explicitly choose not to do so because it's currently only used in the class-based-wrapper which has userCodeNestingLevel == 2

This comment has been minimized.

Copy link
@alexarchambault

alexarchambault Jan 8, 2018

Author Collaborator

Ok


def interps = repls.map(_.interp)

def session(sess: String): Unit =

This comment has been minimized.

Copy link
@lihaoyi

lihaoyi Jan 6, 2018

Owner

I'm a bit suspicious of whether or not these are actually running the tests as desired on both wrapping levels, because the CI times haven't doubled as I would expect. But if you've tested it manually and it's doing the right thing that's good enough for me

This comment has been minimized.

Copy link
@alexarchambault

alexarchambault Jan 8, 2018

Author Collaborator

That's good news! The tests are run for both wrappers. In the CI output, one can find twice the code of the tests of ammRepl (e.g. look for repl.load("val x = 1") in https://travis-ci.org/lihaoyi/Ammonite/jobs/325386928).

@alexarchambault

This comment has been minimized.

Copy link
Collaborator Author

alexarchambault commented Jan 8, 2018

I should have time to update the PR tomorrow.

@lihaoyi

This comment has been minimized.

Copy link
Owner

lihaoyi commented Jan 17, 2018

@alexarchambault let me know when you'd like me to take another look, Github doesn't notify me when you push new commits

@alexarchambault

This comment has been minimized.

Copy link
Collaborator Author

alexarchambault commented Jan 19, 2018

@lihaoyi I commented a bit on #736 (comment). I can still try to use the approach you're proposing in that comment, but this implem I came with for it was a bit heavyweight IMO, so it's still a resource for now…

@lihaoyi

This comment has been minimized.

Copy link
Owner

lihaoyi commented Jan 19, 2018

Huh i thought i replied to that; I guess Github ate my comment.

Yeah it's a bit annoying to pass it around, but I do not think it's better to sneak things through in opaque unstructured binary blobs.

If you can't find a convenient place to attach an additional field to some case class, just plumb the new value through where-ever necessary. We can tidy up the code later

@alexarchambault

This comment has been minimized.

Copy link
Collaborator Author

alexarchambault commented Jan 19, 2018

Gonna try that again then. IIRC, the use data must be kept in Frame, so that things are rolled out properly when tangling with sessions. Which seemed quite cumbersome to implement…

@lihaoyi

This comment has been minimized.

Copy link
Owner

lihaoyi commented Jan 19, 2018

I haven't looked in depth but I'd guess use data should be kept where-ever the classfile data is.

We already keep an imports0 in each Frame, which are generated exactly the same way the use data is: via compiler plugins. It seems perfectly fitting to add another field to Frame to represent the new bit of structured we're extracting out during compilation.

@alexarchambault

This comment has been minimized.

Copy link
Collaborator Author

alexarchambault commented Jan 29, 2018

I'll update that PR soon. I've been thinking about the next (and I think last) PR, to have spark work from Ammonite, in the mean time…

@alexarchambault alexarchambault force-pushed the alexarchambault:topic/classwrap branch from 3c23068 to 921e9e8 Feb 8, 2018

alexarchambault added some commits Feb 8, 2018

Make used earlier definitions available via ReplAPI
and use it from the class-based code wrapper

@alexarchambault alexarchambault force-pushed the alexarchambault:topic/classwrap branch from 539aabd to 294bd54 Feb 9, 2018

@alexarchambault

This comment has been minimized.

Copy link
Collaborator Author

alexarchambault commented Feb 9, 2018

@lihaoyi So PR updated…

@lihaoyi

This comment has been minimized.

Copy link
Owner

lihaoyi commented Feb 11, 2018

Looks good to me. @alexarchambault I gave you commit access, so feel free to merge this whenever you're happy with it.

Going forward let's keep code-reviewing the big PRs like this one, but feel free to directly commit any small fixes & tweaks or merge any PRs you judge uncontroversial

@alexarchambault

This comment has been minimized.

Copy link
Collaborator Author

alexarchambault commented Feb 12, 2018

I'm going to try to move the settings.Ydelambdafy.value = "inline" line to the tests before merging, if that can be done.

@alexarchambault

This comment has been minimized.

Copy link
Collaborator Author

alexarchambault commented Feb 15, 2018

I think I'll merge as soon as the CI is green.

@alexarchambault alexarchambault merged commit 9db2c21 into lihaoyi:master Feb 16, 2018

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@alexarchambault alexarchambault deleted the alexarchambault:topic/classwrap branch Feb 16, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.