Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Java serialization with Kryo #2681

Merged
merged 22 commits into from
Dec 12, 2018
Merged

Replace Java serialization with Kryo #2681

merged 22 commits into from
Dec 12, 2018

Conversation

abyrd
Copy link
Member

@abyrd abyrd commented Nov 22, 2018

To be completed by pull request submitter:

  • issue: Replace Java serialization #2643
  • roadmap: This is not formally on the roadmap but (de)serialization speed has been a central concern for a few years.
  • tests: Tests have been added to ensure that the object graph remains identical after a round-trip through serialization and deserialization. See org.opentripplanner.routing.graph.GraphSerializationTest. This uses an object graph comparison tool in R5 that is intended specifically for checking that serialization works properly. The comparison framework is itself covered by pretty thorough tests.

To be completed by @opentripplanner/plc:

  • reviews and approvals by 2 members, ideally from different organizations
  • before merging: add a bullet point to the changelog file with description and link to the linked issue
  • after merging: update the relevant card on the roadmap

This copies in a recusrive object differ from @csolem from Entur fork https://github.com/entur/OpenTripPlanner/tree/protostuff_poc
I began adding some tests to the differ which show that it produces some false negatives (e.g. doesn't see differences in map values)
The problem with the differ causing the test to fail was identified.
It does not check whether the outermost objects to be compared are Collections or Maps.
This enables reliable tests that serialization / deserialization perfectly restores a Graph.
In some ways this simplifies the differ, but it also ensures that it performs a very deep recursive comparison of almost every object.
Also added tests that can compare two references to the same object, which provide strong evidence that the differ actually works.
Also refactored Graph serialization tests a bit.
This pulls in a stable version of our testing framework for Kryo serialization.
@abyrd abyrd requested a review from a team as a code owner November 22, 2018 06:29
@abyrd
Copy link
Member Author

abyrd commented Nov 22, 2018

Some tests are still failing, specifically GraphSerializationTest.testRoundTrip (fails assertion) and GraphServiceTest (errors, no graphs loaded). I will need to fix these before we can really test this branch.

This makes it similar to the registration used in R5 - all primitive hashmaps are set to use their Externalizable implementation, unless we have a specific serializer for them.
Apparently this was a necessary step for Java serialization, but is no longer relevant for Kryo serialization.
@abyrd
Copy link
Member Author

abyrd commented Nov 23, 2018

This is bizarre. The serialization round trip test fails on Travis CI (which uses Oracle Java 1.8_151) but I can't get it to fail locally even when running on exactly that JVM version. It builds with no errors and passes all tests, in both IntelliJ and under command line Maven, on both Oracle Java 1.8_151 and 1.8_192 (the most recent).

I can however reproduce test failures on an EC2 instance running Amazon Linux 2, Maven 3.6.0, and OpenJDK 1.8.0_191. I get three mismatches: Primitive Long value mismatch: 1542957457381 vs 1542957457382 in testEmptyGraphs, and
No-entry value differs between two maps. One reference was null but not the other. in testRoundTrip.

There may be good reasons for this - minute differences nested deep within internal details the OTP object graph, or unhandled edge cases in the comparisons.

The cached timezone in the graph is transient and lazy-initialized.
Previous tests will sometimes cause a timezone to be cached.
This seems to depend on non-deterministic test execution order.
@abyrd
Copy link
Member Author

abyrd commented Nov 23, 2018

With latest commits to OTP and to the ObjectDiffer code in R5, build is now passing on both Travis CI and the EC2 + OpenJDK combination.

NL graph on the EC2 machine, 30GB memory: 24 minutes to build.
Writing graph: 1 minute 36 seconds
Graph size on disk: 1.5GB
Loading graph: 1 minute 11 seconds + 25 seconds indexing

Comparing to OTP master on the same machine: 23 minutes to build.
Writing graph: 1 minute 45 seconds
Graph size on disk: 2.1GB
Loading graph: 4 minutes 57 seconds + 25 seconds indexing

So this PR provides:

  • 29% improvement in graph size
  • 76% improvement in load time

@abyrd
Copy link
Member Author

abyrd commented Nov 23, 2018

@t2gran @gmellemstrand this is now ready for review and testing. It seems to work well in my testing.

@gmellemstrand
Copy link
Contributor

We have tested this and all our tests pass. We are currently running it in production.

@optionsome
Copy link
Member

Quickly tried to merge this into our fork but some tests were failing and didn't have time to look at it at that time. We will most likely take a better look at it in the near future and I don't yet know if all the issues are just specific to our fork but at least all the tests passed when I ran this branch on my computer.

@abyrd
Copy link
Member Author

abyrd commented Nov 29, 2018

Thanks for the feedback @optionsome. If your fork adds any new classes to the graph, it's possible that they are not serialized correctly, or that they are just not compared correctly in the test. You might need to register a new serializer or add exceptions to the object tree comparator. If you get stuck at any point I may be able to provide guidance on how to get the comparator or serializer working with your data structures.

@abyrd abyrd requested review from fpurcell and removed request for fpurcell November 29, 2018 15:56
@abyrd
Copy link
Member Author

abyrd commented Nov 29, 2018

@irees @fpurcell if you have a moment to spin up this PR, some corroborating test results would be helpful. Basically there's nothing special to do, just build a graph, load it, and make sure the routing is identical to previous versions. So theoretically could just mean running a batch of automated tests.

@fpurcell
Copy link
Member

@irees @fpurcell if you have a moment to spin up this PR, some corroborating test results would be helpful...

Hey Andrew.

I built a graph and ran the TriMet tests against your PR, and all tests passed: http://maps7.trimet.org/p/otp_report.html.

So you got my 1/2 thumb up. Here are the artifacts from that build: http://maps7.trimet.org/p/

Take care,
Frank

@abyrd
Copy link
Member Author

abyrd commented Nov 30, 2018

Thanks @fpurcell. I was hoping for a full thumbs up so the PR would be approved but if you aren't ready to give it we can wait on more input from others.

I see that your build report begins with a reassuring "All tests are PASSING" in green, but there are seven lines that say "failed" in red. Are those failures normal?

@fpurcell
Copy link
Member

Hey Andrew.

Sorry to confuse with my poor attempt at humor. I was thinking that Ian + myself building 2 graphs and running tests were equal to a 1 together.

But that said, I can't do much more at the moment. Didn't look at the code changes I just tested. I do know, all tests passed..that's a big confidence boost.

Take care,
Frank

@abyrd
Copy link
Member Author

abyrd commented Nov 30, 2018

@fpurcell what I meant is that following our new process, we need two official Github PR approvals (via the green review changes button at the top right) before it will let us merge. So if you consider your tests a success, please give it an official approval.

I also just wanted to make sure we were sure everything actually passed, because your report page says "fail" on it. But I'll take your word for it :)

@pailakka
Copy link
Member

pailakka commented Dec 4, 2018

I tested this with out fork (same setup as @optionsome). It looks like that for some reason tje ObjectDiffer throws exception when comparing TreeMaps

java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer

	at java.lang.Integer.compareTo(Integer.java:52)
	at java.util.TreeMap.getEntry(TreeMap.java:352)
	at java.util.TreeMap.get(TreeMap.java:278)
	at com.conveyal.r5.diff.StandardMapWrapper.get(StandardMapWrapper.java:40)
	at com.conveyal.r5.diff.ObjectDiffer.compareMaps(ObjectDiffer.java:262)
	at com.conveyal.r5.diff.ObjectDiffer.compareTwoObjects(ObjectDiffer.java:182)
        ...

additionally there is comparison differences especially with LuceneIndex in roundTrip test. I tried to revert our changes compared to upstream master but no joy. I'll try to find time to dig deeper

Copy link
Member

@fpurcell fpurcell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passes the TriMet test suite.

@abyrd
Copy link
Member Author

abyrd commented Dec 12, 2018

@pailakka looking at the lines cited in your stack trace, I see that the problem lies in the code that checks whether missing entries are represented the same way in both maps. This is only particularly important for Trove primitive maps, but I implemented it in a way that checks every kind of map, using an unlikely integer as a key. Because of type erasure in Java and the fact that the collections predate generics, the methods on maps will generally accept any object as a key even if they are keyed on some specific other class. This worked fine for all the Map implementations I tried, but it does not work for TreeMaps because they use a Comparator to establish the ordering of they keys, so all keys must be mutually comparable.

I suppose I should switch to an alternate implementation in which the map wrappers include methods that fetch the representation of a missing value from their wrapped maps. Indeed I started doing that at one point but thought it more robust to use this more general technique, which would be impervious to bugs or omissions in the map wrapper classes.

I'll make a ticket for this problem in the R5 library. In the mean time you can just exclude TreeMaps from comparisons in your tests, either by excluding the entire class or excluding TreeMap fields by name.

The LuceneIndex was an experimental, unsupported feature, and is only lazy-initialized after the geocoder API endpoint is hit. It is not designed to be serialized and reloaded - in fact much of its data is saved on disk in separate files. You can also exclude that class or field from the graph comparison, though I don't even understand how the Lucene index is getting initialized before you perform the graph comparison. It should still be null. That index is inside Graph.index which is marked transient so will not be automatically restored upon graph load.

@abyrd
Copy link
Member Author

abyrd commented Dec 18, 2018

@pailakka please see OTP PR #2700 which should fix your TreeMap problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants