Add an offheap indexmap implemenation using LinkedIn PalDB as an option. Resolving Issue #4 #17

timyitong · 2016-03-27T07:56:45Z

This is an implementation of introducing an offheap feature IndexMap and an independent FeatureIndexingJob to bypass the scalability limitation due to the previous feature index map.

See Issue #4 for some early thoughts we've discussed.

As I proceed with this improvement. I've found multiple interesting things. Will create separate issues for them.

What this PR has done:

Only one offheap implementation using LinkedIn PalDB;
Minimal introduction of traits and implementation (IndexMap, IndexMapBuilder, IndexMapLoader) trying to adapt both the original default feature map and the current offheap map into the same fashion;
Minimal refactoring of GLMSuite related codes (I have added TODO notes about other potential improvements), but I don't think it's fair to address in this PR as it will make things too long and complicated;
Added a minimal two additional arguments for optionally using OffHeap map as the feature index map;
A simple FeatureIndexingJob thats builds out off-heap map independently;
Unit tests guarantee that the FeatureIndexingJob is building out everything correctly;
Integrated tests in DriverIntegTest with the new option;
Cluster tests that verified we could reach parity in terms of model validation performance (I've added relevant results in an internal ticket)

What this PR is not doing:

Any refactoring that won't be necessary for the additional functionality to be runnable. (I've some personal opinions about refactoring GLMSuite constructor and model saving codes; but they aren't really a must-have for this functionality to work);
A thorough check of using offheap map under all config conditions
More user friendly in terms of parameter specification for FeatureIndexingJob or more configurable properties. (I'd suggest we do it later after we have real use cases) Right now, it is just a "minimal viable product"

Why I use PalDB?
4 optional embeddable offheap storage came to my mind:

MapDB (3.0 contains bugs with serialized storage file loading)
PalDB (open sourced by LinkedIn, currently in use. Has a minor issue but overall very good quality);
Chronicle Map (it claims to be very fast, even the most fast one. It has a very strong requirement: one store per JVM. This is a bit hard to satisfy due to Spark's parallelism nature (multiple partitions could share the same JVM), though I wouldn't say not worth trying);
MvStore (some rumor saying its perf is much better than MapDB), but I haven't tried it out yet.

How does it perform?
Indexing feature normally only take 5 minutes to finish for 200k. 2M features takes longer, but I also only just tried a very harsh setting (50 executors x 2g memory). And by the current implementation, it is definitely linear scalable for the indexing building job.

For the training job, the offheap map is even surprisingly a bit faster than the default map. (But that just my single data point comparision). This looks very promising to me. (suggest we do a thorough perf study later)

With offheap map, we could handle 2M features with ease. I'm not even using a very aggressive setting. (Only 10g or even 5g for driver memory, and 2g for executor memory). Though at model selection stage, I think it will fail due to other reasons.

XianXing · 2016-03-28T18:27:11Z

photon-ml/src/main/scala/com/linkedin/photon/ml/util/IndexMap.scala

+  *
+  * @author yizhou
+  */
+trait IndexMap extends Map[String, Int] with java.io.Serializable {


Why do we extends this trait with java.io.Serializable here? If this class extends scala.collection.immutable.Map already, doesn't it become Scala serializable automatically?

Nitpick: it might be helpful to extend this trait with scala.collection.Map (via import scala.collection.Map) rather than scala.collection.immutable.Map (type Map[A, +B] = scala.collection.immutable.Map[A, B]), because the former is a more generic one, and the "toMap" function in Scala converts the underlying collection into an "scala.collection.Map", but not "scala.collection.immutable.Map".

Hi Xianxing, those two are very good points. You are right, extends with java.io.Serializable is no longer necessary. And I'll make it extends from scala.collection.Map instead of scala.collection.mutable.Map. I'll modify this interface accordingly.

@XianXing , I'd suggest we still stick with scala.collection.immutable.Map for now.

I'll get the following issue with the other map:
/

home/yizhou/Dev/opensource/new-github/photon-ml/photon-ml/src/main/scala/com/linkedin/photon/ml/Driver.scala:210: type mismatch;
found : com.linkedin.photon.ml.util.IndexMap
required: Map[String,Int]
new SystemReport(suite.featureKeyToIdMap, params, summaryOption),
^
/home/yizhou/Dev/opensource/new-github/photon-ml/photon-ml/src/main/scala/com/linkedin/photon/ml/Driver.scala:345: type mismatch;
found : com.linkedin.photon.ml.util.IndexMap
required: Map[String,Int]
val varImportanceDiagnostic = new VarianceFeatureImportanceDiagnostic(suite.featureKeyToIdMap)
^
/home/yizhou/Dev/opensource/new-github/photon-ml/photon-ml/src/main/scala/com/linkedin/photon/ml/Driver.scala:346: type mismatch;
found : com.linkedin.photon.ml.util.IndexMap
required: Map[String,Int]
val meanImportanceDiagnostic = new ExpectedMagnitudeFeatureImportanceDiagnostic(suite.featureKeyToIdMap)
^
/home/yizhou/Dev/opensource/new-github/photon-ml/photon-ml/src/main/scala/com/linkedin/photon/ml/Driver.scala:375: type mismatch;
found : com.linkedin.photon.ml.util.IndexMap
required: Map[String,Int]
Error occurred in an application involving default arguments.
val bootstrapDiagnostic = new BootstrapTrainingDiagnostic(suite.featureKeyToIdMap)
^
/home/yizhou/Dev/opensource/new-github/photon-ml/photon-ml/src/main/scala/com/linkedin/photon/ml/Driver.scala:430: type mismatch;
found : com.linkedin.photon.ml.util.IndexMap
required: Map[String,Int]
suite.featureKeyToIdMap,
^
/home/yizhou/Dev/opensource/new-github/photon-ml/photon-ml/src/main/scala/com/linkedin/photon/ml/io/GLMSuite.scala:201: non-variable type argument Map[String,Any] in type pattern List[Map[String,Any]](the underlying of List[Map[String,Any]]) is unchecked since it is eliminated by erasure
case Some(parsed: List[Map[String, Any]]) =>
^
one warning found
5 errors found

So immutable map already prevails at many places of our codes. I'd suggest we do another round of clean-up to remove immutable.Map interface in another PR.

namitk · 2016-03-28T19:22:20Z

photon-ml/src/main/scala/com/linkedin/photon/ml/Driver.scala

@@ -286,6 +304,9 @@ protected[ml] class Driver(
        }
      }

+      /* TODO: we potentially have an excessive memory usage issue at this step,
+       * 2M feature sponsor ads dataset with fail at here due to OOM


This feels like sensitive information and I'd suggest modifying it to just mention the number of features at which it would fail. I don't think the specifics about which dataset it is should be disclosed here.

Good catch. I'll remove the mention of the dataset

XianXing · 2016-03-28T20:10:33Z

This RB looks good to me, although I raised many questions. Will go back and review more later after lunch.

One general feeling I have after reading the code is that, some refactoring on GLMSuite in necessary in a separate PR, as it is taking much more responsibility than I ever imagined one and half years ago.

XianXing · 2016-03-28T21:08:41Z

photon-ml/src/main/scala/com/linkedin/photon/ml/FeatureIndexingJob.scala

+      res
+    }.mapPartitions{iter =>
+      // Step 2. map features to (hashCode, featureName)
+      val set = new scala.collection.mutable.HashSet[String]()


Nitpick: a better practice might be to import scala.collection.mutable first, then use mutable.HashSet[String] directly.

Some folks might argue that we should use the builder for HashSet (HashSet.newBuilder) here for immutability, but I think either way is fine.

…with PalDBIndexMap's storeReader

namitk · 2016-03-28T21:38:47Z

photon-ml/src/main/scala/com/linkedin/photon/ml/FeatureIndexingJob.scala

+          }
+        }
+
+        println(s"Partition [${idx}] total record number: ${i}")


Use PhotonLogger / spark.logInfo() if this is important or delete it

I'll delete this one. Actually not correct to use PhotonLogger here since it's executor log instead of driver. I'll add one more logger at the end of job indicating how many records we have indexed.

XianXing · 2016-03-28T22:31:58Z

photon-ml/src/main/scala/com/linkedin/photon/ml/util/PalDBIndexMap.scala

+      val config = new Configuration()
+      config.set(Configuration.CACHE_ENABLED, "true")
+      // Allow 200MB in-memory cache in total
+      config.set(Configuration.CACHE_BYTES, String.valueOf(209715200L / _partitionsNum))


This is interesting, how is 200MB determined? Does it relate to the number of features in any way?

Not related to features but more related to what is upper bound we'd like to allow our executor's JVM memory to spend for off-heap map's LRU cache. I'm just being conservative here saying that 200MB doesn't hurt anyway.

Acutually, even turning it off is okay. A TODO item would be that we should allow config customization in the future, but I don't think we need that unless a real use-case is happening.

Do you have numbers for performance with/without the cache? Since there is almost no serialization overhead and memory mapping is on, the cache might actually not be worth it.

If we're going to keep this, we should make it configurable, or extract it into a constant with a comment explaining the value.

timyitong · 2016-03-28T23:48:00Z

Folks, please DON'T click the merge button until we have confirmed everything (including integTest) runs. It takes too long to run integTest everytime, so I'll just run them a few times when we accumulated enough changes. The last time would be right before we are merging.

XianXing · 2016-03-29T00:12:13Z

Sure, I will make sure all the tests pass before merging, and as we discussed last time offline, it should be the responsibility of the person who's going to merge the PR to check whether all tests pass.

mtth · 2016-03-29T03:05:37Z

photon-ml/src/main/scala/com/linkedin/photon/ml/util/PalDBIndexMap.scala

+  * An off heap index map implementation using PalDB.
+  *
+  * The internal implementation assumed the following things:
+  * 1. One DB storage is partitioned into multiple pieces we call partitions. It should be generated and controlled by


What's the motivation for partitioning the store? Do you only ever ship a subset of the partitions?

It's majorly to speed up the index building process that we could partition the full feature space into pieces according to hashcode and then build them in parallel. Not useful when feature space is small, but it speeds up the index building process when the feature space is very large.

So this implementation always assume that everytime we are loading a set of partitions into as a whole.

Makes sense. I wonder what kind of sizes you'd need to make this difference significant. (As a bonus though, now you aren't constrained by PalDB's 2GB index size limit.)

joshvfleming · 2016-03-29T17:08:56Z

Nice work @timyitong! I've left some comments in-line.

…ame and improve the query efficiency via binarySearch

…o one

timyitong · 2016-03-30T03:10:56Z

@XianXing for index building, I'm reverting back to the distinct and then groupByKey approach. reduceByKey will have minor hash collision issues that we will lose a few features. (I lost 7 features in one run over a 200k feature dataset). I think for now, even building the index map against 2M features takes less than 5 minutes, so we probably don't need to optimize this part until we come to a much large scale.

… in a few hash collisions

timyitong · 2016-03-30T18:59:05Z

@XianXing @joshvfleming @namitk for the index map pull request, I've addressed most of the comments. Though a few things might be remaining to do but I'd suggest we don't further complicate it in this PR:

GLMSuite could use some refactoring that we should simplify the tasks it tries to do and create simpler objects just doing one thing at a time. One particular thing related is about the build process of IndexMap, we probably should factor it out of GLMSuite's range.
@mtth suggested a few good recommendations. That having partitions provide another advantage of bypassing the index size limit of 2G currently PalDB has. But overall, I think partitions number will only matter when the feature space is super large. But right now 2M features aren't really a large store for PalDB. So we are probably gonna be fine for a while as we grow the features we could handle. Other parts might reach bottleneck before the indexMap. We can leave it as a failsafe but should just suggest users to use numbers like 1 or 2 for most cases.

I think his suggestion about separating (name -> index) and (index -> name) stores to save memory mapping over heads also makes sense. But considering our current situation, we might not need to optimize this too much since 2M features store only takes around 150MB which is relatively small. Plus, currently the only use case for the reverse mapping (index -> name) are at the model writing stage, so I think a more proper improvement is that we can directly iterate over the full map to write out models instead of hitting getFeatures method, in this case we don't even need reverse feature mapping.

Do these sound reasonable? Please suggest what's the next things we should address in this request.

XianXing · 2016-03-30T21:07:59Z

LGTM. Let me merge it.

For the reverse index mapping, I agree that it's actually not required to write Photon model, because the Photon model can be written to HDFS on the driver locally and there is no need to broadcast the reverse feature index map to the executors. The current way of writing the model to HDFS as RDD is cumbersome. I will open an issue to discuss and address the problem.

However, when we write the GAME models, such broadcasted/PalDB based reverse index map is necessary and also critical, and a separated map logic suggested by @mtth would make a lot of sense there. Considering GAME is beyond the scope of this PR, I would suggest we keep an eye on it but address this problem later.

timyitong added 2 commits March 27, 2016 00:36

Add an offheap indexmap implemenation using LinkedIn PalDB as an option.

db01f51

Append licenses for new files

708dfe7

timyitong mentioned this pull request Mar 27, 2016

Model selection codes are using memory excessively #18

Closed

XianXing reviewed Mar 28, 2016
View reviewed changes

Clean up hacky details, and add better docs

a16a1fc

namitk reviewed Mar 28, 2016
View reviewed changes

XianXing reviewed Mar 28, 2016
View reviewed changes

Make PalDBIndexMap into a real useful one. Fix a thread-safety issue …

4785059

…with PalDBIndexMap's storeReader

namitk reviewed Mar 28, 2016
View reviewed changes

timyitong added 2 commits March 28, 2016 15:15

Make featureMap references transient in classes

13209f0

Leverage Utils and fieldNames for record reading

8a85d9d

XianXing reviewed Mar 28, 2016
View reviewed changes

timyitong added 3 commits March 28, 2016 18:13

A way of getting unique features with less overheads

584e3fb

Address styles and nity issues

5f39d17

More code style and doc improvements

b2365b9

mtth reviewed Mar 29, 2016
View reviewed changes

timyitong added 3 commits March 29, 2016 11:49

Fix GLMSuiteIntegTest failures

f9cfc25

Code style changes, using Option class for PaldDBIndexMap#getFeatureN…

a32eb72

…ame and improve the query efficiency via binarySearch

PalDBIndexMap switch to scala's way of folding multiple iterators int…

4cb97dc

…o one

timyitong added 2 commits March 30, 2016 11:01

Switch back to using groupBy for index building, reduceBy will result…

7c0ef3e

… in a few hash collisions

Minor changes

ba48429

XianXing merged commit dc3ded9 into linkedin:master Mar 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an offheap indexmap implemenation using LinkedIn PalDB as an option. Resolving Issue #4 #17

Add an offheap indexmap implemenation using LinkedIn PalDB as an option. Resolving Issue #4 #17

timyitong commented Mar 27, 2016

XianXing Mar 28, 2016

timyitong Mar 28, 2016

timyitong Mar 28, 2016

namitk Mar 28, 2016

timyitong Mar 28, 2016

XianXing commented Mar 28, 2016

XianXing Mar 28, 2016

namitk Mar 28, 2016

timyitong Mar 28, 2016

XianXing Mar 28, 2016

timyitong Mar 28, 2016

mtth Mar 29, 2016

joshvfleming Mar 29, 2016

timyitong commented Mar 28, 2016

XianXing commented Mar 29, 2016

mtth Mar 29, 2016

timyitong Mar 29, 2016

mtth Mar 30, 2016

joshvfleming commented Mar 29, 2016

timyitong commented Mar 30, 2016

timyitong commented Mar 30, 2016

XianXing commented Mar 30, 2016

Add an offheap indexmap implemenation using LinkedIn PalDB as an option. Resolving Issue #4 #17

Add an offheap indexmap implemenation using LinkedIn PalDB as an option. Resolving Issue #4 #17

Conversation

timyitong commented Mar 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XianXing commented Mar 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timyitong commented Mar 28, 2016

XianXing commented Mar 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joshvfleming commented Mar 29, 2016

timyitong commented Mar 30, 2016

timyitong commented Mar 30, 2016

XianXing commented Mar 30, 2016