Build a fat jar that works with spark-submit #269

darabos · 2022-08-24T17:09:04Z

This is a substantial change to how LynxKite runs. We pack LynxKite (including Sphynx) into a single jar file that can be started with spark-submit. This makes it way easier to deploy in a Hadoop environment. We're open-sourcing this from a private repo (sorry) where it has already proven useful in two enterprise environments.

darabos · 2022-08-24T17:10:14Z

app/com/lynxanalytics/biggraph/graph_operations/ExecuteSQL.scala

@@ -32,7 +31,10 @@ object ExecuteSQL extends OpFromJson {
    val functionRegistry = FunctionRegistry.builtin
    val reg = UDFHelper.udfRegistration(functionRegistry)
    UDF.register(reg)
-    val catalog = new SessionCatalog(new InMemoryCatalog, functionRegistry, sqlConf)
+    val catalog = new spark.sql.catalyst.catalog.SessionCatalog(


This class is abstract on Databricks. The seemingly meaningless change here makes it easier to replace the class name when building for Databricks.

darabos · 2022-08-24T17:10:42Z

app/com/lynxanalytics/biggraph/spark_util/BigGraphSparkContext.scala

-      .set("spark.eventLog.compress", "true")
-      // Progress bars are not great in logs.
-      .set("spark.ui.showConsoleProgress", "false")
+    if (LoggedEnvironment.envOrElse("KITE_CONFIGURE_SPARK", "yes") == "yes") {


This is just an indentation change.

darabos · 2022-08-24T17:13:00Z

build.sbt

@@ -11,6 +11,7 @@ scalacOptions ++= Seq(
  "-feature",
  "-deprecation",
  "-unchecked",
+  "-target:jvm-1.8",


Building for Java 8 has been necessary in multiple environments. And we haven't seen any downsides so far. Let us know if you see any!

darabos · 2022-08-24T17:13:46Z

build.sbt

+  "-source", "1.8",
+  )
+
+version := Option(System.getenv("VERSION")).getOrElse("0.1-SNAPSHOT")


Since we plan to use this fat jar now as a release artifact, we have to stop putting 0.1-SNAPSHOT in the file name. 😅

darabos · 2022-08-24T17:14:15Z

build.sbt

-  "io.grpc" % "grpc-netty" % "1.41.0",
+  "io.grpc" % "grpc-protobuf" % "1.48.0",
+  "io.grpc" % "grpc-stub" % "1.48.0",
+  "io.grpc" % "grpc-netty-shaded" % "1.48.0",


Not really necessary, but I upgraded these while troubleshooting an issue.

darabos · 2022-08-24T17:16:09Z

build.sbt

+assemblyShadeRules in assembly := Seq(
+  ShadeRule.rename("com.typesafe.config.**" -> "lynxkite_shaded.com.typesafe.config.@1").inAll,
+  ShadeRule.rename("com.google.inject.**" -> "lynxkite_shaded.com.google.inject.@1").inAll,
+  ShadeRule.rename("com.google.common.**" -> "lynxkite_shaded.com.google.common.@1").inAll,
+  ShadeRule.rename("com.google.protobuf.**" -> "lynxkite_shaded.com.google.protobuf.@1").inAll,
 )


This shading is motivated by running on Databricks where we find conflicting versions. It's probably important in other environments as well.

darabos · 2022-08-24T17:17:25Z

build.sbt

 // We put the local Spark installation on the classpath for compilation and testing instead of using
 // it from Maven. The version on Maven pulls in an unpredictable (old) version of Hadoop.
 def sparkJars(version: String) = {
  val home = System.getenv("HOME")
-  val jarsDir = new java.io.File(s"$home/spark/spark-$version/jars")
+  val jarsDir = new java.io.File(
+    Option(System.getenv("SPARK_JARS_DIR")).getOrElse(s"$home/spark/spark-$version/jars"))


Set SPARK_JARS_DIR to build against jars harvested from your destination environment.

darabos · 2022-08-24T17:18:12Z

conf/routes

@@ -61,7 +61,6 @@ GET  /downloadCSV      com.lynxanalytics.biggraph.serving.ProductionJsonServer.d

 GET  /getLogFiles      com.lynxanalytics.biggraph.serving.ProductionJsonServer.getLogFiles
 GET  /downloadLogFile  com.lynxanalytics.biggraph.serving.ProductionJsonServer.downloadLogFile
-POST /forceLogRotate   com.lynxanalytics.biggraph.serving.ProductionJsonServer.forceLogRotate


Our log setup was chaos anyway. We hereby leave it all for the environment to manage.

darabos · 2022-08-24T17:19:43Z

sphynx/lynxkite-sphynx/main.go

-	keydir := flag.String(
-		"keydir", "", "directory of cert.pem and private-key.pem files (for encryption)")
-	flag.Parse()
+	keydir := os.Getenv("SPHYNX_CERT_DIR")


Everything else was passed through environment variables but this one was a flag for some reason. Environment variables are nicer in that they are passed automatically to child processes.

darabos · 2022-08-24T17:20:07Z

sphynx/sphynx_compile.sh

+# Package it for sbt-assembly.
+cd .build
+cp -R ../python lynxkite-sphynx/
+LIBS=$(ldd lynxkite-sphynx/lynxkite-sphynx  | sed -n 's/.*=> \(.*anaconda3.*\) (0x.*)/\1/p')


This is probably not the final word, but it works for now.

darabos · 2022-08-24T21:58:47Z

Wow, the BigQuery library from #245 brought in a lot of fixed dependencies. I changed it up to get it to build. We'll have to check if the BigQuery import still works.

The BigQuery library also brings a newer version of Jackson. I tried upgrading Play Framework because play-json is also on Jackson 2.13 on master. But they are still on 2.10 in the latest release. Hopefully they work fine with 2.13, because I pinned it now.

Yeah, lots of testing needed before putting this in a release.

darabos · 2022-08-26T16:36:19Z

test/com/lynxanalytics/biggraph/graph_operations/ReadParquetWithSchemaTest.scala

+    // The file stores timestamps as instants. Then getTimestamp puts them in the default time zone.
+    // To get a reliable test we need to put it in a fixed time zone.


https://www.databricks.com/blog/2020/07/22/a-comprehensive-look-at-dates-and-timestamps-in-apache-spark-3-0.html is a good overview.

darabos · 2022-08-29T09:12:31Z

Another test failure that I'm not sure is specific to this PR:

[info]   File "/home/runner/work/lynxkite/lynxkite/sphynx/python/tsne.py", line 8, in <module>
[info]     z = TSNE(n_components=dim, perplexity=op.params['perplexity']).fit_transform(x)
[info]   File "/usr/share/miniconda/envs/test/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py", line 1122, in fit_transform
[info]     self._check_params_vs_input(X)
[info]   File "/usr/share/miniconda/envs/test/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py", line 793, in _check_params_vs_input
[info]     raise ValueError("perplexity must be less than n_samples")

I'll fix it anyway. I don't want to merge this with failing tests.

…des in new sklearn.

darabos · 2022-08-30T09:41:40Z

I think I fixed that, but now I want to merge it with broken tests anyway. Because I have a follow up PR upgrading to Spark 3.3.0 for #270. (And I like Spark 3.3.0 anyway!) We'll do the big testing after that.

darabos commented Aug 24, 2022

View reviewed changes

Build a fat jar that works with spark-submit.

e3befc3

darabos force-pushed the darabos-spark-submit branch from 5b09b18 to e3befc3 Compare August 24, 2022 17:23

darabos added 2 commits August 24, 2022 23:19

Add boolean to SerializableType.

a458a1f

Fix version conflicts.

5544bfe

Fixed time zone for test.

f5e4939

darabos commented Aug 26, 2022

View reviewed changes

Smaller perplexity in test. Perplexity must be less then number of no…

75b9ff2

…des in new sklearn.

darabos merged commit 99509eb into main Aug 30, 2022

darabos deleted the darabos-spark-submit branch August 30, 2022 09:41

darabos mentioned this pull request Aug 30, 2022

Prepare changelog for LynxKite 5.1.0 #273

Merged

darabos mentioned this pull request Sep 6, 2022

Upgrade to Apache Spark 3.3.0 #272

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build a fat jar that works with spark-submit #269

Build a fat jar that works with spark-submit #269

darabos commented Aug 24, 2022

darabos Aug 24, 2022

darabos Aug 24, 2022

darabos Aug 24, 2022

darabos Aug 24, 2022

darabos Aug 24, 2022

darabos Aug 24, 2022

darabos Aug 24, 2022

darabos Aug 24, 2022

darabos Aug 24, 2022

darabos Aug 24, 2022

darabos commented Aug 24, 2022

darabos Aug 26, 2022

darabos commented Aug 29, 2022

darabos commented Aug 30, 2022

		// The file stores timestamps as instants. Then getTimestamp puts them in the default time zone.
		// To get a reliable test we need to put it in a fixed time zone.

Build a fat jar that works with spark-submit #269

Build a fat jar that works with spark-submit #269

Conversation

darabos commented Aug 24, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

darabos commented Aug 24, 2022

Choose a reason for hiding this comment

darabos commented Aug 29, 2022

darabos commented Aug 30, 2022