Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build a fat jar that works with spark-submit #269

Merged
merged 5 commits into from
Aug 30, 2022
Merged

Conversation

darabos
Copy link
Contributor

@darabos darabos commented Aug 24, 2022

This is a substantial change to how LynxKite runs. We pack LynxKite (including Sphynx) into a single jar file that can be started with spark-submit. This makes it way easier to deploy in a Hadoop environment. We're open-sourcing this from a private repo (sorry) where it has already proven useful in two enterprise environments.

@@ -32,7 +31,10 @@ object ExecuteSQL extends OpFromJson {
val functionRegistry = FunctionRegistry.builtin
val reg = UDFHelper.udfRegistration(functionRegistry)
UDF.register(reg)
val catalog = new SessionCatalog(new InMemoryCatalog, functionRegistry, sqlConf)
val catalog = new spark.sql.catalyst.catalog.SessionCatalog(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is abstract on Databricks. The seemingly meaningless change here makes it easier to replace the class name when building for Databricks.

.set("spark.eventLog.compress", "true")
// Progress bars are not great in logs.
.set("spark.ui.showConsoleProgress", "false")
if (LoggedEnvironment.envOrElse("KITE_CONFIGURE_SPARK", "yes") == "yes") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just an indentation change.

@@ -11,6 +11,7 @@ scalacOptions ++= Seq(
"-feature",
"-deprecation",
"-unchecked",
"-target:jvm-1.8",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building for Java 8 has been necessary in multiple environments. And we haven't seen any downsides so far. Let us know if you see any!

"-source", "1.8",
)

version := Option(System.getenv("VERSION")).getOrElse("0.1-SNAPSHOT")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we plan to use this fat jar now as a release artifact, we have to stop putting 0.1-SNAPSHOT in the file name. 😅

"io.grpc" % "grpc-netty" % "1.41.0",
"io.grpc" % "grpc-protobuf" % "1.48.0",
"io.grpc" % "grpc-stub" % "1.48.0",
"io.grpc" % "grpc-netty-shaded" % "1.48.0",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really necessary, but I upgraded these while troubleshooting an issue.

Comment on lines +100 to 105
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.typesafe.config.**" -> "lynxkite_shaded.com.typesafe.config.@1").inAll,
ShadeRule.rename("com.google.inject.**" -> "lynxkite_shaded.com.google.inject.@1").inAll,
ShadeRule.rename("com.google.common.**" -> "lynxkite_shaded.com.google.common.@1").inAll,
ShadeRule.rename("com.google.protobuf.**" -> "lynxkite_shaded.com.google.protobuf.@1").inAll,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shading is motivated by running on Databricks where we find conflicting versions. It's probably important in other environments as well.

// We put the local Spark installation on the classpath for compilation and testing instead of using
// it from Maven. The version on Maven pulls in an unpredictable (old) version of Hadoop.
def sparkJars(version: String) = {
val home = System.getenv("HOME")
val jarsDir = new java.io.File(s"$home/spark/spark-$version/jars")
val jarsDir = new java.io.File(
Option(System.getenv("SPARK_JARS_DIR")).getOrElse(s"$home/spark/spark-$version/jars"))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set SPARK_JARS_DIR to build against jars harvested from your destination environment.

@@ -61,7 +61,6 @@ GET /downloadCSV com.lynxanalytics.biggraph.serving.ProductionJsonServer.d

GET /getLogFiles com.lynxanalytics.biggraph.serving.ProductionJsonServer.getLogFiles
GET /downloadLogFile com.lynxanalytics.biggraph.serving.ProductionJsonServer.downloadLogFile
POST /forceLogRotate com.lynxanalytics.biggraph.serving.ProductionJsonServer.forceLogRotate
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our log setup was chaos anyway. We hereby leave it all for the environment to manage.

keydir := flag.String(
"keydir", "", "directory of cert.pem and private-key.pem files (for encryption)")
flag.Parse()
keydir := os.Getenv("SPHYNX_CERT_DIR")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything else was passed through environment variables but this one was a flag for some reason. Environment variables are nicer in that they are passed automatically to child processes.

# Package it for sbt-assembly.
cd .build
cp -R ../python lynxkite-sphynx/
LIBS=$(ldd lynxkite-sphynx/lynxkite-sphynx | sed -n 's/.*=> \(.*anaconda3.*\) (0x.*)/\1/p')
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably not the final word, but it works for now.

@darabos
Copy link
Contributor Author

darabos commented Aug 24, 2022

Wow, the BigQuery library from #245 brought in a lot of fixed dependencies. I changed it up to get it to build. We'll have to check if the BigQuery import still works.

The BigQuery library also brings a newer version of Jackson. I tried upgrading Play Framework because play-json is also on Jackson 2.13 on master. But they are still on 2.10 in the latest release. Hopefully they work fine with 2.13, because I pinned it now.

Yeah, lots of testing needed before putting this in a release.

Comment on lines +46 to +47
// The file stores timestamps as instants. Then getTimestamp puts them in the default time zone.
// To get a reliable test we need to put it in a fixed time zone.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darabos
Copy link
Contributor Author

darabos commented Aug 29, 2022

Another test failure that I'm not sure is specific to this PR:

[info]   File "/home/runner/work/lynxkite/lynxkite/sphynx/python/tsne.py", line 8, in <module>
[info]     z = TSNE(n_components=dim, perplexity=op.params['perplexity']).fit_transform(x)
[info]   File "/usr/share/miniconda/envs/test/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py", line 1122, in fit_transform
[info]     self._check_params_vs_input(X)
[info]   File "/usr/share/miniconda/envs/test/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py", line 793, in _check_params_vs_input
[info]     raise ValueError("perplexity must be less than n_samples")

I'll fix it anyway. I don't want to merge this with failing tests.

@darabos
Copy link
Contributor Author

darabos commented Aug 30, 2022

I think I fixed that, but now I want to merge it with broken tests anyway. Because I have a follow up PR upgrading to Spark 3.3.0 for #270. (And I like Spark 3.3.0 anyway!) We'll do the big testing after that.

@darabos darabos merged commit 99509eb into main Aug 30, 2022
@darabos darabos deleted the darabos-spark-submit branch August 30, 2022 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant