-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build a fat jar that works with spark-submit #269
Conversation
@@ -32,7 +31,10 @@ object ExecuteSQL extends OpFromJson { | |||
val functionRegistry = FunctionRegistry.builtin | |||
val reg = UDFHelper.udfRegistration(functionRegistry) | |||
UDF.register(reg) | |||
val catalog = new SessionCatalog(new InMemoryCatalog, functionRegistry, sqlConf) | |||
val catalog = new spark.sql.catalyst.catalog.SessionCatalog( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class is abstract on Databricks. The seemingly meaningless change here makes it easier to replace the class name when building for Databricks.
.set("spark.eventLog.compress", "true") | ||
// Progress bars are not great in logs. | ||
.set("spark.ui.showConsoleProgress", "false") | ||
if (LoggedEnvironment.envOrElse("KITE_CONFIGURE_SPARK", "yes") == "yes") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just an indentation change.
@@ -11,6 +11,7 @@ scalacOptions ++= Seq( | |||
"-feature", | |||
"-deprecation", | |||
"-unchecked", | |||
"-target:jvm-1.8", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building for Java 8 has been necessary in multiple environments. And we haven't seen any downsides so far. Let us know if you see any!
"-source", "1.8", | ||
) | ||
|
||
version := Option(System.getenv("VERSION")).getOrElse("0.1-SNAPSHOT") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we plan to use this fat jar now as a release artifact, we have to stop putting 0.1-SNAPSHOT
in the file name. 😅
"io.grpc" % "grpc-netty" % "1.41.0", | ||
"io.grpc" % "grpc-protobuf" % "1.48.0", | ||
"io.grpc" % "grpc-stub" % "1.48.0", | ||
"io.grpc" % "grpc-netty-shaded" % "1.48.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really necessary, but I upgraded these while troubleshooting an issue.
assemblyShadeRules in assembly := Seq( | ||
ShadeRule.rename("com.typesafe.config.**" -> "lynxkite_shaded.com.typesafe.config.@1").inAll, | ||
ShadeRule.rename("com.google.inject.**" -> "lynxkite_shaded.com.google.inject.@1").inAll, | ||
ShadeRule.rename("com.google.common.**" -> "lynxkite_shaded.com.google.common.@1").inAll, | ||
ShadeRule.rename("com.google.protobuf.**" -> "lynxkite_shaded.com.google.protobuf.@1").inAll, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shading is motivated by running on Databricks where we find conflicting versions. It's probably important in other environments as well.
// We put the local Spark installation on the classpath for compilation and testing instead of using | ||
// it from Maven. The version on Maven pulls in an unpredictable (old) version of Hadoop. | ||
def sparkJars(version: String) = { | ||
val home = System.getenv("HOME") | ||
val jarsDir = new java.io.File(s"$home/spark/spark-$version/jars") | ||
val jarsDir = new java.io.File( | ||
Option(System.getenv("SPARK_JARS_DIR")).getOrElse(s"$home/spark/spark-$version/jars")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set SPARK_JARS_DIR
to build against jars harvested from your destination environment.
@@ -61,7 +61,6 @@ GET /downloadCSV com.lynxanalytics.biggraph.serving.ProductionJsonServer.d | |||
|
|||
GET /getLogFiles com.lynxanalytics.biggraph.serving.ProductionJsonServer.getLogFiles | |||
GET /downloadLogFile com.lynxanalytics.biggraph.serving.ProductionJsonServer.downloadLogFile | |||
POST /forceLogRotate com.lynxanalytics.biggraph.serving.ProductionJsonServer.forceLogRotate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our log setup was chaos anyway. We hereby leave it all for the environment to manage.
keydir := flag.String( | ||
"keydir", "", "directory of cert.pem and private-key.pem files (for encryption)") | ||
flag.Parse() | ||
keydir := os.Getenv("SPHYNX_CERT_DIR") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything else was passed through environment variables but this one was a flag for some reason. Environment variables are nicer in that they are passed automatically to child processes.
# Package it for sbt-assembly. | ||
cd .build | ||
cp -R ../python lynxkite-sphynx/ | ||
LIBS=$(ldd lynxkite-sphynx/lynxkite-sphynx | sed -n 's/.*=> \(.*anaconda3.*\) (0x.*)/\1/p') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably not the final word, but it works for now.
5b09b18
to
e3befc3
Compare
Wow, the BigQuery library from #245 brought in a lot of fixed dependencies. I changed it up to get it to build. We'll have to check if the BigQuery import still works. The BigQuery library also brings a newer version of Jackson. I tried upgrading Play Framework because Yeah, lots of testing needed before putting this in a release. |
// The file stores timestamps as instants. Then getTimestamp puts them in the default time zone. | ||
// To get a reliable test we need to put it in a fixed time zone. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another test failure that I'm not sure is specific to this PR:
I'll fix it anyway. I don't want to merge this with failing tests. |
…des in new sklearn.
I think I fixed that, but now I want to merge it with broken tests anyway. Because I have a follow up PR upgrading to Spark 3.3.0 for #270. (And I like Spark 3.3.0 anyway!) We'll do the big testing after that. |
This is a substantial change to how LynxKite runs. We pack LynxKite (including Sphynx) into a single jar file that can be started with
spark-submit
. This makes it way easier to deploy in a Hadoop environment. We're open-sourcing this from a private repo (sorry) where it has already proven useful in two enterprise environments.