-
Notifications
You must be signed in to change notification settings - Fork 116
-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark 3.0.0 incompatibilities #313
Comments
Thanks for sharing this report, @Wadaboa. This is the correct place to report issues like this. I haven't tested Flintrock myself with Spark 3.0 yet, so I'm not surprised there are issues.
This looks interesting. Flintrock currently only ensures that Java 8 is available, not Java 11. Spark 3.0 should still work with Java 8, though. Something is causing Spark to look at Java 11. Are you sure you've built your application against Java 8? Another possibility is that there is some setting we need to provide to tell Spark 3.0 to use Java 8. The migration guide doesn't mention anything like that, but I haven't looked closely. Maybe try setting |
I'm sure I've built my Scala application using Java 8, since running [info] welcome to sbt 1.3.13 (AdoptOpenJDK Java 1.8.0_252) (I'm using jenv to manage different Java versions). Moreover, building the application on my local machine with Java 8 or Java 11 makes no difference: I also tried to build the application using Java 11, installing Java 11 (I tried with both I assume that to make this "hack" work I would also have to install Java 11 on each slave node, but I was expecting at least some kind of different output/error when launching One thing I noticed while looking for Spark's brew formula is that it requires |
I would try on each cluster node to set |
I tried setting export JAVA_HOME="/usr/lib/jvm/jre" The By the way, I also had another try in which I added the following two lines to spark-env.sh on each cluster node, but everything remained the same: export JAVA_HOME="/usr/lib/jvm/jre"
export PATH="$JAVA_HOME/bin:$PATH" I also tried to restart the master and the slave nodes, by using the scripts located in |
This is strange. I will give things a shot myself later this week and see if I can get to the bottom of what's going on. |
I managed to make it work by doing something sketchy. First of all, I changed my custom Flintrock configuration to be: provider: ec2
services:
spark:
version: 3.0.0
hdfs:
version: 2.8.5
launch:
num-slaves: 1
install-spark: True
install-hdfs: True
providers:
ec2:
key-name: my-key-pair
identity-file: my-key-pair.pem
instance-type: t2.xlarge
region: us-east-1
ami: ami-0f84e2a3635d2fac9
user: ec2-user
instance-profile-name: ec2-s3
debug: true Then, after launching the cluster, I ran the following commands: flintrock run-command <cluster-name> "sudo amazon-linux-extras install -y java-openjdk11"
flintrock run-command <cluster-name> "yes 2 | sudo alternatives --config java"
flintrock run-command <cluster-name> "sudo mkdir -p /usr/local/opt/openjdk@11/bin/"
flintrock run-command <cluster-name> "sudo ln -s /usr/lib/jvm/jre/bin/java /usr/local/opt/openjdk@11/bin/" Obviously, this is not the go-to solution, but at least it's something. My biggest doubt is still about the |
Thank you for looking into this and sharing your workaround. I still plan to look into this as well; sorry I haven't been able to do it yet. |
Just did some testing. I cannot reproduce the error you reported with PySpark (REPL or spark-submit) or with the Scala REPL. Everything seems to work fine. So I guess the problem is triggered by a combination of sbt-built jobs and Spark 3.0. I know that sbt made some breaking changes in the jump from 0.13 to 1.0. I see you're using sbt 1.3. How about if you try building your application with sbt 0.13? Or some other version of sbt < 1.3? |
I tried building my application with #!/bin/bash
JAVA_HOME="/usr/local/opt/openjdk@11" exec "/usr/local/Cellar/apache-spark/3.0.0/libexec/bin/load-spark-env.sh" "$@" I tried removing that initial |
Oh, hmm... I'm not a Scala guy so humor me here: Is it possible that when you build your Spark application, the build process somehow references the Homebrew-installed Spark, which includes a dependency on JDK 11? I don't mean to make so much work for you. 😄 If JDK 11 is going to be frequently used with Spark 3.0, perhaps Flintrock should just support this out of the box somehow. The challenge will be to figure out how to make Flintrock gracefully support both users who use JDK 8 and those who use JDK 11. Or, at the very least, provide some new option or note in the README that JDK 11 users can reference. |
Actually, I'm not that much of a Scala guy too.. I'm currently learning the language... Anyway, I just did some more testing: I tried to build Spark 3.0.0 using a custom Homebrew formula to support Java 8 (in this case my
became
meaning that something related to the driver is not fully executed on the cluster. So, I really don't understand how the Now, I'm not using anymore the hack related to forcing ssh -i <key-pair> <ec2-user>@<ec2-master-name> -t "spark-submit --master spark://<ec2-master-name>:7077 ..." where Another thing that I noticed is that the reported command works, while the following Flintrock command does not: flintrock run-command --master-only <ec2-cluster-name> "spark-submit --master spark://<ec2-master-name>:7077 ..." where Ideally, the two commands should be doing the same thing, shouldn't they? About out-of-the-box Flintrock's support of Java 11, what about providing a new option inside the |
They should. Do you get an error with
That could work, though I am more inclined to just have Flintrock automatically make things work whether Java 8 or Java 11 are being used, if that's not too complicated. |
Hi Nicholas, Just noticed this issue. I have been routinely using Flintrock to run Spark 3.0 on Java 11 for many months. My multi-jdk-support is still a bit ugly and I am wary of Don't expand the support matrix. BTW, generally I run Spark from the master node rather than my desktop FWIW. |
@sfcoy - That's good to know, and I appreciate the restraint to add more supported configurations. Do you know if there is any reason to install Java 8 if the cluster is being launched with Spark 3+? If apps built with Java 8 can be deployed to a Java 11 runtime, perhaps we can just install Java 11 automatically when Spark is set to 3.0 or newer. That saves us from needing to add a user-facing option. @Wadaboa - That definitely sounds like a bug. Does that happen just with Also, are you sure the command hangs, or is it just that it doesn't return any output unless there is an error? See #135 for a potentially relevant issue. |
Hi @nchammas, as far as Java 11 goes, we cannot use it if the user chooses to use HDFS. Hadoop 3.3 (released July 2020) is required for Java 11 and as yet the Apache Spark team have not discussed publicly (that I have noticed) when they will build releases that include this. Of course it is trivial to build one's own release. I'm not sure it is a good idea to auto select the JDK because some users may have their own libraries that are not JDK 11 compatible. |
@nchammas - You are right, it works, it just doesn't return any output unless there is an error. I wasn't aware of the actual workings of the command. Anyway, I agree with issue #135 that it could be valuable to actually see the output produced by |
@Wadaboa - I think @sfcoy - Would you like to submit a PR that adds an option to Flintrock allowing the user to specify the Java version they want deployed to the cluster? I took a quick look at your branch and it looks like a very good start to me. I think with a few refinements we can get this in and release it as part of Flintrock 1.1. |
Sure thing. I also think that "no JDK" should be an option as well. That way users can choose an AMI which includes the required JDK. |
Hi,
I'm using Flintrock to deploy a Scala/Spark application to a simple EC2 cluster (master + 1 slave), but I'm having some problems. In particular, I'm using the following custom Flintrock configuration to launch the cluster:
As you can see, I'm using Spark 3.0.0. Once I have packaged my Scala application (using
sbt
) I perform a Flintrock'scopy-file
operation. Then, when I want to run the JAR usingspark-submit
, I have issues:spark-submit
from my local machine using option--master spark://<ec2-master-name>:7077
and--deploy-mode cluster
, I get a Java error sayingCannot run program "/usr/local/opt/openjdk@11/bin/java" (in directory "/home/ec2-user/spark/work/driver-20200718133018-0005")
spark-submit
from inside the master cluster (using either Flintrock'srun-command --master-only
orlogin
commands), the job gets correctly submitted, but the spark UI is not reporting it (at<ec2-master-name>:8080
). Instead, the application execution can be viewed at<ec2-master-name>:4040
, even though in theExecutors
page only the master seems to be active.I can't wrap my head around this problem. I don't know if it is some kind of Flintrock's compatibility issue with Spark 3.0.0, or I'm doing something wrong.
I apologize if this is the wrong place where to post this kind of doubts.
However, my environment is the following:
1.0.0
3.8.4
macOS Catalina 10.15.5
1.3.13
2.12.10
1.8
The text was updated successfully, but these errors were encountered: