Add Spark multi-user support for standalone mode #750

Closed
wants to merge 1 commit into
from

Projects

None yet

9 participants

@jerryshao
Contributor

This patch add multi-user support for standalone mode.

Currently Spark ExecutorBackend do not distinguish the user who submits app, instead it uses the user who start Spark cluster to communicate with hdfs, this will introduce file permission issue. This patch solves this issue in two aspects:

  1. ExecutorBackend use the app's user to access hdfs, this will keep the same file permission with the app's user.
  2. For security hdfs, client driver get delegation token and distribute to cluster, ExecutorBackend use delegation token to get service authentication and access hdfs.

I've tested it on CDH4.1.2, Hadoop 1.0.4 with or without security enabled, but I cannot cover other different versions.

This patch does not solve delegation token renew issue when communicate with security hadoop, for long-run apps like Spark Streaming, Shark Server, this will make access failure when delegation token expires.

Any advice about renew mechanism is really appreciated, I will add it in this patch.

Thanks
Jerry

@AmplabJenkins

Thank you for your pull request. An admin will review this request soon.

@markhamstra
Contributor

There's a lot of code duplication among the hadoop1, hadoop2 and yarn utils. Can you DRY this out, please?

@andyk
Member
andyk commented Jul 30, 2013

Jenkins, ok to test.

@jerryshao
Contributor

Hi @markhamstra , thanks for your advice. Spark use option compile to deal with hadoop1, hadoop2 and yarn separately, if I extract out duplicated code, a common package is needed, but where to add this package is a problem. My concern is that add it to spark may pollute spark with different version of Hadoop, if you have a good solution please let me know.

Thanks
Jerry

@velvia
Contributor
velvia commented Aug 2, 2013

Hi Jerry, I believe someone suggested the use of hadoop-client jar to abstract out talking to different versions.
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client/1.0.1

It seems this is affecting packaging and other issues, so maybe we should accelerate this work.

@jerryshao
Contributor

Hi @velvia , would you please describe to be more specific? I cannot catch your meaning thoroughly.

@tgravescs tgravescs commented on the diff Aug 5, 2013
.../src/hadoop1/scala/spark/deploy/SparkHadoopUtil.scala
@@ -31,9 +41,38 @@ object SparkHadoopUtil {
}
def runAsUser(func: (Product) => Unit, args: Product) {
+ runAsUser(func, args, getUserNameFromEnvironment)
+ }
+
+ def runAsUser(func: (Product) => Unit, args: Product, user: String) {
+ val ugi = UserGroupInformation.createRemoteUser(user)
+ if (UserGroupInformation.isSecurityEnabled) {
+ Option(System.getenv(HDFS_TOKEN_KEY)) match {
@tgravescs
tgravescs Aug 5, 2013

an environment variable isn't very secure for passing the token as anyone on that machine could simply do a ps and get the token. Perhaps this is ok for the first cut and if you are limiting access to the machines but I think this will eventually need to be made more secure.

@tgravescs
tgravescs Aug 5, 2013

sorry I was wrong. environment variables are read only for the user owning the process.

@tgravescs

@velvia I also don't follow what you are asking. The hadoop api has changed between versions. They have back petalled a bit on the latest hadoop 2 versions (2.1.0-beta) to try to make most of the api's compatible with hadoop 1. So perhaps once spark moves to a later version of hadoop 2 some of those will not be needed.

@velvia
Contributor
velvia commented Aug 5, 2013

Ok, let me try to be more clear.

Currently, Spark is compiled directly against a specific version of a
Hadoop jar. As you discovered, this leads to problems because you have to
manually recompile Spark against different Hadoop versions.

There has been talk recently that we should try building against the
hadoop-client jar. This may allow us to have a Hadoop-version-independent
build of Spark so that you no longer need to build against a specific
Hadoop version. It would also remove a huge chain of dependencies from
the distribution.

I personally don't have experience with hadoop-client so can't vouch for if
it would work, but it's worth trying.

On Mon, Aug 5, 2013 at 6:36 AM, tgravescs notifications@github.com wrote:

@velvia https://github.com/velvia I also don't follow what you are
asking. The hadoop api has changed between versions. They have back
petalled a bit on the latest hadoop 2 versions (2.1.0-beta) to try to make
most of the api's compatible with hadoop 1. So perhaps once spark moves to
a later version of hadoop 2 some of those will not be needed.


Reply to this email directly or view it on GitHubhttps://github.com/mesos/spark/pull/750#issuecomment-22106304
.

Because the people who are crazy enough to think they can change the world,
are the ones who do. -- Steve Jobs

@tgravescs

the hadoop2-yarn profile definitely uses api's that do not exist in hadoop 1. I'm not sure about the hadoop1 and hadoop2 profiles.

@AmplabJenkins

Thank you for your pull request. An admin will review this request soon.

@jerryshao
Contributor

@velvia and @tgravescs . thanks for your comments.

Pass token between parent and child process using environment variable is a simple and easy to implement way currently, but it is not secure as you said, people can get this token under /proc/. While I think all that based on the prerequisite that people can log in that machine using the same user as the processor, so that was my compromise implementation.

For Hadoop client jar, I'm not familiar with Hadoop client and not sure this jar can solve different version's api compatibility. So I will try to investigate on it to see if it is feasible.

@mateiz
Member
mateiz commented Aug 6, 2013

@jerryshao looks like the environment var can only be read by that job's user according to @tgravescs's last comment, so hopefully that's fine?

Jey Kottalam from the AMP Lab has been working on a hadoop-client version of Spark's build system, so we probably don't need to worry about that here. The key is to just make sure this works in all the versions of the Hadoop code. However, I am curious as to whether the code for this will be the same across Hadoop versions or not. Has the security API changed between Hadoop 1 and 2?

@jerryshao
Contributor

@mateiz It's not the job's user but the process owner who can read environment var, I think they are different, and currently that' OK as I described in the last comment.

Most of the security API is same between Hadoop 1 and 2 except some of them are deprecated, but its quite different in Hadoop-yarn.

@mateiz
Member
mateiz commented Aug 11, 2013

Okay, thanks. We might wait until #803 is merged to merge this, since that changes some of the ways we interact with Hadoop, but this is definitely something we want in 0.8. CCing @jey to take a look at this too.

@jey
Contributor
jey commented Aug 20, 2013

@mateiz, is this targeted at 0.8? If so, I can look at updating it to work with our current master that has #838 (hadoop agnostic builds) merged.

@mateiz
Member
mateiz commented Aug 20, 2013

Yeah, this would be nice to add in if you don't mind taking a look.

@jerryshao
Contributor

HI @jey , Do you have any design doc about Hadoop integration, cause so many code has changed, I have to figure out how to update my patch.

@jey
Contributor
jey commented Aug 21, 2013

@jerryshao, I'm happy to take care of updating this PR, but had a question: does your patch provide the same functionality under YARN and has it been tested with YARN? Thanks.

@jey
Contributor
jey commented Aug 21, 2013

Here's my branch with an initial conversion of your patch. I haven't tested it against an HDFS install with security enabled yet.

https://github.com/jey/spark/tree/hdfs-auth

@jerryshao
Contributor

Hi @jey , thanks for your help. YARN has already provided multi-user support and HDFS auth by @tgravescs , so my patch only implement this functionality in standalone mode.

I will checkout your branch and run on my security cluster to see if it is OK.

@jerryshao
Contributor

Hi @jey , I checked out your branch and tested in CDH 4.1.2 cluster with and without security enabled, seems fine. Also all the unit test is passed.

BTW, there's a problem when I run sbt/sbt gen-idea or sbt/sbt eclipse to create project profile:

[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::          UNRESOLVED DEPENDENCIES         ::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  :: org.apache.hadoop#hadoop-yarn-api;2.0.0-mr1-cdh4.1.2: not found
[warn]  :: org.apache.hadoop#hadoop-yarn-common;2.0.0-mr1-cdh4.1.2: not found
[warn]  :: org.apache.hadoop#hadoop-yarn-client;2.0.0-mr1-cdh4.1.2: not found
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::

Also I changed Hadoop version to 1.0.4, same problem occurs:

[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  ::          UNRESOLVED DEPENDENCIES         ::
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::
[warn]  :: org.apache.hadoop#hadoop-yarn-api;1.0.4: not found
[warn]  :: org.apache.hadoop#hadoop-yarn-common;1.0.4: not found
[warn]  :: org.apache.hadoop#hadoop-yarn-client;1.0.4: not found
[warn]  ::::::::::::::::::::::::::::::::::::::::::::::

sbt still try to create module yarn even yarn is not enabled, I think these above dependency is not existed. I'm not sure its my wrong use of this command or something should be changed in SparkBuild.scala to treat Yarn module separately.

@jey
Contributor
jey commented Aug 23, 2013

Hi @jerryshao, thanks for catching that bug. I just submitted #860 with a fix for that issue.

@rxin
Member
rxin commented Sep 22, 2013

Hi @jey and @jerryshao - did you guys decide on who is going to continue this patch?

@jerryshao
Contributor

Hi @rxin, I think jey's updated patch is fine, I've already tested it under CDH 4.1.2, it's fine with security enabled or not, besides I have some concerns:

  1. I have no enough versions of Hadoop, I only tested it under CDH 4.1.2 and Apache 1.0.4, I'm not sure it is OK in other versions.
  2. Passing HDFS delegation token from worker to executor backend using environment variable in my implementation is not an elegant way. I think I can change to use Akka way to send send hdfs token after executor is registered.
  3. No HDFS delegation token renewal mechanism. Delegation token will be expired after 7 days by default, so we should renew it before the expiration, otherwise applications like Shark Server and Spark Streaming will be failed.

Generally I will continue this patch after jey's work to make it more reasonable.

Thanks
Jerry

@jey
Contributor
jey commented Sep 22, 2013

I've only rebased the patch and don't know anything about HDFS security and related issues, so I think it would make sense for @jerryshao to continue the patch.

@rxin
Member
rxin commented Sep 26, 2013

Hi @jerryshao - can you take this over and submit a new pr to the asf repo?

@jerryshao
Contributor

Ok, I will refactor this patch and submit to asf repo.

@jerryshao jerryshao closed this Sep 27, 2013
@xiajunluan xiajunluan pushed a commit to xiajunluan/spark that referenced this pull request May 30, 2014
@marmbrus @rxin marmbrus + rxin [SQL] Make Hive Metastore conversion functions publicly visible.
I need this to be public for the implementation of SharkServer2.  However, I think this functionality is generally useful and should be pretty stable.

Author: Michael Armbrust <michael@databricks.com>

Closes #750 from marmbrus/metastoreTypes and squashes the following commits:

f51b62e [Michael Armbrust] Make Hive Metastore conversion functions publicly visible.
2f1a337
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment