Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tests fail on spark-1.6.X branch with pyspark 1.6.3 #5

Closed
robertjrodger opened this issue May 31, 2017 · 8 comments
Closed

Tests fail on spark-1.6.X branch with pyspark 1.6.3 #5

robertjrodger opened this issue May 31, 2017 · 8 comments

Comments

@robertjrodger
Copy link

Hello,

The maven build as you outline in the README goes fine but the suggested test fails with relevant output:

======================================================================
ERROR: testWorkflow (jpmml_sparkml.tests.JPMMLTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rodgerr/dev/jpmml-sparkml-package/src/main/python/jpmml_sparkml/tests/__init__.py", line 41, in testWorkflow
    pmmlBytes = toPMMLBytes(self.sc, df, pipelineModel)
  File "/Users/rodgerr/dev/jpmml-sparkml-package/src/main/python/jpmml_sparkml/__init__.py", line 22, in toPMMLBytes
    return javaConverter.toPMMLByteArray(javaSchema, javaPipelineModel)
  File "/Users/rodgerr/miniconda2/lib/python2.7/site-packages/py4j/java_gateway.py", line 1154, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/opt/apache-spark@1.6/libexec/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/Users/rodgerr/miniconda2/lib/python2.7/site-packages/py4j/protocol.py", line 320, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling z:org.jpmml.sparkml.ConverterUtil.toPMMLByteArray.
: java.lang.NoClassDefFoundError: com/google/common/collect/Iterables
	at org.jpmml.sparkml.FeatureMapper.getOnlyFeature(FeatureMapper.java:216)
	at org.jpmml.sparkml.feature.StringIndexerModelConverter.encodeFeatures(StringIndexerModelConverter.java:42)
	at org.jpmml.sparkml.FeatureMapper.append(FeatureMapper.java:71)
	at org.jpmml.sparkml.feature.RFormulaModelConverter.encodeFeatures(RFormulaModelConverter.java:60)
	at org.jpmml.sparkml.FeatureMapper.append(FeatureMapper.java:71)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:117)
	at org.jpmml.sparkml.ConverterUtil.toPMMLByteArray(ConverterUtil.java:213)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
	at py4j.Gateway.invoke(Gateway.java:259)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:209)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Iterables
	at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 18 more

In case it's relevant, I'm using py4j-0.10.5 which was released after the most recent branch commit.

@vruusmann
Copy link
Member

I've tested JPMML-SparkML-Package with Apache Spark 1.6.0, 1.6.1 and 1.6.2, but not with 1.6.3. Must be the case that the Google Guava dependency has been relocated between the 1.6.2 and 1.6.3 versions.

You can bypass tests like this:

$ mvn -Dmaven.test.skip=true clean install

During runtime, simply add Google Guava dependency (com.google.guava:guava:[16.0, 20.0]) to your application classpath.

Will investigate potential fixes. Could introduce a build profile, which builds a "fat" JAR (includes Guava) for Apache Spark version 1.6.3, and a "thin" JAR (excludes Guava) for all earlier versions.

@robertjrodger
Copy link
Author

I removed Apache Spark 1.6.3 and installed 1.6.0 and again the Maven build succeeds but the nosetests do not, with the same traceback.

@vruusmann
Copy link
Member

Believe it or not, but everything works as advertised in my computer:

$ export SPARK_HOME=/opt/spark-1.6.2/
$ export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python
$ mvn -Ppyspark clean install
$ cd src/main/python
$ nosetests

End of the output:

17/05/31 18:36:45 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down.
.
----------------------------------------------------------------------
Ran 1 test in 12.075s

OK
17/05/31 18:36:46 INFO util.ShutdownHookManager: Shutdown hook called
17/05/31 18:36:46 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-afe222bc-6d29-4220-b50c-01abaf059e51

@vruusmann
Copy link
Member

Maybe it's some Apache Spark packaging issue? What is the name of your distribution, is it "with Hadoop" or "without Hadoop" edition?

@robertjrodger
Copy link
Author

robertjrodger commented May 31, 2017

Spark 1.6.0 Pre-built for Apache Hadoop 2.6, tarball downloaded from spark.apache.org/downloads.html; I have the same result with Spark 1.6.2 Pre-built for Apache Hadoop 2.6. Using 2.0.0 Pre-built for Apache Hadoop 2.7 leads to successfully passing the tests on both branches.

Could it have to do with the jpmml-sparkml Maven JAR? Which Spark distribution do you use?

@asnare
Copy link

asnare commented Jun 16, 2017

Classpath misery, for the win.

I've just been trying to help Robert understand what's going on here. I must confess I'm a little lost:

  • Guava is declared (upstream) in jpmml-sparkml as 13.0 as provided.
  • The Spark 1.6 binary packages from the Apache project don't include Guava in a usable form. (They have a shaded version.)

So I'm a bit perplexed about where the runtime Guava dependency should be coming from?

@vruusmann
Copy link
Member

So I'm a bit perplexed about where the runtime Guava dependency should be coming from?

In your application project directory, execute Apache Maven command mvn dependency:tree, and look for the occurrences of "guava".

The availability of Guava depends on Apache Spark version (1.6.X vs 2.0.X), and packaging ("with hadoop" or "without hadoop").

In Robert's application environment (#5 (comment)) there is no Guava dependency available (as indicated by java.lang.NoClassDefFoundError: com/google/common/collect/Iterables). Therefore, add the following to your pom.xml, rebuild and redeploy:

<dependency>
	<groupId>com.google.guava</groupId>
	<artifactId>guava</artifactId>
	<version>19.0</version>
</dependency>

@robertjrodger
Copy link
Author

With your suggested addition to the pom.xml, the new build passes the tests. Thank you for having a look into this! Additionally, I will open a pull request for the revised pom.xml for, as you mention, there is no guarantee the user's Spark distribution includes Guava.

vruusmann added a commit to jpmml/jpmml-sparkml that referenced this issue Jun 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants