New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection has grown past JVM limit of 0xFFFF #485
Comments
This seems to be https://issues.apache.org/jira/browse/SPARK-18016 ? Wonder why we didn't hit this on 2.2.2 ? |
@jakubhava I opened support case with Cloudera. |
Hi @Tagar , thanks for investigation! The logic used for converting Spark DataFrame into H2OFrame is stored right here https://github.com/h2oai/sparkling-water/blob/master/core/src/main/scala/org/apache/spark/h2o/converters/SparkDataFrameConverter.scala |
@Tagar @jakubhava it is interesting problem. I tracked the differences between 2.2.2..2.2.3 but did not find any reasonable explanation. There are several potential changes, like this one but i do not see reason for triggering https://issues.apache.org/jira/browse/SPARK-18016. @jakubhava WDYT? |
Thank you for looking into this @jakubhava and @mmalohlava . Cloudera Support confirms it is directly related to SPARK-18016. I also asked users to try spark.sql.codegen.wholeStage to set to false |
@Tagar Could you try Spark 2.2.1? Apparently a lot of those 64KB JVM bytecode limit bugs are fixed now. The limit on the number of columns you experienced sounds a lot like what I had experienced with MLLib. Namely, GLM with 500 variables ran fine, when it got up to 2k variables, GLM errored out. https://issues.apache.org/jira/browse/SPARK-22761 , which is one of those 64KB bytecode limit bugs that is apparently fixed in Spark 2.2.1. |
@axiomoixa We use Cloudera's Spark 2.2 build - they sometimes remove certain patches and on other hand can backport certain other fixes. I have updated my Cloudera case and asked if those 2.2.1 fixes of "64KB JVM bytecode limit" made its way to Cloudera - thank you for pointing to that. |
I just had a quick look and I couldn’t find any particular change in Sparkling Water which would cause such a dramatic column number |
@Tagar |
@axiomoixa I can see some 64KB bytecode fixed already in 2.2.1, at least it is stated in the release notes - https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315420&version=12340470 |
@Tagar @axiomoixa I think it was caused by this change in Sparkling Water - https://0xdata.atlassian.net/browse/SW-499. It is a good change though. Before, the BinaryType was just ignored ( no error thrown ), but right now it is properly handled and when we have for example an Array[Byte] in Spark DataFrame, it will now be expanded into a lots of new columns -> which is probably cause of the exception. Could you please share the schema of the data you are converting @Tagar ? Or at least, share the information if any of the fields is |
@jakubhava thank you for this information. Yes it seems to be a good improvement.
What would be an example of such a datatype? Does it mean SW-499 creates an enum-like structure for categorical features? We use PySpark primarily.
I will upload schema to H2O ticket https://support.h2o.ai/support/tickets/91559 if that's okay with you, as I can't share schema in public domain. |
@jakubhava I updated the H2O case with complete schema.
So not sure where Thank you. |
@Tagar BinaryType is type used to represent If the dataframe has only simple type such as |
My last candidate is this change - #429 , particularly this line - sparkling-water/core/src/main/scala/org/apache/spark/h2o/utils/H2OSchemaUtils.scala Line 111 in 0fa5510
During each conversion we call this newly in 2.2.3 to create a new dataframe with possibly renamed columns. Spark however internally calls this method, with needsConversion set to true. It therefore creates the projection and then creates a Dataset out of the converted data. The project might be reason for triggering the exception above.
Kuba |
@Tagar I think that this change https://github.com/h2oai/sparkling-water/pull/497/files might actually help in your case, however I still need to test it. If you know how to build sparkling water and want to give it a try as well, feel free to build it from this PR https://github.com/h2oai/sparkling-water/pull/497/files |
Closing this issue as it is fixed by #497 . However, please note that this is just optimisation of our code to not create additional dataframes/columns. The original issue still exist in Spark and can be reproduced on really large number of columns, however without upgrading Spark, there is currently not much we can do |
Users confirm this issue is fixed now. Also root cause - https://issues.apache.org/jira/browse/SPARK-18016 was fixed and committed to Spark 2.3 today. Thank you a lot. |
Hi @Tagar, new Sparkling Water release is out with this and also additional fixes |
Thank you @jakubhava! We will upgrade to 2.2.6 tonight. |
We started getting this error on wide datasets after upgrading to latest SW 2.2.3.
It was not happening on previous SW release 2.2.2.
Code:
This error happens on a dataframe with ~3k variables, but doesn't happen on a dataframe with ~800 columns for example. But again, SW 2.2.2 didn't have this problem on the same same data/same code.
The text was updated successfully, but these errors were encountered: