New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues about use generated case class as Encoders.bean #4

Closed
david-2012 opened this Issue Mar 6, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@david-2012

david-2012 commented Mar 6, 2017

Hi, @julianpeeters
I used this project to generate some case class according to the csv content loaded by spark DataFrame. And I'd like to convert the dataframe to Dataset, which required an Encoder(since the Encoder support basic type and case class ).
However, it always threw some exception like following? Do you have any comments or thoughts about this? Appreciated very much!

java.lang.UnsupportedOperationException: Cannot infer type for class models.Station because it is not bean-compliant
at org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$serializerFor(JavaTypeInference.scala:416)
at org.apache.spark.sql.catalyst.JavaTypeInference$.serializerFor(JavaTypeInference.scala:327)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:82)
at org.apache.spark.sql.Encoders$.bean(Encoders.scala:141)

Attach my code sample as below:

_val hydrologyStationDF = spark.sqlContext.read.format("csv").option("header", true).csv(filePath)
val valueMembers : List[FieldData] = List(FieldData("STCD", typeOf[String]), FieldData("STNM", typeOf[String]),
FieldData("LGTD", typeOf[Double]), FieldData("LTTD", typeOf[Double]))
val classData = new ClassData(ClassNamespace(Some("models")), ClassName("Station"), ClassFieldData(valueMembers))
val dcc = new DynamicCaseClass(classData)
val record1 = dcc.runtimeInstance

type MyRecord = record1.type
import spark.implicits._
val points2 = hydrologyStationPropertyDF.as(Encoders.bean(record1.getClass()))_
@julianpeeters

This comment has been minimized.

Show comment
Hide comment
@julianpeeters

julianpeeters Mar 8, 2017

Owner

Hi David,

I've used spark with avro before, but never spark sql, and never with csv files, so I don't have a good answer. I'll note a few things, though:

  1. These DynamicCaseClasses are very strange runtime beasts. They are a quick fix to tech debt or a bad design, and they are interesting for experiments, but they are almost never the right thing to use in a normal situation (like this one appears to be)

  2. Aside from the features described in the readme, DCCs can essentially only be accessed via reflection, I wouldn't expect passing an object into "normal" Spark machinery to work. It looks like JavaTypeInference may be trying to reflect but a) Scala is not Java and it's fields are private, and b) DCCs are Frankensteined together at runtime, and may be missing a piece that is needed for JavaTypeInference to work

  3. Pardon me for not fully understanding your code before I comment, but it looks like you know quite a bit about your data at compile-time, so I wonder why use this project? Maybe those names you use are stand-ins for this question, and you truly need to look into the files to define the classes (why not define the classes by hand instead of as a DCC, since you know about the data at compile time)? If you really do need to look into the file in order to define the class, then why not do it at compile-time, since you know the path at compile time (conceded, perhaps it's impractical to fire up a small spark job at compile time, but that's out of my ken right now)?

Sorry I can't offer any better insight. Good luck!

Owner

julianpeeters commented Mar 8, 2017

Hi David,

I've used spark with avro before, but never spark sql, and never with csv files, so I don't have a good answer. I'll note a few things, though:

  1. These DynamicCaseClasses are very strange runtime beasts. They are a quick fix to tech debt or a bad design, and they are interesting for experiments, but they are almost never the right thing to use in a normal situation (like this one appears to be)

  2. Aside from the features described in the readme, DCCs can essentially only be accessed via reflection, I wouldn't expect passing an object into "normal" Spark machinery to work. It looks like JavaTypeInference may be trying to reflect but a) Scala is not Java and it's fields are private, and b) DCCs are Frankensteined together at runtime, and may be missing a piece that is needed for JavaTypeInference to work

  3. Pardon me for not fully understanding your code before I comment, but it looks like you know quite a bit about your data at compile-time, so I wonder why use this project? Maybe those names you use are stand-ins for this question, and you truly need to look into the files to define the classes (why not define the classes by hand instead of as a DCC, since you know about the data at compile time)? If you really do need to look into the file in order to define the class, then why not do it at compile-time, since you know the path at compile time (conceded, perhaps it's impractical to fire up a small spark job at compile time, but that's out of my ken right now)?

Sorry I can't offer any better insight. Good luck!

@david-2012

This comment has been minimized.

Show comment
Hide comment
@david-2012

david-2012 Mar 8, 2017

Hi, julianpeeters

Thanks for your kindly detailed explanation.
Sorry for the unclear descriptions about the issue. I would give a further description for this.

  1. The cvs file schema would change when loading different datas. For the hydrologyStation dataframe, I can know well about the cvs schema and predefine a case class to convert the loaded Dataset[Row] to Dataset[somecaseclass] so that give more flexible operations approaches. However, when another csv file with different schema is loaded, the previous defined case class won’t work, and there will not be a specific encoders to convert this Dataset[Row] to Dataset[somecaseclass].
  2. I can see some differences between the DynamicCaseClasses in this project and normal scala case classes. So I changed my mind and could share my new solution, which may be not so beautiful :
    a. define some general case classes
    b. rename the columns of the loaded Dataset[Row] to keep align with the case class
    c. use the matched case class as the encoders of renamed Dataset[Row] to convert the Dataset[Row] to Dataset[somecaseclass]
    d. after all the operations, rename the columns of the dataset back

david-2012 commented Mar 8, 2017

Hi, julianpeeters

Thanks for your kindly detailed explanation.
Sorry for the unclear descriptions about the issue. I would give a further description for this.

  1. The cvs file schema would change when loading different datas. For the hydrologyStation dataframe, I can know well about the cvs schema and predefine a case class to convert the loaded Dataset[Row] to Dataset[somecaseclass] so that give more flexible operations approaches. However, when another csv file with different schema is loaded, the previous defined case class won’t work, and there will not be a specific encoders to convert this Dataset[Row] to Dataset[somecaseclass].
  2. I can see some differences between the DynamicCaseClasses in this project and normal scala case classes. So I changed my mind and could share my new solution, which may be not so beautiful :
    a. define some general case classes
    b. rename the columns of the loaded Dataset[Row] to keep align with the case class
    c. use the matched case class as the encoders of renamed Dataset[Row] to convert the Dataset[Row] to Dataset[somecaseclass]
    d. after all the operations, rename the columns of the dataset back
@david-2012

This comment has been minimized.

Show comment
Hide comment
@david-2012

david-2012 Mar 8, 2017

So I guess this issue could be closed.

david-2012 commented Mar 8, 2017

So I guess this issue could be closed.

@julianpeeters

This comment has been minimized.

Show comment
Hide comment
@julianpeeters
Owner

julianpeeters commented Mar 8, 2017

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment