Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling columns with null values #44

Open
malathit opened this issue Jun 26, 2018 · 6 comments
Open

Handling columns with null values #44

malathit opened this issue Jun 26, 2018 · 6 comments

Comments

@malathit
Copy link

malathit commented Jun 26, 2018

Exception in thread "main" java.lang.IllegalArgumentException: Field a1 has valid values [b, a]
	at org.jpmml.converter.PMMLEncoder.toCategorical(PMMLEncoder.java:189)
	at org.jpmml.sparkml.feature.VectorIndexerModelConverter.encodeFeatures(VectorIndexerModelConverter.java:98)
	at org.jpmml.sparkml.FeatureConverter.registerFeatures(FeatureConverter.java:48)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:96)
	at org.jpmml.sparkml.ConverterUtil.toPMML(ConverterUtil.java:68)

I get the above exception when the column has null values. Any ideas on how to resolve this? Please comment if further details are needed.

@vruusmann
Copy link
Member

I get the above exception when the column has null values. Any ideas on how to resolve this?

Apply org.apache.spark.ml.feature.Imputer to this column first?

What is your Apache Spark version? How does Apache Spark handle columns with missing values - AFAIK it should also crash sooner or later.

@malathit
Copy link
Author

Hi,

Thanks for the quick reply. AFAIK the org.apache.spark.ml.feature.Imputer class can be used only on float or double data types. The column that gives me error is String type.

I am using Apache spark 2.2.0.

@malathit
Copy link
Author

malathit commented Jun 26, 2018

How does Apache Spark handle columns with missing values - AFAIK it should also crash sooner or later.

In apache spark null values are handled with StringIndexer setInvalid method with value set to "keep". Let me share the simplied code where I can reproduce the issue and share it.

@malathit
Copy link
Author

random-forest
@vruusmann This is the code and it gives the issue

@vruusmann
Copy link
Member

@malathit90 Sorry, I don't have time to debug images.

@malathit
Copy link
Author

malathit commented Jun 27, 2018

Here is the snippet giving the error @vruusmann

val a1Idx = new StringIndexer().setInputCol("a1").setOutputCol("a1Indexed").setHandleInvalid("keep")

val featureAssembler = new VectorAssembler().setInputCols(Array("a1Indexed", "a2")).setOutputCol("features");

val labelIndexer = new StringIndexer().setInputCol("a16").setOutputCol("labelIndexed").fit(zeroFilledData);

val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("featuresIndexed").setMaxCategories(15);

val classifier = new RandomForestClassifier().setLabelCol("labelIndexed").setFeaturesCol("featuresIndexed").setImpurity("gini").setPredictionCol("predictionIndexed");

val labelConverter = new IndexToString().setInputCol("predictionIndexed").setOutputCol("prediction").setLabels(labelIndexer.labels);

val pipeline = new Pipeline().setStages(Array(a1Idx, labelIndexer, featureAssembler, featureIndexer, classifier, labelConverter));

val model = pipeline.fit(zeroFilledData)

MetroJAXBUtil.marshalPMML(ConverterUtil.toPMML(df.schema, model), new FileOutputStream("/tmp/out.pmml"))```

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants