Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java RExpParser exception #6

Open
svazzole opened this issue Sep 19, 2017 · 5 comments
Open

Java RExpParser exception #6

svazzole opened this issue Sep 19, 2017 · 5 comments

Comments

@svazzole
Copy link

svazzole commented Sep 19, 2017

Hi,
thanks a lot for your work!
I noticed a problem when working with JPMML-R: if I have a matrix of features in which the features names contains some particular characters (such as &) the package throws an exception connected to RExpParser. On the opposite the JPMML-Sklearn package is not affected by this behaviour: it creates an xml file containing the names in which the character "&" is correctly substituted by "&".
Do you think this is a problem? If so, can you fix it?
Best,
Simon

@vruusmann
Copy link
Member

if I have a matrix of features in which the features names contains some particular characters (such as &) the package throws an exception connected to RExpParser.

Can you paste the full stack trace of this exception here?

Better yet, can you provide a reproducible example (a toy dataset and an R script) that I could play with?

@svazzole
Copy link
Author

svazzole commented Sep 19, 2017

Here you have the output of the command.
As soon as possible I will give you the precise example.

D:\jpmml-r-master>java -Xms4G -Xmx16G -jar target/converter-executable-1.2-SNAPSHOT.jar --rds-input LibSVMAnomalyFormulaReq.rds --pmml-output model.pmml
set 19, 2017 4:59:39 PM org.jpmml.rexp.Main run
INFORMAZIONI: Parsing RDS..
Exception in thread "main" java.lang.StackOverflowError
        at java.io.DataInputStream.readInt(Unknown Source)
        at org.jpmml.rexp.XDRInput.readInt(XDRInput.java:62)
        at org.jpmml.rexp.RExpParser.readInt(RExpParser.java:481)
        at org.jpmml.rexp.RExpParser.readRExp(RExpParser.java:67)
        at org.jpmml.rexp.RExpParser.readPairList(RExpParser.java:155)
        at org.jpmml.rexp.RExpParser.readRExp(RExpParser.java:74)
        at org.jpmml.rexp.RExpParser.readFunctionCall(RExpParser.java:218)
        at org.jpmml.rexp.RExpParser.readRExp(RExpParser.java:82)
        at org.jpmml.rexp.RExpParser.readPairList(RExpParser.java:155)
        at org.jpmml.rexp.RExpParser.readRExp(RExpParser.java:74)
        at org.jpmml.rexp.RExpParser.readFunctionCall(RExpParser.java:218)
        at org.jpmml.rexp.RExpParser.readRExp(RExpParser.java:82)
        at org.jpmml.rexp.RExpParser.readPairList(RExpParser.java:155)
        at org.jpmml.rexp.RExpParser.readRExp(RExpParser.java:74)
        at org.jpmml.rexp.RExpParser.readFunctionCall(RExpParser.java:218)
        at org.jpmml.rexp.RExpParser.readRExp(RExpParser.java:82)
        at org.jpmml.rexp.RExpParser.readPairList(RExpParser.java:155)
        at org.jpmml.rexp.RExpParser.readRExp(RExpParser.java:74)
        at org.jpmml.rexp.RExpParser.readFunctionCall(RExpParser.java:218)
        at org.jpmml.rexp.RExpParser.readRExp(RExpParser.java:82)
        at org.jpmml.rexp.RExpParser.readPairList(RExpParser.java:155)
        at org.jpmml.rexp.RExpParser.readRExp(RExpParser.java:74)
        at org.jpmml.rexp.RExpParser.readFunctionCall(RExpParser.java:218)
        at org.jpmml.rexp.RExpParser.readRExp(RExpParser.java:82)

@vruusmann
Copy link
Member

Very interesting - the RDS parser component appears to go into infinite loop.

A reproducible example would be much appreciated. Can you share your LibSVMAnomalyFormulaReq.rds RDS file, which is very nicely broken?

In your R script, can you temporarily work around this issue by escaping variable names? For example, try surrounding them with backticks as suggested here:
https://stackoverflow.com/questions/3574385/can-i-escape-characters-in-variable-names

@svazzole
Copy link
Author

Ok, I will try to explain myself better.
Unfortunately I cannot send you the data (for privacy reasons).
I will try to build a toy model with the same errors.
What I can tell you is that the names of the features contains 4-grams of apache logs (so something like "GET ", "ET /", "T /g" and so on...).
I'm trying to do anomaly detection on the requests so I'm building a One-Class SVM (both in R and Python).
When I use Python there are no problems with the variable names while in R I had to use the following trick: I changed all the variables names to "X1X", "X2X", "X3X" and so on. This fixed the problem and the jpmml-r package performed correctly the conversion rds --> pmml. Then I changed again the variable names in the pmml file taking into account that "&" --> "&". This created the correct model and the results agreed with the Python one.
Here I have another question: I'm trying to use the pmmls created inside a scala program. While the results from R and Python agrees (as I said before), the results from the scala One-Class SVM model are quite different? Have you any ideas about this? Could this be an issue with scala (i'm thinking about machine precision) or something with the One-Class SVM (and libsvm)?
Thanks for your time.
Best,
Simon

@vruusmann
Copy link
Member

The PMML standard (and the JPMML implementation of it) does not have a concept of reserved symbols/keywords. For example, the string & would be a perfectly acceptable field name. There is no need of escaping it as \& or & - honey badger don't care.

The problem is specific to the R platform, because R has the concept of reserved symbols/keywords. The problem would probably be resolved by escaping variable names properly - did you try using backticks as suggested above? It is no wonder that the RDS parser gets confused when the RDS model file contains incorrect RDS strings. Sure, it would be nice if the RDS parser would be able to detect and recover in such a situation, but you as an R end user can prevent this situation from happening in the first place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants