New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RandomForest: need correlation table before using factors ? #584
Comments
You can use real values directly. |
Works for any types ? When I use string I got the following error when fitting my RF model:
If I use integer values (1,2,3,11,12,13), I got the following error message:
Then it seems they must be between 0 and n (n being the number of different "real values" - 1) ? |
Note that I use 'formula = Formula.lhs(varToModel)' as formula (varToModel being the variable containing the column name of the response variable. |
real values means numeric values. Strings must be converted to factor. |
OK for strings. Is the behavior I describe for integer expected ?
|
@j3r3m1 you could convert |
Could you reopen the issue ? I did not have answer to my last question
It seems a strange behavior since you said that
|
It is not clear how do you create |
@j3r3m1 build a dataframe from a resulset with the dataframe.of method. Is there a short method to change the StructField in df ? something like df.schema().field("columName").nominal() |
@ebocher |
|
Thanks I will test it |
It works like a charm. |
OK works for the creation of the random forest. We did the following: def formula = Formula.lhs(varToModel)
def dfFactorized = df.factorize(varToModel);
dfFactorized = dfFactorized.omitNullRows()
// Create the randomForest
def model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample) However, when we apply the randomForest on a new dataset, the resulting vector is factorized (values from 0 to 16). Is there a way to get back to the original values (values from 1 to 10 and then from 101 to 107) then ? int[] prediction = Validation.test(model, df_var)
// We need to add the remove the initial predicted variable in order to not have duplicated...
df=df.drop(var2model)
df=df.merge(IntVector.of(var2model, prediction)) |
Smile translates the predication back to the label range. See L#500 in the source code. If the value is not what you want, something is wrong when you prepare the data. |
OK good news. However, I miss the link to the L#500 in your code, which class is it ? |
OK I have found it thanks.
|
OK but the problem is still not solved. Let's see the problem using the following example: // Dataset used
def data = [
[105, 'grass', 12, 20], [107, 'corn',12, 30], [106, 'corn',18, 20]
, [105, 'grass',12, 30], [105, 'corn',16, 20], [106, 'grass',12, 20], [107, 'forest',12, 20],
[106, 'grass',12, 20],[106, 'grass',16, 20],[106, 'grass',16, 20],[107, 'corn',16, 20],
[107, 'corn',2, 20], [107, 'corn',16, 50],[107, 'corn',18, 40]
]
DataFrame df = DataFrame.of(data, "LCZ", "TYPE","TEMPERATURE", "WIND");
// Now we need to factorize the columns "LCZ" and "TYPE" in order to use the random forest
def dfFactorized = df.factorize("LCZ", "TYPE")
// Then we define the characteristics of the randomForest:
def formula = Formula.lhs("LCZ")
def splitRule = SplitRule.valueOf("GINI")
def ntrees = 2
def mtry =2
def maxDepth =2
def maxNodes =5
def nodeSize = 3
def subsample = 1.0
// At this point we have the "LCZ" column as predicted variable, defined as nominal. No problem here since we still have the correspondence between our "LCZ" values (105, 106, 107) and 0, 1, 2.
// Then we create the randomForest
RandomForest model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample)
// Finally, when we apply the random forest
int[] prediction = Validation.test(model, dfFactorized) Results should be 105, 106 or 107 but we got 0, 1, 2. The correspondence between 105, 106, 107 and 0, 1, 2 is only conserved in the formula, thus we have to get it (in groovy) by the following command: model.formula().@binding.inputSchema.field("LCZ").measure.value2level BUT this information is not conserved by default in the model, thus if we save the model we loose it. Do you have an alternative solution to get back to the 105, 106 and 107 ? |
Note that this method :
doesn't work when the model is loaded after a XStream serialization. |
Is it groovy? I am not familiar with it. Note that Java doesn't allow mixed types in an array. I guess that the underlying JVM type of Note that we choose Java because it is a strong typed language. But we don't have proper data type in this case. BTW, Another workaround is to define each column individually by calling the corresponding proper-typed class (e.g. |
Thank you for your answer. Let's see the problem in Java then. The following example leads to the same problem. Do you have a solution to solve it within smile ? @Test
void test() {
// Dataset used
List<Tuple> data = new ArrayList<>();
List<StructField> fields = new ArrayList<>();
fields.add(new StructField("LCZ", DataType.of(Integer.class)));
fields.add(new StructField("TYPE", DataType.of(String.class)));
fields.add(new StructField("TEMPERATURE", DataType.of(Integer.class)));
fields.add(new StructField("WIND", DataType.of(Integer.class)));
StructType structType = new StructType(fields);
data.add(Tuple.of(new Object[]{105, "grass", 12, 20}, structType));
data.add(Tuple.of(new Object[]{107, "corn", 12, 30}, structType));
data.add(Tuple.of(new Object[]{106, "corn", 18, 20}, structType));
data.add(Tuple.of(new Object[]{105, "grass", 12, 30}, structType));
data.add(Tuple.of(new Object[]{105, "corn", 16, 20}, structType));
data.add(Tuple.of(new Object[]{106, "grass", 12, 20}, structType));
data.add(Tuple.of(new Object[]{107, "forest" ,12, 20}, structType));
data.add(Tuple.of(new Object[]{106, "grass" ,12, 20}, structType));
data.add(Tuple.of(new Object[]{106, "grass" ,16, 20}, structType));
data.add(Tuple.of(new Object[]{107, "corn", 16, 20}, structType));
data.add(Tuple.of(new Object[]{107, "corn", 2, 20}, structType));
data.add(Tuple.of(new Object[]{107, "corn", 16, 50}, structType));
data.add(Tuple.of(new Object[]{107, "corn", 18, 40}, structType));
DataFrame df = DataFrame.of(data, structType);
// Now we need to factorize the columns "LCZ" and "TYPE" in order to use the random forest
DataFrame dfFactorized = df.factorize("LCZ", "TYPE");
// Then we define the characteristics of the randomForest:
Formula formula = Formula.lhs("LCZ");
SplitRule splitRule = SplitRule.valueOf("GINI");
int ntrees = 2;
int mtry = 2;
int maxDepth = 2;
int maxNodes = 5;
int nodeSize = 3;
double subsample = 1.0;
// At this point we have the "LCZ" column as predicted variable, defined as nominal. No problem here since we still have the correspondence between our "LCZ" values (105, 106, 107) and 0, 1, 2.
// Then we create the randomForest
RandomForest model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample);
// Finally, when we apply the random forest
int[] prediction = Validation.test(model, dfFactorized);
// We get [0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 2, 2, 2] and we want [101, 103, 103, 101, 103, ...]
System.out.println(Arrays.toString(prediction));
} |
You don't really understand my previous comment. DO NOT use
It should be |
Here is an updated version of the test using the vectors (Is this usage ok ?) instead of using tuple : @Test
void test() {
// Dataset used
BaseVector[] bv = new BaseVector[]{
IntVector.of("LCZ", new int[]{105, 107, 106, 105, 105, 106, 107, 106, 106, 107, 107, 107, 107}),
StringVector.of("TYPE", new String[]{"grass", "corn", "corn", "grass", "corn", "grass", "forest", "grass", "grass", "corn", "corn", "corn", "corn"}),
IntVector.of("TEMPERATURE", new int[]{12, 12, 18, 12, 16, 12, 12, 12, 16, 16, 2, 16, 18}),
IntVector.of("WIND", new int[]{20, 30, 20, 30, 20, 20, 20, 20, 20, 20, 20, 50, 40}),
};
DataFrame df = DataFrame.of(bv);
// Now we need to factorize the columns "LCZ" and "TYPE" in order to use the random forest
DataFrame dfFactorized = df.factorize("LCZ", "TYPE");
// Then we define the characteristics of the randomForest:
Formula formula = Formula.lhs("LCZ");
SplitRule splitRule = SplitRule.valueOf("GINI");
int ntrees = 2;
int mtry = 2;
int maxDepth = 2;
int maxNodes = 5;
int nodeSize = 3;
double subsample = 1.0;
// At this point we have the "LCZ" column as predicted variable, defined as nominal. No problem here since we still have the correspondence between our "LCZ" values (105, 106, 107) and 0, 1, 2.
// Then we create the randomForest
RandomForest model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample);
// Finally, when we apply the random forest
int[] prediction = Validation.test(model, dfFactorized);
// We get [0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 2, 2, 2] and we want [101, 103, 103, 101, 103, ...]
System.out.println(Arrays.toString(prediction));
} and we still have as output |
Just do
|
So @Test
void test() {
// Dataset used
BaseVector[] bv = new BaseVector[]{
IntVector.of("LCZ", new int[]{105, 107, 106, 105, 105, 106, 107, 106, 106, 107, 107, 107, 107}),
StringVector.of("TYPE", new String[]{"grass", "corn", "corn", "grass", "corn", "grass", "forest", "grass", "grass", "corn", "corn", "corn", "corn"}),
IntVector.of("TEMPERATURE", new int[]{12, 12, 18, 12, 16, 12, 12, 12, 16, 16, 2, 16, 18}),
IntVector.of("WIND", new int[]{20, 30, 20, 30, 20, 20, 20, 20, 20, 20, 20, 50, 40}),
};
DataFrame df = DataFrame.of(bv);
// Now we need to factorize the columns "LCZ" and "TYPE" in order to use the random forest
DataFrame dfFactorized = df.factorize("TYPE");
// Then we define the characteristics of the randomForest:
Formula formula = Formula.lhs("LCZ");
SplitRule splitRule = SplitRule.valueOf("GINI");
int ntrees = 2;
int mtry = 2;
int maxDepth = 2;
int maxNodes = 5;
int nodeSize = 3;
double subsample = 1.0;
// At this point we have the "LCZ" column as predicted variable, defined as nominal. No problem here since we still have the correspondence between our "LCZ" values (105, 106, 107) and 0, 1, 2.
// Then we create the randomForest
RandomForest model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample);
// Finally, when we apply the random forest
int[] prediction = Validation.test(model, dfFactorized);
// We get [0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 2, 2, 2] and we want [101, 103, 103, 101, 103, ...]
System.out.println(Arrays.toString(prediction));
} still not work. Returns : java.lang.ArrayIndexOutOfBoundsException: 105
at smile.classification.RandomForest.fit(RandomForest.java:312)
at smile.classification.RandomForest.fit(RandomForest.java:244)
at smile.classification.RandomForest.fit(RandomForest.java:214)
at smile.classification.RandomForest$fit.call(Unknown Source)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125) |
I revise the code to make both of your examples work. Please build the master branch. |
The fix is in release 2.5.3. Please have a try. Thanks. |
Thanks we will test it. |
Thanks both examples work well |
Describe the bug
When using the randomForest classifier, it seems we need a "correlation table" to recover the real value of our predicted variable
Expected behavior
Would be easier to be able to directly use the real values of the predicted variable
Actual behavior
We need to convert the predicted variable values into factors by ourself in order to keep the correlation table between factors and "real values", then use the model and then convert the resulting factors back to the "real values".
Do I understand well ?
The text was updated successfully, but these errors were encountered: