RandomForest: need correlation table before using factors ? #584

j3r3m1 · 2020-07-23T14:50:35Z

Describe the bug
When using the randomForest classifier, it seems we need a "correlation table" to recover the real value of our predicted variable

Expected behavior
Would be easier to be able to directly use the real values of the predicted variable

Actual behavior
We need to convert the predicted variable values into factors by ourself in order to keep the correlation table between factors and "real values", then use the model and then convert the resulting factors back to the "real values".
Do I understand well ?

haifengl · 2020-07-23T15:19:44Z

You can use real values directly.

j3r3m1 · 2020-07-23T15:45:38Z

Works for any types ? When I use string I got the following error when fitting my RF model:

[main] ERROR org.orbisgis.orbisdata.processmanager.process.Process - Error while executing the process.
java.lang.UnsupportedOperationException: LCZ:String
at smile.data.vector.VectorImpl.toIntArray(VectorImpl.java:156)
at smile.classification.ClassLabels.fit(ClassLabels.java:121)
at smile.classification.RandomForest.fit(RandomForest.java:293)
at smile.classification.RandomForest.fit(RandomForest.java:244)
at smile.classification.RandomForest.fit(RandomForest.java:214)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
at org.codehaus.groovy.runtime.callsite.StaticMetaMethodSite$StaticMetaMethodSiteNoUnwrap.invoke(StaticMetaMethodSite.java:131)
at org.codehaus.groovy.runtime.callsite.StaticMetaMethodSite.call(StaticMetaMethodSite.java:89)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)

If I use integer values (1,2,3,11,12,13), I got the following error message:

[main] ERROR org.orbisgis.orbisdata.processmanager.process.Process - Error while executing the process.
java.lang.ArrayIndexOutOfBoundsException: Index 11 out of bounds for length 6

Then it seems they must be between 0 and n (n being the number of different "real values" - 1) ?

j3r3m1 · 2020-07-24T10:03:28Z

Note that I use 'formula = Formula.lhs(varToModel)' as formula (varToModel being the variable containing the column name of the response variable.

haifengl · 2020-07-25T13:54:31Z

real values means numeric values. Strings must be converted to factor.

j3r3m1 · 2020-07-26T21:55:01Z

OK for strings. Is the behavior I describe for integer expected ?

If I use integer values (1,2,3,11,12,13), I got the following error message:

[main] ERROR org.orbisgis.orbisdata.processmanager.process.Process - Error while executing the process.
java.lang.ArrayIndexOutOfBoundsException: Index 11 out of bounds for length 6

Then it seems they must be between 0 and n (n being the number of different "real values" - 1) ?

rayeaster · 2020-07-27T12:52:49Z

@j3r3m1 you could convert String to nominal values, check NominalScale

j3r3m1 · 2020-08-06T09:51:03Z

Could you reopen the issue ? I did not have answer to my last question

OK for strings. Is the behavior I describe for integer expected ?

If I use integer values (1,2,3,11,12,13), I got the following error message:

[main] ERROR org.orbisgis.orbisdata.processmanager.process.Process - Error while executing the process.
java.lang.ArrayIndexOutOfBoundsException: Index 11 out of bounds for length 6

Then it seems they must be between 0 and n (n being the number of different "real values" - 1) ?

It seems a strange behavior since you said that

You can use real values directly.

haifengl · 2020-08-10T18:38:40Z

It is not clear how do you create DataFrame that is used in the model fitting. If you parse the data with our parsers, it will create proper metadata (StructType) to handle noncontinuous categorical data. If you create the metadata by yourself programmatically, you should set the measure of the column's StructField to NominalScale. Check out NominalScale's constructor, which can take a list of strings or arbitrary integers.

ebocher · 2020-08-18T08:43:55Z

@j3r3m1 build a dataframe from a resulset with the dataframe.of method.
Seems that to solve this issue the StructField of the column that contains the predicted labels must be changed in the dataframe as @haifengl proposes.

Is there a short method to change the StructField in df ? something like

df.schema().field("columName").nominal()

haifengl · 2020-08-18T13:38:02Z

@ebocher DataFrame is immutable.

haifengl · 2020-08-18T16:10:21Z

DataFrame.factorize() returns a new data frame with String (or Object) columns converted to nominal values.

ebocher · 2020-08-18T16:24:01Z

Thanks I will test it

ebocher · 2020-08-18T16:36:14Z

It works like a charm.
Thanks

j3r3m1 · 2020-09-01T11:55:37Z

OK works for the creation of the random forest. We did the following:

def formula = Formula.lhs(varToModel)
def dfFactorized = df.factorize(varToModel);
dfFactorized = dfFactorized.omitNullRows()
// Create the randomForest
def model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample)

However, when we apply the randomForest on a new dataset, the resulting vector is factorized (values from 0 to 16). Is there a way to get back to the original values (values from 1 to 10 and then from 101 to 107) then ?
What we did:

int[] prediction = Validation.test(model, df_var)
// We need to add the remove the initial predicted variable in order to not have duplicated...
df=df.drop(var2model)
df=df.merge(IntVector.of(var2model, prediction))

haifengl · 2020-09-04T18:12:08Z

Smile translates the predication back to the label range. See L#500 in the source code. If the value is not what you want, something is wrong when you prepare the data.

j3r3m1 · 2020-09-07T09:24:26Z

OK good news. However, I miss the link to the L#500 in your code, which class is it ?

j3r3m1 · 2020-09-07T09:41:25Z

OK I have found it thanks.

smile/core/src/main/java/smile/classification/RandomForest.java

Line 500 in 1826b2f

return labels.valueOf(MathEx.whichMax(y));

j3r3m1 · 2020-09-08T10:03:03Z

OK but the problem is still not solved. Let's see the problem using the following example:

// Dataset used
def data = [
                [105,  'grass', 12, 20], [107,  'corn',12, 30], [106,  'corn',18, 20]
                , [105,  'grass',12, 30], [105,  'corn',16, 20], [106,  'grass',12, 20], [107,  'forest',12, 20],
                [106,  'grass',12, 20],[106,  'grass',16, 20],[106,  'grass',16, 20],[107,  'corn',16, 20],
                [107,  'corn',2, 20], [107,  'corn',16, 50],[107,  'corn',18, 40]
        ]
        DataFrame df = DataFrame.of(data, "LCZ", "TYPE","TEMPERATURE", "WIND");

// Now we need to factorize the columns "LCZ" and "TYPE" in order to use the random forest
def dfFactorized = df.factorize("LCZ", "TYPE")

// Then we define the characteristics of the randomForest:
def formula = Formula.lhs("LCZ")
def splitRule = SplitRule.valueOf("GINI")
def ntrees              = 2
def mtry                =2
def maxDepth            =2
def maxNodes            =5
def nodeSize            = 3
def subsample           = 1.0

// At this point we have the "LCZ" column as predicted variable, defined as nominal. No problem here since we still have the correspondence between our "LCZ" values (105, 106, 107) and 0, 1, 2.
// Then we create the randomForest
RandomForest model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample)

// Finally, when we apply the random forest
int[] prediction = Validation.test(model, dfFactorized)

Results should be 105, 106 or 107 but we got 0, 1, 2. The correspondence between 105, 106, 107 and 0, 1, 2 is only conserved in the formula, thus we have to get it (in groovy) by the following command:

model.formula().@binding.inputSchema.field("LCZ").measure.value2level

BUT this information is not conserved by default in the model, thus if we save the model we loose it. Do you have an alternative solution to get back to the 105, 106 and 107 ?

ebocher · 2020-09-08T10:11:01Z

Note that this method :

model.formula().@binding.inputSchema.field("LCZ").measure.value2level

doesn't work when the model is loaded after a XStream serialization.

haifengl · 2020-09-08T13:42:16Z

Is it groovy? I am not familiar with it. Note that Java doesn't allow mixed types in an array. I guess that the underlying JVM type of data is Object[][]. Therefore, DataFrame.of() convert all columns to Object type (ObjectVector). Then factorize converts them to String and then NominalScale.

Note that we choose Java because it is a strong typed language. But we don't have proper data type in this case.

BTW, DataFrame.of() can also take generic List<T>. If you can define your data as class/struct, smile will figure out the data type by reflection.

Another workaround is to define each column individually by calling the corresponding proper-typed class (e.g. IntVector, DoubleVector in package smile.data.vector). Then you can create DataFrame with these vectors. Good luck.

j3r3m1 · 2020-09-08T14:22:59Z

Thank you for your answer. Let's see the problem in Java then. The following example leads to the same problem. Do you have a solution to solve it within smile ?

@Test
    void test() {
        // Dataset used
        List<Tuple> data = new ArrayList<>();
        List<StructField> fields = new ArrayList<>();
        fields.add(new StructField("LCZ", DataType.of(Integer.class)));
        fields.add(new StructField("TYPE", DataType.of(String.class)));
        fields.add(new StructField("TEMPERATURE", DataType.of(Integer.class)));
        fields.add(new StructField("WIND", DataType.of(Integer.class)));
        StructType structType = new StructType(fields);
        data.add(Tuple.of(new Object[]{105, "grass", 12, 20}, structType));
        data.add(Tuple.of(new Object[]{107, "corn", 12, 30}, structType));
        data.add(Tuple.of(new Object[]{106, "corn", 18, 20}, structType));
        data.add(Tuple.of(new Object[]{105, "grass", 12, 30}, structType));
        data.add(Tuple.of(new Object[]{105, "corn", 16, 20}, structType));
        data.add(Tuple.of(new Object[]{106, "grass", 12, 20}, structType));
        data.add(Tuple.of(new Object[]{107, "forest" ,12, 20}, structType));
        data.add(Tuple.of(new Object[]{106, "grass" ,12, 20}, structType));
        data.add(Tuple.of(new Object[]{106, "grass" ,16, 20}, structType));
        data.add(Tuple.of(new Object[]{107, "corn", 16, 20}, structType));
        data.add(Tuple.of(new Object[]{107, "corn", 2, 20}, structType));
        data.add(Tuple.of(new Object[]{107, "corn", 16, 50}, structType));
        data.add(Tuple.of(new Object[]{107, "corn", 18, 40}, structType));
        DataFrame df = DataFrame.of(data, structType);

// Now we need to factorize the columns "LCZ" and "TYPE" in order to use the random forest
        DataFrame dfFactorized = df.factorize("LCZ", "TYPE");

// Then we define the characteristics of the randomForest:
        Formula formula = Formula.lhs("LCZ");
        SplitRule splitRule = SplitRule.valueOf("GINI");
        int ntrees = 2;
        int mtry = 2;
        int maxDepth = 2;
        int maxNodes = 5;
        int nodeSize = 3;
        double subsample = 1.0;

// At this point we have the "LCZ" column as predicted variable, defined as nominal. No problem here since we still have the correspondence between our "LCZ" values (105, 106, 107) and 0, 1, 2.
// Then we create the randomForest
        RandomForest model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample);

// Finally, when we apply the random forest
        int[] prediction = Validation.test(model, dfFactorized);
// We get [0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 2, 2, 2] and we want [101, 103, 103, 101, 103, ...]
        System.out.println(Arrays.toString(prediction));
    }

haifengl · 2020-09-08T17:34:52Z

You don't really understand my previous comment. DO NOT use Object[] as your row type. Also the data type of schema is WRONG:

fields.add(new StructField("LCZ", DataType.of(Integer.class)));

It should be DataTypes.IntegerType. DataTypes.IntegerType and DataType.of(Integer.class) are different things.

SPalominos · 2020-09-09T09:59:03Z

Here is an updated version of the test using the vectors (Is this usage ok ?) instead of using tuple :

    @Test
    void test() {
        // Dataset used
        BaseVector[] bv = new BaseVector[]{
                IntVector.of("LCZ", new int[]{105, 107, 106, 105, 105, 106, 107, 106, 106, 107, 107, 107, 107}),
                StringVector.of("TYPE", new String[]{"grass", "corn", "corn", "grass", "corn", "grass", "forest", "grass", "grass", "corn", "corn", "corn", "corn"}),
                IntVector.of("TEMPERATURE", new int[]{12, 12, 18, 12, 16, 12, 12, 12, 16, 16, 2, 16, 18}),
                IntVector.of("WIND", new int[]{20, 30, 20, 30, 20, 20, 20, 20, 20, 20, 20, 50, 40}),
        };
        DataFrame df = DataFrame.of(bv);

// Now we need to factorize the columns "LCZ" and "TYPE" in order to use the random forest
        DataFrame dfFactorized = df.factorize("LCZ", "TYPE");

// Then we define the characteristics of the randomForest:
        Formula formula = Formula.lhs("LCZ");
        SplitRule splitRule = SplitRule.valueOf("GINI");
        int ntrees = 2;
        int mtry = 2;
        int maxDepth = 2;
        int maxNodes = 5;
        int nodeSize = 3;
        double subsample = 1.0;

// At this point we have the "LCZ" column as predicted variable, defined as nominal. No problem here since we still have the correspondence between our "LCZ" values (105, 106, 107) and 0, 1, 2.
// Then we create the randomForest
        RandomForest model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample);

// Finally, when we apply the random forest
        int[] prediction = Validation.test(model, dfFactorized);
// We get [0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 2, 2, 2] and we want [101, 103, 103, 101, 103, ...]
        System.out.println(Arrays.toString(prediction));
    }

and we still have as output [0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 2, 2, 2] but we need instead [101, 103, 103, 101, 103, 103, 103, 103, 101, 101, 101, 101] (values from the LCZ int vector).

haifengl · 2020-09-09T13:16:40Z

Just do

DataFrame dfFactorized = df.factorize("TYPE");

ebocher · 2020-09-09T13:23:20Z

So

@Test
    void test() {
        // Dataset used
        BaseVector[] bv = new BaseVector[]{
                IntVector.of("LCZ", new int[]{105, 107, 106, 105, 105, 106, 107, 106, 106, 107, 107, 107, 107}),
                StringVector.of("TYPE", new String[]{"grass", "corn", "corn", "grass", "corn", "grass", "forest", "grass", "grass", "corn", "corn", "corn", "corn"}),
                IntVector.of("TEMPERATURE", new int[]{12, 12, 18, 12, 16, 12, 12, 12, 16, 16, 2, 16, 18}),
                IntVector.of("WIND", new int[]{20, 30, 20, 30, 20, 20, 20, 20, 20, 20, 20, 50, 40}),
        };
        DataFrame df = DataFrame.of(bv);

// Now we need to factorize the columns "LCZ" and "TYPE" in order to use the random forest
        DataFrame dfFactorized = df.factorize("TYPE");

// Then we define the characteristics of the randomForest:
        Formula formula = Formula.lhs("LCZ");
        SplitRule splitRule = SplitRule.valueOf("GINI");
        int ntrees = 2;
        int mtry = 2;
        int maxDepth = 2;
        int maxNodes = 5;
        int nodeSize = 3;
        double subsample = 1.0;

// At this point we have the "LCZ" column as predicted variable, defined as nominal. No problem here since we still have the correspondence between our "LCZ" values (105, 106, 107) and 0, 1, 2.
// Then we create the randomForest
        RandomForest model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample);

// Finally, when we apply the random forest
        int[] prediction = Validation.test(model, dfFactorized);
// We get [0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 2, 2, 2] and we want [101, 103, 103, 101, 103, ...]
        System.out.println(Arrays.toString(prediction));
    }

still not work. Returns :

java.lang.ArrayIndexOutOfBoundsException: 105

	at smile.classification.RandomForest.fit(RandomForest.java:312)
	at smile.classification.RandomForest.fit(RandomForest.java:244)
	at smile.classification.RandomForest.fit(RandomForest.java:214)
	at smile.classification.RandomForest$fit.call(Unknown Source)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)

haifengl · 2020-09-09T14:11:53Z

I revise the code to make both of your examples work. Please build the master branch.

haifengl · 2020-09-22T13:50:17Z

The fix is in release 2.5.3. Please have a try. Thanks.

ebocher · 2020-09-24T18:49:06Z

Thanks we will test it.

ebocher · 2020-09-27T08:00:11Z

Thanks both examples work well

haifengl closed this as completed Jul 30, 2020

ebocher mentioned this issue Sep 27, 2020

Add test for smile random forest orbisgis/orbisdata#329

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RandomForest: need correlation table before using factors ? #584

RandomForest: need correlation table before using factors ? #584

j3r3m1 commented Jul 23, 2020

haifengl commented Jul 23, 2020

j3r3m1 commented Jul 23, 2020

j3r3m1 commented Jul 24, 2020

haifengl commented Jul 25, 2020

j3r3m1 commented Jul 26, 2020

rayeaster commented Jul 27, 2020

j3r3m1 commented Aug 6, 2020

haifengl commented Aug 10, 2020

ebocher commented Aug 18, 2020

haifengl commented Aug 18, 2020

haifengl commented Aug 18, 2020

ebocher commented Aug 18, 2020

ebocher commented Aug 18, 2020

j3r3m1 commented Sep 1, 2020

haifengl commented Sep 4, 2020

j3r3m1 commented Sep 7, 2020

j3r3m1 commented Sep 7, 2020

j3r3m1 commented Sep 8, 2020

ebocher commented Sep 8, 2020

haifengl commented Sep 8, 2020 •

edited

j3r3m1 commented Sep 8, 2020

haifengl commented Sep 8, 2020

SPalominos commented Sep 9, 2020

haifengl commented Sep 9, 2020

ebocher commented Sep 9, 2020

haifengl commented Sep 9, 2020

haifengl commented Sep 22, 2020

ebocher commented Sep 24, 2020

ebocher commented Sep 27, 2020

RandomForest: need correlation table before using factors ? #584

RandomForest: need correlation table before using factors ? #584

Comments

j3r3m1 commented Jul 23, 2020

haifengl commented Jul 23, 2020

j3r3m1 commented Jul 23, 2020

j3r3m1 commented Jul 24, 2020

haifengl commented Jul 25, 2020

j3r3m1 commented Jul 26, 2020

rayeaster commented Jul 27, 2020

j3r3m1 commented Aug 6, 2020

haifengl commented Aug 10, 2020

ebocher commented Aug 18, 2020

haifengl commented Aug 18, 2020

haifengl commented Aug 18, 2020

ebocher commented Aug 18, 2020

ebocher commented Aug 18, 2020

j3r3m1 commented Sep 1, 2020

haifengl commented Sep 4, 2020

j3r3m1 commented Sep 7, 2020

j3r3m1 commented Sep 7, 2020

j3r3m1 commented Sep 8, 2020

ebocher commented Sep 8, 2020

haifengl commented Sep 8, 2020 • edited

j3r3m1 commented Sep 8, 2020

haifengl commented Sep 8, 2020

SPalominos commented Sep 9, 2020

haifengl commented Sep 9, 2020

ebocher commented Sep 9, 2020

haifengl commented Sep 9, 2020

haifengl commented Sep 22, 2020

ebocher commented Sep 24, 2020

ebocher commented Sep 27, 2020

haifengl commented Sep 8, 2020 •

edited