Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RandomForest: need correlation table before using factors ? #584

Closed
j3r3m1 opened this issue Jul 23, 2020 · 29 comments
Closed

RandomForest: need correlation table before using factors ? #584

j3r3m1 opened this issue Jul 23, 2020 · 29 comments

Comments

@j3r3m1
Copy link

j3r3m1 commented Jul 23, 2020

Describe the bug
When using the randomForest classifier, it seems we need a "correlation table" to recover the real value of our predicted variable

Expected behavior
Would be easier to be able to directly use the real values of the predicted variable

Actual behavior
We need to convert the predicted variable values into factors by ourself in order to keep the correlation table between factors and "real values", then use the model and then convert the resulting factors back to the "real values".
Do I understand well ?

@haifengl
Copy link
Owner

You can use real values directly.

@j3r3m1
Copy link
Author

j3r3m1 commented Jul 23, 2020

Works for any types ? When I use string I got the following error when fitting my RF model:

[main] ERROR org.orbisgis.orbisdata.processmanager.process.Process - Error while executing the process.
java.lang.UnsupportedOperationException: LCZ:String
at smile.data.vector.VectorImpl.toIntArray(VectorImpl.java:156)
at smile.classification.ClassLabels.fit(ClassLabels.java:121)
at smile.classification.RandomForest.fit(RandomForest.java:293)
at smile.classification.RandomForest.fit(RandomForest.java:244)
at smile.classification.RandomForest.fit(RandomForest.java:214)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:107)
at groovy.lang.MetaMethod.doMethodInvoke(MetaMethod.java:323)
at org.codehaus.groovy.runtime.callsite.StaticMetaMethodSite$StaticMetaMethodSiteNoUnwrap.invoke(StaticMetaMethodSite.java:131)
at org.codehaus.groovy.runtime.callsite.StaticMetaMethodSite.call(StaticMetaMethodSite.java:89)
at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)

If I use integer values (1,2,3,11,12,13), I got the following error message:

[main] ERROR org.orbisgis.orbisdata.processmanager.process.Process - Error while executing the process.
java.lang.ArrayIndexOutOfBoundsException: Index 11 out of bounds for length 6

Then it seems they must be between 0 and n (n being the number of different "real values" - 1) ?

@j3r3m1
Copy link
Author

j3r3m1 commented Jul 24, 2020

Note that I use 'formula = Formula.lhs(varToModel)' as formula (varToModel being the variable containing the column name of the response variable.

@haifengl
Copy link
Owner

real values means numeric values. Strings must be converted to factor.

@j3r3m1
Copy link
Author

j3r3m1 commented Jul 26, 2020

OK for strings. Is the behavior I describe for integer expected ?

If I use integer values (1,2,3,11,12,13), I got the following error message:

[main] ERROR org.orbisgis.orbisdata.processmanager.process.Process - Error while executing the process.
java.lang.ArrayIndexOutOfBoundsException: Index 11 out of bounds for length 6

Then it seems they must be between 0 and n (n being the number of different "real values" - 1) ?

@rayeaster
Copy link
Contributor

@j3r3m1 you could convert String to nominal values, check NominalScale

@j3r3m1
Copy link
Author

j3r3m1 commented Aug 6, 2020

Could you reopen the issue ? I did not have answer to my last question

OK for strings. Is the behavior I describe for integer expected ?

If I use integer values (1,2,3,11,12,13), I got the following error message:

[main] ERROR org.orbisgis.orbisdata.processmanager.process.Process - Error while executing the process.
java.lang.ArrayIndexOutOfBoundsException: Index 11 out of bounds for length 6

Then it seems they must be between 0 and n (n being the number of different "real values" - 1) ?

It seems a strange behavior since you said that

You can use real values directly.

@haifengl
Copy link
Owner

It is not clear how do you create DataFrame that is used in the model fitting. If you parse the data with our parsers, it will create proper metadata (StructType) to handle noncontinuous categorical data. If you create the metadata by yourself programmatically, you should set the measure of the column's StructField to NominalScale. Check out NominalScale's constructor, which can take a list of strings or arbitrary integers.

@ebocher
Copy link

ebocher commented Aug 18, 2020

@j3r3m1 build a dataframe from a resulset with the dataframe.of method.
Seems that to solve this issue the StructField of the column that contains the predicted labels must be changed in the dataframe as @haifengl proposes.

Is there a short method to change the StructField in df ? something like

df.schema().field("columName").nominal()

@haifengl
Copy link
Owner

@ebocher DataFrame is immutable.

@haifengl
Copy link
Owner

DataFrame.factorize() returns a new data frame with String (or Object) columns converted to nominal values.

@ebocher
Copy link

ebocher commented Aug 18, 2020

Thanks I will test it

@ebocher
Copy link

ebocher commented Aug 18, 2020

It works like a charm.
Thanks

@j3r3m1
Copy link
Author

j3r3m1 commented Sep 1, 2020

OK works for the creation of the random forest. We did the following:

def formula = Formula.lhs(varToModel)
def dfFactorized = df.factorize(varToModel);
dfFactorized = dfFactorized.omitNullRows()
// Create the randomForest
def model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample)

However, when we apply the randomForest on a new dataset, the resulting vector is factorized (values from 0 to 16). Is there a way to get back to the original values (values from 1 to 10 and then from 101 to 107) then ?
What we did:

int[] prediction = Validation.test(model, df_var)
// We need to add the remove the initial predicted variable in order to not have duplicated...
df=df.drop(var2model)
df=df.merge(IntVector.of(var2model, prediction))

@haifengl
Copy link
Owner

haifengl commented Sep 4, 2020

Smile translates the predication back to the label range. See L#500 in the source code. If the value is not what you want, something is wrong when you prepare the data.

@j3r3m1
Copy link
Author

j3r3m1 commented Sep 7, 2020

OK good news. However, I miss the link to the L#500 in your code, which class is it ?

@j3r3m1
Copy link
Author

j3r3m1 commented Sep 7, 2020

OK I have found it thanks.

return labels.valueOf(MathEx.whichMax(y));

@j3r3m1
Copy link
Author

j3r3m1 commented Sep 8, 2020

OK but the problem is still not solved. Let's see the problem using the following example:

// Dataset used
def data = [
                [105,  'grass', 12, 20], [107,  'corn',12, 30], [106,  'corn',18, 20]
                , [105,  'grass',12, 30], [105,  'corn',16, 20], [106,  'grass',12, 20], [107,  'forest',12, 20],
                [106,  'grass',12, 20],[106,  'grass',16, 20],[106,  'grass',16, 20],[107,  'corn',16, 20],
                [107,  'corn',2, 20], [107,  'corn',16, 50],[107,  'corn',18, 40]
        ]
        DataFrame df = DataFrame.of(data, "LCZ", "TYPE","TEMPERATURE", "WIND");

// Now we need to factorize the columns "LCZ" and "TYPE" in order to use the random forest
def dfFactorized = df.factorize("LCZ", "TYPE")

// Then we define the characteristics of the randomForest:
def formula = Formula.lhs("LCZ")
def splitRule = SplitRule.valueOf("GINI")
def ntrees              = 2
def mtry                =2
def maxDepth            =2
def maxNodes            =5
def nodeSize            = 3
def subsample           = 1.0

// At this point we have the "LCZ" column as predicted variable, defined as nominal. No problem here since we still have the correspondence between our "LCZ" values (105, 106, 107) and 0, 1, 2.
// Then we create the randomForest
RandomForest model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample)

// Finally, when we apply the random forest
int[] prediction = Validation.test(model, dfFactorized)

Results should be 105, 106 or 107 but we got 0, 1, 2. The correspondence between 105, 106, 107 and 0, 1, 2 is only conserved in the formula, thus we have to get it (in groovy) by the following command:

model.formula().@binding.inputSchema.field("LCZ").measure.value2level

BUT this information is not conserved by default in the model, thus if we save the model we loose it. Do you have an alternative solution to get back to the 105, 106 and 107 ?

@ebocher
Copy link

ebocher commented Sep 8, 2020

Note that this method :

model.formula().@binding.inputSchema.field("LCZ").measure.value2level

doesn't work when the model is loaded after a XStream serialization.

@haifengl
Copy link
Owner

haifengl commented Sep 8, 2020

Is it groovy? I am not familiar with it. Note that Java doesn't allow mixed types in an array. I guess that the underlying JVM type of data is Object[][]. Therefore, DataFrame.of() convert all columns to Object type (ObjectVector). Then factorize converts them to String and then NominalScale.

Note that we choose Java because it is a strong typed language. But we don't have proper data type in this case.

BTW, DataFrame.of() can also take generic List<T>. If you can define your data as class/struct, smile will figure out the data type by reflection.

Another workaround is to define each column individually by calling the corresponding proper-typed class (e.g. IntVector, DoubleVector in package smile.data.vector). Then you can create DataFrame with these vectors. Good luck.

@j3r3m1
Copy link
Author

j3r3m1 commented Sep 8, 2020

Thank you for your answer. Let's see the problem in Java then. The following example leads to the same problem. Do you have a solution to solve it within smile ?

@Test
    void test() {
        // Dataset used
        List<Tuple> data = new ArrayList<>();
        List<StructField> fields = new ArrayList<>();
        fields.add(new StructField("LCZ", DataType.of(Integer.class)));
        fields.add(new StructField("TYPE", DataType.of(String.class)));
        fields.add(new StructField("TEMPERATURE", DataType.of(Integer.class)));
        fields.add(new StructField("WIND", DataType.of(Integer.class)));
        StructType structType = new StructType(fields);
        data.add(Tuple.of(new Object[]{105, "grass", 12, 20}, structType));
        data.add(Tuple.of(new Object[]{107, "corn", 12, 30}, structType));
        data.add(Tuple.of(new Object[]{106, "corn", 18, 20}, structType));
        data.add(Tuple.of(new Object[]{105, "grass", 12, 30}, structType));
        data.add(Tuple.of(new Object[]{105, "corn", 16, 20}, structType));
        data.add(Tuple.of(new Object[]{106, "grass", 12, 20}, structType));
        data.add(Tuple.of(new Object[]{107, "forest" ,12, 20}, structType));
        data.add(Tuple.of(new Object[]{106, "grass" ,12, 20}, structType));
        data.add(Tuple.of(new Object[]{106, "grass" ,16, 20}, structType));
        data.add(Tuple.of(new Object[]{107, "corn", 16, 20}, structType));
        data.add(Tuple.of(new Object[]{107, "corn", 2, 20}, structType));
        data.add(Tuple.of(new Object[]{107, "corn", 16, 50}, structType));
        data.add(Tuple.of(new Object[]{107, "corn", 18, 40}, structType));
        DataFrame df = DataFrame.of(data, structType);

// Now we need to factorize the columns "LCZ" and "TYPE" in order to use the random forest
        DataFrame dfFactorized = df.factorize("LCZ", "TYPE");

// Then we define the characteristics of the randomForest:
        Formula formula = Formula.lhs("LCZ");
        SplitRule splitRule = SplitRule.valueOf("GINI");
        int ntrees = 2;
        int mtry = 2;
        int maxDepth = 2;
        int maxNodes = 5;
        int nodeSize = 3;
        double subsample = 1.0;

// At this point we have the "LCZ" column as predicted variable, defined as nominal. No problem here since we still have the correspondence between our "LCZ" values (105, 106, 107) and 0, 1, 2.
// Then we create the randomForest
        RandomForest model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample);

// Finally, when we apply the random forest
        int[] prediction = Validation.test(model, dfFactorized);
// We get [0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 2, 2, 2] and we want [101, 103, 103, 101, 103, ...]
        System.out.println(Arrays.toString(prediction));
    }

@haifengl
Copy link
Owner

haifengl commented Sep 8, 2020

You don't really understand my previous comment. DO NOT use Object[] as your row type. Also the data type of schema is WRONG:

fields.add(new StructField("LCZ", DataType.of(Integer.class)));

It should be DataTypes.IntegerType. DataTypes.IntegerType and DataType.of(Integer.class) are different things.

@SPalominos
Copy link

Here is an updated version of the test using the vectors (Is this usage ok ?) instead of using tuple :

    @Test
    void test() {
        // Dataset used
        BaseVector[] bv = new BaseVector[]{
                IntVector.of("LCZ", new int[]{105, 107, 106, 105, 105, 106, 107, 106, 106, 107, 107, 107, 107}),
                StringVector.of("TYPE", new String[]{"grass", "corn", "corn", "grass", "corn", "grass", "forest", "grass", "grass", "corn", "corn", "corn", "corn"}),
                IntVector.of("TEMPERATURE", new int[]{12, 12, 18, 12, 16, 12, 12, 12, 16, 16, 2, 16, 18}),
                IntVector.of("WIND", new int[]{20, 30, 20, 30, 20, 20, 20, 20, 20, 20, 20, 50, 40}),
        };
        DataFrame df = DataFrame.of(bv);

// Now we need to factorize the columns "LCZ" and "TYPE" in order to use the random forest
        DataFrame dfFactorized = df.factorize("LCZ", "TYPE");

// Then we define the characteristics of the randomForest:
        Formula formula = Formula.lhs("LCZ");
        SplitRule splitRule = SplitRule.valueOf("GINI");
        int ntrees = 2;
        int mtry = 2;
        int maxDepth = 2;
        int maxNodes = 5;
        int nodeSize = 3;
        double subsample = 1.0;

// At this point we have the "LCZ" column as predicted variable, defined as nominal. No problem here since we still have the correspondence between our "LCZ" values (105, 106, 107) and 0, 1, 2.
// Then we create the randomForest
        RandomForest model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample);

// Finally, when we apply the random forest
        int[] prediction = Validation.test(model, dfFactorized);
// We get [0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 2, 2, 2] and we want [101, 103, 103, 101, 103, ...]
        System.out.println(Arrays.toString(prediction));
    }

and we still have as output [0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 2, 2, 2] but we need instead [101, 103, 103, 101, 103, 103, 103, 103, 101, 101, 101, 101] (values from the LCZ int vector).

@haifengl
Copy link
Owner

haifengl commented Sep 9, 2020

Just do

DataFrame dfFactorized = df.factorize("TYPE");

@ebocher
Copy link

ebocher commented Sep 9, 2020

So

@Test
    void test() {
        // Dataset used
        BaseVector[] bv = new BaseVector[]{
                IntVector.of("LCZ", new int[]{105, 107, 106, 105, 105, 106, 107, 106, 106, 107, 107, 107, 107}),
                StringVector.of("TYPE", new String[]{"grass", "corn", "corn", "grass", "corn", "grass", "forest", "grass", "grass", "corn", "corn", "corn", "corn"}),
                IntVector.of("TEMPERATURE", new int[]{12, 12, 18, 12, 16, 12, 12, 12, 16, 16, 2, 16, 18}),
                IntVector.of("WIND", new int[]{20, 30, 20, 30, 20, 20, 20, 20, 20, 20, 20, 50, 40}),
        };
        DataFrame df = DataFrame.of(bv);

// Now we need to factorize the columns "LCZ" and "TYPE" in order to use the random forest
        DataFrame dfFactorized = df.factorize("TYPE");

// Then we define the characteristics of the randomForest:
        Formula formula = Formula.lhs("LCZ");
        SplitRule splitRule = SplitRule.valueOf("GINI");
        int ntrees = 2;
        int mtry = 2;
        int maxDepth = 2;
        int maxNodes = 5;
        int nodeSize = 3;
        double subsample = 1.0;

// At this point we have the "LCZ" column as predicted variable, defined as nominal. No problem here since we still have the correspondence between our "LCZ" values (105, 106, 107) and 0, 1, 2.
// Then we create the randomForest
        RandomForest model = RandomForest.fit(formula, dfFactorized, ntrees, mtry, splitRule, maxDepth, maxNodes, nodeSize, subsample);

// Finally, when we apply the random forest
        int[] prediction = Validation.test(model, dfFactorized);
// We get [0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 2, 2, 2] and we want [101, 103, 103, 101, 103, ...]
        System.out.println(Arrays.toString(prediction));
    }

still not work. Returns :

java.lang.ArrayIndexOutOfBoundsException: 105

	at smile.classification.RandomForest.fit(RandomForest.java:312)
	at smile.classification.RandomForest.fit(RandomForest.java:244)
	at smile.classification.RandomForest.fit(RandomForest.java:214)
	at smile.classification.RandomForest$fit.call(Unknown Source)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:125)

@haifengl
Copy link
Owner

haifengl commented Sep 9, 2020

I revise the code to make both of your examples work. Please build the master branch.

@haifengl
Copy link
Owner

The fix is in release 2.5.3. Please have a try. Thanks.

@ebocher
Copy link

ebocher commented Sep 24, 2020

Thanks we will test it.

@ebocher
Copy link

ebocher commented Sep 27, 2020

Thanks both examples work well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants