RandomForest Example #144

Mega4alik · 2017-01-17T07:48:22Z

Hello,
I'm trying to train RandomForest Model, but getting same result for each test(about 300 entries)
Here's Java Code

RandomForest model;
model = new RandomForest(X,Y,500);

For this data, python sklearn works as expected

clf = RandomForestClassifier(n_estimators=500)
clf = clf.fit(X, Y)

What am I missing?

Thanks

The text was updated successfully, but these errors were encountered:

leo-xin · 2017-01-17T08:34:22Z

I am not the author. By reading source, I think the reason is that you have not set paramter subsample Which by default is set to be 1.0 in smile. In that way, there is no random in building trees.

Mega4alik · 2017-01-17T08:45:52Z

@leo-xin Can you give an example? Thanks in advance

leo-xin · 2017-01-17T09:17:42Z

I am sorry, after carefully reading. I think I have give the wrong hint. there are two ways of random in building forest. feature and data are both randomly selected. I have no Idea about your problem. Can you show your detailed code?

Mega4alik · 2017-01-17T09:30:39Z

Sure,
Training Part

double[][] X = gson.fromJson(g.getFileContent(g.Path+"/temp/X.txt"), double[][].class);
int[] Y = gson.fromJson(g.getFileContent(g.Path+"/temp/Y.txt"), int[].class);            
RandomForest model;
model = new RandomForest(X, Y, 500);

Prediction part

int getPrediction(double[] doubles){
 int qIdx = model.predict(doubles);   
 return qIdx;
}

Mega4alik · 2017-01-17T09:31:58Z

@haifengl can you help?

haifengl · 2017-01-17T15:07:36Z

Hi @Mega4alik , what do you mean "getting same result for each test"? Thanks!

Mega4alik · 2017-01-17T15:31:53Z

@haifengl Thanks for response,
It means that the model always predict the same class

haifengl · 2017-01-17T16:14:32Z

Do you have missing values in your input? And what's the range of Y?

Mega4alik · 2017-01-17T16:30:25Z

No, I don't
There're 1500 possible classes
As I mentioned before, python sklearn RandomForestClassifier works as expected for this dataset

haifengl · 2017-01-17T16:34:07Z

Is your Y in range of [0, 1499]? How many samples do you have?

Mega4alik · 2017-01-17T16:37:13Z

Oh no, sorry
The size of X is [0,1499] binary and range of Y is [0,323]
There are about 1300 samples

Mega4alik · 2017-01-17T16:37:36Z

Size of X - 1500 fixed

haifengl · 2017-01-17T17:05:44Z

I guess that you are using one-hot encoding, which is what sklearn expects. If you don't specify the attribute type (as shown in your code), smile will assume they are numeric values. It is why it doesn't work. Instead, you should pass an array of NominalAttribute. See the details in smile.data.NominalAttribute.

Besides, in case that you use one-hot encoding, please don't do it. Smile can handle categorical variables directly (and much more efficient). Suppose you have only 10 attributes (each takes discrete value 0 to 149), just pass 10 attributes to us. You will see that we are way faster than sklearn. Thanks!

Mega4alik · 2017-01-18T03:57:20Z

Ok!
I did not understand how to use NominalAttribute. Can you please give an example in code?
I tried this, but getting java.lang.ArrayIndexOutOfBoundsException exception

NominalAttribute ars[] = new NominalAttribute[10];
for (int i=0;i<ars.length;i++) ars[i] = new NominalAttribute("atr"+i);             
model = new RandomForest(ars, X, Y, 500);

Thanks)

haifengl · 2017-01-18T13:28:31Z

Suppose you are still using 1500 one-hot encoding features.

NominalAttribute ars[] = new NominalAttribute[1500];
String[] values = {"0", "1"};
for (int i=0;i<ars.length;i++) ars[i] = new NominalAttribute("atr"+i, values);
model = new RandomForest(ars, X, Y, 500);

Mega4alik · 2017-01-18T14:58:07Z

Thanks, it compiles. But results are still same
Actually X is not one-hot encoding vector.
It is vector of binary values which is built for marking existing features.
example: [0,0,1,0,1,0....]

haifengl · 2017-01-18T15:33:33Z

Can you provide some sample data? Both X and Y. Thanks!

Mega4alik · 2017-01-18T15:49:30Z

Sure
X
Y

haifengl · 2017-01-19T01:17:50Z

Thanks for sharing the data! I don't have time to try it yet but it seems really really sparse. Very few ones and a large number of classes. Each class has only around three samples. Our random forest does the strata sampling, which is very good when the data is unbalanced. But given your samples per class ratio, each class may get only one sample during training, which is clearly not good for machine learning.

haifengl · 2017-01-19T01:22:03Z

BTW, your have less samples than features, which is underdetermined. Machine learning is in general to solve the overdetermined system. For example, the basic least square method. For underdetermined system, the solution is not unique and it is not clear how to assess the model quality.

I don't know how sklearn handles this situation. But what's the accuracy by cross validation or bootstrap?

Mega4alik · 2017-01-19T06:01:18Z

Thank you for detailed answer
Test set shows 84% accuracy using sklern Randomforest and 77% using smile SVM

haifengl · 2017-01-23T03:12:25Z

Thanks! SVM is not that sensitive to high dimension as all it cares is the kernel matrix. However, no matter you use one vs one or one vs all strategy, you have only a few samples in one (or two) side(s). Basically they will all be the support vectors. And the classifier is essentially a template matching.

For Random Forest, there are many parameters you should carefully choose given your very special data. For example, mtry, nodeSize, maxNodes, subSamples. Currently you use the default values, which are probably not good in this case. Also there may be some logic in the code that doesn't consider the use cases like yours, such as only one or two samples per class. It may be the reason that the model always predicts the same class (very likely some NaN value is generated in the model). Is the predicted class always 0 (or the last class)?

Anyway, I strongly suggest that you collect more data if possible. I wouldn't hold any faith on these accuracy numbers if I were you. Thanks!

haifengl closed this as completed Feb 7, 2017

haifengl added the question label Oct 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RandomForest Example #144

RandomForest Example #144

Mega4alik commented Jan 17, 2017

leo-xin commented Jan 17, 2017

Mega4alik commented Jan 17, 2017

leo-xin commented Jan 17, 2017

Mega4alik commented Jan 17, 2017

Mega4alik commented Jan 17, 2017

haifengl commented Jan 17, 2017

Mega4alik commented Jan 17, 2017

haifengl commented Jan 17, 2017

Mega4alik commented Jan 17, 2017

haifengl commented Jan 17, 2017

Mega4alik commented Jan 17, 2017

Mega4alik commented Jan 17, 2017

haifengl commented Jan 17, 2017

Mega4alik commented Jan 18, 2017

haifengl commented Jan 18, 2017

Mega4alik commented Jan 18, 2017 •

edited

haifengl commented Jan 18, 2017

Mega4alik commented Jan 18, 2017

haifengl commented Jan 19, 2017

haifengl commented Jan 19, 2017

Mega4alik commented Jan 19, 2017

haifengl commented Jan 23, 2017

RandomForest Example #144

RandomForest Example #144

Comments

Mega4alik commented Jan 17, 2017

leo-xin commented Jan 17, 2017

Mega4alik commented Jan 17, 2017

leo-xin commented Jan 17, 2017

Mega4alik commented Jan 17, 2017

Mega4alik commented Jan 17, 2017

haifengl commented Jan 17, 2017

Mega4alik commented Jan 17, 2017

haifengl commented Jan 17, 2017

Mega4alik commented Jan 17, 2017

haifengl commented Jan 17, 2017

Mega4alik commented Jan 17, 2017

Mega4alik commented Jan 17, 2017

haifengl commented Jan 17, 2017

Mega4alik commented Jan 18, 2017

haifengl commented Jan 18, 2017

Mega4alik commented Jan 18, 2017 • edited

haifengl commented Jan 18, 2017

Mega4alik commented Jan 18, 2017

haifengl commented Jan 19, 2017

haifengl commented Jan 19, 2017

Mega4alik commented Jan 19, 2017

haifengl commented Jan 23, 2017

Mega4alik commented Jan 18, 2017 •

edited