Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RandomForest Example #144

Closed
Mega4alik opened this issue Jan 17, 2017 · 22 comments
Closed

RandomForest Example #144

Mega4alik opened this issue Jan 17, 2017 · 22 comments
Labels

Comments

@Mega4alik
Copy link

Hello,
I'm trying to train RandomForest Model, but getting same result for each test(about 300 entries)
Here's Java Code

RandomForest model;
model = new RandomForest(X,Y,500);                                                        

For this data, python sklearn works as expected

clf = RandomForestClassifier(n_estimators=500)
clf = clf.fit(X, Y)

What am I missing?

Thanks

@leo-xin
Copy link

leo-xin commented Jan 17, 2017

I am not the author. By reading source, I think the reason is that you have not set paramter subsample Which by default is set to be 1.0 in smile. In that way, there is no random in building trees.

@Mega4alik
Copy link
Author

@leo-xin Can you give an example? Thanks in advance

@leo-xin
Copy link

leo-xin commented Jan 17, 2017

I am sorry, after carefully reading. I think I have give the wrong hint. there are two ways of random in building forest. feature and data are both randomly selected. I have no Idea about your problem. Can you show your detailed code?

@Mega4alik
Copy link
Author

Sure,
Training Part

double[][] X = gson.fromJson(g.getFileContent(g.Path+"/temp/X.txt"), double[][].class);
int[] Y = gson.fromJson(g.getFileContent(g.Path+"/temp/Y.txt"), int[].class);            
RandomForest model;
model = new RandomForest(X, Y, 500);                              

Prediction part

int getPrediction(double[] doubles){
 int qIdx = model.predict(doubles);   
 return qIdx;
}

@Mega4alik
Copy link
Author

@haifengl can you help?

@haifengl
Copy link
Owner

Hi @Mega4alik , what do you mean "getting same result for each test"? Thanks!

@Mega4alik
Copy link
Author

@haifengl Thanks for response,
It means that the model always predict the same class

@haifengl
Copy link
Owner

Do you have missing values in your input? And what's the range of Y?

@Mega4alik
Copy link
Author

No, I don't
There're 1500 possible classes
As I mentioned before, python sklearn RandomForestClassifier works as expected for this dataset

@haifengl
Copy link
Owner

Is your Y in range of [0, 1499]? How many samples do you have?

@Mega4alik
Copy link
Author

Oh no, sorry
The size of X is [0,1499] binary and range of Y is [0,323]
There are about 1300 samples

@Mega4alik
Copy link
Author

  • Size of X - 1500 fixed

@haifengl
Copy link
Owner

I guess that you are using one-hot encoding, which is what sklearn expects. If you don't specify the attribute type (as shown in your code), smile will assume they are numeric values. It is why it doesn't work. Instead, you should pass an array of NominalAttribute. See the details in smile.data.NominalAttribute.

Besides, in case that you use one-hot encoding, please don't do it. Smile can handle categorical variables directly (and much more efficient). Suppose you have only 10 attributes (each takes discrete value 0 to 149), just pass 10 attributes to us. You will see that we are way faster than sklearn. Thanks!

@Mega4alik
Copy link
Author

Ok!
I did not understand how to use NominalAttribute. Can you please give an example in code?
I tried this, but getting java.lang.ArrayIndexOutOfBoundsException exception

NominalAttribute ars[] = new NominalAttribute[10];
for (int i=0;i<ars.length;i++) ars[i] = new NominalAttribute("atr"+i);             
model = new RandomForest(ars, X, Y, 500);

Thanks)

@haifengl
Copy link
Owner

Suppose you are still using 1500 one-hot encoding features.

NominalAttribute ars[] = new NominalAttribute[1500];
String[] values = {"0", "1"};
for (int i=0;i<ars.length;i++) ars[i] = new NominalAttribute("atr"+i, values);
model = new RandomForest(ars, X, Y, 500);

@Mega4alik
Copy link
Author

Mega4alik commented Jan 18, 2017

Thanks, it compiles. But results are still same
Actually X is not one-hot encoding vector.
It is vector of binary values which is built for marking existing features.
example: [0,0,1,0,1,0....]

@haifengl
Copy link
Owner

Can you provide some sample data? Both X and Y. Thanks!

@Mega4alik
Copy link
Author

Sure
X
Y

@haifengl
Copy link
Owner

Thanks for sharing the data! I don't have time to try it yet but it seems really really sparse. Very few ones and a large number of classes. Each class has only around three samples. Our random forest does the strata sampling, which is very good when the data is unbalanced. But given your samples per class ratio, each class may get only one sample during training, which is clearly not good for machine learning.

@haifengl
Copy link
Owner

BTW, your have less samples than features, which is underdetermined. Machine learning is in general to solve the overdetermined system. For example, the basic least square method. For underdetermined system, the solution is not unique and it is not clear how to assess the model quality.

I don't know how sklearn handles this situation. But what's the accuracy by cross validation or bootstrap?

@Mega4alik
Copy link
Author

Thank you for detailed answer
Test set shows 84% accuracy using sklern Randomforest and 77% using smile SVM

@haifengl
Copy link
Owner

Thanks! SVM is not that sensitive to high dimension as all it cares is the kernel matrix. However, no matter you use one vs one or one vs all strategy, you have only a few samples in one (or two) side(s). Basically they will all be the support vectors. And the classifier is essentially a template matching.

For Random Forest, there are many parameters you should carefully choose given your very special data. For example, mtry, nodeSize, maxNodes, subSamples. Currently you use the default values, which are probably not good in this case. Also there may be some logic in the code that doesn't consider the use cases like yours, such as only one or two samples per class. It may be the reason that the model always predicts the same class (very likely some NaN value is generated in the model). Is the predicted class always 0 (or the last class)?

Anyway, I strongly suggest that you collect more data if possible. I wouldn't hold any faith on these accuracy numbers if I were you. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants