New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RandomForest Example #144
Comments
I am not the author. By reading source, I think the reason is that you have not set paramter subsample Which by default is set to be 1.0 in smile. In that way, there is no random in building trees. |
@leo-xin Can you give an example? Thanks in advance |
I am sorry, after carefully reading. I think I have give the wrong hint. there are two ways of random in building forest. feature and data are both randomly selected. I have no Idea about your problem. Can you show your detailed code? |
Sure,
Prediction part
|
@haifengl can you help? |
Hi @Mega4alik , what do you mean "getting same result for each test"? Thanks! |
@haifengl Thanks for response, |
Do you have missing values in your input? And what's the range of Y? |
No, I don't |
Is your Y in range of [0, 1499]? How many samples do you have? |
Oh no, sorry |
|
I guess that you are using one-hot encoding, which is what sklearn expects. If you don't specify the attribute type (as shown in your code), smile will assume they are numeric values. It is why it doesn't work. Instead, you should pass an array of NominalAttribute. See the details in smile.data.NominalAttribute. Besides, in case that you use one-hot encoding, please don't do it. Smile can handle categorical variables directly (and much more efficient). Suppose you have only 10 attributes (each takes discrete value 0 to 149), just pass 10 attributes to us. You will see that we are way faster than sklearn. Thanks! |
Ok!
Thanks) |
Suppose you are still using 1500 one-hot encoding features. NominalAttribute ars[] = new NominalAttribute[1500]; |
Thanks, it compiles. But results are still same |
Can you provide some sample data? Both X and Y. Thanks! |
Thanks for sharing the data! I don't have time to try it yet but it seems really really sparse. Very few ones and a large number of classes. Each class has only around three samples. Our random forest does the strata sampling, which is very good when the data is unbalanced. But given your samples per class ratio, each class may get only one sample during training, which is clearly not good for machine learning. |
BTW, your have less samples than features, which is underdetermined. Machine learning is in general to solve the overdetermined system. For example, the basic least square method. For underdetermined system, the solution is not unique and it is not clear how to assess the model quality. I don't know how sklearn handles this situation. But what's the accuracy by cross validation or bootstrap? |
Thank you for detailed answer |
Thanks! SVM is not that sensitive to high dimension as all it cares is the kernel matrix. However, no matter you use one vs one or one vs all strategy, you have only a few samples in one (or two) side(s). Basically they will all be the support vectors. And the classifier is essentially a template matching. For Random Forest, there are many parameters you should carefully choose given your very special data. For example, mtry, nodeSize, maxNodes, subSamples. Currently you use the default values, which are probably not good in this case. Also there may be some logic in the code that doesn't consider the use cases like yours, such as only one or two samples per class. It may be the reason that the model always predicts the same class (very likely some NaN value is generated in the model). Is the predicted class always 0 (or the last class)? Anyway, I strongly suggest that you collect more data if possible. I wouldn't hold any faith on these accuracy numbers if I were you. Thanks! |
Hello,
I'm trying to train RandomForest Model, but getting same result for each test(about 300 entries)
Here's Java Code
For this data, python sklearn works as expected
What am I missing?
Thanks
The text was updated successfully, but these errors were encountered: