New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adaptive Random Forest Regressor/Hoeffding Tree Regressor splitting gives AttributeError #393
Comments
@smastelini I guess this is for you :) |
Hi @JPalm1! Thanks for the detailed explanation. I'll check your provided example to try reproducing the same error. From your description, it does sound like a bug. |
Hi again :) I've created #394 to fix the issue. If I may, let me put in my two cents regarding this situation:
@JPalm1, could you confirm that the fix in #394 solves your issue? |
Hey, @smastelini - thanks for looking into it! That explanation makes sense. Just to note: I'm not set up to use The changes you have made in #394 seem to have solved the initial exception but pushed the exception to a different location: this time line 223 in Let me know if you need any more info - I'm having a dig around too to try further my own understanding as well. :) |
My bad! I missed one additional place to protect the trees. The fix is in place. Anyhow, my point is that the error appears because only valid labels should be passed to the learning models. I am assuming that |
Yes, you're right - sorry, I was being slow. Thanks for sorting that - the ARFR now runs fine and doesn't trip up for my input data. I do get an exception when I run the HAT, line 208 of
|
Thanks for reporting that!
That's interesting, indeed. From my understanding, the division error happens because the list of found nodes is empty. So, in summary, your solution is correct :)
I am struggling to understand how the Well, I might have a crazy idea about why the problem is happening, haha. Just a wild guess: are some of your features categorical? In this case, if you call Could you please confirm whether or not this is the case with your data? If so, I will also add a comment to warn future contributors about the reason for adding this seemingly unnecessary check. |
Hm, that is interesting. I'm also confused as to why the length is zero... In 2. is it possible for For my data, I have encoded the categorical variables to be numerical values, but the feature set is quite sparse. There are about 250 features that are shared between all instances, and then up to ~500 more, depending on the incoming data. Most of the features are grouped, such that if condition X is met, there will be around ~250 extra features on top of the existing 250 core features; 250 more if condition Y is met. In the case where either condition X or Y is not met, the features are NaN and excluded from the input dictionary. I will double check tomorrow morning, but I'm pretty sure before the point this exception is met, some datapoints have had conditions X and Y to be true. The features are all numerical: the majority are scaled using the After processing each data type, I concatenate the three dictionaries to use in (As a side note: is there an existing feature, or planned feature, that would allow you to assign feature types in the input dictionary, which would then allow the different data types to be processed separately but in the same pipeline? IE some get sent to My plan is to step through tomorrow to get a better understanding, as I can imagine all this info may be a bit vague to you right now, but thanks for your help! |
Hi @smastelini, just had a step through, and it seems my hypothesis yesterday is correct. In the case where Again, I'm unsure whether this is a bug, or whether my setup is causing the issue, however putting in the quick length fix sorts that issue, but trips me up again later: Line 247 in This seems to be the issue you hypothesised, as the clause this line is in is commented to be for when the instance contains a categorical value previously unseen by the split node. Which is weird, as inspecting my input dictionary all values are numerical as mentioned previously.. Would having some inputs as floats, and others as ints be an issue? I will try converting them all to floats and see what happens. |
Hi @JPalm1!
Yes, it is! The tree module automatically treats integers and floats as numerical features and anything else as categorical. Perhaps the occurrence of |
You can check Please, also check the examples section of iSOUPTreeRegressor to see that in action. |
I think I found your culprit :D river/river/tree/_attribute_test/numeric_attribute_binary_test.py Lines 13 to 15 in 0f77d96
The branching returns I am working on a solution. As soon as I finish it, I'll let you know :D |
Stepping through, the x at the tree level seems to be all numerical (I remove the None variables before interacting with the model). But could this be the case still if the models assume that features not present in the dictionary, that have been present before, are 'None'? It's interesting that numerical features are also classed as categorical, which explains the error I'm seeing here:
Will hold off til when your fix is in to worry about this one though, thanks for looking into it! |
Yes! This corner-case did not happen before in skmultiflow because we always assumed all features were presented. I added a fix in #394. Since what started with ARFReg ended up being related to missing data handling and sparse representations, I will update the description of the PR. @JPalm1, could you please check if the fix works for you? |
The HATR is now running smoothly for me with no errors - thanks for taking the time to look into this @smastelini ! It's not giving me the best predictions, however, but that's on me to investigate... :) |
No worries @JPalm1! It was a nice adventure :) Just keep in mind:
In your case, a feature that does not appear very often might be very relevant for your problem. In the future, and that's a possible research venue, one can explore some alternatives:
The impact of those choices is a compelling research subject to investigate! The original Hoeffding Tree framework does not deal with such a situation as far as I am concerned. @jacobmontiel, @hmgomes, and professor @abifet, this might be an interesting problem to consider in the future. Any comments? |
Well, that was quite some reading. Thanks to all involved in the interesting discussion. Missing data is unavoidable in real-world applications. My comment is in two main aspects:
|
Thanks for the replies @smastelini , @jacobmontiel. For my specific case, having played around with the model the past day or so, the accuracy seems to be lower than other models I have implemented/baselines I have calculated - so potentially the sparsity of data, and thus not utilizing the full feature set, is causing me issues as you suspected. I do need to tune hyperparameters though, an which may improve things. I will also try and impute some values to those that are non-existent and see if that also improves things - thanks both for the suggestions! |
Hi @smastelini, after pulling the changes, there doesn't appear to be any major accuracy improvements in the HT. Thanks for letting me know though! Will keep an eye out for any improvements. As a side not, do any of the changes you have implement affect the HST anomaly detection tree? Previously, I was able to input dictionaries of changing lengths (due to some inputs being NaN) - now when I do that I get the exception from line 36 in Note: The algorithm does work when I keep the NaN values in the input dictionary. |
Hi @JPalm1, this question is out of the scope of the decision tree module. Could you please open another issue to directly tackle this problem? Meanwhile, @MaxHalford do you have any idea of what might be happening? |
Yes the half-space trees don't work if a feature that is used in a node split is missing. It's a bit the same problem. I could cook something up :). I'll create an issue to keep track of this. On a side note, @smastelini, I noticed that you put all the base tree code from creme into |
Hey @MaxHalford. During the merge I kept the basic tree structure shared by the previous decision tree and HST. I moved it to the anomaly submodule because it is the only place where this structure is used now. Ideally we should use a single shared tree representation. It's on my plans to bring another family of tree learners soon to river, and they do not fit on the current conventions defined by BaseHoeffdingTree. So I am very interest in such discussion. |
Versions
river version: River 0.1.0
Python version:Python 3.7.1
Operating system: Windows 10 Enterprise v1803
Describe the issue
Have been playing around with River/Creme for a couple of weeks now, and it's super useful for a project I am currently working on. That being said, I'm still pretty new to the workings of the algorithms, so unsure whether this is a bug or an issue with my setup.
When I call
learn_one
on the ARFR or HTR, I receive: "Attribute Error "NoneType" object has no attribute '_left'" from line 113 inbest_evaluated_split_suggestion.py
.I have implemented a delayed prequential evaluation algorithm, and inspecting the loop, the error seems to be thrown when the model after the first delay period has been exceeded - ie when the model can first make a prediction that isn't zero. Before this point,
learn_one
doesn't throw this error.Currently, I am using the default ARFR as in the example, with a simple Linear Regression as the leaf_model. The linear regression model itself has been working with my data, when not used in the ARFR. I want to try the ensemble/tree models with the data to see if accuracy is improved due to the drift detection methods that are included.
Has anyone also seen this error being thrown, or know the causes of it? Let me know if more information is needed! Thanks.
The text was updated successfully, but these errors were encountered: