-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finding Outliers #8
Comments
Wrapping up, I used something like stacking on the validation set to get a lower MAE.
I used Excel solver the first time to iterate through cutoff values and rainfall estimates, giving around 0.3 and a value of 2800. This lowered MAE on the validation set from 23.5 to around 18. However, on the LB, MAE went up quite a bit 😖 The second time around, I played it "safe" and used 0.5 as a cutoff with 1000 as the alternative value. I even applied the model to a double-secret holdout set and got good results (around 21). But again, the LB did not like it. Double 😖 ! So anyway, that was my experience yesterday. I don't know if it's a simple mistake, an error in approach/logic/assumptions, or data peculiarity from the LB. In any case there is a lot of potential here so I'm hoping we can make headway with it! Signing off for now, will probably resurface in a couple days. Glad to respond, etc, in the meantime. |
Yup, you're right on the approach. The best finish Thakur and I had together was where we handled the problem like this. In that case, he did a classifier which returned amazingly accurate predictions of a value other than 0. And then I had a regression to figure out the number between 0 and 100. But knowing that he was going to send me only those above 0 (with F1 of 0.99 or so, I think), I trained on only the subset of the original data that was above 0. Worked out great. So what is going on here? First, not positive. Everything seems good, especially the double-blind. It's possible that 0.95 isn't high enough. After all our initial GBM tried to solve these as well, so it is jointly guessing on these and on the regulars. And deep XGBoost models are quite good at naturally separating the problem. One thing is to ensure the blinds are working. As you say, it's just like stacking, so hopefully you're doing the stacking correct. I imagine so, but just in case.... If we want to reverse course and go with far simpler, then the data size is big enough that we can probably get away using a single model. What that means is that we just do a 90/10 or 80/20 for that first level. And we use that model to predict on the 10 or 20 and also for the full set. Then we only have 10% or 20% to "train" the model selector, but for such a simple task, that certainly might be enough. Nice work. Hopefully the infrastructure is good and we can spot some simple tweaks to get a good submission out of it. |
So I've started looking at the 'problem within a problem', which is to identify IDs in the test set likely to have unreasonably high Expected(rainfall) values. As pointed out in the forums, these outliers are largely noise and are likely responsible for most of the MAE. Here is the contribution from a validation set using our best XGB model. .
Identifying even a subset of the outliers could significantly reduce MAE! Here's how I've approached the problem so far:
I've had good results in the lab, but no progress at all on the public LB, and can't say why. I'll include details of work to date in separate comments.
The text was updated successfully, but these errors were encountered: