Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample weight support #17

Open
kmedved opened this issue Apr 15, 2021 · 8 comments
Open

Sample weight support #17

kmedved opened this issue Apr 15, 2021 · 8 comments

Comments

@kmedved
Copy link

kmedved commented Apr 15, 2021

Would it be possible or sensible to add support for sample weights (at the observation level) to this package? Most scikit-learn estimators allow the user to pass a sample weight into the .fit() call (e.g., linear-regression or LightGBM). These is a key characteristic of many regression problems.

Normally this would be pretty simple to implement by just allowing a user to pass a sample_weight into the fit call. But given a Bayesian Bootstrap is already relying on weighting as opposed to resampling, maybe this doesn't make sense in this context.

Tagging @JulianWgs in case he has any thoughts.

Thanks!

@lmc2179
Copy link
Owner

lmc2179 commented Apr 16, 2021

Hi! Can you perhaps give an example of the kind of problem you'd like to solve with this? It's not immediately clear what you have in mind, but maybe that will make it a little more clear to me.

@kmedved
Copy link
Author

kmedved commented Apr 16, 2021

Sure.

I work a lot with sports data (and in fact found this package through this: http://savvastjortjoglou.com/nfl-bayesian-bootstrap.html), where it is common for each row to represent a single game or season by an player, but since the players may play different numbers of games or minutes, I don't want to weight all observations equally in fitting a model. Typically I would handle this by passing a sample_weight into the .fit call of the model (where the weight is the number of games/minutes played).

Other uses include environmental sensor data, where some sensors may collect aggregated data every X days, but X varies between sensors. In such cases, the loss associated with rows in the data which represent fewer days should receive proportionately less weight than rows representing more data.

Finally, it's fairly common to use sample weights in time-series analysis, to apply an exponential decay weight to older observations in proportion to how old they are (thus downweighting their importance in the model, without dropping them entirely). This would likewise normally be done by passing a sample_weight into the fit call.

Under the hood, what scikit estimators are doing is just multiplying the loss associated with each row by the corresponding sample weight. So the math is very simple. But I am somewhat uncertain as to how this would interact with the sampling being done by the Bayesian Bootstrap already,

Thanks for quick reply, and I appreciate any help.

@JulianWgs
Copy link
Contributor

I think the idea is very good, but I didn't find a way to represent this mathematically. The obvious way would just multiplying the Bayesian bootstrap weights and the model weights and then scaling the resulting weight to 1. Is this clean? Are there any papers out there describing something similar? I wouldn't implement something which ought to work, but which is not backed by math and papers.

import numpy as np

bb_weights = np.array([0.25, 0.5, 0.25])
model_weights = np.array([0.25, 0.65, 0.1])

assert np.sum(bb_weights) == 1
assert np.sum(model_weights) == 1

weights = bb_weights*model_weights
assert np.sum(weights) != 1
print(np.sum(weights))

weights /= np.sum(weights)
print(weights)

@kmedved
Copy link
Author

kmedved commented Apr 21, 2021

That was my instinct as well @JulianWgs - that the same multiplication could be done on top of the bootstrap weight, and I don't see any obvious issues.

But I am unaware of any literature on this point either.

@lmc2179
Copy link
Owner

lmc2179 commented Apr 21, 2021

This makes sense to me, I think. In the usual nonparametric bootstrap case, it seems reasonable to resample your data points along with their weights, and the simple scheme here should give you a smoothed-out version of that (and the BB is just a smoothed-out standard bootstrap). It seems like a reasonable approach from my point of view - but I also don't know of any literature that specifically addresses this topic.

@kmedved
Copy link
Author

kmedved commented Apr 21, 2021

Just found a discussion of this issue here: https://stats.stackexchange.com/questions/88615/reweighting-importance-weighted-samples-in-bayesian-bootstrap.

If I'm reading this correctly (and I may not be - I'm a bit over my head here), then I think @JulianWgs solution is correct. (Need to read down to his 'Edit', where he explains the solution).

@JulianWgs
Copy link
Contributor

JulianWgs commented Apr 21, 2021 via email

@kmedved
Copy link
Author

kmedved commented Apr 28, 2021

Good ideas. Approach 2 seems to seems to be the most straightforward to test. I will give it a try and see what it looks like with a high enough number of samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants