Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QuantileDifferenceReason and StandardDeviationReason #28

Closed
FBruzzesi opened this issue Dec 17, 2021 · 31 comments
Closed

QuantileDifferenceReason and StandardDeviationReason #28

FBruzzesi opened this issue Dec 17, 2021 · 31 comments
Assignees

Comments

@FBruzzesi
Copy link
Contributor

Hey! I was thinking if it would make sense to add two more reasons for regressions tasks, namely something like HighLeveragePointReason and HighStudentizedResidualReason.

Citing Wikipedia:

  • Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables (link)
  • A studentized residual is the quotient resulting from the division of a residual by an estimate of its standard deviation. [...] This is an important technique in the detection of outliers. (link)
@koaning
Copy link
Owner

koaning commented Dec 17, 2021

Do you have an example (preferably something semi-real life) that demonstrates the utility of this technique?

@FBruzzesi
Copy link
Contributor Author

  • Regarding studentized residuals:

    The first thing that comes to mind is that in regression problems:

    • Absolute difference may require domain knowledge to set a threshold;
    • Relative difference can be very misleading when dealing with true values close to zero.

    Therefore by standardizing/studentizing residuals one can use a default threshold.

    Remark that the magnitude of diagonal elements H_ii of the hat matrix H = X @ np.linalg.inv(X.T @ X) @ X.T, involved in the computation, decreases quickly with the increase of the size of X (in the order of p/n, where n, p = X.shape).
    Hence for simplicity it's often used the z-score with mean zero, as model errors should be 0-centered, and a threshold of 3 can be a good default.

  • Regarding high leverage:

    As for the previous point, it may be hard to compute H for large values of n, and better methods for outlier detection can be used in most cases.

@koaning
Copy link
Owner

koaning commented Dec 17, 2021

@FBruzzesi ah yeah that makes it more clear.

On the studentized residuals ... I think a bell-curve assumption for error might work for some instances, but not all. I'm wondering if the makes sense to introduce a QuantileDifferenceReason and a StandardDeviationReason for this realm of use-cases. Any concerns with using these two?

Regarding HighLeveragePointReason I'm leaning towards asking users to implement an outlier detection system for their use-case. How to detect an outlier tends to be very use-case specific.

@FBruzzesi
Copy link
Contributor Author

FBruzzesi commented Dec 17, 2021

I believe those two Reason(s) you are proposing should cover the majority of cases.
Also totally make senso to keep custom outlier detection model, especially since there is a OutlierReason already implemented.

@koaning koaning changed the title High leverage and studentized residual reasons QuantileDifferenceReason and StandardDeviationReason Dec 17, 2021
@koaning
Copy link
Owner

koaning commented Dec 17, 2021

I've changed the title of this issue to reflect this.

I'm not sure when I'll have time to work on this feature though. Part of me is also wondering if we should first find a representative dataset such that we might have a valid demo for these tools. Any suggestions for a dataset are very welcome.

@FBruzzesi
Copy link
Contributor Author

I can work on it and try to find a toy dataset where it applies

@koaning
Copy link
Owner

koaning commented Dec 17, 2021

Grand! Let me know if you appreciate any support/review.

My advice might be to first try to run the problem on the dataset before worrying too much about implementation. It's much easier to tackle the theoretical part of a problem when there's a practical example done.

@FBruzzesi
Copy link
Contributor Author

FBruzzesi commented Dec 17, 2021

Hey @koaning, I have few questions/observations:

  • StandardDeviationReason doesn't feel like a great name for the feature, as I would relate it to the overall model rather than single point predictions. How about something like AbsoluteDifferenceStdReason or StandardizedErrorReason?
  • Just want to make sure we agree on what QuantileDifferenceReason means. I implemented a check for residuals to be in within the [q1 - 1.5 IQR, q3 + 1.5 IQR] range where q1, q3 and IQR are first quartile, third quartile and interquartile range respectively. And more generally withing [quantiles[0] - multiplier * IQR, quantiles[1] + multiplier * IQR].
  • As a styling question: can I check values validity with assert statement (e.g. positive threshold and quantiles within 0-1 range)?
  • Finally, how should I proceed forward? I tested on diabets toy dataset from sklearn, yielding few examples for both Reasons.

Please let me know if something isn't clear, I may comment here with some code snippet as well if needed.

@koaning
Copy link
Owner

koaning commented Dec 17, 2021

  1. StandardDeviationReason doesn't feel like a great name for the feature

I think StandardizedErrorReason sounds good for now. I'll noodle on it a bit.

  1. Just want to make sure we agree on what QuantileDifferenceReason means.

I was thinking that we sort the residuals and allow the user to say something like "assign doubt to all rows where the error is larger than the 95% quantile".

  1. As a styling question: can I check values validity with assert statement

I usually resort to assert but sometimes in combination of np.all or np.isclose.

  1. ... yielding few examples for both Reasons

Did these yield the wrong labels? One thing you might want to try is to flip a few labels randomly upfront and to see if you can retrieve the flipped labels with this trick. It's not a perfect proxy, but it's a plausible demo.

@FBruzzesi
Copy link
Contributor Author

I was thinking that we sort the residuals and allow the user to say something like "assign doubt to all rows where the error is larger than the 95% quantile".

This looks very deterministic, meaning that for any given model you will doubt 5% of the results. On the other hand using the usual boxplot ranges mentioned above (or any other user favourite quantiles-multipliers) may or may not result in doubt. Imagine having a error 0 centered and "very" symmetrical, then the former would still doubt some results, while the latter wouldn't.

Did these yield the wrong labels?

As this is a regression task I am not even sure what flipping labels exactly means. I am trying to add/multiply the feature matrix by random noise, then check if the rows I get back by DoubtEnsemle are the most pertubed ones.

@koaning
Copy link
Owner

koaning commented Dec 18, 2021

Let me try to explain the "flipping labels experiment". Suppose we have a dataset X, y in a dataframe. Let's take, say 10%, of all rows and designate these to be shuffled.
image

Next, we take the y values that are designated to be shuffled and we shuffle these such that the original value is replaced by another value.

image

We now have a dataset where we know some of the y values to be false. We can then ask "does our approach find the bad labels?". It's a bit of a hacky way to go about it, since the way we simulate bad labels may certainly not resemble reality. But it's a proxy if nothing else that does suggest if we're able to find bad labels. If nothing else it should give us a hint on how reliable some of our doubt reasons might be.

@koaning
Copy link
Owner

koaning commented Dec 18, 2021

This reminds me, we may want to have a utility submodule to make these kinds of experiments easy.

@FBruzzesi
Copy link
Contributor Author

FBruzzesi commented Dec 19, 2021

While working on such test, I find that

We now have a dataset where we know some of the y values to be false.

is kind of misleading, as predicted values are not influenced by the shuffle, however, by random chance, few shuffled y values may get closer to predicted values y_hat, ending up reducing the magnitude of the residual.

Focusing solely on those datapoints satisfying both the following conditions:

  • Shuffled data
  • Larger residual than original (remark that this can still be small in relative terms)

Then testing on diabets toy dataset from sklearn with 1000 different random states yields:

  • For StandardDeviationReason
    • True positive rate of ~30%
    • False positive rate of ~2%
  • For QuantileDifferenceReason (with the above mentioned boxplot method)
    • True positive rate of ~13%
    • False positive rate of ~0.2%
  • For QuantileDifferenceReason (by just sorting residuals and doubting those prediction with residual > 0.95-quantile)
    • True positive rate of ~25%
    • False positive rate of ~3.5%

@koaning
Copy link
Owner

koaning commented Dec 19, 2021

Cool!

Just to confirm, could you varify the precision/recall values?

Also, when are you training your model, before or after the shuffling? If we're to match reality, we should train the model after we've shuffled.

@FBruzzesi
Copy link
Contributor Author

FBruzzesi commented Dec 19, 2021

Here are some of the stats:

reason recall precision fpr
QuantileDifferenceReason(quantile=0.95) 0.23 0.39 0.035
BoxplotReason(multiplier=1.5) (*) 0.099 0.54 0.002
StandardDeviationReason(threshold=2.) 0.28 0.62 0.017

Yes shuffle and training is done is such order.

(*) Any better name for this one? Should we keep all these three reasons?

@koaning
Copy link
Owner

koaning commented Dec 19, 2021

One final question before we move on (although the results themselves are pretty interesting!). Could you check if these numbers change much if you flip more/less labels? I might imagine that 1%, 5%, 10% label errors might yield different results.

@FBruzzesi
Copy link
Contributor Author

FBruzzesi commented Dec 19, 2021

The following results are mean scores across 500 different random states per reason-%shuffled pairs

reason recall precision fpr %shuffled
QuantileDifferenceReason 0.31 0.05 0.051 1%
QuantileDifferenceReason 0.40 0.24 0.042 5%
QuantileDifferenceReason 0.37 0.42 0.033 10%
QuantileDifferenceReason 0.29 0.62 0.023 20%
BoxplotReason 0.10 0.07 0.004 1%
BoxplotReason 0.13 0.34 0.003 5%
BoxplotReason 0.11 0.47 0.002 10%
BoxplotReason 0.06 0.52 0.0007 20%
StandardDeviationReason 0.29 0.072 0.037 1%
StandardDeviationReason 0.34 0.296 0.029 5%
StandardDeviationReason 0.30 0.461 0.023 10%
StandardDeviationReason 0.26 0.703 0.014 20%

@koaning
Copy link
Owner

koaning commented Dec 19, 2021

Nicely done! It's interesting to see that the StandardDeviationReason seems to outperform the other two reasons.

As far as I'm concerned a PR for StandardDeviationReason can get started.

If you happen to have any benchmarking code to share I might consider saving that for the documentation as well.

@FBruzzesi
Copy link
Contributor Author

@koaning I just found an error in the QuantileDifferenceReason implementation; I am updating the table above. I want to mention again that such reason will doubt a certain percentage of results no matter what.

Regarding some sample code, not sure where I should/could share it.

@koaning
Copy link
Owner

koaning commented Dec 20, 2021

@FBruzzesi if it's a notebook you can put it in a Github gist if that's easier for you.

koaning added a commit that referenced this issue Dec 21, 2021
Issue #28, StandardizedErrorReason class
@koaning
Copy link
Owner

koaning commented Dec 21, 2021

I've just merged #29. Before making a new release though I'm wondering if it makes sense to add the QuantileDifferenceReason as well. @FBruzzesi would you prefer to add it?

@koaning
Copy link
Owner

koaning commented Dec 21, 2021

Actually ... the new method is listed on the readme so I should release a patch. Lemme do that real quick.

@koaning
Copy link
Owner

koaning commented Dec 21, 2021

Done! I'll also make an announcement tomorrow for it. Got a twitter handle? If so I can give you a shoutout.

@FBruzzesi
Copy link
Contributor Author

I feel like you are not actually conviced by these other methods! I will make a notebook illustrating them as soon as I have the time and maybe we can discuss whether to add them afterword.

@FBruzzesi
Copy link
Contributor Author

FBruzzesi commented Dec 21, 2021

Also, you should be able to find me on twitter as @BruzzesiFr

@koaning
Copy link
Owner

koaning commented Dec 21, 2021

Just to be explicit; I very much appreciate the work you're doing here! But what method are you referring to now? The BoxplotReason?

I figured moving on to the QuantileDifferenceReason made sense because of its performance on your initial benchmark. I'll gladly consider other options but I do prefer a benchmark that backs up the reasoning.

Am looking forward to your notebook 👍

@FBruzzesi
Copy link
Contributor Author

@koaning finally found the time to write a notebook, you can find it here.

@koaning
Copy link
Owner

koaning commented Dec 27, 2021

Interesting!

I've added utility methods to the main branch that allows folks to play around with "flipping" labels in a subset. I'll likely also add some plotting functionality around it so we can get some "precision_at_k" and "recall_at_k" plots to compare approaches. My impression so far is that for some dataset/model/reason combinations it's very easy to find bad labels while for others it's barely better than random sorting.

@koaning
Copy link
Owner

koaning commented Dec 27, 2021

I'll likely merge the plotting tonight and I'll also push a new version.

Out of curiosity, since you've given the library a spin already, are there any features missing in your opinion with regards to plotting?

@FBruzzesi
Copy link
Contributor Author

As you may have noticed I work much more with regression problems than classification tasks. There are a lot of custom plotting I do when it comes to check results/predictions, and currently working on a (still private) library to standardize few of these checks.

That said, not sure of what you could integrate here, maybe something as simple as residual plot with different colors for doubted/non-doubted points, similar to what I tried to do in the notebook I just shared. Feel free to assign me such task if needed.

@FBruzzesi
Copy link
Contributor Author

@koaning should we proceed to close this issue?

@koaning koaning closed this as completed Dec 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants