-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PUBDEV-8899: Tweedie Variance Power Maximum Likelihood estimation #6676
PUBDEV-8899: Tweedie Variance Power Maximum Likelihood estimation #6676
Conversation
h2o-docs/src/product/code-examples/explain-code-examples/Explain-wine-example.ipynb
Outdated
Show resolved
Hide resolved
@@ -2244,7 +2260,176 @@ private void fitIRLSMML(Solver s) { | |||
} | |||
} catch (NonSPDMatrixException e) { | |||
Log.warn(LogMsg("Got Non SPD matrix, stopped.")); | |||
throw e; // TODO: Decide if we want to return a bad model or no model at all. | |||
warn("Regression with MLE training:", "Got Non SPD matrix, stopped."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about this behavior. I think it might be better to end up with some model than none, but the user has to be warned that the model didn't finish training properly. If we decide to keep it, we should probably improve the message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When you nave Non SPD matrix, it cannot be avoided. It means the dataset is bad the the gram matrix cannot be inverted because it is not SPD (semi-definite positive definite).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this is fine.
…sion, and saddlepoint method
…and inversion methods
… removal of the link power estimation
f4e8256
to
19b9a46
Compare
I tried your code and it works really well for most cases. It has problem for power = 2.1 and power = 2.6. I don't think setting the bounds of power to between 1 and 2 is a good idea. I know the reason you choose to do. I have attached several datasets that the algorithm failed to arrive at a good estimate. You can access them here: s3://h2o-public-test-data/smalldata/glm_test/tweedie_p2p1_disp3_5Cols_10kRows_est1p89.csv These datasets do contains y <= 0 and the algo somehow locks it within the 1 and 2 range and will not consider the outside range at all. I think a good remedy here is to put into the documentation that if the user believes that the variance power is > 2, they should remove all rows that have resp <= 0. However this does not seem to work for s3://h2o-public-test-data/smalldata/glm_test/tweedie_p2p6_disp2_5Cols_10krows_est1p94.csv. I will slack you my code. |
Thank you @wendycwong for the thorough testing. I have tried both datasets with response above 0 I can remove the logic that sets the range to (1,2) when there are zeros (or lower numbers in the response), however, it will not change anything. For I have tried ignoring zeros in the response but that increases the likelihood of I wonder if the difference can be just from the definition - if you want to have non-zero chance of having a zero in the response you have to allocate for it some non-zero probability so there is less probability for the pdf since: And this tiny difference is added with every single row so it might get big enough to influence the results. (1)
(2)
I looked more in detail on the case with true p value = 2.6: Note: vertical lines denote the optimum found out by H2O, all the points are, however, calculated only using R's package Combined - filled points are from the dataset with zeros: |
pyunit_utils.run_tests([ | ||
test_tweedie_var_power_estimation_2p1_disp2_est, | ||
test_tweedie_var_power_estimation_2p6_disp2_est, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you separate out tests that take longer time and move them to ***_large.py? Also, the dataset they use should be found in bigdata/laptop/glm_test as well?
This will help reduce our test timeout.
…er_ml_estimation_large.py
https://h2oai.atlassian.net/browse/PUBDEV-8899
NOTE: If you see a comma in math expression and it not within a polygamma function ($\psi(a,b)$) then it's just happen to be misrendered
\,
(small space; 3/18 of a quad).Implementation
Optimization
Bracket a region where$p$ maximizes likelihood ($\mathcal{L}$ )
Once the bracket is small enough ($\Delta p = 0.1$ ) try to use Newton's method which should converge much faster
This optimization procedure is further optimized to speed up the convergence a bit:
p
from previous iteration, andp-0.01
,p+0.01
, ifp
has highest likelihood search only in[p-0.01, p+0.01]
. This can eliminates multiple evaluations ifp >> 2
and if it doesn't fail it eliminates ~5 evaluations for p in (1, 3] each iteration. It tends to fail only in first couple iterations.Likelihood, gradient and hessian calculation
Likelihood is calculated by the best method available depending on the values of$y, \mu, \varphi$ using the $\xi\text{-based}$ heuristic proposed in section 8 in [1]. I also tried to use a heuristic from R's $p=1.2$ but overall the results worsened and weren't as stable - starting Tweedie variance significantly influenced (±0.05) the estimated Tweedie var. power and at the same time it utilized the Inversion method more often for cases that took long time to estimate which increased the runtime ~10x.
tweedie::dtweedie
which improved results forI also tried several different combinations of those two heuristics but it seems that using the one from [1] is the best but it has troubles with$p < 1.3$ and sometimes when $p$ is close to 2 it gets noisy (2, 2.1) ends up often in $-\infty$ so for this part I force the likelihood during the optimization to be estimated using the inversion method that provides more stable results and use the golden ratio search. Problem with this is that there are slight differences between the two methods that can add up to a difference that would influence the optimization procedure, for that reason the inversion method is forced to be used only if the bracket is small enough (lower bound > 1.95 && upper bound < 2.1).
Gradient and hessian is calculated only by the series method. This results in more robust likelihood calculation that can be used to correct the optimization procedure when using the Newton's method.
Details
What methods are used to estimate likelihood?
Depending on$p$ , $y$ , $\phi$ different methods are used for log likelihood estimation:
$$\xi = \frac{\phi}{y^{2-p}}$$
Let
If$p = 2$ use log likelihood of gamma distribution:
If$p = 3$ use inverse Gaussian distribution:
If$p < 2$ and $\xi \leq 0.01$ use Fourier inversion method[1] briefly described below.
If$p > 2$ and $\xi \geq 1$ use Fourier inversion method.
else use the Series method[4] (also described in more detail below). If the series method fails (outputs
NaN
) try to use Fourier inversion method.If Series method and Fourier inversion method failed or if the Fourier inversion method was chosen based on the$\xi$ criterium and it failed, estimate the log likelihood using the Saddlepoint approximation (also described below).
During the optimization we sometimes want to force usage of the Fourier inversion method, this happens when the whole bracket is within [1.95, 2.1].
Series Method [4]
NOTE:$\psi(x)$ is digamma function and $\psi(1,x)$ is a trigamma function
1<p<2
Log likelihood for
Y = 0
first derivative with respect to$p$ :
$\displaystyle \frac{\partial \log \mathrm{P}(y=0)}{\partial p} = -\frac{\frac{\mu^{-p + 2} w \log\left(\mu\right)}{p - 2} + \frac{\mu^{-p + 2} w}{{\left(p - 2\right)}^{2}}}{\phi}$
second derivative with respect to$p$ :
$\displaystyle \frac{\partial \log \mathrm{P}(y=0)}{\partial^2 p} = \frac{\frac{\mu^{-p + 2} w \log\left(\mu\right)^{2}}{p - 2} + \frac{2 , \mu^{-p + 2} w \log\left(\mu\right)}{{\left(p - 2\right)}^{2}} + \frac{2 , \mu^{-p + 2} w}{{\left(p - 2\right)}^{3}}}{\phi}$
Log likelihood for
Y > 0
First derivative with respect to$p$ :
$$\frac{\partial W_j}{\partial p} = \frac{{\left(p {\left(\log\left(\frac{{\left(p - 1\right)} \phi}{w y}\right) - 2\right)} + {\left(p - 2\right)} \psi\left(-\frac{j {\left(p - 2\right)}}{p - 1}\right) + 2 , \log\left(\frac{w y}{{\left(p - 1\right)} \phi}\right) + 3\right)} \left({\left(p - 1\right)}^{\frac{p - 2}{p - 1}} w^{\left(\frac{1}{p - 1}\right)}\right)^{j} j}{{\left(p^{3} - 4 , p^{2} + 5 , p - 2\right)} \left(-{\left(p - 2\right)} \phi^{\left(\frac{1}{p - 1}\right)} y^{\frac{p - 2}{p - 1}}\right)^{j} \Gamma\left(j + 1\right) \Gamma\left(-\frac{j {\left(p - 2\right)}}{p - 1}\right)}$$
$$\frac{\partial \log\mathrm{P}(y)}{\partial p} = \dfrac{{\sum}{j=1}^{+\infty} \cfrac{\partial W_j}{\partial p}}{{\sum}{j=1}^{+\infty} W_j } + \frac{{\left(\frac{\mu^{-p + 1} y \log\left(\mu\right)}{p - 1} - \frac{\mu^{-p + 2} \log\left(\mu\right)}{p - 2} + \frac{\mu^{-p + 1} y}{{\left(p - 1\right)}^{2}} - \frac{\mu^{-p + 2}}{{\left(p - 2\right)}^{2}}\right)} w}{\phi}$$
Second derivative with respect to$p$ : (breaks math rendering in github)
p > 2
Undefined for
y = 0
.For y > 0:
Log likelihood:
$$V_k = \frac{\left(-1\right)^{k} {\left(p - 1\right)}^{\frac{k {\left(p - 2\right)}}{p - 1}} \phi^{k {\left(\frac{p - 2}{p - 1} - 1\right)}} \Gamma\left(\frac{k {\left(p - 2\right)}}{p - 1} + 1\right) \sin\left(-\frac{\pi k p}{p - 1} + \frac{2 , \pi k}{p - 1}\right)}{{\left(p - 2\right)}^{k} w^{k {\left(\frac{p - 2}{p - 1} - 1\right)}} y^{\frac{k {\left(p - 2\right)}}{p - 1}} \Gamma\left(k + 1\right)}$$
$$\log\mathrm{P}(y) = - \log(\pi y) + \log\left(\sum_{k=1}^{+\infty}V_k\right)+\frac{w}{\phi} (y\theta - \kappa(\theta))$$
First derivative with respect to$p$ :
$$\frac{\partial \log\mathrm{P}(y)}{\partial p} = \dfrac{{\sum}{k=1}^{+\infty} \cfrac{\partial V_k}{\partial p}}{{\sum}{k=1}^{+\infty} V_k} +\frac{{\left(\frac{\mu^{-p + 1} y \log\left(\mu\right)}{p - 1} - \frac{\mu^{-p + 2} \log\left(\mu\right)}{p - 2} + \frac{\mu^{-p + 1} y}{{\left(p - 1\right)}^{2}} - \frac{\mu^{-p + 2}}{{\left(p - 2\right)}^{2}}\right)} w}{\phi}$$
Second derivative with respect to$p$ : (breaks math rendering in github)
Fourier Inversion method
Implemented as method 3 described in [1]:
(3)$$f(y; \mu, \phi) = b(y, \phi)\exp\left\{-\frac{1}{2\phi} d(y, \mu) \right\}$$ $d(y, \mu)$ is deviance.
where
(6)$$f(y;\mu, \phi) = cf(cy; c\mu, c^{2-p}\phi)$$
Let
1<p<2
Two conditional densities (y=0, y>0) merged together
For y=0:
For y > 0:
Resulting formula:
p > 2
Saddlepoint approximation method [3]
(Used only if previous methods produce
NaN
)Log likelihood:
$$\log(p) = -0.5 (\log(2 \pi \phi) + p \log(y)) + (-\mathrm{d}(y, \mu) / (2 \phi))$$
Usage
For all cases I assume the dispersion method is set to ML
withUsing the old calculation as the new one causedfix_tweedie_variance_power = True
andfix_dispersion_parameter = True|False
I don't optimize either but the likelihood is calculated properly so it can be usedfor external optimization of both p and phi(using the
init_dispersion_parameter
), e.g., Bayesian optimisation with Gaussian process with squared exponential kernel could work (having nice smooth estimates could mitigate the noisyness of the loglikelihood; but I did not try this approach)NonSPDMatrixError
inpyunit_pubdev_8685_tweedie_dispersion_factor_exceed2
.with$\varphi$ ) is used value set in
fix_tweedie_variance_power = False
andfix_dispersion_parameter = True
- it will optimize the Tweedie variance power as described in the Optimization section of this PR. As dispersion (phi,init_dispersion_parameter
with$\varphi$ )
fix_tweedie_variance_power = False
andfix_dispersion_parameter = False
- will in future optimize both Tweedie variance power (p) and dispersion parameter (TODO
test on more datasetstest with link power estimationdoesn't work so I removed the related codeCurrent Limitations
When response contains zeros, Tweedie Var. Power (p) will be estimated only between 1 and 2
When skipping likelihood calculation for zeros with$p>=2$ , we can see an artifact created by this around 2:
![skipping_zeros](https://user-images.githubusercontent.com/61695433/230599594-62536ef9-b0f4-41a6-a6dd-4ece613958f6.png)
If we use the same dataset but filter out the rows with $\text{response} <= 0 $ we could estimate some$p$ but the value is different then using just the proper likelihood where the log likelihood of response 0 or lower turns to $-\infty$ for $p >= 2$ :
![zeros_filtered](https://user-images.githubusercontent.com/61695433/230599628-15023e95-b0cf-474d-97ee-cb77a8b79330.png)
Properly calculated log likelihood:
![zero_to_inf](https://user-images.githubusercontent.com/61695433/230599660-eb8ec593-f30b-49bd-a1e3-ef217e0b1613.png)
Not estimating Poisson
As can be seen from the previous plot$p = 1$ which corresponds to poisson (when $\varphi = 1$ ) Poisson looks very likely but there are 2 problems:
Problems
tweedie::dtweedie.interp
) but I wasn't able to try it yet.NonSPDMatrixException
References
[1] DUNN, Peter and SMYTH, Gordon, 2008. Evaluation of Tweedie exponential dispersion model densities by Fourier inversion.
[2] OHLSSON, Esbjörn and JOHANSSON, Björn, 2006. Exact Credibility and Tweedie Models. ASTIN Bulletin: The Journal of the IAA. Online. May 2006. Vol. 36, no. 1, p. 121–133. link
[3] DUNN, Peter and SMYTH, Gordon, 2001. Tweedie Family Densities: Methods of Evaluation.
[4] DUNN, Peter K. and SMYTH, Gordon K., 2005. Series evaluation of Tweedie exponential dispersion model densities.
Attachments
Sage notebooks.zip