-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts on _jacobi.py #1
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HDembinski Last year, you opened scipy/scipy#17059. I'd be interested in moving something like that forward, although I think I'd like to work it into the framework of mdhaber/scipy#106. Here are a few thoughts I had about _jacobi.py
.
If you don't have time to pursue this right now or are no longer interested, that's fine, but if you have a moment, I was hoping you could answer two questions about line 157/158 (the even powers of h
used if np.polyfit
) and line 164 (no appearance of Student's t distribution). Images are from D'Errico's writup. Thanks for your help!
maxiter : int, optional | ||
Maximum number of iterations of the algorithm. | ||
maxgrad : int, optional | ||
Maximum grad of the extrapolation polynomial. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider expressing in terms of "degree" of the polynomial. Perhaps "grad" refers to highest derivative that is nonzero?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that should be "degree". I don't know why I called it maxgrad
, usually I try to follow established names used in established libraries. That may be a bit of German language influence, we call "degree" "grad".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah makes sense.
return (-3 * f0 + 4 * f1 - f2) * (0.5 / h) | ||
|
||
|
||
def _first(method, f0, f, x, i, h, args): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the purpose of this function is to detect whether NaNs are encountered and, if so, choose the appropriate one-sided formula. If this were to be added to SciPy, I'd suggest that this function just determine which formula should be used, then call _derive
(_differentiate
?), possibly passing the values as which f
has already been evaluated (to eliminate the duplicate calculation). The purpose is to avoid repeating the formulas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that sounds like a cleaner design, but I remember I chose this design to avoid a duplicate calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured. I'd just suggest that there are other ways to avoid the duplicate calculation. I will probably work on that in my PR and then I'll have something more concrete to say.
else: | ||
method = 0 | ||
if method == 0: | ||
return method, None, (f1 - f2) * norm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return method, None, (f1 - f2) * norm | |
return method, None, (f2 - f1) * norm |
I think this produces the negative of the derivative estimate as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this should be correct. I always compute f1 - f2
and h = (x1 - x2) * factor
consistently everywhere, so h
and norm
should also be negative for x1 < x2. I hope that one of the unit tests breaks if you make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest taking a look; there are only a few lines to follow starting at 121 then jumping into this function. I noticed the numerical effect before finding the cause here. I saw that the _first
derivative estimate changes sign depending on whether it's doing central or one-sided differences.
I haven't run the tests because I didn't install this properly (just copy-pasted), but I wouldn't be surprised if they didn't catch it. (Unless you have tests that confirm that polynomials only require the expected number of iterations.) It only affects the first estimate because every other iteration uses _derive
. It could affect the number of iterations, but has little effect on the final result.
start = i - (grad + 1) | ||
stop = i + 1 | ||
q, c = np.polyfit( | ||
h[start:stop] ** 2, fd[start:], grad, rcond=None, cov=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I think you are right and this should only be done for the central derivatives. Forward and backward derivatives should use a polynomial in h
instead of h^2
.
This may explain why the performance of the forward/backward derivatives was always worse than the central derivative. I don't see a fundamental reason for that in this scheme.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well.. I'd think that they will require more iterations because you have twice as many error terms to cancel to get to a given order. But yes, I think that adjusting this will help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but on the other hand in each iteration you do half as many calculations, assuming that f0 is computed only once in the beginning. It should roughly cancel out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, theoretically that is possible, yes. I didn't notice that when h
is reduced, two of the three one-sided points can be the same as in the previous iteration, so you only need to evaluate the function once per iteration with one-sided differences. Nice.
I don't think the code currently takes advantage of that, though; it performs two function evaluations in _derive
whether using a one-sided or two-sided stencil. This would work if you use f(x0)
, f(x0 + h)
, f(x0 + h/factor)
as the evaluation points. For example, with factor=0.5
, you could use f(x0)
, f(x0 + h)
, f(x0 + 2*h)
so that you only need to evaluate the function at one new point per step size.
Currently, though, you use f(x0)
, f(x0 + h)
, and f(x + 2h)
with a default factor of 1/4. You would need to adjust the weights based on factor
if that is user-adjustable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I forgot about that. I have to check, but then the h^2 powers in the polynomial fit may be correct for forward/backward formula, since it is second order like the central difference formula. At least that was the idea I had there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think the order of the formula you use controls the power on h
. It's a matter of cancelling out error terms. With central difference, there are only even-powered error terms, so you can use h**2
. But for one-sided difference, all powers of h
remain and need to be cancelled.
When these things are adjusted, each new function evaluation will reduce the order of the error by one whether you are using one-sided or central difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll be doing the math on this today.
On one hand, mentally if I do Richardson extrapolation on two second order approximations that have both odd and even order error terms, I can only hope to solve for coefficients that satisfy two criteria: the first derivative itself remains, and the second order error term is cancelled. Symbolically, this would appear to leave the third order error terms if I know nothing about the coefficients.
On the other hand, I know that if I have five function evaluations, I can solve for coefficients that will give me a fourth order approximation of the derivative.
So there is definitely a way to use the n+1
function evaluations to get an n
th order approximation. I don't think you can get that in general by extrapolating from n/2
second order approximations because in forming the second order approximations, you've locked some degrees of freedom you could have used to cancel out extra error terms.
For instance, suppose you have evaluated
to satisfy five constraints: canceling out all the terms of Taylor series expansion through
Instead, you choose three weights
and you get another second-order approximation like:
You get to choose two more coefficients,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, here's something a bit more concrete. I modified the Jacobi code as follows:
- Correct the sign error discussed above Thoughts on _jacobi.py #1 (comment)
- Allow step sizes larger than 0.5 (so we can see convergence before running into numerical garbage).
Here's some code that allows us to plot the error as a function of the initial step size while we play with the initial step size and maxgrad
(while maxiter
stays fixed).
f = np.exp
df = np.exp
x = 1
df_ref = df(x)
maxiter = 5
maxgrad = 0
hs = np.logspace(np.log10(1), np.log10(5), 30)
err_central = []
err_oneside = []
for h in hs:
step = (h, 0.25)
df_approx, _ = jacobi(f, x, maxiter=maxiter, step=step, diagonal=True,
method=0, maxgrad=maxgrad)
err_central.append(abs(df_approx - df_ref))
df_approx, _ = jacobi(f, x, maxiter=maxiter, step=step, diagonal=True,
method=1, maxgrad=maxgrad)
err_oneside.append(abs(df_approx - df_ref))
err_central = np.array(err_central)
err_oneside = np.array(err_oneside)
order_central = np.polyfit(np.log10(hs), np.log10(err_central), 1)[0]
order_oneside = np.polyfit(np.log10(hs), np.log10(err_oneside), 1)[0]
print(f'Central difference order: {order_central}')
print(f'One-sided difference convergence order: {order_oneside}')
plt.loglog(hs, err_central, label=f'Central difference (order: {order_central})')
plt.loglog(hs, err_oneside, label=f'One-sided difference (order: {order_oneside})')
plt.legend()
plt.xlabel('Initial step size')
plt.ylabel('Error')
When maxgrad=0
, we would expect both central-difference and one-sided difference to give us 2nd-order convergence.
Great! When maxgrad=1
, we would hope that both central-difference and one-sided difference result in a fourth-order estimate, but:
If you adjust np.polyfit
to look for a fit in h
for the one-sided method instead of h**2
, you can start to see convergence of order maxgrad + 1
. I think this makes sense, because maxgrad
is the degree of the polynomial, and that would correspond with the highest order error term you can cancel out. (For central difference, you get order 2*(maxgrad+1)
, which also makes sense for the same reason.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very cool, thanks for investigating and fixing this.
# pulls have roughly unit variance, however, | ||
# the pull distribution is not gaussian and looks | ||
# more like student's t | ||
rei = c[-1, -1] ** 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not following the original recipe, because the Students-t distribution is not exact for this case, since it assumes Gaussian fluctuations for the residuals. The residuals in this case come from both round-off error and truncation error, which are not Gaussian. Their probability distribution is not well-defined.
In light of this ambiguity, I chose am empirical approach and tried out several examples. I could not find an estimate for the error which is generally satisfactory, so I chose the simplest possible calculation (which is also the fastest) and just return the standard error reported by the fit. The error is rather an order of magnitude estimate than an interval with a well-defined statistical coverage of X%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Yeah I thought the multivariate t idea may have been practical, but I prefer not to abuse statistics, too. Just checking.
@mdhaber Sorry for the silence. I still very much would like to move this to scipy. I can only work on this in my freetime and I am currently handling another private project. If someone (maybe you) is interested in helping me with this, I would be happy to work with them. This would be a way to move this forward. |
Great. Yes, I can work on it with you. How about this. I want/need derivatives of scalar-valued functions soon for my work on scipy.stats, so I will finish up mdhaber/scipy#106 and then open a PR on the main SciPy repo. It's going to be a private function for now, so we don't need to worry about backwards compatibility if we want to change things later. Then, as time permits, we can work on incorporating some of the features you have here. For instance, we can compare the merits of the I know you already have these features here, so the other approach would be to mostly copy what you have. However, I have a pretty different initial set of requirements, so it made sense for me to start fresh. The other benefit is that by working on my own version, I get more familiarity with the subject. After that, we'll both have experience, and then we can combine the best of both to put together a public function. It will be good to finally have numerical differentiation in SciPy - it's been b discussed for decades! |
Sounds like a good plan, I am happy to work you on this. I am also ok with your idea of starting again from scratch. I had a quick look at mdhaber/scipy#106 and I saw that you calculate the weights for different orders (I think). I am convinced that with the DERIVEST approach, it makes no sense to calculate anything that's higher order than the central derivative, since you get the benefits of that effectively from the polynomial extrapolation. |
Yes.
Maybe. You might still need/want to calculate weights for a one-sided stencil to save function evaluations; see #1 (comment).
I didn't use IIUC, if you were using with a polynomial of exactly the right degree, these approaches must be equivalent if they are both producing approximations of the same order. The downside of using exactly matching the degree to the number of points is the noise sensitivity. But again,
The nice thing about having both versions is that we can compare the performance differences! One set of benchmark functions is the CDFs of all of SciPy's distributions because we can compare the results against the PDFs. Do you know of any other existing benchmark set? Update: I used the SciPy distribution CDFs as a benchmark Code: import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from scipy.optimize import _zeros_py as zeros
from time import perf_counter
times1 = []
times2 = []
acc1 = []
acc2 = []
err_ratio1 = []
err_ratio2 = []
nits = []
rtol = np.sqrt(np.finfo(np.float64).eps)
for case in stats._distr_params.distcont:
distname, params = case
print(distname)
dist = getattr(stats, distname)(*params)
x = dist.median() + 0.1
# x = dist.ppf(rng.random(size=100))
ref = dist.pdf(x)
tic = perf_counter()
res = zeros._differentiate(dist.cdf, x, rtol=rtol)
toc = perf_counter()
nits.append(res.nit)
true_error = abs(res.df-ref)/ref
times1.append(toc-tic)
acc1.append(true_error)
err_ratio1.append(res.error/true_error)
tic = perf_counter()
res, error = jacobi(dist.cdf, x, diagonal=True, rtol=rtol)
toc = perf_counter()
true_error = abs(res-ref)/ref
times2.append(toc-tic)
acc2.append(true_error)
err_ratio2.append(error/true_error)
plt.hist(np.log10(times1), bins=np.linspace(-4, 0, 25), label='differentiate', alpha=0.5)
plt.hist(np.log10(times2), bins=np.linspace(-4, 0, 25), label='jacobi', alpha=0.5)
plt.legend()
plt.title('log10 of execution time (s)')
plt.show()
# acc1 = np.asarray(acc1)
# acc2 = np.asarray(acc2)
# acc1[acc1 == 0] = 1e-16
# acc2[acc2 == 0] = 1e-16
# plt.hist(np.log10(acc1), label='differentiate', alpha=0.5, bins=np.linspace(-16, -2, 25))
# plt.hist(np.log10(acc2), label='jacobi', alpha=0.5, bins=np.linspace(-16, -2, 25))
# plt.legend()
# plt.title('log10 of true error')
# plt.show()
# err_ratio1 = np.asarray(err_ratio1)
# err_ratio2 = np.asarray(err_ratio2)
#
# err_ratio1 = err_ratio1[np.isfinite(err_ratio1)]
# err_ratio2 = err_ratio2[np.isfinite(err_ratio2)]
# plt.hist(np.log10(err_ratio1), label='differentiate', alpha=0.5, bins=np.linspace(-5, 5, 25))
# plt.hist(np.log10(err_ratio2), label='jacobi', alpha=0.5, bins=np.linspace(-5, 5, 25))
# plt.legend()
# plt.title('log10 of error ratio')
# plt.show() Results: With relative tolerance set at the square root of floating point epsilon (~
Same but with relative error tolerance at 1e-12: Same but taking the derivative at 100 points in a single vectorized call. (For Not too much difference really! |
No description provided.