Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on _jacobi.py #1

Closed
wants to merge 1 commit into from
Closed

Thoughts on _jacobi.py #1

wants to merge 1 commit into from

Conversation

mdhaber
Copy link
Owner

@mdhaber mdhaber commented Jun 23, 2023

No description provided.

Copy link
Owner Author

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HDembinski Last year, you opened scipy/scipy#17059. I'd be interested in moving something like that forward, although I think I'd like to work it into the framework of mdhaber/scipy#106. Here are a few thoughts I had about _jacobi.py.

If you don't have time to pursue this right now or are no longer interested, that's fine, but if you have a moment, I was hoping you could answer two questions about line 157/158 (the even powers of h used if np.polyfit) and line 164 (no appearance of Student's t distribution). Images are from D'Errico's writup. Thanks for your help!

src/jacobi/_jacobi.py Show resolved Hide resolved
maxiter : int, optional
Maximum number of iterations of the algorithm.
maxgrad : int, optional
Maximum grad of the extrapolation polynomial.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider expressing in terms of "degree" of the polynomial. Perhaps "grad" refers to highest derivative that is nonzero?

Copy link

@HDembinski HDembinski Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should be "degree". I don't know why I called it maxgrad, usually I try to follow established names used in established libraries. That may be a bit of German language influence, we call "degree" "grad".

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense.

return (-3 * f0 + 4 * f1 - f2) * (0.5 / h)


def _first(method, f0, f, x, i, h, args):
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the purpose of this function is to detect whether NaNs are encountered and, if so, choose the appropriate one-sided formula. If this were to be added to SciPy, I'd suggest that this function just determine which formula should be used, then call _derive (_differentiate?), possibly passing the values as which f has already been evaluated (to eliminate the duplicate calculation). The purpose is to avoid repeating the formulas.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that sounds like a cleaner design, but I remember I chose this design to avoid a duplicate calculation.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured. I'd just suggest that there are other ways to avoid the duplicate calculation. I will probably work on that in my PR and then I'll have something more concrete to say.

else:
method = 0
if method == 0:
return method, None, (f1 - f2) * norm
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return method, None, (f1 - f2) * norm
return method, None, (f2 - f1) * norm

I think this produces the negative of the derivative estimate as-is.

Copy link

@HDembinski HDembinski Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this should be correct. I always compute f1 - f2 and h = (x1 - x2) * factor consistently everywhere, so h and norm should also be negative for x1 < x2. I hope that one of the unit tests breaks if you make that change.

Copy link
Owner Author

@mdhaber mdhaber Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest taking a look; there are only a few lines to follow starting at 121 then jumping into this function. I noticed the numerical effect before finding the cause here. I saw that the _first derivative estimate changes sign depending on whether it's doing central or one-sided differences.

I haven't run the tests because I didn't install this properly (just copy-pasted), but I wouldn't be surprised if they didn't catch it. (Unless you have tests that confirm that polynomials only require the expected number of iterations.) It only affects the first estimate because every other iteration uses _derive. It could affect the number of iterations, but has little effect on the final result.

src/jacobi/_jacobi.py Show resolved Hide resolved
start = i - (grad + 1)
stop = i + 1
q, c = np.polyfit(
h[start:stop] ** 2, fd[start:], grad, rcond=None, cov=True
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re:
image

I can see why the odd power terms are not included when the derivative is computed with central differences. Is that still true when the derivative is estimated with one-sided differences?

Copy link

@HDembinski HDembinski Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I think you are right and this should only be done for the central derivatives. Forward and backward derivatives should use a polynomial in h instead of h^2.

This may explain why the performance of the forward/backward derivatives was always worse than the central derivative. I don't see a fundamental reason for that in this scheme.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well.. I'd think that they will require more iterations because you have twice as many error terms to cancel to get to a given order. But yes, I think that adjusting this will help.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but on the other hand in each iteration you do half as many calculations, assuming that f0 is computed only once in the beginning. It should roughly cancel out.

Copy link
Owner Author

@mdhaber mdhaber Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, theoretically that is possible, yes. I didn't notice that when h is reduced, two of the three one-sided points can be the same as in the previous iteration, so you only need to evaluate the function once per iteration with one-sided differences. Nice.

I don't think the code currently takes advantage of that, though; it performs two function evaluations in _derive whether using a one-sided or two-sided stencil. This would work if you use f(x0), f(x0 + h), f(x0 + h/factor) as the evaluation points. For example, with factor=0.5, you could use f(x0), f(x0 + h), f(x0 + 2*h) so that you only need to evaluate the function at one new point per step size.

Currently, though, you use f(x0), f(x0 + h), and f(x + 2h) with a default factor of 1/4. You would need to adjust the weights based on factor if that is user-adjustable.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I forgot about that. I have to check, but then the h^2 powers in the polynomial fit may be correct for forward/backward formula, since it is second order like the central difference formula. At least that was the idea I had there.

Copy link
Owner Author

@mdhaber mdhaber Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think the order of the formula you use controls the power on h. It's a matter of cancelling out error terms. With central difference, there are only even-powered error terms, so you can use h**2. But for one-sided difference, all powers of h remain and need to be cancelled.

When these things are adjusted, each new function evaluation will reduce the order of the error by one whether you are using one-sided or central difference.

Copy link
Owner Author

@mdhaber mdhaber Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll be doing the math on this today.

On one hand, mentally if I do Richardson extrapolation on two second order approximations that have both odd and even order error terms, I can only hope to solve for coefficients that satisfy two criteria: the first derivative itself remains, and the second order error term is cancelled. Symbolically, this would appear to leave the third order error terms if I know nothing about the coefficients.

On the other hand, I know that if I have five function evaluations, I can solve for coefficients that will give me a fourth order approximation of the derivative.

So there is definitely a way to use the n+1 function evaluations to get an nth order approximation. I don't think you can get that in general by extrapolating from n/2 second order approximations because in forming the second order approximations, you've locked some degrees of freedom you could have used to cancel out extra error terms.

For instance, suppose you have evaluated $f(x)$, $f(x+h/4)$, $f(x+h/2)$, $f(x+h)$, and $f(x+2h)$. (This is what you're currently doing with a step factor of 1/4). You can get a 4th order formula by choosing five weights $c_i$.

$c_0 f(x) + c_1 f(x +h/4) + c_2 f(x +h/2) + c_3 f(x+h) + c_4 f(x+2h) = hf'(x) + hO(h^4)$

to satisfy five constraints: canceling out all the terms of Taylor series expansion through $f^{(4)}(x)$ except the one with $f'(x)$.

Instead, you choose three weights $d_i$ to get a second-order approximation:

$d_0 f(x) + d_1 f(x+h) + d_1 f(x+2h) = hf'(x) + hO(h^2)$

and you get another second-order approximation like:

$d_0 f(x) + d_1 f(x+h/4) + d_1 f(x+h/2) = \frac{h}{4}f'(x) + \frac{h}{4}O(h^2)$

You get to choose two more coefficients, $w_1$ and $w_2$, to weight your two lower-order approximations to get a higher-order approximation, so you can only satisfy two new constraints. You have to preserve the first derivative, so you can only cancel one higher-order term, leaving you with a third-order approximation. Only if you're lucky would you happen to cancel an extra term and get a fourth-order approximation, and I don't think you're lucky with this stencil and step factor. We could probably choose the step factor to allow for the cancellation of the extra error term, but then there's no guarantee that two steps of the same ratio will let you get a sixth-order approximation and so on.

Copy link
Owner Author

@mdhaber mdhaber Jun 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, here's something a bit more concrete. I modified the Jacobi code as follows:

Here's some code that allows us to plot the error as a function of the initial step size while we play with the initial step size and maxgrad (while maxiter stays fixed).

f = np.exp
df = np.exp
x = 1
df_ref = df(x)
maxiter = 5
maxgrad = 0

hs = np.logspace(np.log10(1), np.log10(5), 30)
err_central = []
err_oneside = []
for h in hs:
  step = (h, 0.25)

  df_approx, _ = jacobi(f, x, maxiter=maxiter, step=step, diagonal=True, 
                        method=0, maxgrad=maxgrad)
  err_central.append(abs(df_approx - df_ref))

  df_approx, _ = jacobi(f, x, maxiter=maxiter, step=step, diagonal=True, 
                        method=1, maxgrad=maxgrad)
  err_oneside.append(abs(df_approx - df_ref))

err_central = np.array(err_central)
err_oneside = np.array(err_oneside)

order_central = np.polyfit(np.log10(hs), np.log10(err_central), 1)[0]
order_oneside = np.polyfit(np.log10(hs), np.log10(err_oneside), 1)[0]
print(f'Central difference order: {order_central}')
print(f'One-sided difference convergence order: {order_oneside}')

plt.loglog(hs, err_central, label=f'Central difference (order: {order_central})')
plt.loglog(hs, err_oneside, label=f'One-sided difference (order: {order_oneside})')
plt.legend()
plt.xlabel('Initial step size')
plt.ylabel('Error')

When maxgrad=0, we would expect both central-difference and one-sided difference to give us 2nd-order convergence.

image

Great! When maxgrad=1, we would hope that both central-difference and one-sided difference result in a fourth-order estimate, but:
image

maxgrad=2
image

maxgrad=3
image

If you adjust np.polyfit to look for a fit in h for the one-sided method instead of h**2, you can start to see convergence of order maxgrad + 1. I think this makes sense, because maxgrad is the degree of the polynomial, and that would correspond with the highest order error term you can cancel out. (For central difference, you get order 2*(maxgrad+1), which also makes sense for the same reason.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very cool, thanks for investigating and fixing this.

Comment on lines +161 to +164
# pulls have roughly unit variance, however,
# the pull distribution is not gaussian and looks
# more like student's t
rei = c[-1, -1] ** 0.5
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like you're using this idea:

image

so I was expecting the use of Student's t distribution, but I don't see it. Is the idea that you're just reporting the standard error instead of the size of a confidence interval?

Copy link

@HDembinski HDembinski Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following the original recipe, because the Students-t distribution is not exact for this case, since it assumes Gaussian fluctuations for the residuals. The residuals in this case come from both round-off error and truncation error, which are not Gaussian. Their probability distribution is not well-defined.

In light of this ambiguity, I chose am empirical approach and tried out several examples. I could not find an estimate for the error which is generally satisfactory, so I chose the simplest possible calculation (which is also the fastest) and just return the standard error reported by the fit. The error is rather an order of magnitude estimate than an interval with a well-defined statistical coverage of X%.

Copy link
Owner Author

@mdhaber mdhaber Jun 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Yeah I thought the multivariate t idea may have been practical, but I prefer not to abuse statistics, too. Just checking.

@mdhaber mdhaber changed the title Update _jacobi.py Thoughts on _jacobi.py Jun 23, 2023
@HDembinski
Copy link

HDembinski commented Jun 26, 2023

@mdhaber Sorry for the silence. I still very much would like to move this to scipy. I can only work on this in my freetime and I am currently handling another private project. If someone (maybe you) is interested in helping me with this, I would be happy to work with them. This would be a way to move this forward.

@mdhaber
Copy link
Owner Author

mdhaber commented Jun 26, 2023

Great. Yes, I can work on it with you. How about this.

I want/need derivatives of scalar-valued functions soon for my work on scipy.stats, so I will finish up mdhaber/scipy#106 and then open a PR on the main SciPy repo. It's going to be a private function for now, so we don't need to worry about backwards compatibility if we want to change things later.

Then, as time permits, we can work on incorporating some of the features you have here. For instance, we can compare the merits of the np.polyfit approach to what I have and switch, if desired. And of course, we would want add support for multivariate, vector-valued functions for it to be more broadly useful.

I know you already have these features here, so the other approach would be to mostly copy what you have. However, I have a pretty different initial set of requirements, so it made sense for me to start fresh. The other benefit is that by working on my own version, I get more familiarity with the subject. After that, we'll both have experience, and then we can combine the best of both to put together a public function. It will be good to finally have numerical differentiation in SciPy - it's been b discussed for decades!

@HDembinski
Copy link

HDembinski commented Jun 26, 2023

Sounds like a good plan, I am happy to work you on this. I am also ok with your idea of starting again from scratch.

I had a quick look at mdhaber/scipy#106 and I saw that you calculate the weights for different orders (I think). I am convinced that with the DERIVEST approach, it makes no sense to calculate anything that's higher order than the central derivative, since you get the benefits of that effectively from the polynomial extrapolation.

@mdhaber
Copy link
Owner Author

mdhaber commented Jun 26, 2023

I saw that you calculate the weights for different orders (I think)

Yes.

I am convinced that with the DERIVEST approach, it makes no sense to calculate anything that's higher order than the central derivative

Maybe. You might still need/want to calculate weights for a one-sided stencil to save function evaluations; see #1 (comment).

since you get the benefits of that effectively from the polynomial extrapolation.

I didn't use polyfit, though. A call to solve is faster than a call to polyfit, and it only has to be done once per import (assuming the same sequence of step sizes) because it's cached. So after that is done, this really replaces a call to polyfit with a dot product (@). Considering that this is for use on an elementwise function and we can reuse the weights, it's bound to be faster for the intended use case.

IIUC, if you were using with a polynomial of exactly the right degree, these approaches must be equivalent if they are both producing approximations of the same order. The downside of using exactly matching the degree to the number of points is the noise sensitivity. But again,

we can compare the merits of the np.polyfit approach to what I have and switch, if desired

The nice thing about having both versions is that we can compare the performance differences! One set of benchmark functions is the CDFs of all of SciPy's distributions because we can compare the results against the PDFs. Do you know of any other existing benchmark set?

Update: I used the SciPy distribution CDFs as a benchmark

Code:

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from scipy.optimize import _zeros_py as zeros
from time import perf_counter

times1 = []
times2 = []
acc1 = []
acc2 = []
err_ratio1 = []
err_ratio2 = []
nits = []
rtol = np.sqrt(np.finfo(np.float64).eps)

for case in stats._distr_params.distcont:
    distname, params = case
    print(distname)
    dist = getattr(stats, distname)(*params)
    x = dist.median() + 0.1
    # x = dist.ppf(rng.random(size=100))
    ref = dist.pdf(x)

    tic = perf_counter()
    res = zeros._differentiate(dist.cdf, x, rtol=rtol)
    toc = perf_counter()

    nits.append(res.nit)

    true_error = abs(res.df-ref)/ref
    times1.append(toc-tic)
    acc1.append(true_error)
    err_ratio1.append(res.error/true_error)

    tic = perf_counter()
    res, error = jacobi(dist.cdf, x, diagonal=True, rtol=rtol)
    toc = perf_counter()

    true_error = abs(res-ref)/ref
    times2.append(toc-tic)
    acc2.append(true_error)
    err_ratio2.append(error/true_error)

plt.hist(np.log10(times1), bins=np.linspace(-4, 0, 25), label='differentiate', alpha=0.5)
plt.hist(np.log10(times2), bins=np.linspace(-4, 0, 25), label='jacobi', alpha=0.5)
plt.legend()
plt.title('log10 of execution time (s)')
plt.show()

# acc1 = np.asarray(acc1)
# acc2 = np.asarray(acc2)
# acc1[acc1 == 0] = 1e-16
# acc2[acc2 == 0] = 1e-16
# plt.hist(np.log10(acc1), label='differentiate', alpha=0.5, bins=np.linspace(-16, -2, 25))
# plt.hist(np.log10(acc2), label='jacobi', alpha=0.5, bins=np.linspace(-16, -2, 25))
# plt.legend()
# plt.title('log10 of true error')
# plt.show()

# err_ratio1 = np.asarray(err_ratio1)
# err_ratio2 = np.asarray(err_ratio2)
#
# err_ratio1 = err_ratio1[np.isfinite(err_ratio1)]
# err_ratio2 = err_ratio2[np.isfinite(err_ratio2)]
# plt.hist(np.log10(err_ratio1), label='differentiate', alpha=0.5, bins=np.linspace(-5, 5, 25))
# plt.hist(np.log10(err_ratio2), label='jacobi', alpha=0.5, bins=np.linspace(-5, 5, 25))
# plt.legend()
# plt.title('log10 of error ratio')
# plt.show()

Results:

With relative tolerance set at the square root of floating point epsilon (~ 1e-8 ), and taking the derivative of a scalar-valued function at a single (scalar) point:

log10 of the execution time
image

log10 of the actual relative (to the true derivative) error
image

log10 of the ratio between the error estimate / true error (positive means overestimate)
image

Same but with relative error tolerance at 1e-12:

image

image

image

Same but taking the derivative at 100 points in a single vectorized call. (For jacobi, this means diagonal=True.)

image

Not too much difference really! _differentiate is a tad faster, but it overestimates the error at the loose rtol. Both get about the same accuracy at a given rtol setting, and both have pretty accurate error estimates for tighter rtol.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants