In [1]:
import numpy as np

# Estimating the $L_1$ Lipschitz constant of a function

In many problems of interest in optimization and other fields, target functions $f: R^N \to R$ must have bounded slopes, in the sense that for any $x,y \in R^N$ $\exists L_1 >0$ so that

$$||\nabla f(x) - \nabla f(y)|| \leq L_1 ||x-y||,$$

where $||\cdot||$ denotes the usual Euclidean norm in $R^N$; i.e., $||x||^2=\sum_{i=1}^N x_i^2, x\in R^N$. We can call the Euclidean norm by np.linalg.norm(). Such a constant $L_1$ is called the Liptschitz constant of $f$, and we will write $L_1^*$ to denote the supremum or least upper bound of all possible values $L_1$ for which the inequality in question may hold. We need to determine $L_1^*$ to find out what the tightest bound in variations of $\nabla f(\cdot)$ are. 

In this notebook, we numerically find a value for an estimator $\hat{L_1^*}$ using approximations to derivatives in our noisy, black-box setting using a paper and results from More and Wild. Their main result provides an optimal value $h$ in a forward difference approximation to the directional derivative $f', p \in R^N$ with $$f' \approx \frac{f(x+hp)-f(x)}{h}.$$ In fact, More and Wild show that with certain assumptions, using their optimum $h$ value, this approximation is the best, in the sense of minimizing a norm in the function space $\mathcal{L}_1$; importantly the authors assume the presence of additive noise.

### Example 1

Let $f: R^N \to R$ equal the sphere function; i.e., $$f(x; \xi)=\sum_{i=1}^N x_i^2 + \epsilon (\xi),$$ where $\epsilon \sim U[-k,k]$ is stoachastic additive noise with zero mean and bounded variance. Then we have $\nabla f(x)=2x$ and one can show that $L_1^*=2$. This means that the value of $||\nabla f(x) -\nabla f(y)||$ will never be larger than twice the value of $2||x-y||$ which is obvious because $||\nabla f(x) -\nabla f(y)||$ literally equals $2||x-y||$.

In [30]:
k=1e-5

def sphere(x):
    return np.sum(x**2, axis=0) + k*(2*np.random.rand(1) - 1)

F=sphere

print(F(np.array((10,10))))

[199.99999046]


#### Analytically check $L_1$ (just to make sure we aren't doing anything crazy)

In [92]:
# Draw two random points in a large hypercube:

N=10

x_0=100*(2*np.random.rand(N,1) - np.ones((N,1)))
y_0=100*(2*np.random.rand(N,1) - np.ones((N,1)))

d_1=np.linalg.norm(x_0-y_0)
print(d_1)

301.66208860006196


In [93]:
# Now compute the difference in their derivatives
d_2=np.linalg.norm(2*(x_0-y_0))

print(d_2)

603.3241772001239


In [94]:
print(d_2/d_1)

2.0


#### Numerical Approach

We will compute ratios at samples $x_k, x_j \in R^N$ $$\frac{||\hat{\nabla f(x_j)}-\hat{\nabla f(x_k)}||- 2 \epsilon^*}{||x_j - x_k||},j \neq k$$  where $\hat{\nabla f (x_k)}$ denotes a numerical approximation to $\nabla f(x_k)$. Here $\epsilon^* := \sup_{\xi} \epsilon (\xi),$ a technique mentioned in a Callies paper as a slightly-better-than-ad-hoc way of estimating $L_1$: slightly better since we are trying to remove the average effect of the noise. In our example, we have $\epsilon^* \approx \bar{\epsilon} + 3\sigma$, where $\bar{\epsilon}$ is the mean of the noise and $\sigma^2$ is the variance of the noise. We estimate $\sigma$ numerically in another notebook, but here we know the mean and variance of our noise. The mean is 0 and variance is $k^2/3$. Adding three times the standard deviation to the mean is a slightly optimistic upper bound on the average contribution of noise, but should suffice.

In [95]:
std=k/np.sqrt(3)
eps_star=3*std
eps_star

1.7320508075688777e-05

Now we need a small heuristic estimate for $||f''(x_0)||.$ The following is directly adapted from Algorithm 5.1 in More and Wild.

In [96]:
tao_1=100
tao_2=0.1

x_0=100*(2*np.random.rand(N,1) - np.ones((N,1)))

unit_v=np.ones(N)/(N**(1/N))

def Delta(h):
    F_m=F(x_0 - h*unit_v)
    F_0=F(x_0)
    F_p=F(x_0 + h*unit_v)
    return np.array((abs(F_m - 2*F_0 + F_p), F_m, F_0, F_p))

h_a=std**0.25
DD=Delta(h_a)
D_h_a=DD[0][0]
F_m_a=DD[1][0]
F_0_a=DD[2][0]
F_p_a=DD[3][0]

D_h_a

0.030305457428767113

In [97]:
mu_a=D_h_a/h_a**2

LHS_1=abs(F_p_a-F_0_a)
LHS_2=abs(F_m_a-F_0_a)

RHS_1=tao_2*max(abs(F_0_a),abs(F_p_a))
RHS_2=tao_2*max(abs(F_0_a),abs(F_m_a))

if D_h_a/std>=tao_1 and LHS_1<=RHS_1 and LHS_2<=RHS_2:
    mu_f2=mu_a
else:
    h_b=(std/mu_a)**0.25
    mu_f2=Delta(H_b)[0]/h_b**2
    
print(mu_f2)

12.612499362410437


Now we need estimates for the gradients at sample points, which are denoted $\hat{\nabla f (x_k)}$. We do so by evaluating the forward difference for in $i$-th component of a sample $x$ to approximate the $i$-th partial derivate of $f$ at x, $$\frac{\partial f}{\partial x_i}(x)\approx \frac{f(x+ h^* \cdot e_i)-f(x)}{h^*} \quad h^*:=8^{1/4}\left(\frac{\sigma}{\mu_{f''}}\right)^{1/2} \quad \mu_{f''} \approx \max |f''| \quad e_i:=(0,\ldots,0,1,0,\ldots,0).$$ In our case, $\mu_{f''}=2$ since the Hessian of $f$, $\nabla^2 f$ is a $10 \times 10$ diagonal matrix with entries of 2 along the diagonal, and we are using the standard Euclidean norm. Note that the Frobenius norm equals $\sqrt{40}$.

Chen and More show that $h^*$ yields the best estimates to the partial derivates in an $\mathcal{L}_1$ sense.

In [98]:
M=6

x_vals=np.zeros((N,1)) # matrix of the vectors we randomly make

grads=np.zeros((N,1)) # matrix of the gradients we approximates

ratios=np.zeros((M,1)) # vector storing the ratios we are intersted in!

#mu_f2=2
h_star=(8**0.25)*np.sqrt(std/mu_f2)

F=sphere

for j in range(0,M):

    x=100*(2*np.random.rand(N,1) - np.ones((N,1)))
    y=100*(2*np.random.rand(N,1) - np.ones((N,1)))
    x_vals=np.hstack((x_vals,x))
    x_vals=np.hstack((x_vals,y))

    approx_grad_x=np.zeros((N,1))
    for i in range(0,N):
        e = np.zeros((N,1))
        e[i] = 1.0
        approx_grad_x[i] = (F(x + h_star*e) - F(x))/h_star
    
    grads=np.hstack((grads,approx_grad_x))
    
    approx_grad_y=np.zeros((N,1))
    for p in range(0,N):
        e = np.zeros((N,1))
        e[p] = 1.0
        approx_grad_y[p] = (F(y + h_star*e) - F(y))/h_star
    
    grads=np.hstack((grads,approx_grad_y))
    
    diff_1=np.linalg.norm(approx_grad_x - approx_grad_y) - 2*eps_star
    diff_2=np.linalg.norm(x-y)
    r=diff_1/diff_2
    
    ratios[j]=r
    
x_vals=x_vals[:,1:]
grads=grads[:,1:]

In [99]:
average=np.sum(ratios,axis=0)/M

print(average)

[2.00000551]
