ref:

http://patwie.com/2017/01/04/optimization_with_tf.html

https://cs.stackexchange.com/questions/75090/can-deep-learning-be-applied-to-nonlinear-parameter-estimation-problems

https://github.com/Mazecreator/tensorflow-hints/blob/master/maximize/Maximize%20Test.ipynb

In [None]:
import numpy as np
import tensorflow as tf
sess=tf.InteractiveSession()

$$% <![CDATA[
f(x) = \frac{1}{2}x^\top \left(\begin{matrix}9 & 2\\ 2 & 10\end{matrix}\right)x - \left(\begin{matrix}5 \\ 6\end{matrix}\right)^\top x + 42, \quad x\in\mathbb{R}^2 %]]>$$

#### 1.Steepest Descent (gradient, Hessian, alpha)

$$% <![CDATA[
f(x) = \frac{1}{2}x^\top \left(\begin{matrix}4 & -2\\ -2 & 2\end{matrix}\right)x + \left(\begin{matrix}2 \\ -2\end{matrix}\right)^\top x + 0, \quad x\in\mathbb{R}^2 %]]>$$

$$% <![CDATA[
f(x) = \frac{1}{2}x^\top \left(\begin{matrix}2 & -0.4\\ -0.4 & 2\end{matrix}\right)x + \left(\begin{matrix}-2.2 \\ 2.2\end{matrix}\right)^\top x + 2.2, \quad x\in\mathbb{R}^2 %]]>$$

$$% <![CDATA[
f(x) = \frac{1}{2}x^\top \left(\begin{matrix}9 & 2\\ 2 & 10\end{matrix}\right)x - \left(\begin{matrix}5 \\ 6\end{matrix}\right)^\top x + 42, \quad x\in\mathbb{R}^2 %]]>$$

$$f(x)=\frac{1}{2}x^\top Ax + b^\top x + c$$

In [None]:
# we just declare all variables:
x = tf.Variable(np.random.rand(2,1), dtype=tf.float32, name="x")
# we already make clear, that we are not going to optimize these variables
b = tf.Variable([[2],[-2]], dtype=tf.float32, trainable=False, name="b")
A = tf.Variable([[4,-2],[-2,2]], dtype=tf.float32, trainable=False, name="A")
sess.run(tf.global_variables_initializer())

There are 3 ways to get the derivative:

- the elegant way: by hand (I porpose to do that)
- the clever way: look it up in the so-called “Matrix Cookbook”
- the modern way: use http://www.tensortree.org



$$\frac{\partial}{\partial x}f(x) = Ax + b$$

This is great as it directly allows us to solve the optimization (minimization problem) by obtaining a solution for the linear equation system. The latter is quite evident:

In [None]:
# solving Ax=b for x is as easy as:
xstar = tf.matrix_solve_ls(A, -b)
# now we just print the solution
print( "x=", sess.run(xstar))
objective = 0.5 * tf.matmul(tf.matmul(tf.transpose(xstar), A), xstar) + tf.matmul(tf.transpose(b), xstar) + 0
print( "f(x)=", sess.run(objective))

If you do not trust tf.matrix_solve you should check the result by:

In [None]:
# test A*x - b = 0 ?
sess.run(tf.matmul(A, xstar) + b)

A very basic but very powerful iterative method is gradient descent:

$$x \gets x - \eta \cdot \frac{\partial}{\partial x}f(x)$$

It is quite easy to verify, that the gradient is always pointing in the direction of maximum increase. Hence, if we just move our current guess $x$ in the opposite direction 
$-\frac{\partial}{\partial x}f(x)$ with a tiny step  $η=0.01 $we may converge at some point which is likely to be the minimum.

A short version of this optimization procedure is:

### Known Gradient

In [None]:
# destroy previous session and graph and create new session
tf.reset_default_graph()
sess = tf.InteractiveSession()

# define variables in the problem
x = tf.Variable(np.random.rand(2,1), dtype=tf.float32, name="x")
b = tf.Variable([[2],[-2]], dtype=tf.float32, trainable=False, name="b")
A = tf.Variable([[4,-2],[-2,2]], dtype=tf.float32, trainable=False, name="A")
sess.run(tf.global_variables_initializer())

# define expressions
objective = 0.5 * tf.matmul(tf.matmul(tf.transpose(x), A), x) + tf.matmul(tf.transpose(b), x) + 0
grad = tf.matmul(A, x) + b              # this is new
optimize_op = x.assign(x - 0.01 * grad) # this is new

In [None]:
# start at some random point again
sess.run(x.assign(np.random.rand(2,1)))
# optimize
for _ in range(300):
    sess.run(optimize_op)
print( "x=", sess.run(x))
print( "f(x)=", sess.run(objective))

This look conspicuously similar to our closed-form solution from the beginning.


## Unknown Gradient
Let’s assume we have no clue what’s the derivative looks like. We just assume that the objective function is continuously differentiable. Using reverse-mode auto-differentiation TensorFlow can compute these gradients fully automatically for us, see tf.gradients.

In [None]:
grad = tf.gradients(objective, x)[0]   # get gradient from objective wrt. to x
optimize_op = x.assign(x - 0.01 * grad)

Note, we use tf.gradients to compute all necessary gradients for us. Optimization is the same as before:

In [None]:
# start at some random point again
sess.run(x.assign(np.random.rand(2,1)))
# optimize
for _ in range(300):
    sess.run(optimize_op)
print( "x=", sess.run(x))
print( "f(x)=", sess.run(objective))

As this entire routine is a pretty typical pattern, TensorFlow provides an easy way to run the optimization:

In [None]:
# define expressions
objective = 0.5 * tf.matmul(tf.matmul(tf.transpose(x), A), x) + tf.matmul(tf.transpose(b), x) + 0
optimize_op = tf.train.GradientDescentOptimizer(0.01).minimize(objective)

In [None]:
# start at some random point again
sess.run(x.assign(np.random.rand(2,1)))
# optimize
for _ in range(300):
    sess.run(optimize_op)
print( sess.run(objective))

print( "xstar=", sess.run(x))
print( "fstar(x)=", sess.run(objective))

So, basically, 2 lines (define objective + get optimization step) is enough to solve such an optimization problem.

There is such a kind of ugly objective function (Rosenbrock function), which is supposed to be a nightmare for numerical optimization:

$$f(x,y)= (1-x)^2 + 100 (y-x^2)^2$$


TensorFlow ships several heuristics to adapt this $η$ and guessing an optimal step with for each update:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt


def rosenbrock(x, y):
    a, b = 1., 100.
    f = (a - x)**2 + b *(y - x**2 )**2
    x_solution = (a, a*a)
    return  f, x_solution

# just for visualization
xx, yy = np.meshgrid(np.linspace(-1.3, 1.3, 31), np.linspace(-0.9, 1.7, 31))
zz, solution = rosenbrock(xx, yy)

# destroy previous session and graph and create new session
tf.reset_default_graph()
sess = tf.InteractiveSession()

x0 = (-0.5, 0.9)

x = tf.Variable(0, dtype=tf.float64, name="x")
y = tf.Variable(0, dtype=tf.float64, name="y")
objective, _ = rosenbrock(x,y)

optimizer = []
optimizer.append(tf.train.RMSPropOptimizer(0.02).minimize(objective))
optimizer.append(tf.train.GradientDescentOptimizer(0.002).minimize(objective))
optimizer.append(tf.train.AdamOptimizer(0.3).minimize(objective))
optimizer.append(tf.train.MomentumOptimizer(0.002, 0.9).minimize(objective))
optimizer.append(tf.train.AdadeltaOptimizer(0.1).minimize(objective))
optimizer.append(tf.train.AdagradOptimizer(0.1).minimize(objective))


sess.run(tf.global_variables_initializer())

In [None]:
fig, ax = plt.subplots()
ax.contourf(xx, yy, zz, np.logspace(-5, 3, 60), cmap="YlGn_r");
for opt_op in optimizer:
    steps = [x0]
    sess.run([x.assign(x0[0]), y.assign(x0[1])])
    for i in range(100): # change epochs to get close to solution
        sess.run(opt_op)
        steps.append(sess.run([x, y]))

    steps = np.array(steps)
    ax.plot(steps[:,0], steps[:,1])

ax.plot((x0[0]), (x0[1]), 'o', color='y', label = "Starting Point")
ax.plot(solution[0], solution[1], 'o', color='r', label = "Solution Point");
ax.legend(['GradDesc', 'RMSprop', 'Adam', 'Momentum', 'AdaDelta', 'AdaGrad'],
          bbox_to_anchor=(1.4, 0.7));

Apparently, they behave differently. Furthermore, it is totally up to your problem, which optimizer work best. As a rule-of-thumb: The tf.train.AdamOptimizer seems to have nice results on many practical problems.

## New example

In [None]:
# destroy previous session and graph and create new session
tf.reset_default_graph()
sess = tf.InteractiveSession()
#exercise611
def f(x, y):
    
    f = 2*x**2 - 2*x*y +y**2 +2*x-2*y
   
    return  f 
x = tf.Variable(0, dtype=tf.float64, name="x")
y = tf.Variable(0, dtype=tf.float64, name="y")

objective = 2*x**2 - 2*x*y +y**2 +2*x-2*y

### using Gradient from tensorflow to solve it

In [None]:
# using FOC to solve it
# let tensorflow do the gradients
xgrad = tf.gradients(objective, x)[0]   # get gradient from objective wrt. to x
ygrad = tf.gradients(objective, y)[0]   # get gradient from objective wrt. to x

In [None]:
## Straightforward gradient descent with 0.1 learning rate
x_optimize_op = x.assign(x - 0.1 * xgrad)
y_optimize_op = y.assign(y - 0.1 * ygrad)

In [None]:

# start at some random point again
sess.run(x.assign(0))
sess.run(y.assign(0))

In [None]:
# optimize
for _ in range(1000):
    sess.run(x_optimize_op)
    sess.run(y_optimize_op)
print( "x=", sess.run(x))
print( "y=", sess.run(y))
print( "f(x,y)=", sess.run(objective))


#### Now using Tensorflow to deal with gradient descent 

In [None]:


%matplotlib inline
import matplotlib.pyplot as plt

#exercise611
def f(x, y):
    
    f = 2*x**2 - 2*x*y +y**2 +2*x-2*y
   
    return  f 

solution = [0,1]
#x_solution

# just for visualization
xx, yy = np.meshgrid(np.linspace(-1.3, 1.3, 31), np.linspace(-0.9, 1.7, 31))
#zz = 2*xx**2 - 2*xx*yy +yy**2 +2*xx-2*yy
zz = f(xx,yy)
# destroy previous session and graph and create new session
tf.reset_default_graph()
sess = tf.InteractiveSession()

x0 = (-0.5, 0.9)

x = tf.Variable(0, dtype=tf.float64, name="x")
y = tf.Variable(0, dtype=tf.float64, name="y")
objective_value = f(x,y)

optimizer = []
optimizer.append(tf.train.RMSPropOptimizer(0.02).minimize(objective_value))
optimizer.append(tf.train.GradientDescentOptimizer(0.002).minimize(objective_value))
optimizer.append(tf.train.AdamOptimizer(0.3).minimize(objective_value))
optimizer.append(tf.train.MomentumOptimizer(0.002, 0.9).minimize(objective_value))
optimizer.append(tf.train.AdadeltaOptimizer(0.1).minimize(objective_value))
optimizer.append(tf.train.AdagradOptimizer(0.1).minimize(objective_value))


sess.run(tf.global_variables_initializer())

In [None]:
fig, ax = plt.subplots()
#ax.contourf(xx, yy, zz, np.logspace(-5, 3, 60), cmap="YlGn_r");
ax.contourf(xx, yy, zz,64, alpha=.75, cmap="YlGn_r");
ax.set_xlim(-1,0.5)
ax.set_ylim(0,1.5)
for opt_op in optimizer:
    steps = [x0]
    sess.run([x.assign(x0[0]), y.assign(x0[1])])
    for i in range(100):
        sess.run(opt_op)
        steps.append(sess.run([x, y]))

    steps = np.array(steps)
    ax.plot(steps[:,0], steps[:,1])

ax.plot((x0[0]), (x0[1]), 'o', color='y')
ax.plot(solution[0], solution[1], 'o', color='r');
ax.legend(['GradDesc', 'RMSprop', 'Adam', 'Momentum', 'AdaDelta', 'AdaGrad'],
          bbox_to_anchor=(1.4, 0.7));