<h1>Optimization Package</h1><br>
<img src="https://img.icons8.com/external-itim2101-lineal-color-itim2101/344/external-professor-life-style-avatar-itim2101-lineal-color-itim2101.png" alt="Instructor" width=50> In this section we check some of higer level tools for running gradient based optimization. We also, will talk about more advanced version of gradient descent methods and some theoritical ideas behind each version.

<img src="https://docs.scipy.org/doc/scipy/_static/logo.svg" width=80> <b><font size=20>SciPy</font></b><br>
Scipy is a package for scientific calculations. It contains modules for optimization, linear algebra, integration, interpolation, special functions, FFT, signal and image processing, ODE solvers and other tasks common in science and engineering. Scipy uses numpy as a backbone. Some functionalities are overlapped but many of them are just available in scipy.<br>
For installing Scipy use: ```conda install -c anaconda scipy```

In [None]:
import scipy
import numpy as np # loding np for passing numbers to scipy

In [None]:
# Let's do some examples before jumping to optimization
# Scipy provide a great implementation of linear algebra
from scipy import linalg
a = np.array([[3, 2, 0], [1, -1, 0], [0, 5, 1]])
b = np.array([2, 4, -1])
linalg.inv(a)

In [None]:
linalg.solve(a,b) # The solution to linear system aX=b

In [None]:
# Determinant
linalg.det(a)

In [None]:
# Eigen decomposition
linalg.eig(a)

In [None]:
# singular Value Decomposition SVD
U, s, Vh = linalg.svd(a)
display(U,  s,  Vh)

<img src="https://img.icons8.com/color/344/light.png" width=70> There is a tone of useful and advanced linear algebra functions inside linalg. Also, Scipy.stats has very useful statistical tools. We just reviewed the tip of iceberg. When you have time, go and check documents.

<img src="https://img.icons8.com/external-itim2101-lineal-color-itim2101/344/external-professor-life-style-avatar-itim2101-lineal-color-itim2101.png" alt="Instructor" width=50> Scipy.optimize has optimization functions. There are few functions but support a variety of methods. Different methods are selected by providing an argument to the function. Let's review them. Also, there are specific methods for minimization of single variable functions. Let's review them first.

In [None]:
from scipy import optimize

<img src="https://img.icons8.com/external-itim2101-lineal-color-itim2101/344/external-professor-life-style-avatar-itim2101-lineal-color-itim2101.png" alt="Instructor" width=50>. For single variable function use `optimize.minimize_scalar`. It is based on <a href="https://en.wikipedia.org/wiki/Brent%27s_method"> Brent</a> method. This method uses a bi-section search for finding the optimal point.

In [None]:
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy import special # This module has many special function in math like Bessel, Hankel, Gamma, ...

objective = special.j0 # This is the Bessel 0 function - Check here https://en.wikipedia.org/wiki/Bessel_function

x = np.linspace(0,15, 1000)
y = objective(x)
plt.plot(x,y)

In [None]:
x_opt = optimize.minimize_scalar(objective)
x_opt

In [None]:
type(x_opt)

<img src="https://img.icons8.com/color/344/light.png" width=50> <b>OptimizationResult:</b> <br>
It keeps the outcome of the optimization. Depending on the meethod it can have more or less field. But, it generally contains:<br>
<ul>
    <li> <b>x</b>: The optimal point. </li>
    <li><b>fun</b>: The value of objective at optimal point.</li>
    <li><b>Success</b>: Whether the optimization problem converged to an answer or not</li>
    <li><b> nit</b>: Number of iterations.</li>
    <li><b> nfev</b>: Number of time the objective function is evaluated.</li>
    </ul>

In [None]:
optimize.minimize_scalar(objective, method="Golden") # https://en.wikipedia.org/wiki/Golden-section_search

In [None]:
optimize.minimize_scalar(objective, method="bounded", bounds=(8, 12)) # bounded implementation of Brent’s algorithm.

In [None]:
# Other methods have bracket.
optimize.minimize_scalar(objective, method="Golden", bracket=(8,12))

In [None]:
# But bracket is not a hard constraint. The answer can go out of the braket based on situation
optimize.minimize_scalar(objective, method="Brent", bracket=(6,8))

<img src="https://img.icons8.com/external-itim2101-lineal-color-itim2101/344/external-professor-life-style-avatar-itim2101-lineal-color-itim2101.png" alt="Instructor" width=50> <font size=8>Multi Variable Optimization</font><br>
The `optimize.minimize` function implements a variety of optimization methods. These methods are distinguished by `method` argument. We will review them one by one here. This function works with one or more variable functions. It also covers constrained optimization, which we will review later.

<h2> Simplex Search</h2>
The first method is a heurstic  and gradient free method called Nelder-Mead method. It is also know as Simplex Search. A simplex is a n+1 points shape in n dimension. For example in 2D it is a triangle. In the begining, the simplex is chosen randomally, and it evolves by iteration. Based on value of the function at a simplex points, we always try to replace the worst one with a new point. The next point is selected from remaining point based on a set of rules. These rules are : Reflection, expansion, contraction and collapse and depicted in the following picture. Based on the value of the function at simplex points one rule is selected. At each step the convergence is checked. One stop condition is checking the standard deviation of value of the function and see if it saturates or not. 
<center><table><tr><td><img src="images/simplex_operation.jpeg" width=300></td><td><img src="https://upload.wikimedia.org/wikipedia/commons/d/de/Nelder-Mead_Himmelblau.gif" alt="demo" width=400></td><td></td></tr></table></center>
Check the following page for more details

In [None]:
# Himmelblau's function
objective = lambda x: (x[0]**2 +x[1] -11)**2 + (x[0] +x[1]**2 -7)**2
x0 = x0=np.array([1,1],dtype=np.float32)
optimize.minimize(objective, method='Nelder-Mead', x0=x0)

In [None]:
optimize.minimize(objective, method='Nelder-Mead', x0=x0, options={'maxiter':1000, 'xatol' : 1.E-6})

<h2> Powell Method</h2>
Exteded version of Brent method for multi-variable function is called Powell method.
<a href="https://en.wikipedia.org/wiki/Powell%27s_method">Check here for more details</a>

In [None]:
optimize.minimize(objective, method='Powell', x0=x0)

<h2> Conjugate Gradient</h2>
This method as the name suggests, it is based on gradient but it is not the same as gradient descent. CG is a method for solving the linear equation systems(<a href="https://en.wikipedia.org/wiki/Conjugate_gradient_method">See here for details</a>). You can consider CG as finding the solution of equation ▽f(x) = 0. The method has much details and it convert the optimization problem to an orthogonal decomposition of gradient. Check <a href="https://www.youtube.com/watch?v=h4cG8jLGmKg"> this video</a> for learning about the math underhood of this method. The method can be applied to unconstrained convex optimizations.

In [None]:
optimize.minimize(objective, method="CG", x0=x0)

<img src="https://img.icons8.com/color/344/light.png" width=50> Scipy has an internal implementation of numerical differentiation. Check this function for more details `scipy.optimize.approx_fprime`

<img src= "https://img.icons8.com/external-flaticons-flat-flat-icons/344/external-question-100-most-used-icons-flaticons-flat-flat-icons.png" alt="Tip" width=60> __Question__:<br>
There are a set of functions for comparing the optimization algorithms. Check the following wikipedia page for a list of some of them:<br>
<a href="https://en.wikipedia.org/wiki/Test_functions_for_optimization"> Test functions for optimization</a> <br>
Do the followings: <br>
<ul>
    <li>from the above wiki page find the definition for <em>Beale function</em></li>
    <li> write a function for it</li>
    <li> find its minimum using the methods we discussed so far. Use a random initialization.</li>
    <li> Compare the accuracy of methods</li>
 </ul>

<h1> Second Order Methods:</h1>
    The second order methods uses hessian matrix in addition to gradient vector. This category contains Newton and Quasi-Newton methods. They can be used when the number optimization parameters are small. (For example for a DNN model with 50 K parameters, the hessian will be 50K x 50K  and have 2500K elements !!). The Newton method specifies the step size in a very efficient way so can be fast but at the same time it requires inversing the hessian matrix. The Quasi-newton method try to optimize the calculations and estimate hessian inverse. 

<h2>Newton Method</h2>
Newton method is based on a second order taylor approximation of the function and solving the optimal point for it. See the following figure for clarification.

<tabble><tr><td><img src="images/newton_method.jpeg" width=300></td><td> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  </td><td><img  src="https://wikimedia.org/api/rest_v1/media/math/render/svg/90e32b708ca17d5659fdc482fe3c9f88996361ba"></td></tr></table><br>
For multi-dimensional optimization the equations is: <br>
<img src="images/newton_multid.png" width=500> <br>
The function ```optimize.newton```  implements the newton method. 
<br>But, a better way is using a hessian free implementation called *Truncated Newton*.

In [None]:
# Truncated Newton -> Hessian Free
optimize.minimize(objective, method="TNC", x0=x0)


<img src="https://img.icons8.com/color/344/light.png" width=50> There is a combinantion of Conjugate Gradinet and Newton algorithms called Newton-CG.<br> You have to provide the jacobian for this method.<br>
For using this method run:  ```optimize.minimize(objective, method="Newton-CG", jac = jac, x0=x0)```

<h2>Levenberg–Marquardt</h2>
The LM algorithm is one of widely used algorithm in optimization (specifically in the context of traditional neural networks). It uses a regularized version of hessian for step size. The update rule is:<br>
<img src="images/lm.png" width =400><br>
For λ → 0, LM converts to Newton and for λ → ∞ it converts to GD. For using LM, one can call the function:<br>
<code>optimize.least_squares</code>.

In [None]:
optimize.least_squares(objective, x0=x0)

<h1>BFGS Algorithm</h1>
Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm is one the most important ones implemented by Scipy. BFGS is used for unconstrained nonlinear optimization. It is faster than Newton method (O(n^2) compared to O(n^3) in newton method). Check <a href="https://archive.org/details/practicalmethods0000flet">this</a> for the mathematics of this algorithm.

In [None]:
optimize.minimize(objective, method="BFGS", x0=x0)

<h2>L-BFGS-B Algorithm</h2>
L-BFGS-B is a limited memory version of BFGS and it is the default method for the minimize function. check this <a href="https://en.wikipedia.org/wiki/Limited-memory_BFGS"> wiki page</a> for the details.

In [None]:
optimize.minimize(objective, method="L-BFGS-B", x0=x0)

<img src="https://img.icons8.com/external-flaticons-lineal-color-flat-icons/344/external-coffee-cup-bakery-flaticons-lineal-color-flat-icons.png" alt="Takehome" width=60> <font size=6>Takehome:</font><br>
There are few more methods and functions that we have not tried them here. You can do research on your own for learning about them.<br>
<ul>
    <li> <code>optimize.minimize(objective, method="COBYLA", x0=x0)</code></li>
    <li> <code>optimize.minimize(objective, method="SLSQP", x0=x0)</code></li>
    <li><code>optimize.dual_annealing</code></li>
    <li><code>optimize.basinhopping</code></li>
    </ul>

<img src="https://img.icons8.com/external-kosonicon-lineal-color-kosonicon/344/external-lab-tool-back-to-school-kosonicon-lineal-color-kosonicon.png" alt="Lab" width=80 > <font size=6>Lab (Elastic Demand):</font><br>
In elastic demand scenario, price of a product is function of supply. More supply lowers the price and vice versa. See the following figure.
<br><img src="images/ElasticDemand.png" width=400><br>
Let's suppose that a product has the following demand equation:<br>
<img src="images/demand.png" width=250><br>
Also, cost of building a unit of product is function of quantity. Let's say for our product, the cost has the following equation:<br>
<img src="images/cost.png" width=300><br>
Write a program to answer the following questions:
    <ul>
        <li> At what price the revenue is maximized?</li>
        <li> At what price the profit is maximized?</li>
    <li> At what price the cost is minimized?</li>
    </ul>


<h1> Constrained Optimization</h1>
What we studied so far was unconstrained optimization, which the solution can be any point in the domain of objective function. But in real life we might have some constraints. These constraints limit the choice of optimization algorithm. We can equality and non-equality constraints. The genral optimization problem has the following form:<br>
<img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/7b8beab031562d937314a4894ec449189f179219">


<table><tr><td><img src="https://upload.wikimedia.org/wikipedia/commons/4/46/%D0%9B%D0%B0%D0%B3%D1%80%D0%B0%D0%BD%D0%B6.jpg" width=200></td><td><h2> Lagrange Multipliers for Constrained Opimization</h2>
<br> There is a mathematical trick to convert every  (equality) constrained optimization problem to a non-constrained one. This trick is attributed to Lagrange. In this trick we replace the objective function with a new version of it called Lagrangian and we add some new parameters to the function they are called lagrangian multipliers. <br>
Check this document for more details: <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange_multiplier</a></td></tr></table>

Many of above discussed methods support constrained optimization. Let's review scipy's capablities for constrained optimization.

In [None]:
# Defining bounds for the solution
x0 = np.array([0,0], dtype=np.float32)
bounds = [(-4, 1), (-4, 1)] 
optimize.minimize(objective,  x0=x0, bounds=bounds,)

In [None]:
# Check how sensitive it is to initial condition
x0 = np.array([-1,-1], dtype=np.float32) 
bounds = [(-4, 1), (-4, 1)] 
optimize.minimize(objective, x0=x0, bounds=bounds)

<h4> Defining linear constraints</h4>
We can have linear equality constraints like: <br>
<img src="images/linear_constraints.png" width = 400>
This can be written as Ax=0 where:<br>
<img src="images/linear_constraints_mat.png" width=300>

In [None]:
x0 = np.array([-1,-1], dtype=np.float32) 
lin_cond = np.array([[1, 1],[2, 1]], dtype=np.float32)
lower_bound = (-4, -4)
upper_bound = (1,1)
lin_const = optimize.LinearConstraint(lin_cond, lower_bound, upper_bound)
optimize.minimize(objective, x0=x0, constraints=lin_const)

<h4>Arbitrary Constraint</h4>
You can define a function as a constraint. Also, non equality constraint can be covered this way.

In [None]:
cons = {"type":"eq", "fun":lambda x: 4*x[0] + 3*x[1]}
x0 = np.array([-1,-1], dtype=np.float32) 

lin_const = optimize.LinearConstraint(lin_cond, lower_bound, upper_bound)
optimize.minimize(objective, x0=x0, constraints=cons)

In [None]:
cons = [{"type":"eq", "fun":lambda x: 4*x[0] + 3*x[1]}, {"type":"ineq", "fun":lambda x:x[0]*x[1]}]
x0 = np.array([-1,-1], dtype=np.float32) 

lin_const = optimize.LinearConstraint(lin_cond, lower_bound, upper_bound)
optimize.minimize(objective, x0=x0, constraints=cons)