<h1> Gradient Descent - A Detailed Discussion</h1>
Gradient descent is the most popular optimization algorithm. When the evaluation cost of the objective is high, or there are many optimization parameters, GC is the only possible approach. Also, there are good implementations of GC for paralell computing. In this chapter, we review Tensorflow optimizer and we study different flavors of GC.

<img src="https://www.gstatic.com/devrel-devsite/prod/vee468f4e10aa470182a016132769d1277f3b792f56b19f433715afc734e9c71d/tensorflow/images/lockup.svg" width=200>  <img src="https://keras.io/img/logo.png" width=150><br>
Tensorflow is an open source free library for machine learning. Tensorflow is donated to public by Google and has a large body of developers. Tensorflow is one major tools for training and evaluation of deep learning models. It also has an optimizer which is used for training DL models. Keras is another open source project which acts as an interface for tensorflow. Keras makes using Tensorflow easier. Also, keras supports other platforms too. Tensorflow has an internal module for keras. However, keras itself can be installed separately.  <br>
For installing tensorflow using the following command: ```conda install -c conda-forge tensorflow```<br>
For Keras use this command: ```conda install -c conda-forge keras```<br><br>
In this chapter we focus on ```tensorflow.keras.optimizaers```.<br>
Let's have a refresher first.

<h2> A Tensorflow Refresher</h2>

In [None]:
import tensorflow as tf

In [None]:
x = tf.constant([[1., 2.],
                 [3., 4.]])
print("x: ", x)
print("shape: ", x.shape)
print("dtype: ", x.dtype)

In [None]:
# convert it to numpy
x.numpy()

In [None]:
x[0,0]

In [None]:
x[:,0]

In [None]:
x[-1,-1]

In [None]:
# tensor add
x + x

In [None]:
# Scalar tensor multiply
6 * x

In [None]:
# tensor transpose
tf.transpose(x)

In [None]:
# matrix(tensor) multiplication
x @ x

In [None]:
# tensor concat
tf.concat([x, x, x], axis=0)

In [None]:
import numpy as np
tf.sin(np.pi/2), tf.cos(np.pi/2)

In [None]:
tf.reduce_sum(x)

In [None]:
tf.reduce_max(x)

In [None]:
tf.nn.softmax(x, axis=0)

In [None]:
# reshape a tensor
x = tf.constant([[1., 2.], [3., 4.]])
x2= tf.reshape(x, shape=(1,4))
x2

<img src="https://img.icons8.com/color/344/light.png" width=70><font size=4>Ragged Tensors:</font><br>
A tensor with variable numbers of elements along some axis is called "ragged".<br>
<img src="https://www.tensorflow.org/guide/images/tensor/ragged.png" alt="A 2-axis ragged tensor, each row can have a different length.">

In [None]:
ragged_list = [
    [0, 1, 2, 3],
    [4, 5],
    [6, 7, 8],
    [9]]

In [None]:
# This line raise an exception sicne tensor has to be rectangular
tensor = tf.constant(ragged_list)

In [None]:
# Instead use the ragged tensor
ragged_tensor = tf.ragged.constant(ragged_list)
ragged_tensor

In [None]:
tensor_of_strings = tf.constant(["Alice", "Bob","Mark"])
tensor_of_strings # It prints out elements with a b prefix pointing out that strings are not unicode rather byte string

<img src="https://img.icons8.com/color/344/light.png" width=70><font size=4>Sparse Tensors:</font><br>
If your tensor is sparse (having many zero elements) it is a good idea to store it as a sparse representation rather than dense. <br>
<img src="https://www.tensorflow.org/guide/images/tensor/sparse.png">

In [None]:
sparse_tensor = tf.sparse.SparseTensor(indices=[[0, 0], [1, 2]],
                                       values=[1, 2],
                                       dense_shape=[3, 4])
sparse_tensor

In [None]:
tf.sparse.to_dense(sparse_tensor)

<img src="https://img.icons8.com/external-itim2101-lineal-color-itim2101/344/external-professor-life-style-avatar-itim2101-lineal-color-itim2101.png" alt="Instructor" width=50><font size=4>Variables:</font>Tensorflow has variables. Variables are wrapper around tensors. Variables are mutable but tensors are immutable. More over variables operates with tensorflow in some specific computations needed for learning a model. At the end a variable can be a piece of memory accessed by CPU or GPU.

In [None]:
var = tf.Variable([[0.0, 0.0, 0.0], [1., 2., 3.]])
print(var)

In [None]:
var.assign([[2., 3., 4.], [5., 6., 7.]])

In [None]:
var.assign_add([[1., 1., 1.], [2., -1., 1.]])

In [None]:
# This will raise an exception. The shape has to be the same
var.assign([1, 2, 3])

In [None]:
def f(x):
    y = x**2 + 2*x - 5
    return y

In [None]:
x = tf.Variable(3.0)
f(x)

In [None]:
# This will create a new variable. 
var2= tf.reshape(var, shape=[1,6])
var2

In [None]:
# You can specify a name for variables. They are used in serialization and restoring object. Names are uniques inside the scope.
x = tf.Variable([1., 2., 3.], name="MyVariable")
x

In [None]:
# also, there is a trainable argument inside the variable init.
x = tf.Variable([1., 2.], trainable=False)
x

<html>
<img src="https://img.icons8.com/color/344/light.png" width=70><font size=4>Placing on specific Backend:</font><br>
Tensorflow tries to put the variable on the most efficient available backend. If you have a platform with multiple backend you can override rules and place a variable on specific backend like:
<code>
with tf.device('CPU:0'):
    a = tf.Variable([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
    b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
    c = tf.matmul(a, b)
print(c)
</code></html>

<h2>Graident Tape: </h2>  You can use tensorflow (Jax under hood) to calculate the gradients. For doing that, you need to start a tape to record all activities (__watch__ them)then asking tensorflow to calculate the gradient using the recording on the tape by calling ```GradientTape.gradient(target, sources)```

In [None]:
# Vars are jax compatible. You can calculate the gradient
with tf.GradientTape() as tape:
    y = f(x)

g_x = tape.gradient(y, x)  # g(x) = dy/dx
g_x

In [None]:
# Trainable variable and non-trainable
# A trainable variable
x0 = tf.Variable(3.0, name='x0')
# Not trainable
x1 = tf.Variable(3.0, name='x1', trainable=False)
# Not a Variable: A variable + tensor returns a tensor.
x2 = tf.Variable(2.0, name='x2', trainable=True) + 1.0
# Not a variable
x3 = tf.constant(3.0, name='x3')

with tf.GradientTape() as tape:
    y = (x0**2) + (x1**3) + (x2**4) + (x3**5)


grad = tape.gradient(y, [x0, x1, x2, x3])

for g in grad:
    print(g)

In [None]:
# Check all watched values in the tape
[var.name for var in tape.watched_variables()]

In [None]:
# You can watch stuffs manually
x0 = tf.Variable(3.0, name='x0')
# Not trainable but can be watched
x1 = tf.Variable(3.0, name='x1', trainable=False)
# A tensor can be watched
x2 = tf.Variable(2.0, name='x2') + 1.0
# A constant is not going to be watched
x3 = tf.constant(3.0, name='x3')

with tf.GradientTape() as tape:
    tape.watch(x0)
    tape.watch(x1)
    tape.watch(x2)
    y = (x0**2) + (x1**3) + (x2**4)

grad = tape.gradient(y, [x0, x1, x2, x3])

for g in grad:
    print(g)

In [None]:
# let's run this example one more time here
x = tf.Variable(4.)
with tf.GradientTape() as tape:
    y =  x**3
    z = x**4

In [None]:
# This line runs fine
tape.gradient(y,x)

In [None]:
# if you run this command you will get an error
tape.gradient(z,x)  # When you define the tape as context manager, after you call gradient, it frees the tap

In [None]:
x = tf.Variable(4.)
with tf.GradientTape(persistent=True) as tape:
    y = x**3
    z = x**4
print(tape.gradient(y,x))
print(tape.gradient(z,x))
del tape # Free the resource manually

In [None]:
# using the above we can nest gradient caluclations / This will raise a performance warning.
x = tf.Variable(1.)
with tf.GradientTape(persistent=True) as tape:
    y = x**3
    z = tape.gradient(y,x)
w = tape.gradient(z, x)
print(y)
print(z)
print(w)

In [None]:
# It is better to nest two tapes
# using the above we can nest gradient caluclations
x = tf.Variable(1.)
with tf.GradientTape() as outer_tape:
    with tf.GradientTape() as inner_tape:
        y = x**3
    z = inner_tape.gradient(y,x)
w = outer_tape.gradient(z, x)
print(y)
print(z)
print(w)

In [None]:
# You can temporary stop recording
x = tf.Variable(2.0)
y = tf.Variable(3.0)

with tf.GradientTape() as t:
    x_sq = x * x
    with t.stop_recording():
        y_sq = y * y
    z = x_sq + y_sq

grad = t.gradient(z, {'x': x, 'y': y}) # You can pass a named -value dict instead of a list.

print('dz/dx:', grad['x'])  # 2*x => 4
print('dz/dy:', grad['y'])

<img src= "https://img.icons8.com/external-flaticons-flat-flat-icons/344/external-question-100-most-used-icons-flaticons-flat-flat-icons.png" alt="Tip" width=70>  For the function f(x, y) = sin((x+y)^2) + cos(x*y/2), and desired points of {(x,y)} = {(pi/2, pi/3), (pi/4,pi/5), (pi/6,pi/7)}, Write a program to do the following:<br>
<ul>
    <li> Evaluate the function at desried point</li>
    <li> What is the value of the first, second and third derivative of function at desired point</li>
    </ul>

In [None]:
x = tf.linspace(1.0, 10.0, 11)
delta = tf.Variable(0.0)

with tf.GradientTape() as tape:
    y = tf.nn.sigmoid(x+delta)

dy_dx = tape.jacobian(y, delta)

In [None]:
dy_dx

In [None]:
# Control flow and gradient
x = tf.constant(1.0)
y = tf.Variable(2.0)
z = tf.Variable(2.0)

with tf.GradientTape(persistent=True) as tape:
    tape.watch(x)
    if x > 0.0:
        w = y
    else:
        w = z**2 

dv0, dv1 = tape.gradient(w, [y, z])

print(dv0)
print(dv1) # this will return None since based on the condition this is not connected to target

In [None]:
# Gradient of multiple function (not using jacobian)
x = tf.Variable(2.0)
with tf.GradientTape(persistent=True) as tape:
    y0 = x**2
    y1 = 1 / x

y = [y0.numpy(), y1.numpy()]
g_y = [tape.gradient(y0, x).numpy(), tape.gradient(y1, x).numpy()]
print(y)
print(g_y)
del tape

<img src="https://img.icons8.com/external-itim2101-lineal-color-itim2101/344/external-professor-life-style-avatar-itim2101-lineal-color-itim2101.png" alt="Instructor" width=50> <font size=4> Gradient of Non-Scalar Targets:</font><br>
A gradient is fundamentally an operation on a scalar. If you force tensorflow to return the gradient of a non-scalar target, they will be added together. This is chosen by desing to make calculation of gradient of sum of element (like sum of error in linear regression). If you need gradient of a vector function, use jacobian instead. Let's demonstrate it using examples:

In [None]:
# Gradient on non-scalar target example 1
x = tf.Variable(2.0)
with tf.GradientTape() as tape:
    y0 = x**2
    y1 = 1 / x
print(tape.gradient({'y0': y0, 'y1': y1}, x).numpy())

In [None]:
# Gradient on non-scalar target example 2
x = tf.Variable(2.)
with tf.GradientTape() as tape:
    y = x * [3., 4.]
print(tape.gradient(y, x).numpy())

In [None]:
opt = tf.keras.optimizers.Adam(learning_rate=0.1)
var1 = tf.Variable(10.0)
loss = lambda: (var1 ** 2)/2.0       # d(loss)/d(var1) == var1
step_count = opt.minimize(loss, [var1]).numpy()
# The first step is `-learning_rate*sign(grad)`
var1.numpy()

<img src="https://img.icons8.com/external-itim2101-lineal-color-itim2101/344/external-professor-life-style-avatar-itim2101-lineal-color-itim2101.png" alt="Instructor" width=50> <font size=4> Gradient of Non-Tnesorflow Calculations:</font><br>
As you may guess, Gradient Tape is not able to caluclate the gradient of non-tensorflow operations. Also, TF can't calculate the gradient for int or string. Let's demonstrate it using an example

In [None]:
x = tf.Variable([[1.0, 2.0],
                 [3.0, 4.0]], dtype=tf.float32)
with tf.GradientTape() as tape:
    x2 = x**2

    # This step is calculated with NumPy
    y = np.mean(x2, axis=0)

    # Like most ops, reduce_mean will cast the NumPy array to a constant tensor
    # using `tf.convert_to_tensor`.
    y = tf.reduce_mean(y, axis=0)
print(tape.gradient(y, x))

In [None]:
# It is not possible to calculate the gradient of int.
x = tf.constant(10)
with tf.GradientTape() as g:
    g.watch(x)
    y = x * x

print(g.gradient(y, x))

<img src="https://img.icons8.com/external-itim2101-lineal-color-itim2101/344/external-professor-life-style-avatar-itim2101-lineal-color-itim2101.png" alt="Instructor" width=50> <font size=4> Gradient of Stateful Object:</font><br>
By default, Tensorflow can calculate the gradient of stateful object. When the tape gets to a stateful object, it stops there. Let's have an example

In [None]:
x0 = tf.Variable(3.0)
x1 = tf.Variable(0.0)

with tf.GradientTape() as tape:
    # Update x1 = x1 + x0.
    x1.assign_add(x0)
    # The tape starts recording from x1.
    y = x1**2   # y = (x1 + x0)**2

# This doesn't work.
print(tape.gradient(y, x0))   #dy/dx0 = 2*(x1 + x0)

In [None]:
# Training Data Prep
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

a = tf.constant(3.)
b = tf.constant(2.)
noise_std = tf.constant(0.1)
x = tf.random.uniform(shape=(1,100))
y = a * x + b + noise_std * tf.random.normal(shape=(1,100))

_=plt.scatter(x.numpy(), y.numpy())

In [None]:
# Find a and b using SSE and optimization
eta = 0.001
a_hat = 1.
b_hat = 1.
for i in range(400):
    v_a = tf.Variable(a_hat, dtype=tf.float32)
    v_b = tf.Variable(b_hat, dtype=tf.float32)
    with tf.GradientTape() as tape:
        loss = tf.reduce_sum(((v_a * x + v_b) - y)**2.)
    g_a, g_b = tape.gradient(loss, [v_a, v_b])
    a_hat = a_hat - eta * g_a.numpy()
    b_hat = b_hat - eta * g_b.numpy()

a_hat, b_hat

In [None]:
plt.scatter(x, y)
x_plt = np.array([x.numpy().min(), x.numpy().max()])
plt.plot(x_plt, a_hat * x_plt + b, color='r')

In [None]:
# Another way is using Keras which makes it easier
opt = tf.keras.optimizers.SGD(learning_rate=0.001)
a_hat = tf.Variable(1.)
b_hat = tf.Variable(1.)
theta = [a_hat, b_hat]
eta = tf.constant(0.1)
for i in range(400):
    with tf.GradientTape() as tape:
        loss = ((a_hat * x + b_hat) - y)**2.  # Reduce sume is omitted on purpose.
    g_theta = tape.gradient(loss, theta)
    opt.apply_gradients(zip(g_theta, theta)) # Remember: we can't directly 
a_hat, b_hat

<img src="https://img.icons8.com/color/344/light.png" width=50> __Tip:__ <br> <code>tf.keras.optimizers.SGD</code> is one of Keras optimizers. We will discuss it in details later. Here we used <code>apply_gradients</code>, it also has <code>minimize</code> function which is much easier and you don't need to work with the tape. However, <code>apply_gradients</code> is used if you wish to do some post proecessing on gradients before using them. This is very useful for clipping gradients. We will come back to clipping later in an example.

<h1> Gradient Descent Using Keras</h1>
In chapter 2, we reviewed vanilla gradient descent as below:
<img src="images/vanilla_gd.png" width=250> <br>
In many optimization problems, the objective function f has the following form:<br>
<img src="images/sum_err.png" width=450> 
An example of this form of objective function is SSE error in a linear regression:<br>
<img src="images/sum_err_ex.png" width=300>
If you want to calculate the gradient you should iterate over N (for example number of training samples). This might take a while to go through and at the end you just make one step update. This is called full batch gradient decent. Another alternative is update the optimal point using just one sample (estiating the gradient), this is called stochastic gradient decent. This is good since for one update we don't have to go through the whole dataset. However, using one sample can be very noisy. specifically, when we are near the optimal point and we need precision. A good comporomize is using a sub sample of the dataset called mini-batch (let's say 50 samples) and calculating gradient using mini-batch. This is better in terms of accuracy of estimate and computation performance.  This method is called stochastic mini-batch gradient descent. Here we use mini-batch or stochastic GD interchangably and we refer to this method with it.
<img src="images/mini_batch.png" width=200>

In [None]:
# Regardless of your objective type you can use the following command to implement SGD
opt = tf.keras.optimizers.SGD(learning_rate=0.1)
x = tf.Variable(1.0)
y = tf.Variable(2.0)
loss = lambda: (x**2 + y**2)/2.0     
for i in range(100):
    step_count = opt.minimize(loss, [x,y]).numpy()
# Step is `- learning_rate * grad`
x.numpy(), y.numpy()


<img src="https://img.icons8.com/external-soft-fill-juicy-fish/344/external-maths-school-soft-fill-soft-fill-juicy-fish.png" alt="Math Tip" width=50> <font size=5> Momentum: </font>
If the objective function is very steep in one dimension but not much in other, GD jumps around and it slows the covergence. <br>
<img src="images/momentum.png" width =400><br>
This limits the choice of learning rate. If you choose a big learning rate, the problem can diverge. Overall in this situation tuning learning rate becomes very hard. A solution for fighting this situation is adding a term to GD called momentum.<br>
<img src="images/momentum_eq.png" width=300><br>
You can introduce momentum to (all implementation) of GD in tensorflow. You just need to set the gamma parameter. Gamma itself is a hyper parameters.

In [None]:
opt = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.1)
x = tf.Variable(1.0)
y = tf.Variable(2.0)   
for i in range(100):
    loss = lambda: (x**2 + y**2)/2.0  
    step_count = opt.minimize(loss, [x,y]).numpy()
# Step is `- learning_rate * grad`
x.numpy(), y.numpy()

<img src="https://img.icons8.com/external-soft-fill-juicy-fish/344/external-maths-school-soft-fill-soft-fill-juicy-fish.png" alt="Math Tip" width=50> <font size=5> Nesterov Accelerated Gradient: </font>
In order to improve momentum, Nesterov Accelerated Gradient (NAG), use the value of gradient at look ahead point instead of the current point. This can increase the speed of convergence. <br>
<table><tr><td><img src="images/nag_eq.png" width=300></td><td><img src="images/nag_concept.png" width=600></td></tr></table><br>
Check the following illustrative example:<br><br>
<img src="images/nag_opt.png" width=600><br>
In tensorflow for (all GD implementations) set <code>nesterov = True</code> for using Nesterov momentum update.

In [None]:
opt = tf.keras.optimizers.SGD(learning_rate=0.1, momentum=0.1, nesterov=True)
x = tf.Variable(1.0)
y = tf.Variable(1.0)     
for i in range(100):
    loss = lambda: (x**2 + y**2)/2.0
    step_count = opt.minimize(loss, [x,y]).numpy()
# Step is `- learning_rate * grad`
x.numpy(), y.numpy()

<img src="https://img.icons8.com/external-kosonicon-lineal-color-kosonicon/344/external-lab-tool-back-to-school-kosonicon-lineal-color-kosonicon.png" alt="Lab" width=80 > __Polynomial Model:__ We are interested in fitting a model like ax^2 + bx + c. Let's assume a = -1 , b = 2,  c = 3. Generate training samples (Gaussian noise with standard dev of 0.04). Then try to fit the model using SSE criteria. In optimization use SGD with momentum and Nesterov.

<b>Hint:</b> Start with a small number of iterations (epochs) and try to adjust the learning rate. Then, iterate for a good number of iterations.


<img src="https://img.icons8.com/external-soft-fill-juicy-fish/344/external-maths-school-soft-fill-soft-fill-juicy-fish.png" alt="Math Tip" width=50> <font size=5> Adaptive Learning Rate: </font><br>
As you might have seen by above lab, tuning learning rate is important. If you choose large learning rate, it is fast but it can diverge, if you choose small learning rate, it won't diverge but it will take too long to converge. A good solution here, is starting with a large learning rate and then as we get close to the optimal point taking smaller steps. This is called adaptive learning rate. Here, we review two famous algorithm for adaptive learning rate:<br>
<h2>Adagrad</h2>
Adaptive gradient algorithm (AdaGrad) adapts learning rate separately for different parameters. It perform larger updates for infrequent and smaller updates for frequent parameters which makes it suitable for dealing with sparse datasets. (It is originall developed by Jeff Dean from Google).
<img src="images/adagrad.png" width=250><br>
You can use <code>tf.keras.optimizers.Adagrad(learning_rate, initial_accumulator_value, epsilon)</code> to define an adagrad GD.
<h2>Adadelta</h2>
Adadelta is an extension to Adagrad. It restricts the window of accumulated past gradient to a fixed size like $w$. Instead of inefficiently storing all past gradients , a decaying average is used. 
<table><tr><td><img src="images/adadelta_update.png" width=300></td><td> Where:</td><td> <img src="images/adadelta_adaptive.png" width=300></td></tr></table>
You can use <code>tf.keras.optimizers.Adadelta(learning_rate,rho, epsilon),</code> to define an adagrad GD. Where rho is the beta in above equations.<br>

In using these functions, if you leave parameters, tensorflow will 
fill the with default values suggested in research papers.

<img src="https://img.icons8.com/external-kosonicon-lineal-color-kosonicon/344/external-lab-tool-back-to-school-kosonicon-lineal-color-kosonicon.png" alt="Lab" width=50 > <font size=5>Paper Box Problem:</font>
Given a square piece of paper with side length of L = 30. Find the cutting place x which the maximize the volume of box.
<img src="http://jwilson.coe.uga.edu/EMT725/Box/Image20.jpg" width=500><br>
After writing the problem solve it using Adagrad and Adadelta.

<h2> RMSProp</h2>
RMSProp is similar to Adagrad, it just uses a leaky sum to adjust learning rates.<br>
<img src="images/rmsprop.png" width=300><br>
In tensorflow you can use RMSProp as <code>tf.keras.optimizers.RMSprop(learning_rate, rho,momentum)</code>
The following animation compares the so far discussed method for an objective function.
<img src="https://miro.medium.com/max/1240/1*Y2KPVGrVX9MQkeI8Yjy59Q.gif" width=500>

<img src="https://img.icons8.com/external-flaticons-lineal-color-flat-icons/344/external-case-study-social-media-agency-flaticons-lineal-color-flat-icons-2.png" width=50> <font size=5> Case Study:</font><br>
Here, we want to build a predictive model for stock market using gradient descent. We use General Mills stock market (Ticker: GIS). The data is downloaded from Yahoo Finanace and stored in the file "GIS.csv". Load the data and build an AR model for it.

In [None]:
# Predicting General Mills Stocks using an AR model
import pandas as pd
import seaborn as sns
gis = pd.read_csv("GIS.csv") # This data is downloaded from Yahoo finance.
gis = gis[["Date", "Close"]] # Keeping closing value and date.
fig = plt.figure(figsize=(18,9))
x_ticks = gis["Date"][range(1,gis.shape[0], 8)]
ax = sns.lineplot(data=gis, x="Date", y="Close")
_=plt.xticks(x_ticks, rotation=30)

In [None]:
def add_lags(df, p):
    close = df["Close"].to_list()
    df["const"] = 1.
    for lag in range(1,p):
        col_name = "lag_" + str(lag)
        df[col_name] = np.array([0] * lag + close[:-lag])
       
    return df.iloc[p:]

p = 4
gis_cpy = gis.copy()
gis_cpy = add_lags(gis_cpy, p)
#fig = plt.figure(figsize=(18,9))
#ax = sns.lineplot(data=gis_cpy, x="Date", y="Close")
#for i in range(1,p ):
#    sns.lineplot(data=gis_cpy, x="Date", y="lag_" + str(i))
#_=plt.xticks(x_ticks, rotation=30)
y = tf.constant(gis_cpy["Close"].to_numpy(), dtype=tf.double)
x_cols = ["const"] + [c for c in gis_cpy.columns if c.startswith("lag")]
x = tf.constant(gis_cpy[x_cols].to_numpy(), dtype=tf.double)
t = gis_cpy["Date"]

In [None]:
# Define the model parameters -> y[n] = a_0 + a_1 * y[n-1] + a_2 * y[n-2] + ...+ a_p * y[n-p]
a = [tf.Variable(np.random.rand(1)[0], dtype=tf.double, name="a_" + str(i)) for i in range(p)]

In [None]:
# Optimize the SSE loss for finding optimal a_i's
opt = tf.keras.optimizers.RMSprop(learning_rate=0.003, rho=0.9,momentum=0.1)
for i in range(1000):
    loss = lambda: tf.reduce_sum((tf.linalg.matvec(x, a) - y)**2)
    step_count = opt.minimize(loss, a)
print("Loss: ", loss())
print("Coefficients: ", a)

In [None]:
y_hat = tf.linalg.matvec(x, a).numpy()
fig = plt.figure(figsize=(18,9))
sns.lineplot(x=t, y=y.numpy(), label="Actual")
sns.lineplot(x=t, y=y_hat, label="Forecast")
plt.setp(plt.gca().get_legend().get_texts(), fontsize='22')
_=plt.xticks(x_ticks, rotation=30)

<h2>Adam</h2>
Adam is basically mixing momentum with RMSprop. It has been widely used in different application and is one the most efficient versions of gradient descent. Adam applies bias correction to momentum and decay. As t goes higher the weight of these terms decay.<br>
<img src="images/adam.png" width=250><br>
In tensorflow you can use <code>tf.keras.optimizers.Adam(learning_rate, beta_1, beta_2)</code>

In [None]:
# as an example let's solve the stock market prediction using Adam
a = [tf.Variable(np.random.rand(1)[0], dtype=tf.double, name="a_" + str(i)) for i in range(p)]
opt = tf.keras.optimizers.Adam(learning_rate=0.1,beta_1=0.9, beta_2=0.999)
for i in range(800):
    loss = lambda: tf.reduce_sum((tf.linalg.matvec(x, a) - y)**2)
    step_count = opt.minimize(loss, a)
print("Loss: ", loss())
print("Coefficients: ", a)

<img src="https://img.icons8.com/color/344/light.png" width=50>__Tip__: Tensorflow has two variants of Adam. They are called Adamax and NAdam. Adamax is adam, it just uses max norm. NAdam uses Nesterov instead of momentum.

<img src="https://img.icons8.com/color/344/light.png" width=50>__Tip__: Tensorflow has another variant of GD called FTRL (Follow The Regularized Leader). FTRL is developed at Google for click-through rate prediction in the early 2010s. It is most suitable for shallow models with large and sparse feature spaces. 

<img src="https://img.icons8.com/external-flaticons-lineal-color-flat-icons/344/external-coffee-cup-bakery-flaticons-lineal-color-flat-icons.png" alt="Takehome" width=50><font size=4>Burg Method:</font><br>
In the Market stock case study, we minimize the forward prediciton error. which means error of predicting x[n] using x[n-1] ... x[n-p]. It is also possible to define a backward error which is predicting x[n-p] using x[n-p+1] ...x[n]. In burg method for AR both forward and backward errors are minimized. Although, both forward and burg objectives have closed form solution, here we try to solve them using gradient descent. Write a program which uses Adam to minimize burg objective for the case study.

<img src="https://img.icons8.com/external-soft-fill-juicy-fish/344/external-maths-school-soft-fill-soft-fill-juicy-fish.png" alt="Math Tip" width=50><font size=4> Gradient Clipping:</font><br>
One of issues with gradient descent is <em>exploding gradient</em> which is the case that gradient become too large. This raise certain issues such as instability and divergence. Also it can cause running out the precision of double numbers. A treatment for this issue is <em>gradient clipping </em> which bounds the magnitude of gradient and preserving its direction. There are different ways of clipping. Check Tensorflow docs for more details. Here, Let's check the concept using an exmple:

In [None]:
a = [tf.Variable(np.random.rand(1)[0], dtype=tf.double, name="a_" + str(i)) for i in range(p)]
opt = tf.keras.optimizers.RMSprop(learning_rate=0.003, rho=0.9,momentum=0.1)
for i in range(1000):
    with tf.GradientTape() as tape:
        loss = tf.reduce_sum((tf.linalg.matvec(x, a) - y)**2)
    grads = tape.gradient(loss, a)
    grads = [tf.clip_by_norm(g, 100.) for g in grads]
    opt.apply_gradients(zip(grads, a))
    
print("Loss: ", loss)
print("Coefficients: ", a)

<img src="https://img.icons8.com/external-flaticons-lineal-color-flat-icons/344/external-coffee-cup-bakery-flaticons-lineal-color-flat-icons.png" alt="Takehome" width=50><font size=5> Constrained Optimization Takehome:</font>
What we have discussed so far was about unconstrained optimization. Tensorflow originally built for training DNNs and constrained optimization is not very relevant. However, there is an extension called TensorFlow Constrained Optimization (TFCO). <br>
For installing TFCO use: <code>pip install tensorflow-constrained-optimization</code>
We don't cover the TFCO here as it is not very mature. But, we leave it to you as a take home. The notebook `Takehome_TFCO_Oscillation_compas` contains an example using TFCO, go and check it on your own.

<img src="https://img.icons8.com/color/344/light.png" width=50><font size=5> Tips for Performance:</font><br>
Tensorflow is a high performance distributed computing library. It can run the code on multiple CPUs, GPUs and TPUs. Check tensorflow docs for <code>tf.distribute.Strategy</code>. It allows you to define your distributed computing strategy. <br>
Also, so far we used the eager API which runs tensors and operators like Python. Tensorflow supports a lazy evaluation API called graphs, which makes exporting the problem to other platforms easier.