# Backprop Examples
While troubleshooting some exploding gradients, I decided to walk through backpropagation for the first step, line by line. Here are a couple of examples for two different loss functions: squared error and Huber loss.

## Squared error

The squared error is given by:
    
    L = np.sum((target_Q - Q)^2)
   
And thus the gradient with respect to Q becomes:

    dL/dQ = -2 * (target_Q - Q)

Because Q is given by:

    Q_i = b_i + w_i,1 * x_i,1 + w_i,2 * x_i,2 + ... + w_i,n * x_i,n

where `x_i` represents the input from the previous (fully-connected) layer, the gradients with respect the biases and weights of Q are:

    dL/dQ_b_i = dL/dQ * 1 = dL/dQ = -2 * (target_Q - Q)
    dL/dQ_w_i = dL/dQ * x_i = -2 * (target_Q - Q) * x_i

Let's run through an example. After initializing the variables of a DQN and copying them to holder network, we can calculate q(s, a) and target_q(s, a) via our primary and target networks, respectively.

In [None]:
Q
[[ 5.34987879 -2.93507671 -1.3103404   1.01032877]]

target_Q
[[ 4.39664125 -2.93507671 -1.3103404   1.01032877]]

loss
0.908662

sum_grad
[[ 1.  1.  1.  1.]]

square_grad
[[-1.90647507  0.          0.          0.        ]]

sub_grad
[[ 1.90647507  0.          0.          0.        ]]

Q_grad
[(array([ 1.90647507,  0.        ,  0.        ,  0.        ]

Note that because the parameters are initially equal, the target_Q will be the same for all actions, given the initial state, except that which was chosen, in this case at index 0. The update for the target_Q is:

In [None]:
q2 = np.max(target_network.get_q_values(s2))
target_q = target_network.get_q_values(s1)
target_q[a] = r + gamma * (1 - isterminal) * q2

By summing over the squared differences between Q and target_Q for each element, we arrive at the loss.

## Huber loss

In [None]:
# Batch size 5
Q:
 [[-0.53266817 -0.0878141   2.45806932  3.23200798]
 [-0.8812443   1.07983756  0.65692121  1.23270094]
 [ 0.48478067  0.3618294   1.32550728  2.08765173]
 [ 0.65240127  0.50296509  2.48937154  1.34428322]
 [-0.55477738  0.12505805  2.41784     2.32180619]]
a:
 [[0 2]
 [1 1]
 [2 3]
 [3 3]
 [4 1]]
target_Q:
 [-0.79592603 -4.67121124 -1.58115745  0.19765809  1.02122998]
loss:
 [ 2.75399542  5.25104904  3.16880918  0.64662516  0.40156206]
grads:
 [ 0.          0.10382807  1.          2.        ]

# Obtained by placing this code in Network.learn between feed_dict and train_step
q, loss_ = self.sess.run([self.q, self.loss], feed_dict=feed_dict)
print("Q:\n", q)
print("a:\n", a)
print("target_Q:\n", target_q)
print("loss:\n", loss_)
var_list = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 
                             scope=self.scope)
optimizer = self.graph_dict["optimizer"][0]
gvs = self.sess.run(optimizer.compute_gradients(self.loss, var_list=var_list),
                    feed_dict=feed_dict)
print("grads:\n", gvs[-1][0])
input("Press enter...")