# Backprop Examples
While troubleshooting some exploding gradients, I decided to walk through backpropagation for the first step, line by line. Here are a couple of examples for two different loss functions: squared error and Huber loss.

## Squared error

The squared error is given by:
    
    L = np.sum((target_Q - Q)^2)
   
And thus the gradient with respect to Q becomes:

    dL/dQ = -2 * (target_Q - Q)

Because Q is given by:

    Q_i = b_i + w_i,1 * x_i,1 + w_i,2 * x_i,2 + ... + w_i,n * x_i,n

where `x_i` represents the input from the previous (fully-connected) layer, the gradients with respect the biases and weights of Q are:

    dL/dQ_b_i = dL/dQ * 1 = dL/dQ = -2 * (target_Q - Q)
    dL/dQ_w_i = dL/dQ * x_i = -2 * (target_Q - Q) * x_i

Let's run through an example. After initializing the variables of a DQN and copying them to holder network, we can calculate q(s, a) and target_q(s, a) via our primary and target networks, respectively.

In [None]:
Q
[[ 5.34987879 -2.93507671 -1.3103404   1.01032877]]

target_Q
[[ 4.39664125 -2.93507671 -1.3103404   1.01032877]]

loss
0.908662

sum_grad
[[ 1.  1.  1.  1.]]

square_grad
[[-1.90647507  0.          0.          0.        ]]

sub_grad
[[ 1.90647507  0.          0.          0.        ]]

Q_grad
[(array([ 1.90647507,  0.        ,  0.        ,  0.        ]

Note that because the parameters are initially equal, the target_Q will be the same for all actions, given the initial state, except that which was chosen, in this case at index 0. The update for the target_Q is:

In [None]:
q2 = np.max(target_network.get_q_values(s2))
target_q = target_network.get_q_values(s1)
target_q[a] = r + gamma * (1 - isterminal) * q2

By summing over the squared differences between Q and target_Q for each element, we arrive at the loss.