# **Chapter 4 - THE PRELIMINARIES: A CRASHCOURSE**

## **4.3 Automatic Differentiation**

In [2]:
from mxnet import autograd, nd

#### **4.3.1 A simple Example**

- As a toy example, say that we are interested in differentiating the mapping **y = 2x⊤x** with respect to the column vector x.<br> 
  To start, let’s create the variable x and assign it an initial value.

In [3]:
x = nd.arange(4)
x


[0. 1. 2. 3.]
<NDArray 4 @cpu(0)>

In [4]:
x.attach_grad()

In [5]:
with autograd.record():
    y = 2 * nd.dot(x, x)
y    


[28.]
<NDArray 1 @cpu(0)>

In [6]:
y.backward()

- The gradient of the function **y = 2x⊤x** with respect to x should be **_4x_**.<br> 
  Now let’s verify that the gradient produced is correct.

In [7]:
print(x)
print(x.grad)
print(x.grad - 4 * x)


[0. 1. 2. 3.]
<NDArray 4 @cpu(0)>

[ 0.  4.  8. 12.]
<NDArray 4 @cpu(0)>

[0. 0. 0. 0.]
<NDArray 4 @cpu(0)>


In [8]:
with autograd.record():
    y = x.norm()
y.backward()
x.grad


[0.         0.26726124 0.5345225  0.80178374]
<NDArray 4 @cpu(0)>

#### **4.3.2 Backward for Non-scalar Variable**

In [9]:
print('x vector : ', x)

with autograd.record(): # y is a vector
    y = x * x
print('y vector : ', y)    
y.backward()
print('x.grad : ', x.grad)

u = x.copy()
u.attach_grad()

with autograd.record(): # v is scalar
    v = (u * u).sum()
print('v scalar : ', v)    
v.backward()
print('u.grad : ', u.grad)

x.grad - u.grad

x vector :  
[0. 1. 2. 3.]
<NDArray 4 @cpu(0)>
y vector :  
[0. 1. 4. 9.]
<NDArray 4 @cpu(0)>
x.grad :  
[0. 2. 4. 6.]
<NDArray 4 @cpu(0)>
v scalar :  
[14.]
<NDArray 1 @cpu(0)>
u.grad :  
[0. 2. 4. 6.]
<NDArray 4 @cpu(0)>



[0. 0. 0. 0.]
<NDArray 4 @cpu(0)>

#### **4.3.3 Detach Computations**

In [10]:
with autograd.record():
    y = x * x
    u = y.detach()
    z = u * x
print('x : ', x)
print('u : ', u)
print('z : ', z)

z.backward()

print('x.grad : ', x.grad)
print('u : ', u)

x.grad - u

x :  
[0. 1. 2. 3.]
<NDArray 4 @cpu(0)>
u :  
[0. 1. 4. 9.]
<NDArray 4 @cpu(0)>
z :  
[ 0.  1.  8. 27.]
<NDArray 4 @cpu(0)>
x.grad :  
[0. 1. 4. 9.]
<NDArray 4 @cpu(0)>
u :  
[0. 1. 4. 9.]
<NDArray 4 @cpu(0)>



[0. 0. 0. 0.]
<NDArray 4 @cpu(0)>

- The following backward computes **_∂u2x/∂x_** with u = x instead of **∂x3/∂x**.

- Since the computation of y is still recorded, we can call y.backward() to get **∂y/∂x = 2x**.

In [11]:
y.backward()
print('y : ', y)
print('x.grad : ', x.grad)
print('x : ', x)

x.grad - 2*x

y :  
[0. 1. 4. 9.]
<NDArray 4 @cpu(0)>
x.grad :  
[0. 2. 4. 6.]
<NDArray 4 @cpu(0)>
x :  
[0. 1. 2. 3.]
<NDArray 4 @cpu(0)>



[0. 0. 0. 0.]
<NDArray 4 @cpu(0)>

#### **4.3.4 Attach Gradients to Internal Variables**

In [12]:
y = nd.ones(4) * 2 
y.attach_grad() 

with autograd.record():
    u = x * y
    u.attach_grad() # implicitly run u = u.detach()
    z = u + x 
    
z.backward() 
x.grad, u.grad, y.grad

(
 [1. 1. 1. 1.]
 <NDArray 4 @cpu(0)>, 
 [1. 1. 1. 1.]
 <NDArray 4 @cpu(0)>, 
 [0. 0. 0. 0.]
 <NDArray 4 @cpu(0)>)

#### **4.3.5 Head gradients**

In [13]:
y = nd.ones(4) * 2 
y.attach_grad() 

with autograd.record():
    u = x * y
    v = u.detach() # u still keeps the computation graph
    v.attach_grad()
    z = v + x 
    
z.backward() 
x.grad, y.grad

(
 [1. 1. 1. 1.]
 <NDArray 4 @cpu(0)>, 
 [0. 0. 0. 0.]
 <NDArray 4 @cpu(0)>)

In [14]:
u.backward(v.grad) 
x.grad, y.grad

(
 [2. 2. 2. 2.]
 <NDArray 4 @cpu(0)>, 
 [0. 1. 2. 3.]
 <NDArray 4 @cpu(0)>)

#### **4.3.6 Computing the Gradient of Python Control Flow**

In [15]:
def f(a):
    b = a * 2
    
    while b.norm().asscalar() < 1000:
        b = b * 2
        
    if b.sum().asscalar() > 0:
        c = b
    else:
        c = 100 * b
        
    return c 

In [16]:
a = nd.random.normal(shape=1) 
a.attach_grad() 

with autograd.record():
    d = f(a) 
    
d.backward() 

In [18]:
print('a : ', a)
print('d : ', d)
print('a.grad : ', a.grad)
print(a.grad == (d / a))

a :  
[1.1630787]
<NDArray 1 @cpu(0)>
d :  
[1190.9926]
<NDArray 1 @cpu(0)>
a.grad :  
[1024.]
<NDArray 1 @cpu(0)>

[1.]
<NDArray 1 @cpu(0)>


#### **4.3.7 Training Mode and Prediction Mode**

In [19]:
print(autograd.is_training()) 

with autograd.record():
    print(autograd.is_training())

False
True


#### **4.3.8 Summary**


- MXNet provides an autograd package to automate the derivation process. <br>
  To do so, we first attach gradients to variables, record the computation, and then run the backward function.<br><br>
- We can detach gradients and pass head gradients to the backward function to control the part of the computation will be used in the backward function.<br><br>
- The running modes of MXNet include the training mode and the prediction mode. <br>
  We can determine the running mode by autograd.is_training().

#### **4.3.9 Exercises**

In [None]:
#### 4.3.9 Exercises

____