# Lab Assignment 2
## With this assignment you will get to know more about gradient descent optimization and writing your own functions with forward and backward (i.e., gradient) passes
## You need to complete all the tasks in this notebook in the lab and show you work to the TA. Edit only those portions in the cells where it asks you to do so!

In [0]:
import torch
from torch.autograd import Variable
from torch.autograd import Function
import torch.nn.functional as F
import numpy as np

## Huber loss function
https://en.wikipedia.org/wiki/Huber_loss

In [0]:
# A loss function measures distance between a predicted and a target tensor
# An implementation of Huber loss function is given below
# We will make use of this loss function in gradient descent optimization
def Huber_Loss(input,delta):
  m = (torch.abs(input)<=delta).detach().float()
  output = torch.sum(0.5*m*input**2 + delta*(1.0-m)*(torch.abs(input)-0.5*delta))
  return output

# Test Huber loss with a couple of different examples

In [19]:
a = torch.tensor([[0.3, 2.0, -3.1],[0.5, 9.2, 0.1]])
print(a.numpy())
ha = Huber_Loss(a,1.0)
print(ha.numpy())

b = torch.tensor([0.3, 2.0])
print(b.numpy())
hb = Huber_Loss(b,1.0)
print(hb.numpy())

[[ 0.3  2.  -3.1]
 [ 0.5  9.2  0.1]]
12.974999
[0.3 2. ]
1.545


# Gradient descent code
## Study the following generic gradient descent optimization code.
## Huber loss f measures the distance between a probability vector z and target 1-hot vector target.
## When f.backward is called, PyTorch first computes $\nabla_z f$ (gradient of f with respect to z), then by chain rule it computes $\nabla_{var} f = J^{z}_{var} \nabla_z f$, where $J^{z}_{var}$ is the Jacobian of z with respect to var.
## Next, optimizer.step() call adjusts the variable var in the opposite direction of $\nabla_{var} f.$

In [0]:
def gradient_descent(var,optimizer,softmax,loss,target,nIter,nPrint):
  for i in range(nIter):
    z = softmax(var)
    f = loss(z-target,1.0)
    optimizer.zero_grad()
    f.backward()
    optimizer.step()
    if i%nPrint==0:
      with np.printoptions(precision=3, suppress=True):
        print("Iteration:",i,"Variable:", z.detach().numpy(),"Loss: %0.6f" % f.item())


# Gradient descent with Huber Loss
## The following cell shows how gradient_descent function can be used.
## The cell first creates a target 1-hot vector y, where only the 3rd place is on.
## It also creates a variable x with random initialization and an optimizer.
## Learning rate and momentum has been set to 0.1 and 0.9, respectively.
## Then it calls gradient_descent function.

In [21]:
y = torch.zeros(10)
y[2] = 1.0
print("Target 1-hot vector:",y.numpy())
x = Variable(torch.randn(y.shape),requires_grad=True)

optimizer = torch.optim.SGD([x], lr=1e-1, momentum=0.9) # create an optimizer that will do gradient descent optimization

gradient_descent(x,optimizer,F.softmax,Huber_Loss,y,1000,100)


Target 1-hot vector: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
Iteration: 0 Variable: [0.051 0.317 0.022 0.018 0.062 0.008 0.118 0.204 0.045 0.155] Loss: 0.572318
Iteration: 100 Variable: [0.007 0.009 0.937 0.003 0.008 0.001 0.009 0.009 0.007 0.009] Loss: 0.002269
Iteration: 200 Variable: [0.005 0.006 0.956 0.002 0.006 0.001 0.006 0.006 0.005 0.006] Loss: 0.001073
Iteration: 300 Variable: [0.004 0.005 0.964 0.002 0.005 0.001 0.005 0.005 0.004 0.005] Loss: 0.000733
Iteration: 400 Variable: [0.004 0.004 0.969 0.002 0.004 0.001 0.005 0.005 0.003 0.005] Loss: 0.000556
Iteration: 500 Variable: [0.003 0.004 0.972 0.001 0.004 0.001 0.004 0.004 0.003 0.004] Loss: 0.000448
Iteration: 600 Variable: [0.003 0.004 0.974 0.001 0.003 0.001 0.004 0.004 0.003 0.004] Loss: 0.000375


  This is separate from the ipykernel package so we can avoid doing imports until


Iteration: 700 Variable: [0.003 0.003 0.976 0.001 0.003 0.001 0.003 0.003 0.003 0.003] Loss: 0.000322
Iteration: 800 Variable: [0.003 0.003 0.978 0.001 0.003 0.001 0.003 0.003 0.002 0.003] Loss: 0.000283
Iteration: 900 Variable: [0.002 0.003 0.979 0.001 0.003 0.001 0.003 0.003 0.002 0.003] Loss: 0.000252


# <font color='red'>20% Weight:</font> In this markdown using math mode write gradient of Huber loss function: $output = \sum_i 0.5 m_i (input)^{2}_{i} + \delta (1-m_i)(|input_i|-0.5 \delta)$ with respect to $input.$ Treat $m_i$ to be independent of $input_i,$ becuase we replaced if control statement with $m_i.$
## Your solution <font color='red'>20% (complete formula)</font>: $\frac{\partial (output)}{\partial (input)_i} =  m_i (input)_{i} + \delta (1-m_i) * sign(input)_i$

# <font color='red'>20% Weight:</font> Define your own (correct!) rule of differentiation for Huber loss function
## Edit indicated line in the cell below. Use the following formula. Do not use for/while/any loop in your solution.
## For this function chain rule (Jacobian-vector product) takes the following form: $\frac{\partial (cost)}{\partial (input)_i} = \frac{\partial (output)}{\partial (input)_i} \frac{\partial (cost)}{\partial (output)}.$
# In the backward method below, $\frac{\partial (cost)}{\partial (output)}$ is denoted by output_grad and the ith component of input_grad is symbolized by $\frac{\partial (cost)}{\partial (input)_i}.$

In [0]:
# Inherit from Function
class My_Huber_Loss(Function):

    # Note that both forward and backward are @staticmethods
    @staticmethod
    def forward(ctx, input, delta):
        m = (torch.abs(input)<=delta).float()
        ctx.save_for_backward(input,torch.tensor(m),torch.tensor(delta))
        output = torch.sum(0.5*m*input**2 + delta*(1.0-m)*(torch.abs(input)-0.5*delta))
        return output

    @staticmethod
    def backward(ctx, output_grad):
        # retrieve saved tensors and use them in derivative calculation
        input, m, delta = ctx.saved_tensors

        # Return Jacobian-vector product (chain rule)
        # For Huber loss function the Jacobian happens to be a diagonal matrix
        # Also, note that output_grad is a scalar, because forward function returns a scalar value
        input_grad = (m * input + delta * (1-m) * torch.sign(input)) * output_grad # complete this line, do not use for loop
        # must return two gradients becuase forward function takes in two arguments
        return input_grad, None

#Gradient Descent on Your Own Huber Loss
## You should get almost identical results as before if your rule of differentation is correct!

In [23]:
y = torch.zeros(10)
y[2] = 1.0
print("Target:",y.numpy())
x = Variable(torch.randn(y.shape),requires_grad=True)

optimizer = torch.optim.SGD([x], lr=1e-1, momentum=0.9) # create an optimizer that will do gradient descent optimization

gradient_descent(x,optimizer,F.softmax,My_Huber_Loss.apply,y,1000,100)


Target: [0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
Iteration: 0 Variable: [0.115 0.132 0.032 0.219 0.049 0.045 0.176 0.128 0.022 0.082] Loss: 0.537294
Iteration: 100 Variable: [0.008 0.008 0.94  0.008 0.005 0.005 0.008 0.008 0.003 0.007] Loss: 0.002031
Iteration: 200 Variable: [0.006 0.006 0.957 0.006 0.004 0.004 0.006 0.006 0.002 0.005] Loss: 0.001055
Iteration: 300 Variable: [0.005 0.005 0.964 0.005 0.003 0.003 0.005 0.005 0.002 0.004] Loss: 0.000727
Iteration: 400 Variable: [0.004 0.004 0.969 0.004 0.003 0.003 0.004 0.004 0.002 0.004] Loss: 0.000554
Iteration: 500 Variable: [0.004 0.004 0.972 0.004 0.003 0.002 0.004 0.004 0.001 0.003] Loss: 0.000447


  This is separate from the ipykernel package so we can avoid doing imports until
  import sys


Iteration: 600 Variable: [0.003 0.003 0.974 0.003 0.002 0.002 0.003 0.003 0.001 0.003] Loss: 0.000375
Iteration: 700 Variable: [0.003 0.003 0.976 0.003 0.002 0.002 0.003 0.003 0.001 0.003] Loss: 0.000323
Iteration: 800 Variable: [0.003 0.003 0.977 0.003 0.002 0.002 0.003 0.003 0.001 0.003] Loss: 0.000283
Iteration: 900 Variable: [0.003 0.003 0.979 0.003 0.002 0.002 0.003 0.003 0.001 0.003] Loss: 0.000252


# <font color='red'>30% Weight:</font> In this markdown using math mode write Jacobian of softmax function: $(output)_i = \frac{exp((input)_i)}{ \sum_j exp((input)_j)}.$
## Your solution (<font color='red'>show your derivation to TA</font>): 
\begin{equation*}
    \frac{\partial (output)_j}{\partial (input)_i} = \begin{cases}
               (output)_i -(output)_j (output)_i,               & i = j,\\
               -(output)_j (output)_i, & \text{otherwise.}
           \end{cases}
\end{equation*}

Solution: $\frac{\partial (output)_j}{\partial (input)_i} = diag(output) - output^\top output$



# <font color='red'>30% Weight:</font> Your own softmax with forward and backward functions
## Edit indicated line in the cell below. Use the following formula. Do not use for/while/any loop in your solution.
## The Jacobian-vector product (chain rule) takes the following form using summation sign: $\frac{\partial (cost)}{\partial (input)_i} = \sum_j \frac{\partial (output)_j}{\partial (input)_i} \frac{\partial (cost)}{\partial (output)_j}$
# Once again note that in the backward method below, ith component of input_grad and jth component of output_grad is denoted by $\frac{\partial (cost)}{\partial (input)_i}$ and $\frac{\partial (cost)}{\partial (output)_j}$, respectively.

In [0]:
# Inherit from Function
class My_softmax(Function):

    # Note that both forward and backward are @staticmethods
    @staticmethod
    def forward(ctx, input):
        output = F.softmax(input,dim=0)
        ctx.save_for_backward(output) # this is the only tensor you will need to save for backward function
        return output

    # This function has only a single output, so it gets only one gradient
    @staticmethod
    def backward(ctx, grad_output):
        output = ctx.saved_tensors[0]
        # retrieve saved tensors and use them in derivative calculation
        # return Jacobian-vecor product
        grad_input = torch.sum((torch.diag(output) - torch.ger(output, output)) * grad_output, dim = 1) # Complete this line
        return grad_input

# Gradient Descent on your own Huber Loss and your own softmax

In [25]:
y = torch.zeros(10)
y[2] = 1.0
print(y)
x = Variable(torch.randn(y.shape),requires_grad=True)
print(x)

optimizer = torch.optim.SGD([x], lr=1e-1, momentum=0.9) # create an optimizer that will do gradient descent optimization

gradient_descent(x,optimizer,My_softmax.apply,My_Huber_Loss.apply,y,1000,100)


tensor([0., 0., 1., 0., 0., 0., 0., 0., 0., 0.])
tensor([ 0.8756, -1.4361, -0.6156,  0.2791, -0.2809, -0.3099,  1.8846,  0.9310,
        -1.4725,  0.6733], requires_grad=True)
Iteration: 0 Variable: [0.139 0.014 0.031 0.076 0.044 0.042 0.381 0.147 0.013 0.113] Loss: 0.573427
Iteration: 100 Variable: [0.009 0.002 0.943 0.008 0.006 0.006 0.008 0.009 0.002 0.009] Loss: 0.001859
Iteration: 200 Variable: [0.006 0.002 0.958 0.006 0.004 0.004 0.006 0.006 0.002 0.006] Loss: 0.001008
Iteration: 300 Variable: [0.005 0.001 0.965 0.005 0.004 0.004 0.005 0.005 0.001 0.005] Loss: 0.000702
Iteration: 400 Variable: [0.005 0.001 0.969 0.004 0.003 0.003 0.004 0.005 0.001 0.005] Loss: 0.000538
Iteration: 500 Variable: [0.004 0.001 0.972 0.004 0.003 0.003 0.004 0.004 0.001 0.004] Loss: 0.000436


  import sys


Iteration: 600 Variable: [0.004 0.001 0.975 0.003 0.003 0.003 0.004 0.004 0.001 0.004] Loss: 0.000367
Iteration: 700 Variable: [0.004 0.001 0.976 0.003 0.002 0.002 0.003 0.004 0.001 0.003] Loss: 0.000316
Iteration: 800 Variable: [0.003 0.001 0.978 0.003 0.002 0.002 0.003 0.003 0.001 0.003] Loss: 0.000278
Iteration: 900 Variable: [0.003 0.001 0.979 0.003 0.002 0.002 0.003 0.003 0.001 0.003] Loss: 0.000248
