# Difference Target Propagation
---
In this notebook, we will use an interesting neural network that learns via **target propagation** to classify images from the CIFAR-10 database. Note that we provide a number of instructions (including code and descriptions) for the ease of your learning. Be free to change the provided code if you want but in this case you should explain your motivation in the submitted report.


Our aim is to reproduce the results of the paper entitled "Difference Target Propagation" （ https://arxiv.org/abs/1412.7525 ） on the CIFAR-10 database. Before completing this work, you are required to carefully read the paper and understand its basic idea about the proposed learning strategy. 


![](model.png)

Figure 1. The Schematic overview of the target propagation


### Test for CUDA

Since these are larger (32x32x3) images, it may prove useful to speed up your training time by using a GPU. CUDA is a parallel computing platform and CUDA Tensors are the same as typical Tensors, only they utilize GPU's for computation.

In [1]:
import torch
import numpy as np

# check if CUDA is available
train_on_gpu = torch.cuda.is_available()
device=torch.device('cuda' if torch.cuda.is_available() else 'cpu')
if not train_on_gpu:
    print('CUDA is not available.  Training on CPU ...')
else:
    print('CUDA is available!  Training on GPU ...')

CUDA is available!  Training on GPU ...


---
## Load the Data

If you're not familiar with the Cifar-10, you may find it useful to look at: http://www.cs.toronto.edu/~kriz/cifar.html . 

A copy of the data is also placed on the class website.  

#### TODO: Load the data

In [2]:
#############################################################################
# TODO: load the data   #
#############################################################################
import torchvision
import torchvision.transforms as transforms

batch_size=1024

transform=transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

full_dataset = torchvision.datasets.CIFAR10(
    root='./data',
    train=True,
    download=True,
    transform=transform)
train_length = int(full_dataset.__len__()*0.8)
val_length = full_dataset.__len__() - train_length
train_dataset, val_dataset = torch.utils.data.random_split(full_dataset, [train_length, val_length])
test_dataset = torchvision.datasets.CIFAR10(
    root='./data',
    train=False,
    download=True,
    transform=transform)

trainloader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=0)
valloader = torch.utils.data.DataLoader(
    val_dataset,
    batch_size=batch_size,
    shuffle=True,
    num_workers=0)
testloader = torch.utils.data.DataLoader(
    test_dataset,
    batch_size=1,
    shuffle=False,
    num_workers=0)
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
#############################################################################
#                          END OF YOUR CODE               #
#############################################################################

Files already downloaded and verified
Files already downloaded and verified


---
## Define the Network Architecture

Here, you'll define a neural network named DTPNet, whose architecture resembles multiple layer perceptron (MLP) but adopts target propagation for parameter updating. You may use the following Pytorch functions to build it.

* [Linear transformation layer](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)
* [Non-linear activations](https://pytorch.org/docs/stable/generated/torch.tanh.html)
* [Gradient computing operation](https://pytorch.org/docs/stable/autograd.html?highlight=torch%20autograd%20grad#torch.autograd.grad)


### TODO: 
#### 1)  Define DTPNet: Completing the __init__ function
**Sizes of the hidden layers**: As the MLP takes a vector as input (ignoring the dimension of the batch), we should transform the image to a vector, whose dimension should be 3072 ($ 32\times 32\times 3 $)。 We suggest that the network architecture was 3072-1000-1000-1000-10. In each layer, the network uses the hyperbolic tangent as the activation function. You can also design these hyperparameters on your own.

 HINT: because we will compute the loss and gradient for each layer instead of chain rule, you might want to build a separate computational graph for each layer. Think of how to do this.
#### 2)  Define the forward path of DTPNet: Completing the "forward" function
The forward path involves computing unit values for all layers, that is,

\begin{align}
\text{for } i&=  \text{1 to } M \\
 & h_i \leftarrow f_i(h_{i-1})
\end{align}
where $f_i$ stands for the transformation layer of the DTPNet. Please refer to the original paper for more information.

#### 3)  Define the backward path of DTPNet: Completing the "backward" function
The computational details of the backward path are described in Algorithm 1 of the paper. It involves computing the targets, calculating the loss and calculate the gradients for each of the layers in the neural network in a top-down manner. 

To make the code more readable, you need first define the "backward" function will call ``compute_target`` and ``reconstruction`` function.

**Important:** The basic idea of the strategy to update the parameter for each layer are presented in the following section. You may need it for completing this function.

##### 3.1)  Calculate the targets for each layer
For the target of the highest layer, it is computed by
$\hat{\mathbf{h}}_{M-1} \leftarrow \mathbf{h}_{M-1} - \hat{\eta} \frac {\partial L} {\partial \mathbf{h}_{M-1}}$, \; ($L$ is the global loss)

For the targets of the lower layers, they are computed by
\begin{align}
\text{for } i&=  M-1 \text{ to } 2 \\
& \hat{\mathbf{h}}_{i-1} \leftarrow \mathbf{h}_{i-1} - g_i(\mathbf{h}_{i}) +  g_i(\hat{\mathbf{h}}_{i})
\end{align}


##### 3.2) Implement the reconstruction function 
The reconstruction function involves $g_i(f_i(\mathbf{h}_{i-1}))$.

In [3]:
import torch
import torch.nn as nn
import torch.optim as optim

#Define a MLP
class DTPNet(nn.Module):
    def __init__(self,hidden_sizes=None):
        super(DTPNet, self).__init__()
        # construct the network
        #############################################################################
        self.num_layers=len(hidden_sizes)-1
        self.criterion = torch.nn.MSELoss()
        
        self.forward_model=dict()
        self.forward_optim=dict()
        for i in range(self.num_layers):
            self.forward_model['F{}'.format(i+1)]=nn.Linear(in_features=hidden_sizes[i],out_features=hidden_sizes[i+1]).to(device)
            self.forward_optim['F{}'.format(i+1)]=optim.RMSprop(params=self.forward_model['F{}'.format(i+1)].parameters(), lr=0.01)
        
        self.backward_model=dict()
        self.backward_optim=dict()
        for i in range(1,self.num_layers-1):
            self.backward_model['G{}'.format(i+1)]=nn.Linear(in_features=hidden_sizes[i+1],out_features=hidden_sizes[i]).to(device)
            self.backward_optim['G{}'.format(i+1)]=optim.RMSprop(params=self.backward_model['G{}'.format(i+1)].parameters(), lr=0.01)
        #############################################################################

    def forward(self, x):
        #Input: a batch of images, the size is [batchsize, 3072]
        #Output:the values of each layer in the network
        values = dict()
        #############################################################################
        for i,k in enumerate(self.forward_model):
            if i==self.num_layers-1:
                x=torch.softmax(self.forward_model[k](x),dim=-1)
            else:
                x=torch.tanh(self.forward_model[k](x))
            values['H{}'.format(i+1)]=x        
        #############################################################################
        return values
    
    #以下函数定义有修改，详见实验报告
    def backward(self, values, global_loss):
        targets=self.compute_target(values, global_loss)

        loss_inv=self.compute_loss_inv(values)
        loss=self.compute_loss(values,targets,global_loss)
        
        for key in loss_inv.keys():
            grad=torch.autograd.grad(loss_inv[key], self.backward_model[key].parameters(), retain_graph = True)
            self.backward_model[key].weight.grad=grad[0]
            self.backward_model[key].bias.grad=grad[1]
        for key in loss.keys():
            grad=torch.autograd.grad(loss[key], self.forward_model[key].parameters(), retain_graph = True)
            self.forward_model[key].weight.grad=grad[0]
            self.forward_model[key].bias.grad=grad[1]

        for key in loss_inv.keys():
            self.backward_optim[key].step()
        for key in loss.keys():
            self.forward_optim[key].step()

    def compute_target(self, values, global_loss):
       #Input: values=[value_1,value_2,...,value_N]: the values of each layer (totally N layers) in the network
        #Output: loss=[target_1,target_2,...,target_N]: the targets of each layer (totally N layers) in the network
        targets  = dict() 
        lr0=0.327736332653       
        targets['H{}_'.format(self.num_layers-1)]=values['H{}'.format(self.num_layers-1)]-lr0*torch.autograd.grad(global_loss, values['H{}'.format(self.num_layers-1)], retain_graph = True)[0]
        for i in range(self.num_layers-1,1,-1):
            targets['H{}_'.format(i-1)]=values['H{}'.format(i-1)]-self.backward_model['G{}'.format(i)](values['H{}'.format(i)])+self.backward_model['G{}'.format(i)](targets['H{}_'.format(i)])      
        
        return targets
    
    def compute_loss(self, values, targets, global_loss):
        loss=dict()

        for i in range(self.num_layers-1):
            loss['F{}'.format(i+1)]=self.criterion(values['H{}'.format(i+1)],targets['H{}_'.format(i+1)])
        loss['F{}'.format(self.num_layers)]=global_loss
        return loss
    
    def compute_loss_inv(self, values):
        #Input: values=[value_1,value_2,...,value_N]: the values of each layer (totally N layers) in the network
        #output: loss=[value_2,...,value_N-1]: the reconstructed values of each layer (totally N layers) in the network
        loss_inv = dict()
        c=0.359829566008
        for i in range(self.num_layers-1,1,-1):
            temp=torch.randn(values['H{}'.format(i-1)].shape).to(device)*(c**1/2)+values['H{}'.format(i-1)]
            loss_inv['G{}'.format(i)]=self.criterion(self.backward_model['G{}'.format(i)](self.forward_model['F{}'.format(i)](temp)),temp)

        return loss_inv
    
    def set_train(self):
        for key in self.forward_model.keys():
            self.forward_model[key].train()
        for key in self.backward_model.keys():
            self.backward_model[key].train()     
    
    def set_eval(self):
        for key in self.forward_model.keys():
            self.forward_model[key].eval()
        for key in self.backward_model.keys():
            self.backward_model[key].eval() 

model = DTPNet([3072,1000,1000,1000,10])
print('forward_model:',model.forward_model)#用字典定义模型层，打印模型层
print('backward_model:',model.backward_model)#用字典定义模型层，打印模型层

forward_model: {'F1': Linear(in_features=3072, out_features=1000, bias=True), 'F2': Linear(in_features=1000, out_features=1000, bias=True), 'F3': Linear(in_features=1000, out_features=1000, bias=True), 'F4': Linear(in_features=1000, out_features=10, bias=True)}
backward_model: {'G2': Linear(in_features=1000, out_features=1000, bias=True), 'G3': Linear(in_features=1000, out_features=1000, bias=True)}


### Specify [Loss Function](http://pytorch.org/docs/stable/nn.html#loss-functions) 

Before the training process, you need to first specify your the loss function. For example, we set the negative log-likelihood as the the global loss of the image classification task. 

#### TODO: Define the loss

In [4]:
#############################################################################
# TODO: define the loss function           #
#############################################################################
global_loss_fn=torch.nn.NLLLoss()
#############################################################################
#                          END OF YOUR CODE               #
#############################################################################

---
## Train the Network
The details of the training phrase of DTPNet can be found in Algorithm 1 of the paper, which is also posted here for your convenience.  Please note that the training process of DTPNet is dramatically different from the traditional neural network. The main difference is that back-propagation uses chain rule to update parameters and the DTP calculates the targets and updates parameters layer by layer. Thus, so you can use the "detach" function in Pytorch to truncate the gradients to avoid auto-differetiation.

Hint: Instead using the backpropagation algorithm, we calculate the loss and then calculate the gradients for each layer, then we can use the gradients for each layer to update the parameters for each layer.
## Examples to show how to update parameters of a single layer by Pytorch
There are two approaches to derive the gradients for the parameters of a neural network, namely, the optimizer module and autograd.grad() function. You can choose one of them or both of them to accomplish this work. Below, we will elaborate them in detail.
### 1)  Optimizer
The ``optimizer`` module controls the parameter updating for the neural network in Pytorch. We take a single layer of the multiple layer perceptron (MLP) as an example, which is defined by
```
single_layer = torch.nn.Linear(2,3)
```
We can build an optimizer via Pytorch.
```
optimizer = torch.optim.RMSprop([{'params':single_layer.parameters(), 'lr': 1}])
```
Then we can update the parameters of single_layer by calling the ``step`` function of the optimizer, for example,
```
x = torch.randn(10,2)
y = torch.randn(10,3)
predict = single_layer(x)
loss = ((predict-y)**2).sum() 
loss.backward()
optimizer.step()
```



### 2) autograd.grad()
This function provides a way for the users to manually obtain the gradients for each of the parameters.
```
x = torch.randn(10,2)
y = torch.randn(10,3)
predict = MLP(x)
loss = ((predict-y)**2).sum()
grad_weight = torch.autograd.grad(loss, MLP.weight, retain_graph = True)[0]

```



<img src="algo.png" alt="drawing" width="800"/>



Remember to look at how the training and validation loss decreases over time and print them.

In [5]:
#############################################################################
# TODO: train and validation              #
#############################################################################
for epoch in range(30):
    train_loss = 0.0
    train_acc = 0.0
    train_correct_count = 0
    model.set_train()
    for i, data in enumerate(trainloader, 0):
        inputs, labels = data
        inputs=inputs.to(device)
        inputs=inputs.view([inputs.shape[0],3072])
        labels=labels.to(device)

        values = model(inputs)
        outputs = values['H{}'.format(model.num_layers)]
        train_correct_count += (torch.argmax(outputs,dim=1)==labels).sum().cpu().item()
        global_loss = global_loss_fn(outputs, labels)
        model.backward(values,global_loss)
        train_loss += global_loss.cpu().item()

    train_loss=train_loss/(i+1)
    train_acc=train_correct_count/(i+1)/batch_size

    val_loss = 0.0
    val_acc = 0.0
    val_correct_count = 0
    model.set_eval()
    for i, data in enumerate(valloader, 0):
        inputs, labels = data
        inputs=inputs.to(device).view(inputs.shape[0],3072)
        labels=labels.to(device)

        outputs = model(inputs)['H{}'.format(model.num_layers)]
        val_correct_count += (torch.argmax(outputs,dim=1)==labels).sum().cpu().item()
        global_loss = global_loss_fn(outputs, labels)
        val_loss += global_loss.cpu().item()
    val_loss=val_loss/(i+1)
    val_acc=val_correct_count/(i+1)/batch_size

    print('Epoch %d|Train_loss:%.3f Eval_loss:%.3f Train_acc:%.3f Eval_acc:%.3f'%(epoch+1,train_loss,val_loss,train_acc,val_acc))
#############################################################################
#                          END OF YOUR CODE               #
#############################################################################

Epoch 1|Train_loss:-0.331 Eval_loss:-0.339 Train_acc:0.335 Eval_acc:0.341
Epoch 2|Train_loss:-0.373 Eval_loss:-0.350 Train_acc:0.378 Eval_acc:0.352
Epoch 3|Train_loss:-0.384 Eval_loss:-0.355 Train_acc:0.388 Eval_acc:0.356
Epoch 4|Train_loss:-0.391 Eval_loss:-0.360 Train_acc:0.396 Eval_acc:0.362
Epoch 5|Train_loss:-0.396 Eval_loss:-0.375 Train_acc:0.402 Eval_acc:0.378
Epoch 6|Train_loss:-0.400 Eval_loss:-0.375 Train_acc:0.406 Eval_acc:0.375
Epoch 7|Train_loss:-0.405 Eval_loss:-0.363 Train_acc:0.411 Eval_acc:0.364
Epoch 8|Train_loss:-0.407 Eval_loss:-0.371 Train_acc:0.413 Eval_acc:0.372
Epoch 9|Train_loss:-0.409 Eval_loss:-0.381 Train_acc:0.416 Eval_acc:0.383
Epoch 10|Train_loss:-0.413 Eval_loss:-0.380 Train_acc:0.419 Eval_acc:0.378
Epoch 11|Train_loss:-0.414 Eval_loss:-0.383 Train_acc:0.421 Eval_acc:0.381
Epoch 12|Train_loss:-0.417 Eval_loss:-0.376 Train_acc:0.422 Eval_acc:0.374
Epoch 13|Train_loss:-0.418 Eval_loss:-0.380 Train_acc:0.424 Eval_acc:0.381
Epoch 14|Train_loss:-0.419 Eval_lo

---
## Test the Trained Network

Test your trained model on previously unseen data and print the test accuracy of each class and the whole! Try your best to get a better accuracy.

In [6]:
#############################################################################
# TODO: test the trained network             #
#############################################################################
class_count=np.zeros(10,dtype=int)
correct_count=np.zeros(10,dtype=int)
model.set_eval()
for i, data in enumerate(testloader, 0):
    inputs, labels = data
    inputs=inputs.to(device)
    inputs=inputs.view([inputs.shape[0],3072])
    outputs = torch.argmax(model(inputs)['H{}'.format(model.num_layers)],dim=1).cpu().item()
    class_count[labels]+=1
    if outputs==labels:
        correct_count[labels]+=1
for i in range(10):
    print('Test|Class{}('.format(i+1)+classes[i]+')-acc={}'.format(correct_count[i]/class_count[i]))
print('\nTest|Overall-acc={}'.format(np.sum(correct_count/10000)))
#############################################################################
#                          END OF YOUR CODE               #
#############################################################################

Test|Class1(plane)-acc=0.48
Test|Class2(car)-acc=0.453
Test|Class3(bird)-acc=0.246
Test|Class4(cat)-acc=0.171
Test|Class5(deer)-acc=0.311
Test|Class6(dog)-acc=0.339
Test|Class7(frog)-acc=0.464
Test|Class8(horse)-acc=0.403
Test|Class9(ship)-acc=0.561
Test|Class10(truck)-acc=0.496

Test|Overall-acc=0.39239999999999997


### Question: What are your model's weaknesses during your experiment and how might they be improved?
+ 本实验只使用了较为简单的MLP网络，由于网络结构过于简单，最终分类结果较差，仅用作DTP算法是示意。实际上，MLP网络输入时先将图像展开为列向量，这种操作不易获得图像上邻近位置的信息，因而分类效果较差。使用更加复杂的卷积神经网络应当可以达到更好的效果，将卷积神经网络中的一个卷积层或一个卷积块看做一个广义的前向计算函数${f_i}$即可使用DTP算法进行训练。由于任务一中已经验证了ResNet的分类表现，此处不再重复实现。
+ DTP算法中涉及较多的超参数选择，如各层局部参数更新的学习率、随机高斯噪声的功率谱密度以及全局损失函数、优化器的选择。在这些超参数上进行进一步调优可能在相同的网络结构上实现更好的测试表现。
