# <span style="color:#0b486b">  FIT5215: Deep Learning (2022)</span>
***
*CE/Lecturer:*  **Dr Trung Le** | trunglm@monash.edu <br/> <br/>
*Tutor:*  **Mr Tuan Nguyen**  \[tuan.ng@monash.edu \] |**Mr Anh Bui** \[tuananh.bui@monash.edu\] | **Mr Xiaohao Yang** \[xiaohao.yang@monash.edu \] | **Mr Md Mohaimenuzzaman** \[md.mohaimen@monash.edu \] |**Mr Thanh Nguyen** \[Thanh.Nguyen4@monash.edu \] |
<br/> <br/>
Faculty of Information Technology, Monash University, Australia
******

# <span style="color:#0b486b">Tutorial 3a: Feed-forward Neural Nets with TensorFlow 1.x</span> 
**The purpose of this tutorial is to demonstrate how to work with an open source software library for developing deep neural networks apllications, called TensorFlow. In this tutorial, we will focus on**:  
- ***Inspect the common pipeline of deep learning*.**
- ***How to implement a feedforward neural net for a multi-class classfication problem using TF 1.x in Tutorial 3a (this tutorial)*.**
- ***How to implement a feedforward neural net for a multi-class classfication problem using TF 2.x in Tutorial 3b*.**

***

### <span style="color:#0b486b"> II.1 Feedforward Neural Network </span> <span style="color:red">***** (highly important)</span>
#### <span style="color:#0b486b"> Tutorial objective </span>

In this tutorial we will consider a fairly realistic deep NNs with *three* layers plus the *output* layer. Its architecture will be specified as: $16 \rightarrow 10 (ReLU) \rightarrow 20 (ReLU) \rightarrow 15 (ReLu) \rightarrow 26$. This means:
- Input size is 16
- First layer has 10 hidden units with ReLU activation functions
- Second layer has 20 hidden units with 20 ReLU activiation functions
- Third layer has 15 hidden units with 15 ReLU activiation functions
- And output layer is logit layer with 26 hidden units

This network, for example, can take the `letter` dataset input with $16$ features and with $26$ classes (A-Z). **Our objective in this tutorial is to implement this specific network in `TensorFlow 1.x`.**

#### <span style="color:#0b486b">Specifying the Neural Network Architecture </span>

We can visualize this network as in the figure below. Please note that for readability, the number of hidden units in the figure might not correspond exactly to the actual size of the hidden units used.

<img src="./images/DNN_Pipeline.PNG" width="1000">

Furthermore, the above figure shows the pipeline of the entire process for feeding a mini-batch of batch size $32$ into the network. Using ***mini-batch*** is a common way to train deep NNs in practice.

Let us denote the mini-batch by $X_b= \{(x_1, y_1),\dots, (x_{32}, y_{32})\}$. The mini-batch can be stored using a $2D$ tensor with the shape $(32, 16)$. Assume that in this network, we use the activation function $ReLu$ where $ReLu(t)= \max\{0, t\}$. The computation in the forward propagation step is as follows:
- Input $X_b$ with mini-batch size of 32
- $h_1= ReLu(X_b \times W^1 + b^1)\in \mathbb{R}^{32 \times 10}$. 
- $h_2= ReLu(h_1 \times W^2 + b^2\in \mathbb{R}^{32 \times 20}$. 
- $h_3= ReLu(h_2 \times W^3 + b^3\in \mathbb{R}^{32 \times 15}$. 
- $logits= h_3 \times W^4 + b^4 \in \mathbb{R}^{32 \times 26}$
- $p = softmax(logits) \in \mathbb{R}^{32 \times 26}$ <br>

where we note that the activation function is perfomed element-wise and the softmax function is used to transform a vector of scalars to a discrete distribution as: 

$$softmax(z)=\big[\frac{\exp(z_i)}{\sum_{j=1}^{26}{\exp(z_j)}}\big]_{i=1}^{26}$$

    
The $k$-th row $p_k$ of the matrix $p$ can represent the probability distribution to classify the data point $x_k$ to the classes $1,2,\dots,26$. In particular, we have:

$$p_{km}= p(y_k=m \mid x_k)  \text{ for }  m=1,2,\dots,26$$

**<span style="color:red"> Exercise 1</span>** : Explain why the dimension for $h_1$ is $32\times 10$? Similarly, please work out the dimension for $h2, h3, logits$ and $p$.

#### <span style="color:#0b486b">Specifying the Loss Function </span>
Essiential to training a deep NN is the concept of the **loss function**. This function will tell us how good the network is predicting, and hence we can use this loss to find the network weights in such a way that the loss can be minimized.

For classification task, a common approach is to use the **cross-entropy** loss function. Given a data-label instance $(x_k,y_k)$ where feature $x_k\in \mathbb{R}^{16}$ and the label $y_k \in \{1,2,...,26\}$ is a numeric label (for example if $x_k$ is in the class 2, then $y_k =2 $ and its one-hot vector $1_{y_k}=[0,1,0,...,0]$). The cross-entroty between the classification distribution $p_k$ returned from the NN and true label distribution $y_k$ is defined as:
$$cross\_entropy(1_{y_k}, p_k)=-\sum_{j=1}^{26}y_{kj}\log{p_{kj}}=-\log p_{k,y_k}$$. This loss basically enforces the model to predict the label as close as the true label by minimizing $cross\_entropy(1_{y_k}, p_k)$.

The above loss function was applied for each instance. For the entire current mini-batch, our loss function becomes: 
$$\min \sum_{k=1}^{32}cross\_entropy(1_{y_k}, p_k)$$

**<span style="color:red"> Exercise 2: </span>** : **<span style="color:#0b486b">In the corss-entropy equation above, $y_k$ is the class for $x_k$, explain why the end result is $-\log p_{k,y_k}$.</span>**

**<span style="color:red"> Exercise 3: </span>** : **<span style="color:#0b486b">Let $p=[0.1, 0.3, 0.6]$ and $q=[0.0, 0.5, 0.5]$ be two discrete distributions, what is the $cross\_entropy(q,p)$ ?</span>**

### <span style="color:#0b486b"> II.2 Implementation with TensorFlow 1.x</span> <span style="color:red">**** (important)</span>
We now shall implement the aforementioned network with the architecture of $16 \rightarrow 10 (ReLU) \rightarrow 20 (ReLU) \rightarrow 15 (ReLu) \rightarrow 26$ in Tensorflow using the dataset `letter`. 

This letter dataset can be found at [the LIBSVM website](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#letter). Here is the dataset information:
-  *The objective is to identify each of a large number of black-and-white rectangular pixel displays as one of the 26 capital letters in the English alphabet. The character images were based on 20 different fonts and each letter within these 20 fonts was randomly distorted to produce a file of 20,000 unique stimuli. Each stimulus was converted into 16 primitive numerical attributes (statistical moments and edge counts) which were then scaled to fit into a range of integer values from 0 through 15*

A typical pipeline process of implementing a deep learning model is as follows:

1. **Data processing**: 
   - Load the dataset and split into train, valid, and test sets.  
     
2. **Construction phase**: 
   - Define the NN model and construct the corresponding computational graph.
   - Define the loss function and the relevant measures of performance of interest (accuracy, F1, and AUC).
    
3. **Execution and evaluation phase**: 
   - Train the model using mini-batches from the train set by minimizing the loss function with an optimizer.
   - Predict on the test set and access its performance.

#### <span style="color:#0b486b">1. Data Processing </span>

We use `sklearn` to load the dataset.

In [1]:
import os
import numpy as np
from sklearn.datasets import load_svmlight_file

In [2]:
data_file_name= "letter_scale.libsvm"
data_file = os.path.abspath("./Data/" + data_file_name)
X_data, y_data = load_svmlight_file(data_file)
X_data= X_data.toarray()
y_data= y_data.reshape(y_data.shape[0],-1)
print("X data shape: {}".format(X_data.shape))
print("y data shape: {}".format(y_data.shape))
print("# classes: {}".format(len(np.unique(y_data))))
print(np.unique(y_data))

X data shape: (15000, 16)
y data shape: (15000, 1)
# classes: 26
[ 1.  2.  3.  4.  5.  6.  7.  8.  9. 10. 11. 12. 13. 14. 15. 16. 17. 18.
 19. 20. 21. 22. 23. 24. 25. 26.]


We use `sklearn` to split the dataset into the train, validation, and test sets. 


In [3]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

def train_valid_test_split(data, target, train_size, test_size):
    valid_size = 1 - (train_size + test_size)
    X1, X_test, y1, y_test = train_test_split(data, target, test_size = test_size, random_state= 33)
    X_train, X_valid, y_train, y_valid = train_test_split(X1, y1, test_size = float(valid_size)/(valid_size+ train_size))
    return X_train, X_valid, X_test, y_train, y_valid, y_test

Instructions for updating:
non-resource variables are not supported in the long term


Next, we would like to encode the label in the form of numeric vector. For example, we want to turn $y\_data=["cat", "dog", "cat", "lion", "dog"]$ to $y\_data=[0,1,0,2,1]$.

To do this, in the following segment of code, we use the object `le` as an instance of the class `preprocessing.LabelEncoder()` which supports us to transform catefgorial labels in `y_data` to numerical vector.

In [4]:
le = preprocessing.LabelEncoder()
le.fit(y_data)
y_data= le.transform(y_data)
print(y_data[:20])

[25 15 18  7  7  5 13 17 12  3 21  0 10  3 18  7  4 25 14 16]


  return f(*args, **kwargs)


We now use the function defined above to prepare our data for training, validating and testing.

In [5]:
X_train, X_valid, X_test, y_train, y_valid, y_test = train_valid_test_split(X_data, y_data, train_size=0.8, test_size=0.1)
y_train= y_train.reshape(-1)
y_test= y_test.reshape(-1)
y_valid= y_valid.reshape(-1)
print(X_train.shape, X_valid.shape, X_test.shape)
print(y_train.shape, y_valid.shape, y_test.shape)
print("lables: {}".format(np.unique(y_train)))

(12000, 16) (1500, 16) (1500, 16)
(12000,) (1500,) (1500,)
lables: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25]


We catch some information of the training set which will be reused later.

In [6]:
train_size= int(X_train.shape[0])
n_features= int(X_train.shape[1])
n_classes= len(np.unique(y_train))

Once again, in real-world implementation of deep learning models, we use Stochastic Gradient Descent (SGD). Input to this algorithm is a sequence of **mini-batch** of data drawn from the training dataset.

#### <span style="color:#0b486b">2. Construction Phase </span>

We build up a feedforward neural network with the architecture: $16 \rightarrow 10 (ReLU) \rightarrow 20 (ReLU) \rightarrow 15 (ReLu) \rightarrow 26$ in TensorFlow.

In [7]:
n_in= n_features    # dimension of input
n1= 10              # number of hidden units at the first layer
n2= 20              # number of hidden units at the second layer
n3= 15              # number of hidden units at the third layer
n_out= n_classes    # number of classification classes

The function `dense_layer` represents a fully connected layer in a deep learning network. This takes $W,b$ and input as inputs and returns $\sigma(W \times input + b)$ where the activation function $sigma$ is specified by the parameter `act`.
- In `TensorFlow`, we can refer to the `activation functions` as `tf.nn.relu`, `tf.nn.sigmoid`, `tf.nn.tanh`, and etc.
- You can also define your own activation function.

In [8]:
def dense_layer(inputs, output_size, act=None, name="hidden-layer"):
    with tf.name_scope(name):
        input_size= int(inputs.get_shape()[1])
        W_init = tf.random.normal([input_size, output_size], mean=0, stddev= 0.1, dtype= tf.float32)
        b_init= tf.random.normal([output_size], mean=0, stddev= 0.1, dtype= tf.float32)
        W= tf.Variable(W_init, name= "W")
        b= tf.Variable(b_init, name="b")
        Wxb= tf.matmul(inputs, W) + b
        if act is None:
            return Wxb
        else:
            return act(Wxb)

We now construct the computational graph. But before that we need to reset the default graph.

In [9]:
tf.reset_default_graph()
with tf.name_scope("network"):
    X= tf.placeholder(shape=[None, n_in], dtype= tf.float32)
    y= tf.placeholder(shape=[None], dtype= tf.int32)
    h1= dense_layer(X, n1, act= tf.nn.relu, name= "layer1")
    h2= dense_layer(h1, n2, act= tf.nn.relu, name= "layer2")
    h3= dense_layer(h2, n3, act= tf.nn.relu, name= "layer3")
    logits= dense_layer(h3, n_out, name="logits")

We compute the cross-entropy loss. Note that in TensorFlow you can use two of following functions for evaluating the cross-entropy loss:
- `tf.nn.sparse_softmax_cross_entropy_with_logits`: if the labels `y_train` is in the categorial format (e.g., `y_train=[0,1,0,1,1,2]`).
- `tf.nn.softmax_cross_entropy_with_logits`: if the labels `y_train` is in the one-hot format (e.g., `y_train=[[1,0,0], [0,1,0], [1,0,0], [0,0,1]]`.

We also need to specify an optimizer to minimize the loss. Here, we are using the Adam optimizer for this optimization. 

In [10]:
with tf.name_scope('train'):
    xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y, 
                                                              logits=logits, 
                                                              name='xentropy')
    loss= tf.reduce_mean(xentropy, name="loss")
    tf.summary.scalar("loss", loss)    #summarize the loss
    optimizer= tf.train.AdamOptimizer(learning_rate=0.001)
    train_op= optimizer.minimize(loss)

In the above code, we add the line ***tf.summary.scalar("loss", loss)*** to add to the summary the loss.

We also wish to estimate the accuracy of our model. For this you can use the [*`in_top_k()`* function](https://www.tensorflow.org/api_docs/python/tf/nn/in_top_k) with *k=1*. This returns a 1D tensor full of boolean values, so we need to cast these booleans to floats and then compute the average. This will give us the network’s overall accuracy. We insert the line ***tf.summary.scalar("accuracy", accuracy)*** to add to the summary the accuracy.

In [11]:
with tf.name_scope('evaluation'):
    correct = tf.nn.in_top_k(logits, y, 1)
    accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
    tf.summary.scalar("accuracy", accuracy)  #summarize the accuracy

We now define two FileWriters to write the summary to two log folders. By this way, we can plot the train, valid losses (or accuracies) on the same graph. Note that you can use this trick when you want to display some plots on the same graph.

In [12]:
if(not os.path.exists("./logs/train")):
    os.makedirs("./logs/train")

if(not os.path.exists("./logs/val")):
    os.makedirs("./logs/val")

merged= tf.summary.merge_all()
train_writer= tf.summary.FileWriter("./logs/train")
valid_writer= tf.summary.FileWriter("./logs/val")

#### <span style="color:#0b486b">3. Execution and Evaluation Phase </span>

In the `execution phase`, we need to create a `TensorFlow session`, then initialize `all variables` in the graph, execute `train_op`, and query the values of necessary nodes (e.g., `loss` and `accuracy`).

- Initialize all variables
  - `init= tf.global_variables_initializer()` and `sess.run(init)`.

- Execute `train_op` when feeding mini-batches to the network
  - `sess.run([train_op], feed_dict={X:x_batch, y:y_batch})`

- Query the values of necessary nodes
  - `val_loss, val_accuracy= sess.run([loss, accuracy], feed_dict={X:X_valid, y:y_valid})`

Note that as a rule of machine learning, during training phase, we **cannot** touch the `test set` and only use this set when we need to output the predictive performance of a trained model.
-  Output the predictive performance on the test set
   - `test_accuracy= sess.run(accuracy, feed_dict={X:X_test, y:y_test})`

In [13]:
import math
batch_size= 32
history= []  #used to store train, valid accuracies and losses for showing later
num_epoch = 100
iter_per_epoch= math.ceil(float(train_size)/batch_size)  #number of iterations per epoch

with tf.Session() as sess:
    init= tf.global_variables_initializer()
    sess.run(init)
    for epoch in range(num_epoch):
        for idx_start in range(0, X_train.shape[0], batch_size):
            idx_end = min(X_train.shape[0], idx_start + batch_size)
            X_batch, y_batch = X_train[idx_start:idx_end], y_train[idx_start:idx_end]
            sess.run([train_op], feed_dict={X:X_batch, y:y_batch})
        #compute accuracies and losses at the end of each epoch
        train_summary, train_loss, train_accuracy= sess.run([merged,loss, accuracy], feed_dict={X:X_train, y:y_train})
        train_writer.add_summary(train_summary, epoch +1)
        train_writer.flush()
        
        valid_summary,val_loss, val_accuracy= sess.run([merged,loss, accuracy], feed_dict={X:X_valid, y:y_valid})
        valid_writer.add_summary(valid_summary, epoch +1)
        valid_writer.flush()
        print("Epoch {}: valid loss={:.4f}, valid acc={:.4f}".format(epoch+1, val_loss, val_accuracy))
        print("########: train loss={:.4f}, train acc={:.4f}".format(train_loss, train_accuracy))
        hist_item={"train_loss": train_loss, "train_acc": train_accuracy, 
                   "val_loss":val_loss, "val_acc": val_accuracy}
        history.append(hist_item)
    print("---------------------------------------------\n")
    test_accuracy= sess.run(accuracy, feed_dict={X:X_test, y:y_test})
    print("Test accuracy: {:.4f}".format(test_accuracy))

Epoch 1: valid loss=2.5695, valid acc=0.1833
########: train loss=2.5716, train acc=0.1742
Epoch 2: valid loss=2.0712, valid acc=0.3467
########: train loss=2.0833, train acc=0.3499
Epoch 3: valid loss=1.8119, valid acc=0.4293
########: train loss=1.8167, train acc=0.4295
Epoch 4: valid loss=1.6897, valid acc=0.4673
########: train loss=1.6942, train acc=0.4687
Epoch 5: valid loss=1.6261, valid acc=0.4920
########: train loss=1.6322, train acc=0.4907
Epoch 6: valid loss=1.5822, valid acc=0.5200
########: train loss=1.5872, train acc=0.5069
Epoch 7: valid loss=1.5457, valid acc=0.5360
########: train loss=1.5490, train acc=0.5232
Epoch 8: valid loss=1.5129, valid acc=0.5427
########: train loss=1.5152, train acc=0.5355
Epoch 9: valid loss=1.4831, valid acc=0.5573
########: train loss=1.4844, train acc=0.5474
Epoch 10: valid loss=1.4544, valid acc=0.5633
########: train loss=1.4532, train acc=0.5593
Epoch 11: valid loss=1.4234, valid acc=0.5740
########: train loss=1.4202, train acc=0.57

### <span style="color:#0b486b"> Additional Exercises </span> 

1. Write your own code to save a trained model to the hard disk and restore this model, then use the restored model to output the prediction result on the test set.

2. Write code to add the plots of `test accuracy` and `loss` to the above line charts with your color of interest.

3. Insert new code to the above code to enable outputting to TensorBoard the values of `training loss`, `training accuracy`, `valid loss`, and `valid accuracy` at the end of epochs. You can refer to the code [here](https://www.tensorflow.org/guide/summaries_and_tensorboard).

4. Write code to do regression on the dataset `cadata` which can be downloaded [here](https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html). Note that for a regression problem, you need to use the `L2` loss instead of the `cross-entropy` loss as in a classification problem. 

---
### <span style="color:#0b486b"> <div  style="text-align:center">**THE END**</div> </span>