# Neural Networks

## Take-home points -- Lecture
- NN is constructed from MLP
- Linear OPs define individual layers (even seems not so, OP for operation/operator)
    - Matrix-vector production is the workhorse
- Non-linear OPs support the network body structure
- Training is to choose a set of parameters -- weights in each layer -- by changing from a starting point, slowly.
- Algorithm to determine the direction of change

Keywords ("smart words" I can talk to people and don't afraid they asking "what do you mean by ...?" after this class!):
`deep convolutional neural network`, `back-propagation`, `regularisation`, `learning rate`, `stochastic gradient descent`


## Take-home points -- Lab
- a "tensor" object (personally, this is a misnormer)
- end-to-end neural network learning (when the data allows)
    1. build net
    2. select loss
    3. optimiser
    
Collectables: `GPU-deep learning`, `pytorch` (`tensorflow` if you choose to learn)

### 1 MLP

Recall our old friend linear model:

- ONE linear model accesses ALL _observable_ attributes of the data, and produces ONE answer.
    - Form the final analytics using this answer -- linear model completed
    
__Let's push for an MLP__
- Employ MULTIPLE, say, $n$, linear models, and 
- Treat the $n$ answers as new attributes
- Build another layer of linear models on top of the $n$ answers

![NN as MLP][fig:mlp]

[fig:mlp]: ref/illu-1.png

__Q1__
1. If a data sample consists of 10 attributes, how many parameters in a linear model? What is the Hypotheses space $\mathcal{H}$? (In this class, we always ignore the bias, which we can treatas a weight on a constant-1 attribute)
2. How many parameters we need to specify the model for 3 data samples?
3. How many parameters we need to specify 5 such models?
4. If I build a second layer of linear model, taking as input the outputs of the first layer, and produce the final answer. How many parameters in the entire model?
5. (opt) Consider a practical model, where the inputs are images of $64 \times 64$ RGB pixels, the first layer has 4,096 units (linear models), the second, third and forth layers have 1,024 units each, finally, it outputs 10 predictions, say the plausibility that input image belongs to 10 different classes. Specifically $[X \in \mathbb{R}^{64\times 64}] \mapsto  [H^1 \in \mathbb{R}^{4096}] \mapsto  [H^2 \in \mathbb{R}^{1024}] \mapsto  [H^3 \in \mathbb{R}^{1024}] \mapsto  [H^4 \in \mathbb{R}^{1024}] \mapsto  [Y \in \mathbb{R}^{10}]$. Figure out the hypotheses space of the network.


__A1__

- $\mathcal{H}$ is $\mathbb{R}^{10}$
- the same, 10
- 50
- 50 + 5=55, the extra 5 weights are for the second layer model
- (opt) see lab solution

__Lab__

1. Build a linear model of 10 attributes using `pytorch`, and count the parameters; then let the linear model output 5 results, and count the parameters

    - Hint-1: In torch module `torch.nn`, there is a [Linear] class.
    - Hint-2: When providing the class the information about the number of output-"feature"s, consider that the linear model will give THE final answer.
    - Hint-3: Each "Neural network module" class in `torch.nn` has a [access-to-parameters] method providing reference to the internal parameters. Note linear models _optionally_ contain bias.
    - Hint-4: check the example below, note the use of `np.prod`.
    
2. Program the model in the last question of __Q1__

[Linear]: https://pytorch.org/docs/stable/nn.html#linear-layers
[access-to-parameters]: https://pytorch.org/docs/stable/nn.html#torch.nn.Module.parameters

In [None]:
import torch.nn as nn
import numpy as np # for a convenient cumulative product
linear_model = nn.Linear(
    in_features=10, out_features=1, bias=False)
for i_, param in enumerate(linear_model.parameters()):
    s = param.size()
    print("Para {}: type {}, size {}, #.elements {}".format(
        i_, type(param.data), s, np.prod(s)
    ))

In [None]:
## ANSWER TO Q1_Lab.2
# X->H1
layer1 = nn.Linear(in_features=64*64, out_features=4096, bias=False)
# H1->H2
layer2 = nn.Linear(in_features=4096, out_features=1024, bias=False)
# H2->H3
layer3 = nn.Linear(in_features=1024, out_features=1024, bias=False)
# H3->H4
layer4 = nn.Linear(in_features=1024, out_features=1024, bias=False)
# H4->Y
layer5 = nn.Linear(in_features=1024, out_features=10, bias=False)

import numpy as np 
i = 0
total_parasize = 0
for model in (layer1, layer2, layer3, layer4, layer5):
    for param in model.parameters():
        s = param.size()
        print("{}:{}".format(i, s))
        total_parasize += np.prod(s)
        i += 1
print("Total parameter number is {}".format(total_parasize))

### 2 Unified formulation of the computation

Consider a data sample of 3 attributes, with a linear model with 3 weights: $(x_1, x_2, x_3)$ and $w_1, w_2, w_3$. The computation is $$
x_1 w_1 + x_2 w_2 + x_3 w_3
$$

Let us write this product-sum in a format which allows extensive generalisation:

$[\begin{array}{ccc}
x_{1} & x_{2} & x_{3}\end{array}]\times\left[\begin{array}{c}
w_{1}\\
w_{2}\\
w_{3}
\end{array}\right]$

**Further**, what if we have two samples instead of one? (Recall Q1.2) We can simply expand the $X$-part of the above computation, where _rows represent samples_ (with an extra subscript).

$\left[\begin{array}{ccc}
x_{1,1} & x_{1,2} & x_{1,3}\\
x_{2,1} & x_{2,2} & x_{2,3}
\end{array}\right]\times\left[\begin{array}{c}
w_{1}\\
w_{2}\\
w_{3}
\end{array}\right] \mapsto \left[\begin{array}{c}
y_{1}\\
y_{2}
\end{array}\right]$

**Further more**, what if we have two more linear models, i.e. three outputs together for each data sample? We can simply expand the $W$-part of the above computation, where _columns represent individual models_.

$\left[\begin{array}{ccc}
x_{1,1} & x_{1,2} & x_{1,3}\\
x_{2,1} & x_{2,2} & x_{2,3}
\end{array}\right]\times\left[\begin{array}{ccc}
w_{1,1} & w_{1,2} & w_{1,3}\\
w_{2,1} & w_{2,2} & w_{2,3}\\
w_{3,1} & w_{3,2} & w_{3,3}
\end{array}\right] \mapsto\left[\begin{array}{ccc}
y_{1,1} & y_{1,2} & y_{1,3}\\
y_{2,1} & y_{2,2} & y_{2,3}
\end{array}\right]$

__Q2__
1. Write out the computation for $y_{1,2}$ in the last formulation.
2. If we want to compute a further layer, using 3 $Y$-variables as input, to output 2 outputs, say $z_{i,1}, z_{i,2}$ for a sample $i$. Write out the matrix formulation.

**ANSWER TO Q2.2**

Following the computation of $Y$,
$\left[\begin{array}{ccc}
y_{1,1} & y_{1,2} & y_{1,3}\\
y_{2,1} & y_{2,2} & y_{2,3}
\end{array}\right]\times\left[\begin{array}{cc}
u_{1,1} & u_{1,2}\\
u_{2,1} & u_{2,2}\\
u_{3,1} & u_{3,2}
\end{array}\right]\mapsto\left[\begin{array}{cc}
z_{1,1} & z_{1,2}\\
z_{2,1} & z_{2,2}
\end{array}\right]$

### 3 Non-linear construction

The "multilayer" models above are illusion! All multi-stage linear models above are equivalent to single layer models.

__Q3__

Can you show for the above example $X\mapsto Y \mapsto Z$, how the two stage models are equivalent to one?

**ANSWER TO Q3**

$$
\begin{align*}
Z & =Y\times U\\
Y & =X\times W\\
Z & =X\times W\times U\\
Z & =X\times V,\\
V & :=W\times U
\end{align*}
$$

Note $V$ is a definition. Using a linear model parameterised by $V$, we achieve the effect of two models in one shot.

**LAB**

1. Verify the above construction -- following the example codes.

In [None]:
import torch
import torch.nn as nn
example_X = torch.rand(2, 3)
linear_model_W = nn.Linear(in_features=3, out_features=3, bias=False)
linear_model_U = nn.Linear(in_features=3, out_features=2, bias=False)
linear_model_V = nn.Linear(in_features=3, out_features=2, bias=False)

1.1 Verify the computation of linear model following the matrix multiplication

- is the output according to expectation? if not, why and how to fix?
    - Hint-1: does `pytorch` represent one linear model in a _column_ in the weight matrix? Review the outputs in Q1_Lab.2.
    - Hint-2: check `transpose` method in tensors.
    - Please note similar introspection is useful for all frameworks.

In [None]:
print("Applying linear model \n", linear_model_W(example_X))
print("Matrix multiplication between X and Weight Matrix \n", 
      torch.mm(example_X, linear_model_W.weight)
     )

In [None]:
# ANSWER
print("Matrix multiplication between X and Weight Matrix \n", 
      torch.mm(example_X, linear_model_W.weight.transpose(1,0))
     )

1.2 Let us replace the parameters of V with product of W and U
    - Hint: assign `linear_model_V.weight.data` appropriate values

In [None]:
# ANSWER
linear_model_V.weight.data = \
    torch.mm(linear_model_U.weight, linear_model_W.weight)

In [None]:
print("Transformed by W then U\n", linear_model_U(linear_model_W(example_X)))
print("Transformed by V\n", linear_model_V(example_X))

#### Non-linear activation

- Elementwise transform, e.g. 
$\frac{1}{1+\exp(-x)}$ or $0$ if $x<0$, $x$ otherwise

**Q4**

1. Draw plots of the above two functions for $x \in [-3, 3]$
    - Hint-1: the first activation is called "Sigmoid" and the second one "ReLU" (rectified linear). In `torch.nn` module, their are classes "Sigmoid" and "ReLU", from which you can instantiate the activators. In `torch.nn.functional` module, however, there are corresponding activators. The choice is often upto the developer's style.
2. How many parameters learnable in the activators?
3. [**LAB**] Verify that after applying an activation, the two step transform in the example above is no longer collapsing into one. 
4. Construct a model accepting 4 attributes, transform to 2 features, do non-linear activation and then into 3 outputs.
    - Hint-2: see the template definition below
    - Hint-3: Let's use a `softmax` activation for the final layer.
    

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

x = torch.arange(-3, 3, 0.01)
x_np = x.numpy()
# you many want to check 
# x_np2 = np.arange(-3, 3, 0.01) 
# is the same as x_np, numpy and torch arrays are easily convertible

In [None]:
## ANSWER to (one of many) 1.
f1 = nn.Sigmoid()
y1 = f1(x)
y2 = nn.functional.relu(x)
plt.plot(x_np, y1.numpy())
plt.plot(x_np, y2.numpy())

In [None]:
# ANSWER to 3
# Template definition of an NN model
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.linear_layer1 = nn.Linear(
            in_features=4,
            out_features=2
        )
        self.linear_layer2 = nn.Linear(
            in_features=2,
            out_features=3
        )
    
    def forward(self, x):
        """
        This is a piece of comments for functions
        :param x: x the input data
        :type x: torch.FloatTensor
        """
        h = self.linear_layer1(x)
        h = nn.functional.relu(h)
        h = self.linear_layer2(h)
        y = nn.functional.softmax(h, dim=1)
        return y

In [None]:
# Given data x, the usage will be:
x = torch.randn(10, 4)
model = MyModel()
results = model(x)
print(results)

----
# A Complex NN Example

----
The model is adopted from CycleGAN, see the project [page](https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix).

In [None]:
AVAILABLE_TARGET_STYLES = [
    "apple2orange", "orange2apple", 
    "summer2winter_yosemite", "winter2summer_yosemite", 
    "horse2zebra", "zebra2horse", "monet2photo", 
    "style_monet", "style_cezanne", "style_ukiyoe", 
    "style_vangogh", "sat2map", "map2sat", 
    "cityscapes_photo2label", "cityscapes_label2photo", 
    "facades_photo2label", "facades_label2photo", "iphone2dslr_flower"
]

TARGET_STYLE = AVAILABLE_TARGET_STYLES[10]
print("TARGET_STYLE: ", TARGET_STYLE)

In [None]:
# download trained style-conversion models
import os
import urllib.request
model_path = "ref/saved_style_models/" + TARGET_STYLE + ".pth"
if not os.path.exists(model_path):
    urllib.request.urlretrieve(
        "http://efrosgans.eecs.berkeley.edu/cyclegan/pretrained_models/" + \
        TARGET_STYLE + ".pth",
        model_path)

In [None]:
import cganimstyler as cim
# build the style model
netG = cim.load_generator_from(model_path)

The cell below performs the conversion. Each pixel in the target image is the answer of a series of linear models. The model is defined in `cganimstyler/resnet.py`. The pre-trained parameters are saved in `ref/saved_style_models`.

In [None]:
im = cim.load_image('ref/testimages/Jun.jpeg') # Put your own image here!
res = netG(im)
npim = cim.tensor2im(im)
res_npim = cim.tensor2im(res)

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
plt.subplot(1,2,1)
plt.imshow(npim)
plt.axis('off')
plt.subplot(1,2,2)
plt.imshow(res_npim)
plt.axis('off')
plt.show()

----
# TRAINING

**Back-propagation** interpreted.

We will practice a simple demo of this algorithm on class.

----


**Q5**: Train the model defined in Q4.4 to classify Iris Data (provided by scikit-learn, see below)

1. Prepare data
    - Hint: template is provided below
    - Why train-validation split?
    - What is the random seed use for?
    - **LAB**: try and understand the data preparation steps. Specifically, understanding the definition of the following objects in terms of "duck typing", i.e. their implementation and utility. Consult your tutors for any confusion.
        - Dataset
        - Dataset split
            - Hint: you will have two / three subsets
        - Shuffling
        - Random seeding
    - **LAB**: (optional) consider normalising the variables
2. Define the objective. (TBC below)
3. Calculate the direction of change
4. Apply the change. (TBC below)

In [None]:
## Template data loading procedure: Q5-1.1
import random
from sklearn.datasets import load_iris
from torch.utils.data import Dataset, DataLoader
from torch.utils.data.sampler import SubsetRandomSampler # train/valid-subset sampler
from torchvision.transforms import ToTensor

In [None]:
class FlowerDataset(Dataset):
    """
    Such an object can be handled by a "Loader" object. 
    """
    def __init__(self):
        super(FlowerDataset, self).__init__()
        self._data = load_iris()
        
    def __len__(self):
        return len(self._data.data)
    
    def __getitem__(self, i):
        """
        So you can use dataset[i]
        """
        sample = (torch.FloatTensor(self._data.data[i]), 
                  int(self._data.target[i]))
        return sample

dataset = FlowerDataset()
VALID_RATIO = 0.2
valid_num = int(len(dataset)*VALID_RATIO)

print("Use {} samples for training, {} for validation".format(
    len(dataset)-valid_num, valid_num))

In [None]:
random.seed(42)
indices = list(range(len(dataset)))
random.shuffle(indices)
train_indices = indices[valid_num:] # check Python indexing
valid_indices = indices[:valid_num]
print(train_indices, valid_indices)

In [None]:
train_sampler = SubsetRandomSampler(train_indices)
valid_sampler = SubsetRandomSampler(valid_indices)
train_loader = DataLoader(dataset=dataset, 
                          sampler=train_sampler, 
                          batch_size=32)
valid_loader = DataLoader(dataset=dataset,
                          sampler=valid_sampler,
                          batch_size=valid_num)

In [None]:
# this is the way a DataLoader is used, we break at the first
# round to take only one batch of samples.
for x, y in train_loader:
    break

**Q5.2** (cont.)

We treate the output of the model as __predicted__ probability of each sample belongs to each class. The discrepancy between the prediction and the ground-truth is _the target value_ to minimise. Before proceeding, let's review the model and make a slight modification

- Why we use LOG-softmax + NLL-Loss, instead of using softmax (without log)? [hint](https://pytorch.org/docs/stable/nn.html?highlight=nll%20loss#torch.nn.NLLLoss)

In [None]:
class MyModel2(nn.Module):
    def __init__(self):
        super(MyModel2, self).__init__()
        self.linear_layer1 = nn.Linear(
            in_features=4,
            out_features=5
        )
        self.linear_layer2 = nn.Linear(
            in_features=5,
            out_features=3
        )
    
    def forward(self, x):
        """
        This is a piece of comments for functions
        :param x: x the input data
        :type x: torch.FloatTensor
        """
        h = self.linear_layer1(x)
        h = nn.functional.relu(h)
        h = self.linear_layer2(h)
        y = nn.functional.log_softmax(h, dim=1)
        return y

In [None]:
model = MyModel2()

In [None]:
pred = model(x)
loss = nn.functional.nll_loss(pred, y)

**Q5.3** (cont.)

To compute the direction along which to adjust the parameters of the model, now we can simply let `loss` backprop:

**LAB**: adjust one or several parameters of the parameter and check the effect on the prediction and loss

In [None]:
# You cannot repeat this OP to overwrite previously computed gradients
loss.backward()

**Q5.4** (cont.)

We use an optimiser object to handle the update of the parameters.

In [None]:
from torch.optim import Adam
optimiser = Adam(model.parameters(), lr=1e-3)

In [None]:
optimiser.zero_grad() # reset all computed gradients
pred = model(x)
loss = nn.functional.nll_loss(pred, y)
loss.backward()
optimiser.step() # Apply the change
print("Loss Before {:.6f}".format(loss))

optimiser.zero_grad()
pred = model(x)
loss = nn.functional.nll_loss(pred, y)
loss.backward()
optimiser.step() # Apply the change
print("Loss After {:.6f}".format(loss))

- what happens if executing the cell above for several times?
- explain your finding
- there is a key element missing if one wants to call repeated iteration of the above cell as "training" -- what's the missing piece?
- **LAB**: Write the _training_ algorithm
- **LAB**: evaluate the model performance on the validation data
- why do we need the validation set (or, why don't just call them test set)
    - choices such as number of internal neurons can be selected against the model performance on this set
    - make the internal nodes adjustable
        - Hint: check the definition on `Model2` above
    - perform multiple experiments on the train/test split 
        - Hint: recall the setting of random seed above
        
- the performance is close to [cross-ref](https://www.kaggle.com/azzion/iris-data-set-classification-using-neural-network)

In [None]:
## ANSWER AND SOLUTION TO LAB TASKS
# let's reset
model = MyModel2() 
optimiser = Adam(model.parameters(), lr=1e-3)

TRAIN_ITERS = 1000
EVALUATE_EVERY_N_STEPS = 100
total_steps = 0
for epoch in range(TRAIN_ITERS):
    for x, y in train_loader:
        optimiser.zero_grad() # reset all computed gradients
        pred = model(x)
        loss = nn.functional.nll_loss(pred, y)
        loss.backward()
        optimiser.step()
        total_steps += 1
        if total_steps % EVALUATE_EVERY_N_STEPS == 0:
            # compute ACCURACY on VALIDATION SET
            total_valid = 0
            correct_valid = 0
            for x_, y_ in valid_loader:
                pred_ = model(x_)
                correct_valid += (torch.argmax(pred_, dim=1)==y_).sum()
                # ==: element by element comparison
                # if == holds, max-in-pred EQUALS TO target class, correct, count 1
                # if not, count 0
                # at last compare how many we have counted with total validation samples
                total_valid += len(y_)
            print("Training epoch {} (total iter {})," 
                  "loss {:.6f}, accuracy {:.2f}".format(
                      epoch, total_steps, loss, 
                      float(correct_valid)/total_valid
                  ))