# Lab: Image classification with deep learning

#### Course: Statistical machine learning

Authors: Jakob Nyström, Inga Wohlert (group 19)

## 3. Preparation exercises

### 3.1 Softmax and cross-entropy

#### Question 3.1

In [32]:
import numpy as np

In [3]:
def sigmoid(x):
    
    return np.exp(x) / (1 + np.exp(x))

In [34]:
z = 1
h_z = sigmoid(z)
p_1 = h_z
p_neg_1 = 1 - h_z

print(f"p=1: {round(p_1, 3)}")
print(f"p=-1: {round(p_neg_1, 3)}")

p=1: 0.731
p=-1: 0.269


**Answer:** $p(y = 1 \, | \, x) \approx 0.731$ and $p(y = 1 \, | \, x) \approx 0.269$. 

#### Question 3.2

In [16]:
def softmax(z, z_list):
    
    return np.exp(z) / np.sum(np.exp(z_list))

In [35]:
z_list = [0, -1 , 1]
probas = []

for z in z_list:
    res = softmax(z, z_list)
    probas.append(res)
    
for count, prob in enumerate(probas, start=1):
    print(f"p={count}: {round(prob,3)}")

p=1: 0.245
p=2: 0.09
p=3: 0.665


**Answer:** Class y = 3 has the highest probability, $0.665$.

#### Question 3.3

In [28]:
def cross_entropy(y_vec, class_proba):
    
    y_vec = np.array(y_vec)
    class_proba = np.array(np.log(class_proba))
    cross_entropy = -np.dot(y_vec, class_proba)
    
    return cross_entropy

In [36]:
y_1 = [1, 0, 0]
y_2 = [0, 1, 0]
y_3 = [0, 0, 1]
y_list = [y_1, y_2, y_3]

class_proba = probas
cross_entropies = []

for y in y_list:
    res = cross_entropy(y, class_proba)
    cross_entropies.append(res)
    
for count, ce in enumerate(cross_entropies, start=1):
    print(f"p={count}: {round(ce, 3)}")

p=1: 1.408
p=2: 2.408
p=3: 0.408


**Answer:** y = 3 has the lowest cross-entropy, which makes sense since it has the highest probability.

### 3.2 Dense neural network

#### Question 3.4

Sizes of the weight matrices and offset vectors

- $W^{(1)}$: 30 x 144
- $b^{(1)}$: 30 x 1
- $W^{(2)}$: 4 x 30 
- $b^{(2)}$: 4 x 1

Number of parameters in the network is 4,474 (see calculation below)

In [30]:
params = (30 * 144 + 30) + (4 * 30 + 4)
print(params)

4474


### 3.3 Convolutional neural network

#### Question 3.5

$W$ has dimensions filter rows x filter columns x input channels x output channels; $b$ has dimension equal to the number of filters; and $Q$: $(k \, / \, s_k)$ x $(l \, / \,s_l)$ x output channels, where $k$ is the width of the image, $l$ is the height of the image and $s_k, s_l$ are the stride parameters. 

- $W^{(1)}$: 5 x 5 x 1 x 4 
- $b^{(1)}$: 1 x 4  
- $Q^{(1)}$: 12 x 12 x 4   


#### Question 3.6

- $W^{(2)}$: 3 x 3 x 4 x 8 
- $b^{(2)}$: 1 x 8  
- $Q^{(2)}$: 6 x 6 x 8

#### Question 3.7

- $W^{(3)}$: 60 x 288
- $b^{(3)}$: 60 x 1 
- $W^{(4)}$: 4 x 60 
- $b^{(4)}$: 4 x 1

## 4. Lab exercises

### Classification of handwritten digits

#### Question 4.1: Single layer network (essentially multi-class logistic regression)

Test accuracy is 92.29%.

#### Question 4.2: Number of params in the single-layer network

$\mathbf{W}$ contains $784 x 10 = 7,840$ parameters; $\mathbf{b}$ contains $10$ (that are the same for all $n$ rows). So in total there are $7,850$ parameters.

#### Question 4.3: Number of batches and epochs in training 

There are $n = 60,000$ training observations, so for a mini-batch size of $100$ that means we have $60,000 / 100 = 600$ iterations in each epoch. There are $2,000$ epochs during training. 

#### Question 4.4: Performance when adding one hidden layer

With a hidden layer of 200 units we get the following test performance: 94.58%

Varying the number of units in the hidden layer, ranging between 10 to 750, give these results

- 10 units: 91.6%
- 25 units: 93.76%
- 50 units: 94.5%
- 100 units: 94.76%
- 200 units: 94.58%
- 300 units: 94.75%
- 500 units: 94.69%
- 750 units: 94.89%

A certain number (at least bigger than 10 or 25) is required to capture the complexity in the data, but at some point around 100 units, adding more does not significantly improve performance of the model. 

Using $U = 200$ but initializing the weight vectors to $0$ instead of randomly means significantly lower test accuracy of 72.47% as the model easier gets stuck in a suboptimal local minima or other stationary point (saddle point).

#### Question 4.5: Performance when adding several more hidden layers

Classification accuracy on the test set is now only 11.35%. Since all weight vectors are initialized to zero, we are likely to get a lot of zero gradients, which means we will get stuck in some stationary point very quickly.

#### Task 4.6: Tuning the multi-layer model

Test accuracy improves significantly, up to 97.22%, when randomly initializing weights, changing the initialization of the offset vectors and swapping the solver.

#### Task 4.7: Increasing the number of iterations

There is still some oscillation in the test accuracy for the last 500 or so iterations, which can indicate that there is still some potential by increasing the number of epochs. 

Changing to 10,000 iterations yields 97.93% accuracy and seemingly no issues with numerical instability, when using SGD. However, when switching to the Adam solver, the algorithm breaks down after 5,300 iterations and test accuracy drop from 97.82% to 9.8%.

#### Task 4.8 / Question 4.6: Dealing with numerical instability

When changing the cross entropy function, we don't have instability, and manage to get 98.09% test accuracy.

### Use convolutional neural networks

#### Question 4.7

There are a total of $7 x 7 x 12 = 588$ hidden units in the third convolutional layer, with 49 in each channel. 

#### Question 4.8

For the convolutional neural net we get 98.55% accuracy on the test set.



### Real world image classification

#### Task 4.14

The model does very well on the animal and object pictures. However, it incorrectly classified the hedgehog as a porcupine. When dealing with more "abstract" entities like the Pelle Svanslös statue, it goes wrong. The same when testing pictures of humans - apparently it's not trained on pictures of humans (rather, it identifies clothing in the image).