| <p style="text-align: left;">Name</p>               | Matr.Nr. | <p style="text-align: right;">Date</p> |
| --------------------------------------------------- | -------- | ------------------------------------- |
| <p style="text-align: left">Lion DUNGL</p> | 01553060 | 15.12.2019                            |

<h1 style="color:rgb(0,120,170)">Hands-on AI I</h1>
<h2 style="color:rgb(0,120,170)">Unit 4 (Assignment) -- Your first neural networks </h2>

Authors: Brandstetter, Schäfl, Patil <br>
Date: 20-11-2019

This file is part of the "Hands-on AI I" lecture material. The following copyright statement applies 
to all code within this file.

Copyright statement: <br>
This  material,  no  matter  whether  in  printed  or  electronic  form,  may  be  used  for personal  and non-commercial educational use only.  Any reproduction of this manuscript, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

# Exercise 0
Before tackling all those exciting tasks of this notebook, the neccessary Python modules need to be loaded. Have a look at the notebook discussed during the lecture, and import the following modules/symbols:

- <code>u4_utils</code>
- <code>matplotlib.pyplot</code>
- <code>numpy</code>
- <code>torch</code>
- <code>torch.nn</code>

In [1]:
%matplotlib notebook

In [2]:
import u4_utils as u4
import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
import warnings
warnings.filterwarnings(r'ignore')

Afterwards, check if the <code>torch</code> module was correctly imported, by computing the <i>sum</i> of <code>[7, 2, 3]</code> and printing the result.

In [3]:
torch.sum(torch.as_tensor([7, 2, 3], dtype=torch.float32))

tensor(12.)

# Exercise 1
Normally, machine learning specific tasks start with digging into some <i>data set</i>. This time, we want to emphasize a different approach by focusing on miscellaneous kinds of <i>functions</i> at the beginning. <b>Exercise 1</b> is all about

- <i>convex</i> functions

and how their <i>derivative</i> can be used for optimizing the same. So, your <b>first task</b> of this exercise requires you to perform the following steps:

- Define&emsp;$y = x^{d}_{0} + x^{d}_{1} + \ldots{} + x^{d}_{n}$&emsp;as a <i>Python</i> function.
- Define the corresponding <i>derivative</i> as a <i>Python</i> function.

Note, that both <i>Python</i> functions should accept <i>exactly one</i> mandatory parameter, namely some one dimensional <i>numpy array</i> consisting of real values. Regardless of this requirement, optional parameters are allowed, though (e.g. to specify the corresponding <i>degree</i> of the current function of interest).

In [4]:
def f(x, d:int=2):
    y = 0
    for i in x:
        y += np.power(i, d)
    return np.array(y)

In [5]:
# gradient = d * x^(d-1)

def gradient_f(x, d:int=2):
    y = []
    for i in x:
        y.append(d*np.power(i, (d-1)))
    return np.array(y)

After you have <i>implemented</i> said function as well as the corresponding derivative, we want to visualize both to get more familiar with them as well as to get some <i>feeling</i> for their behaviour. Most often, some kind of visualization vastly supports problem finding processes (often termed as <i>debugging</i>), so keep this always in mind.

- Create two <i>numpy arrays</i> with values in the range of $[-2, 2]$, with a step size of $0.1$ (<i>hint:</i> look at <code>arange</code> supplied by <i>numpy</i>).
- Visualize the <i>convex</i> function as well as its <i>derivative</i> in $(1.2\ \ 1.5)$.

In [6]:
x0 = np.array([1.2, 1.5])
x_1 = np.arange(-2, 2, 0.1)
x_2 = np.arange(-2, 2, 0.1)

u4.plot_function(x0, x_1, x_2, f, gradient_f)

<IPython.core.display.Javascript object>

As the <b>second</b> and <b>last task</b> of this exercise, we want to know the <i>exact</i> value of the <i>derivative</i> of some <i>result</i> of the convex function with respect to its <i>input</i>. For this to happen, the following steps are required:

- Transform the list $[1.2, 1.5]$ to a <i>numpy array</i> of type <i>float32</i>.
- Compute the <i>result</i> of the <i>convex</i> function applied to said newly created <i>input</i>
- Compute the <i>derivative</i> of the <i>result</i> with respect to the input.

Print the <i>result</i> as well as all <i>intermediate</i> values to the standard output.

In [7]:
inp = np.array([1.2, 1.5], dtype=np.float32)
inp_f = f(inp)
inp_grad = gradient_f(inp)

print(f"result: {inp_f}\ngradient of x_1: {inp_grad[0]}, gradient of x_2: {inp_grad[1]}")

result: 3.6900001144409202
gradient of x_1: 2.4000000953674316, gradient of x_2: 3.0


# Exercise 2


This exercise is quite similar to the <i>previous</i> one, with a difference in the type of functions to be analyzed. <b>Exercise 2</b> is all about

- <i>non-convex</i> functions

and how their <i>derivative</i> can be used for optimizing the same. So, your <b>first task</b> of this exercise requires you to perform the following steps:

- Define&emsp;$y = \tanh\left(x^{d}_{0} + x^{d}_{1} + \ldots{} + x^{d}_{n}\right)$&emsp;as a <i>Python</i> function.
- Define the corresponding <i>derivative</i> as a <i>Python</i> function.

Note, that both <i>Python</i> functions should accept <i>exactly one</i> mandatory parameter, namely some one dimensional <i>numpy array</i> consisting of real values. Regardless of this requirement, optional parameters are allowed, though (e.g. to specify the corresponding <i>degree</i> of the current function of interest).

In [8]:
def f_2(x, d:int=2):
    y = 0
    for i in x:
        y += np.power(i, d)
    return np.array(np.tanh(y))

In [9]:
def sech(x):
    return (1/np.cosh(x))

In [10]:
# gradient = d * x^(d-1) * sech²(x^d + x_1^d + x_2^d + ... + x_n^d)

def gradient_f_2(x, d:int=2):
    y = []
    for i in x:
        y.append(d*np.power(i, (d-1))*np.power(sech(f(x, d)), 2))
    return np.array(y)

After you have <i>implemented</i> said function as well as the corresponding derivative, we want to visualize both to get more familiar with them as well as to get some <i>feeling</i> for their behaviour.

- Create two <i>numpy arrays</i> with values in the range of $[-2, 2]$, with a step size of $0.1$ (<i>hint:</i> look at <code>arange</code> supplied by <i>numpy</i>).
- Visualize the <i>non-convex</i> function as well as its <i>derivative</i> in $(0.9\ \ 0.9)$.

The input of the <i>non-convex</i> function is in the same range as the input of the <i>convex</i> one. Nonetheless, their result (and so does their visualization) might differ. Do you notice any major <i>differences</i>? If you do, briefly describe them, otherwise leave a short notice.

In [11]:
x0 = np.array([0.9, 0.9])
x_1 = np.arange(-2, 2, 0.1)
x_2 = np.arange(-2, 2, 0.1)

u4.plot_function(x0, x_1, x_2, f_2, gradient_f_2)

<IPython.core.display.Javascript object>

# Answer
- The tanh-function changes from convex to concave while the polynomial function stays convex all the time.
- Furthermore, the tahh-function is narrower (I mean "schmäler"), the polynomial function look wider (I mean "breiter").
- One can also observe that the range of the tanh-function reaches from 0 to 1 (*). Because of this, this function can be used in logistic regression. The range of the convex function on the other hand reaches from 0 to infinity.
- Both functions have, in this particular case, only one global minimum.

(*) because of that, the value of the gradient for all points after the function-value has reached z = 1 should be 0, but for some reason in the plot those areas are colored red. When manually computing the gradient-values for points in this area, the function correctly returns approximately 0 (see below).

In [12]:
x_test = [2, 3]
print(gradient_f_2(x_test))

[8.17454244e-11 1.22618137e-10]


Similar to the <i>last tasl</i> of the <i> previous</i> exercise, the <b>second</b> and <b>last task</b> of this one requires you tocompute the <i>exact</i> value of the <i>derivative</i> of some <i>result</i> of the non-convex function with respect to its <i>input</i>. For this to happen, the following steps are necessary:

- Transform the list $[0.9, 0.9]$ to a <i>numpy array</i> of type <i>float32</i>.
- Compute the <i>result</i> of the <i>non-convex</i> function applied to said newly created <i>input</i>
- Compute the <i>derivative</i> of the <i>result</i> with respect to the input.

Print the <i>result</i> as well as all <i>intermediate</i> values to the standard output.

In [13]:
inp = np.array([0.9, 0.9], dtype=np.float32)
inp_f = f_2(inp)
inp_grad = gradient_f_2(inp)

print(f"result: {inp_f}\ngradient of x_1: {inp_grad[0]}, gradient of x_2: {inp_grad[1]}")

result: 0.9246242065313247
gradient of x_1: 0.2611261311358922, gradient of x_2: 0.2611261311358922


# Exercise 3

As you are now an expert in <i>convex</i> and <i>non-convex</i> functions, you would for sure happily get your hands dirty by applying your knowledge to some data set. In this exercise you will be working with one composed of various <i>images</i> of fashion items. For curious minds, more information regarding this data set can be found at:

<cite>Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms. Han Xiao, Kashif Rasul, Roland Vollgraf. arXiv:1708.07747</cite>

For the <b>first task</b> of this exercise you are required to perform the following steps:

- Set the <i>random seed</i> to $s = 42$ using the <i>PyTorch</i> interface.
- Load the <i>Fashion-MNIST</i> data set (returns the <i>training</i> as well as the <i>test</i> set).
- Display the first <i>eight</i> images of the <i>Fashion-MNIST</i> data set.

Can you identify possible <i>labels</i> of the eight images?

In [14]:
torch.manual_seed(42)
train_set, test_set = u4.load_fashion_mnist()

In [15]:
u4.display_FashionMNIST(train_set)

<IPython.core.display.Javascript object>

In order to define a <i>logistic regression</i> model as well as a <i>dense feedforward neural network</i> for identifying images as visualized above, some minimal knowledge about the <i>structure</i> of the images is required:

- Find out the <i>input dimensionality</i> of the data set.
- Set the output dimensionality to be $d_{out} = 10$

In [16]:
train_image_zero, train_target_zero = train_set[0]

input_dim = train_image_zero.shape[0] * train_image_zero.shape[1] * train_image_zero.shape[2]
output_dim = 10

print(f"Input dimensionality: {input_dim}")

Input dimensionality: 784


Last time (for <i>assignment 3</i>) you were supplied with an implementation of <i>logistic regression</i> by us. As this would be too simple (and obviosuly no <i>fun</i> at all) for you, the <b>second task</b> of this exercise comprises:

- Implement a <i>Python class</i> <code>LogisticRegression</code> as discussed during the lecture.
- Keep in mind, which <i>activation</i> function a <i>multi-class</i> setting requires.
- Optionally, <i>initialize</i> the parameters of the model in a different way.

In [17]:
class LogisticRegression(nn.Module):
    def __init__(self):
        super(LogisticRegression, self).__init__()
        self.n_classes = 10
        self.linear = nn.Linear(784, self.n_classes)

    def forward(self, x):
        out = self.linear(x)
        return u4.F.log_softmax(out, dim=1)
    
# Optional initialization with variable input- and output-dimensions passed to the class as parameters
class LogisticRegression_opt(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(LogisticRegression_opt, self).__init__()
        self.n_classes = output_dim
        self.linear = nn.Linear(input_dim, self.n_classes)

    def forward(self, x):
        out = self.linear(x)
        return torch.softmax(out, dim=1)

Moreover, define an <i>instance</i> of the type <code>SimpleNamespace</code>, and set the hyperparameters accordingly:

- <code>batch_size = 64</code>
- <code>test_batch_size = 1000</code>
- <code>epochs = 10</code>
- <code>lr = 0.001</code>
- <code>momentum = 0.9</code>

The field <code>log_interval</code> can be chosen freely.

- Set the <i>random seed</i> to $s = 42$ using the <i>PyTorch</i> interface.
- Create additional instances of <code>DataLoader</code> for the <i>training</i> as well as the <i>test set</i> and enable <i>shuffling</i>.
- Create a <i>logistic regression</i> model using your <i>own</i> implementation, using the proper <i>input</i> and <i>output</i> dimensionalities.
- Create an optimizer of the type <code>SGD</code> and initialize it accordingly.

In [18]:
args00 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=10,
                       lr=0.001, momentum=0.9, seed=42, log_interval=100)

torch.manual_seed(args00.seed)

train_loader = torch.utils.data.DataLoader(train_set, batch_size=args00.batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=args00.test_batch_size, shuffle=True)

use_cuda = torch.cuda.is_available()
device = torch.device('cuda' if use_cuda else 'cpu')

In [19]:
model_1 = LogisticRegression()

optimizer = u4.optim.SGD(model_1.parameters(), lr=args00.lr, momentum=args00.momentum)

Train the previously defined <i>logistic regression</i> model by applying the corresponding <i>data loader</i> (keep in mind for which set we want the model to be <i>trained</i>) as well as the <i>optimizer</i>. Report the performance on the <i>test set</i> afterwards. Experiment with different hyperparameter settings, for instance set different values for $\ldots$

- $\ldots$ the learning rate <code>lr</code>.
- $\ldots$ the momentum term <code>momentum</code>.
- $\ldots$ the amount of epochs <code>epochs</code>.

Do you notice any serious differences? If yes, which <i>settings</i> lead to them? If not, try to argue about a <i>possible</i> reason.

In [20]:
for epoch in range(1, args00.epochs + 1):
    u4.train(args00, model_1, device, train_loader, optimizer, epoch, input_dim)
    u4.test(args00, model_1, device, test_loader, input_dim)


Test set: Average loss: 0.0007, Accuracy: 7558/10000 (75.58%)


Test set: Average loss: 0.0006, Accuracy: 7876/10000 (78.76%)


Test set: Average loss: 0.0006, Accuracy: 8022/10000 (80.22%)


Test set: Average loss: 0.0006, Accuracy: 8074/10000 (80.74%)


Test set: Average loss: 0.0006, Accuracy: 8132/10000 (81.32%)


Test set: Average loss: 0.0005, Accuracy: 8129/10000 (81.29%)


Test set: Average loss: 0.0005, Accuracy: 8175/10000 (81.75%)


Test set: Average loss: 0.0005, Accuracy: 8212/10000 (82.12%)


Test set: Average loss: 0.0005, Accuracy: 8240/10000 (82.40%)


Test set: Average loss: 0.0005, Accuracy: 8256/10000 (82.56%)



In [21]:
def training(args):
    model = LogisticRegression().to(device)
    optimizer = u4.optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
    
    for epoch in range(1, args.epochs + 1):
        u4.train(args, model, device, train_loader, optimizer, epoch, input_dim)
        u4.test(args, model, device, test_loader, input_dim)
    
    return model

In [22]:
# lr = 0.0000000001

args01 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=10, lr=0.0000000001, momentum=0.9, log_interval=100)
model1_01 = training(args01)


Test set: Average loss: 0.0023, Accuracy: 819/10000 (8.19%)


Test set: Average loss: 0.0023, Accuracy: 819/10000 (8.19%)


Test set: Average loss: 0.0023, Accuracy: 819/10000 (8.19%)


Test set: Average loss: 0.0023, Accuracy: 819/10000 (8.19%)


Test set: Average loss: 0.0023, Accuracy: 819/10000 (8.19%)


Test set: Average loss: 0.0023, Accuracy: 819/10000 (8.19%)


Test set: Average loss: 0.0023, Accuracy: 819/10000 (8.19%)


Test set: Average loss: 0.0023, Accuracy: 819/10000 (8.19%)


Test set: Average loss: 0.0023, Accuracy: 819/10000 (8.19%)


Test set: Average loss: 0.0023, Accuracy: 819/10000 (8.19%)



In [23]:
# lr = 1

args02 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=10, lr=1, momentum=0.9, log_interval=100)
model1_02 = training(args02)


Test set: Average loss: 0.0070, Accuracy: 7427/10000 (74.27%)


Test set: Average loss: 0.0088, Accuracy: 7419/10000 (74.19%)


Test set: Average loss: 0.0049, Accuracy: 8079/10000 (80.79%)


Test set: Average loss: 0.0050, Accuracy: 8160/10000 (81.60%)


Test set: Average loss: 0.0056, Accuracy: 7892/10000 (78.92%)


Test set: Average loss: 0.0047, Accuracy: 8215/10000 (82.15%)


Test set: Average loss: 0.0099, Accuracy: 7407/10000 (74.07%)


Test set: Average loss: 0.0082, Accuracy: 7527/10000 (75.27%)


Test set: Average loss: 0.0044, Accuracy: 8079/10000 (80.79%)


Test set: Average loss: 0.0044, Accuracy: 7976/10000 (79.76%)



In [24]:
# lr = 10000

args03 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=10, lr=10000, momentum=0.9, log_interval=100)
model1_03 = training(args03)


Test set: Average loss: 49.2839, Accuracy: 7884/10000 (78.84%)


Test set: Average loss: 78.0519, Accuracy: 7251/10000 (72.51%)


Test set: Average loss: 63.6688, Accuracy: 7805/10000 (78.05%)


Test set: Average loss: 41.7629, Accuracy: 8112/10000 (81.12%)


Test set: Average loss: 50.6734, Accuracy: 8095/10000 (80.95%)


Test set: Average loss: 57.3543, Accuracy: 7679/10000 (76.79%)


Test set: Average loss: 45.6202, Accuracy: 8112/10000 (81.12%)


Test set: Average loss: 69.4011, Accuracy: 7890/10000 (78.90%)


Test set: Average loss: 62.9317, Accuracy: 8029/10000 (80.29%)


Test set: Average loss: 57.8733, Accuracy: 7979/10000 (79.79%)



In [25]:
# epochs = 1

args04 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=1, lr=0.001, momentum=0.9, log_interval=100)
model1_04 = training(args04)


Test set: Average loss: 0.0007, Accuracy: 7617/10000 (76.17%)



In [26]:
# epochs = 5

args05 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=5, lr=0.001, momentum=0.9, log_interval=100)
model1_05 = training(args05)


Test set: Average loss: 0.0007, Accuracy: 7534/10000 (75.34%)


Test set: Average loss: 0.0007, Accuracy: 7825/10000 (78.25%)


Test set: Average loss: 0.0006, Accuracy: 7990/10000 (79.90%)


Test set: Average loss: 0.0006, Accuracy: 8095/10000 (80.95%)


Test set: Average loss: 0.0006, Accuracy: 8119/10000 (81.19%)



In [27]:
# epochs = 20

args06 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=20, lr=0.001, momentum=0.9, log_interval=100)
model1_06 = training(args06)


Test set: Average loss: 0.0007, Accuracy: 7635/10000 (76.35%)


Test set: Average loss: 0.0006, Accuracy: 7874/10000 (78.74%)


Test set: Average loss: 0.0006, Accuracy: 8005/10000 (80.05%)


Test set: Average loss: 0.0006, Accuracy: 8052/10000 (80.52%)


Test set: Average loss: 0.0006, Accuracy: 8115/10000 (81.15%)


Test set: Average loss: 0.0005, Accuracy: 8141/10000 (81.41%)


Test set: Average loss: 0.0005, Accuracy: 8200/10000 (82.00%)


Test set: Average loss: 0.0005, Accuracy: 8216/10000 (82.16%)


Test set: Average loss: 0.0005, Accuracy: 8244/10000 (82.44%)


Test set: Average loss: 0.0005, Accuracy: 8263/10000 (82.63%)


Test set: Average loss: 0.0005, Accuracy: 8252/10000 (82.52%)


Test set: Average loss: 0.0005, Accuracy: 8280/10000 (82.80%)


Test set: Average loss: 0.0005, Accuracy: 8288/10000 (82.88%)


Test set: Average loss: 0.0005, Accuracy: 8278/10000 (82.78%)




Test set: Average loss: 0.0005, Accuracy: 8298/10000 (82.98%)


Test set: Average loss: 0.0005, Accuracy: 8294/10000 (82.94%)


Test set: Average loss: 0.0005, Accuracy: 8297/10000 (82.97%)


Test set: Average loss: 0.0005, Accuracy: 8316/10000 (83.16%)


Test set: Average loss: 0.0005, Accuracy: 8341/10000 (83.41%)


Test set: Average loss: 0.0005, Accuracy: 8322/10000 (83.22%)



In [28]:
# momentum = 0.00000001

args07 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=10, lr=0.001, momentum=0.00000001, log_interval=100)
model1_07 = training(args07)


Test set: Average loss: 0.0014, Accuracy: 6474/10000 (64.74%)


Test set: Average loss: 0.0011, Accuracy: 6688/10000 (66.88%)


Test set: Average loss: 0.0010, Accuracy: 6879/10000 (68.79%)


Test set: Average loss: 0.0009, Accuracy: 7064/10000 (70.64%)


Test set: Average loss: 0.0009, Accuracy: 7229/10000 (72.29%)


Test set: Average loss: 0.0008, Accuracy: 7316/10000 (73.16%)


Test set: Average loss: 0.0008, Accuracy: 7420/10000 (74.20%)


Test set: Average loss: 0.0008, Accuracy: 7485/10000 (74.85%)


Test set: Average loss: 0.0008, Accuracy: 7527/10000 (75.27%)


Test set: Average loss: 0.0007, Accuracy: 7596/10000 (75.96%)



In [29]:
# momentum = 0.5

args08 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=10, lr=0.001, momentum=0.5, log_interval=100)
model1_08 = training(args08)


Test set: Average loss: 0.0011, Accuracy: 6763/10000 (67.63%)


Test set: Average loss: 0.0009, Accuracy: 7095/10000 (70.95%)


Test set: Average loss: 0.0008, Accuracy: 7347/10000 (73.47%)


Test set: Average loss: 0.0008, Accuracy: 7522/10000 (75.22%)


Test set: Average loss: 0.0007, Accuracy: 7613/10000 (76.13%)


Test set: Average loss: 0.0007, Accuracy: 7700/10000 (77.00%)


Test set: Average loss: 0.0007, Accuracy: 7750/10000 (77.50%)


Test set: Average loss: 0.0007, Accuracy: 7805/10000 (78.05%)


Test set: Average loss: 0.0007, Accuracy: 7838/10000 (78.38%)


Test set: Average loss: 0.0006, Accuracy: 7878/10000 (78.78%)



In [30]:
# momentum = 0.999

args09 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=10, lr=0.001, momentum=0.999, log_interval=100)
model1_09 = training(args09)


Test set: Average loss: 0.0010, Accuracy: 7989/10000 (79.89%)


Test set: Average loss: 0.0009, Accuracy: 7962/10000 (79.62%)


Test set: Average loss: 0.0008, Accuracy: 8247/10000 (82.47%)


Test set: Average loss: 0.0010, Accuracy: 8072/10000 (80.72%)


Test set: Average loss: 0.0008, Accuracy: 8261/10000 (82.61%)


Test set: Average loss: 0.0007, Accuracy: 8233/10000 (82.33%)


Test set: Average loss: 0.0006, Accuracy: 8142/10000 (81.42%)


Test set: Average loss: 0.0009, Accuracy: 7830/10000 (78.30%)


Test set: Average loss: 0.0009, Accuracy: 7973/10000 (79.73%)


Test set: Average loss: 0.0006, Accuracy: 8246/10000 (82.46%)



# Answer
- A very low learning rate leads to an average loss and an accuracy that are barely changing every epoch. This is because with a very low learning rate near 0, the parameters which are to be optimised are changing by e.g. "derivation * 0.0000001" every iteration, so practically they aren't changing at all. Therefore, the average loss is staying the same, too.
- A very high learning rate leads to a high average loss in general on the one hand, an on the other hand to an average loss, that is strongly changing every epoch. The reason for this is because the exact opposite than in the point above is the case: The parameters which are to be optimised are changing by e.g. "derivation * 1000" every iteration, so they are changing very much. Therefore sometimes the parameters may be near the optimum, but already in the next iteration they could be far away from it.
- When training with only a few epochs, there is less chance that the parameters which are to be optimised (weights, etc.) are reaching their optimum, because it can be observed that the accuracy of a model, in general, gets bigger every epoch.
- A very low momentum leads to the oberservation that the accuracy after the first epoch is relatively low compared with trainings with a higher momentum. This is because the momentum "accelerates" the steps towards the optimal weights when the direction is right. But is the momentum too high, this can lead to the fact that the optimum is skipped. A high momentum has comparable effects as a high learning rate. 

On the basis of your <i>logistic regression</i> implementation, construct a <i>dense feedforward neural network</i> with the following attributes (to get you started, later on you will modify these settings in order to get a better performance on the corresponding test set):

- <i>One</i> input layer, accepting the same data as the <i>logistic regression</i> model.
- <i>Two</i> hidden layers with a dimensionality of $256$ each.
- <i>One</i> output layer, of the same output dimensionality as the <i>logistic regression</i> model. 

To summarize this task:

- Implement a <i>Python class</i> <code>DenseNeuralNet</code> as discussed during the lecture.
- Keep in mind, which <i>activation</i> function a <i>multi-class</i> setting requires.
- Optionally, <i>initialize</i> the parameters of the model in a different way.

In [31]:
class DenseNeuralNet(nn.Module):
    def __init__(self):
        super(DenseNeuralNet, self).__init__()
        self.layer1 = nn.Linear(784, 256)
        self.layer2 = nn.Linear(256, 256)
        self.layer3 = nn.Linear(256, 256)
        self.layer4 = nn.Linear(256, 10)
    
    def forward(self, x):
        x = torch.selu(self.layer1(x))
        x = torch.selu(self.layer2(x))
        x = torch.selu(self.layer3(x))
        return self.layer4(x)

# Optional initialization with variable input- and output-dimensions passed to the class as parameters    
class DenseNeuralNet_opt(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DenseNeuralNet, self).__init__()
        self.n_classes = output_dim
        self.layer1 = nn.Linear(input_dim, 256)
        self.layer2 = nn.Linear(256, 256)
        self.layer3 = nn.Linear(256, 256)
        self.layer4 = nn.Linear(256, self.n_classes)
    
    def forward(self, x):
        x = torch.selu(self.layer1(x))
        x = torch.selu(self.layer2(x))
        x = torch.selu(self.layer3(x))
        return self.layer4(x)

Moreover, define an <i>instance</i> of the type <code>SimpleNamespace</code>, and set the hyperparameters accordingly (similar to the settings of the <i>logistic regression</i>):

- <code>batch_size = 64</code>
- <code>test_batch_size = 1000</code>
- <code>epochs = 10</code>
- <code>lr = 0.001</code>
- <code>momentum = 0.9</code>

The field <code>log_interval</code> can be chosen freely.

- Set the <i>random seed</i> to $s = 42$ using the <i>PyTorch</i> interface.
- Create additional instances of <code>DataLoader</code> for the <i>training</i> as well as the <i>test set</i> and enable <i>shuffling</i>.
- Create a <i>dense feedforward neural network</i> model using your <i>own</i> implementation, using the proper <i>input</i> and <i>output</i> dimensionalities.
- Create an optimizer of the type <code>SGD</code> and initialize it accordingly.

In [32]:
torch.manual_seed(42)

args2_00 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=10, lr=0.001, momentum=0.9, log_interval=100)

train_loader_2 = torch.utils.data.DataLoader(train_set, batch_size=args2_00.batch_size, shuffle=True)
test_loader_2 = torch.utils.data.DataLoader(test_set, batch_size=args2_00.test_batch_size, shuffle=True)

In [33]:
model_2 = DenseNeuralNet()
optimizer_2 = u4.optim.SGD(model_2.parameters(), lr=args2_00.lr, momentum=args2_00.momentum)

Train the previously defined <i>dense feedforward neural network</i> model by applying the corresponding <i>data loader</i> (keep in mind for which set we want the model to be <i>trained</i>) as well as the <i>optimizer</i>. Report the performance on the <i>test set</i> afterwards. As this kind of network behaves differently than a <i>logistic regression</i> model, experiment with different hyperparameter settings, for instance set different values for $\ldots$

- $\ldots$ the learning rate <code>lr</code>.
- $\ldots$ the momentum term <code>momentum</code>.
- $\ldots$ the amount of epochs <code>epochs</code>.

Do you notice any serious differences? If yes, which <i>settings</i> lead to them? If not, try to argue about a <i>possible</i> reason.

In [34]:
for epoch in range(1, args2_00.epochs + 1):
    u4.train(args2_00, model_2, device, train_loader_2, optimizer_2, epoch, input_dim)
    u4.test(args2_00, model_2, device, test_loader_2, input_dim)


Test set: Average loss: 0.0006, Accuracy: 7863/10000 (78.63%)


Test set: Average loss: 0.0005, Accuracy: 8152/10000 (81.52%)


Test set: Average loss: 0.0005, Accuracy: 8275/10000 (82.75%)


Test set: Average loss: 0.0005, Accuracy: 8348/10000 (83.48%)


Test set: Average loss: 0.0004, Accuracy: 8374/10000 (83.74%)


Test set: Average loss: 0.0004, Accuracy: 8432/10000 (84.32%)


Test set: Average loss: 0.0004, Accuracy: 8426/10000 (84.26%)


Test set: Average loss: 0.0004, Accuracy: 8504/10000 (85.04%)


Test set: Average loss: 0.0004, Accuracy: 8516/10000 (85.16%)


Test set: Average loss: 0.0004, Accuracy: 8498/10000 (84.98%)



In [35]:
def training_2(args):
    model = DenseNeuralNet().to(device)
    optimizer = u4.optim.SGD(model.parameters(), lr=args.lr, momentum=args.momentum)
    
    for epoch in range(1, args.epochs + 1):
        u4.train(args, model, device, train_loader_2, optimizer, epoch, input_dim)
        u4.test(args, model, device, test_loader_2, input_dim)
    
    return model

In [36]:
# lr = 0.0000000001

args2_01 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=10, lr=0.0000000001, momentum=0.9, log_interval=100)
model2_01 = training_2(args2_01)


Test set: Average loss: 0.0023, Accuracy: 1104/10000 (11.04%)


Test set: Average loss: 0.0023, Accuracy: 1104/10000 (11.04%)


Test set: Average loss: 0.0023, Accuracy: 1104/10000 (11.04%)


Test set: Average loss: 0.0023, Accuracy: 1104/10000 (11.04%)


Test set: Average loss: 0.0023, Accuracy: 1104/10000 (11.04%)


Test set: Average loss: 0.0023, Accuracy: 1104/10000 (11.04%)


Test set: Average loss: 0.0023, Accuracy: 1104/10000 (11.04%)


Test set: Average loss: 0.0023, Accuracy: 1104/10000 (11.04%)


Test set: Average loss: 0.0023, Accuracy: 1104/10000 (11.04%)


Test set: Average loss: 0.0023, Accuracy: 1104/10000 (11.04%)



In [37]:
# lr = 0.08

args2_02 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=10, lr=0.08, momentum=0.9, log_interval=100)
model2_02 = training_2(args2_02)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)



In [38]:
# lr = 0.09

args2_03 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=10, lr=0.09, momentum=0.9, log_interval=100)
model2_03 = training_2(args2_03)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: nan, Accuracy: 1000/10000 (10.00%)



In [39]:
# epochs = 1

args2_04 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=1, lr=0.001, momentum=0.9, log_interval=100)
model2_04 = training_2(args2_04)


Test set: Average loss: 0.0006, Accuracy: 7853/10000 (78.53%)



In [40]:
# epochs = 5

args2_05 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=5, lr=0.001, momentum=0.9, log_interval=100)
model2_05 = training_2(args2_05)


Test set: Average loss: 0.0006, Accuracy: 7835/10000 (78.35%)


Test set: Average loss: 0.0005, Accuracy: 8204/10000 (82.04%)


Test set: Average loss: 0.0005, Accuracy: 8295/10000 (82.95%)


Test set: Average loss: 0.0005, Accuracy: 8336/10000 (83.36%)


Test set: Average loss: 0.0004, Accuracy: 8371/10000 (83.71%)



In [41]:
# epochs = 20

args2_06 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=20, lr=0.001, momentum=0.9, log_interval=100)
model2_06 = training_2(args2_06)


Test set: Average loss: 0.0006, Accuracy: 7872/10000 (78.72%)


Test set: Average loss: 0.0005, Accuracy: 8171/10000 (81.71%)


Test set: Average loss: 0.0005, Accuracy: 8305/10000 (83.05%)


Test set: Average loss: 0.0005, Accuracy: 8290/10000 (82.90%)


Test set: Average loss: 0.0004, Accuracy: 8413/10000 (84.13%)


Test set: Average loss: 0.0005, Accuracy: 8313/10000 (83.13%)


Test set: Average loss: 0.0004, Accuracy: 8474/10000 (84.74%)


Test set: Average loss: 0.0004, Accuracy: 8435/10000 (84.35%)


Test set: Average loss: 0.0004, Accuracy: 8515/10000 (85.15%)


Test set: Average loss: 0.0004, Accuracy: 8511/10000 (85.11%)


Test set: Average loss: 0.0004, Accuracy: 8398/10000 (83.98%)


Test set: Average loss: 0.0004, Accuracy: 8591/10000 (85.91%)


Test set: Average loss: 0.0004, Accuracy: 8514/10000 (85.14%)


Test set: Average loss: 0.0004, Accuracy: 8587/10000 (85.87%)




Test set: Average loss: 0.0004, Accuracy: 8614/10000 (86.14%)


Test set: Average loss: 0.0004, Accuracy: 8528/10000 (85.28%)


Test set: Average loss: 0.0004, Accuracy: 8600/10000 (86.00%)


Test set: Average loss: 0.0004, Accuracy: 8610/10000 (86.10%)


Test set: Average loss: 0.0004, Accuracy: 8626/10000 (86.26%)


Test set: Average loss: 0.0004, Accuracy: 8586/10000 (85.86%)



In [42]:
# momentum = 0.00000001

args2_07 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=10, lr=0.001, momentum=0.00000001, log_interval=100)
model2_07 = training_2(args2_07)


Test set: Average loss: 0.0015, Accuracy: 6073/10000 (60.73%)


Test set: Average loss: 0.0011, Accuracy: 6483/10000 (64.83%)


Test set: Average loss: 0.0009, Accuracy: 6829/10000 (68.29%)


Test set: Average loss: 0.0008, Accuracy: 7099/10000 (70.99%)


Test set: Average loss: 0.0007, Accuracy: 7337/10000 (73.37%)


Test set: Average loss: 0.0007, Accuracy: 7494/10000 (74.94%)


Test set: Average loss: 0.0007, Accuracy: 7621/10000 (76.21%)


Test set: Average loss: 0.0006, Accuracy: 7729/10000 (77.29%)


Test set: Average loss: 0.0006, Accuracy: 7842/10000 (78.42%)


Test set: Average loss: 0.0006, Accuracy: 7935/10000 (79.35%)



In [43]:
# momentum = 0.5

args2_08 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=10, lr=0.001, momentum=0.5, log_interval=100)
model2_08 = training_2(args2_08)


Test set: Average loss: 0.0011, Accuracy: 6434/10000 (64.34%)


Test set: Average loss: 0.0008, Accuracy: 7047/10000 (70.47%)


Test set: Average loss: 0.0007, Accuracy: 7474/10000 (74.74%)


Test set: Average loss: 0.0006, Accuracy: 7727/10000 (77.27%)


Test set: Average loss: 0.0006, Accuracy: 7872/10000 (78.72%)


Test set: Average loss: 0.0006, Accuracy: 7987/10000 (79.87%)


Test set: Average loss: 0.0005, Accuracy: 8057/10000 (80.57%)


Test set: Average loss: 0.0005, Accuracy: 8111/10000 (81.11%)


Test set: Average loss: 0.0005, Accuracy: 8167/10000 (81.67%)


Test set: Average loss: 0.0005, Accuracy: 8183/10000 (81.83%)



In [44]:
# momentum = 0.999

args2_09 = u4.SimpleNamespace(batch_size=64, test_batch_size=1000, epochs=10, lr=0.001, momentum=0.999, log_interval=100)
model2_09 = training_2(args2_09)


Test set: Average loss: 0.0010, Accuracy: 7491/10000 (74.91%)


Test set: Average loss: 0.0012, Accuracy: 7287/10000 (72.87%)


Test set: Average loss: 0.0016, Accuracy: 6023/10000 (60.23%)


Test set: Average loss: 0.0063, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: 0.0072, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: 0.0036, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: 0.0055, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: 0.0073, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: 0.0079, Accuracy: 1000/10000 (10.00%)


Test set: Average loss: 0.0061, Accuracy: 1000/10000 (10.00%)



# Answer
- In this case, a low learning rate leads to the same results as before. Very interesting is, and I couldn't find a good reason for this, that with a learning rate of 0.08 the losses suddenly grow to values that seems to be so big, that it is represented as "nan".
- With respect to the epochs, the oberservations are the same as above.
- Observations regarding the momentum are also the same as above. The only difference is that with a very high momentum the accuracy doesn't "jump" but constantly decreases from epoch to epoch. Moreover from a certain point of training on, the accuracy stays at 10% every epoch.

As already discussing during the lecture and experimented with during the <i>last</i> assignment, simply <i>inverting</i> the original images on which the model is trained, may already be enough to break it. To show this behavior, perform the following steps:

- Set the <i>random seed</i> to $s = 42$ using the <i>PyTorch</i> interface.
- Load the <i>Fashion-MNIST</i> data set with a <code>flip_probability</code> of $p = 1$.
- Display the first <i>eight</i> images of the <i>Fashion-MNIST</i> data set.

Can you identify possible <i>labels</i> of the three images? How do they differ from the previous visualization?

- Evaluate the previously trained <i>logistic regression</i> model on the flipped data set.
- Evaluate the previously trained <i>dense feedforward neural network</i> on the flipped data set.

If you experiment with different <i>hyperparameter settings</i> with respect to the original data set, do the performances differ when tested on the <i>flipped</i> data set?

In [49]:
torch.manual_seed(42)
train_set_flipped, test_set_flipped = u4.load_fashion_mnist(flip_probability=1)

In [51]:
u4.display_FashionMNIST(train_set_flipped)

<IPython.core.display.Javascript object>

# Answer
- Picture 1: Shoe
- Picture 2: T-Shirt
- Picture 3: Dress
- Picture 4: Dress
- Picture 5: Shoe
- Picture 6: Backpack
- Picture 7: Shoe
- Picture 8: Longsleeve

Some pictures like no. 4 or the backpack in no. 6 could be interpreted in different ways then "unflipped".

In [47]:
test_loader_flipped = torch.utils.data.DataLoader(test_set_flipped, batch_size=1000, shuffle=True)

In [52]:
print(f"LogisticRegression(), Model: model_1\n, Arguments: {args00}")
u4.test(args00, model_1, device, test_loader_flipped, input_dim)

print(f"LogisticRegression(), Model: model1_01, Arguments: {args01}\n")
u4.test(args01, model1_01, device, test_loader_flipped, input_dim)

print(f"LogisticRegression(), Model: model1_02, Arguments: {args02}\n")
u4.test(args02, model1_02, device, test_loader_flipped, input_dim)

print(f"LogisticRegression(), Model: model1_03, Arguments: {args03}\n")
u4.test(args03, model1_03, device, test_loader_flipped, input_dim)

print(f"LogisticRegression(), Model: model1_04, Arguments: {args04}\n")
u4.test(args04, model1_04, device, test_loader_flipped, input_dim)

print(f"LogisticRegression(), Model: model1_05, Arguments: {args05}\n")
u4.test(args05, model1_05, device, test_loader_flipped, input_dim)

print(f"LogisticRegression(), Model: model1_06, Arguments: {args06}\n")
u4.test(args06, model1_06, device, test_loader_flipped, input_dim)

print(f"LogisticRegression(), Model: model1_07, Arguments: {args07}\n")
u4.test(args07, model1_07, device, test_loader_flipped, input_dim)

print(f"LogisticRegression(), Model: model1_08, Arguments: {args08}\n")
u4.test(args08, model1_08, device, test_loader_flipped, input_dim)

print(f"LogisticRegression(), Model: model1_09, Arguments: {args09}\n")
u4.test(args09, model1_09, device, test_loader_flipped, input_dim)


print(f"DenseNeuralNet(), Model: model_2, Arguments: {args00}")
u4.test(args2_00, model_2, device, test_loader_flipped, input_dim)

print(f"DenseNeuralNet(), Model: model2_01, Arguments: {args2_01}\n")
u4.test(args2_01, model2_01, device, test_loader_flipped, input_dim)

print(f"DenseNeuralNet(), Model: model2_02, Arguments: {args2_02}\n")
u4.test(args2_02, model2_02, device, test_loader_flipped, input_dim)

print(f"DenseNeuralNet(), Model: model2_03, Arguments: {args2_03}\n")
u4.test(args2_03, model2_03, device, test_loader_flipped, input_dim)

print(f"DenseNeuralNet(), Model: model2_04, Arguments: {args2_04}\n")
u4.test(args2_04, model2_04, device, test_loader_flipped, input_dim)

print(f"DenseNeuralNet(), Model: model2_05, Arguments: {args2_05}\n")
u4.test(args2_05, model2_05, device, test_loader_flipped, input_dim)

print(f"DenseNeuralNet(), Model: model2_06, Arguments: {args2_06}\n")
u4.test(args2_06, model2_06, device, test_loader_flipped, input_dim)

print(f"DenseNeuralNet(), Model: model2_07, Arguments: {args2_07}\n")
u4.test(args2_07, model2_07, device, test_loader_flipped, input_dim)

print(f"DenseNeuralNet(), Model: model2_08, Arguments: {args2_08}\n")
u4.test(args2_08, model2_08, device, test_loader_flipped, input_dim)

print(f"DenseNeuralNet(), Model: model2_09, Arguments: {args2_09}\n")
u4.test(args2_09, model2_09, device, test_loader_flipped, input_dim)

LogisticRegression(), Model: model_1
, Arguments: namespace(batch_size=64, epochs=10, log_interval=100, lr=0.001, momentum=0.9, seed=42, test_batch_size=1000)

Test set: Average loss: 0.0024, Accuracy: 3370/10000 (33.70%)

LogisticRegression(), Model: model1_01, Arguments: namespace(batch_size=64, epochs=10, log_interval=100, lr=1e-10, momentum=0.9, test_batch_size=1000)


Test set: Average loss: 0.0023, Accuracy: 1044/10000 (10.44%)

LogisticRegression(), Model: model1_02, Arguments: namespace(batch_size=64, epochs=10, log_interval=100, lr=1, momentum=0.9, test_batch_size=1000)


Test set: Average loss: 0.0487, Accuracy: 1697/10000 (16.97%)

LogisticRegression(), Model: model1_03, Arguments: namespace(batch_size=64, epochs=10, log_interval=100, lr=10000, momentum=0.9, test_batch_size=1000)


Test set: Average loss: 564.3575, Accuracy: 2073/10000 (20.73%)

LogisticRegression(), Model: model1_04, Arguments: namespace(batch_size=64, epochs=1, log_interval=100, lr=0.001, momentum=0.9, tes

# Answer
Yes!

- At first, it looks like the LogisticRegression-model computes a little bit better results then the DenseNeuralNet. This could be because the DenseNeuralNet has more layers, therefore it is more complicated and not that flexible. The LogisitcRegression on the other hand is simpler and could be used more flexible (same principle as over- and underfitting).
- Moreover, fewer epochs are also leading to better results at the flipped test-set. In my opinion, this is because with fewer epochs, the model is also less fitted to the training set, therefore more flexible.
- Regarding the LogisticRegression, a higher learning rate leads to a better result on the flipped test-set (with an exception of the learning rate 0.001, where the results are the best). This is not the case with the DenseNeuralNet, because as seen above, from a certain learning rate on, the model performes very bad in general, therefore also on the flipped test-set.