<div style='background-image: url("../share/header_no_text.svg") ; padding: 0px ; background-size: cover ; border-radius: 5px ; height: 250px'>
    <div style="float: right ; margin: 50px ; padding: 20px ; background: rgba(255 , 255 , 255 , 0.7) ; width: 50% ; height: 150px">
        <div style="position: relative ; top: 50% ; transform: translatey(-50%)">
            <div style="font-size: xx-large ; font-weight: 900 ; color: rgba(0 , 0 , 0 , 0.8) ; line-height: 100%">Machine Learning</div>
            <div style="font-size: large ; padding-top: 20px ; color: rgba(0 , 0 , 0 , 0.5)">A Fully Connected Neural Network From Scratch</div>
        </div>
    </div>
</div>

##### Authors:
* Lion Krischer ([@krischer](https://github.com/krischer))

---

In [None]:
%matplotlib inline

# We only need matplotlib and numpy here.
import matplotlib.pyplot as plt
import numpy as np

---------

# Step 1: Activation Function

Let us first implement the activation function. For reasons of simplicity we will use the sigmoid activation function here. Recall that the formula is

$$
sigmoid(x) = \sigma(x) = \frac{1}{1 + e^{-x}}
$$

### Exercise A: Sigmoid Function

Implement this function and plot it in the interval $[-10, 10]$. Please use `numpy`'s `np.exp()` function.

In [None]:
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

x = np.linspace(-10, 10, 1000)
plt.plot(x, sigmoid(x));

----

# Step 2: Forward Pass

The goal is to implement this simple neural network from the lecture. We will later on use this to approximate a simple function of one variable. The first step is to code up the forward pass.

![Simple network](images/simple_network.png)

### Exercise B: Single Layer

For now let us focus on the middle part:

![Simple network middle](images/simple_network_middle.png)

Given sets of inputs, $x_1$, $x_2$, and $x_3$, and weights $\omega_i, i=1...6$, compute $y_1$ and $y_2$. Remember to also apply the sigmoid function to the outputs.

In [None]:
x = [-0.5, 0.1, 0.3]
w = [0.0, 1.0, -2.0, 3.0, -4.0, 5.0]

y1 = sigmoid(x[0] * w[0] + x[1] * w[2] + x[2] * w[4])
y2 = sigmoid(x[0] * w[1] + x[1] * w[3] + x[2] * w[5])

print(y1, y2)

You probably realized that this is suspiciously similar to a Matrix multiplication. If you have not already done so, rewrite the whole operation as a matrix multiplication. Remember that you need to conver the lists to `numpy` arrays. To perform a matrix multiplication use the `@` operator, e.g. `A @ B` will compute the matrix multiplication of `A` and `B`.

In [None]:
W = np.array(w).reshape((3, 2)).T
sigmoid(W @ x)

### Exercise C: Full Forward Pass

Now implement the full forward pass - do it without bias units for now.

To get comparabale results, please use an input of 0.5 and intialize all weights to 2.0 (use `np.ones(shape) * 2.0`).

In [None]:
inputs = [0.5]

# Initialize the weights. Writing it down
# should convince you that the first dimension
# must be equal to the number of outputs and
# the second equal to the number of inputs.
weights_1 = np.ones((3, 1)) * 2.0
weights_2 = np.ones((2, 3)) * 2.0
weights_3 = np.ones((1, 2)) * 2.0

# Now just apply all the weights and
# the activation function.
at_hidden_1 = sigmoid(weights_1 @ inputs)
at_hidden_2 = sigmoid(weights_2 @ at_hidden_1)
# No activation function for the last layer!
output = weights_3 @ at_hidden_2

print(output)

# Step 3: A Simple Neural Networks Framework

Two lessons to draw from this so far:

* Fully connected layers can be computed by dense matrix multiplications. This also happens to be one of the reasons why it is so fast on GPUs.
* When actually performing the computations a "layer" are not the actual neurons but the operations that happen in between one set of neurons and the next.

The following is a pre-made implementation of the forward pass of the desired neural network. Note that it is very similar to what we just did, there is just some bookkeeping around it and a slightly nicer structure.

In [None]:
class FullyConnectedLayer:
    def __init__(self,
                 # The type hints after the colon are a newish Python future
                 # and fully optional. But they document the expected type
                 # and thus help clarify the code.
                 input_size: int,
                 output_size: int,
                 bias_units: bool,
                 activation_function=None):
        """
        This function is called upon object creation.
        """
        # Initialize the weights. The matrix must always be
        # an M x (N + 1) matrix. M rows, one for each output, and
        # N + 1 columns, one for each input + one for the bias.
        #
        # Here we just initialize with a normal distribution with
        # zero mean and a standard deviation of 0.1. Small initial
        # weights results in faster initial training!
        self.bias_units = bias_units
        if bias_units:
            i_size = input_size + 1
        else:
            i_size = input_size
        self.weights = np.random.randn(output_size, i_size) * 0.1
        
        # Also set the activation function.
        self.activation_function = activation_function
        

    def forward(self, inputs: np.ndarray) -> np.ndarray:
        """
        This function performs the forward pass for that layer given
        a set of inputs.
        """
        # Add the bias at the end of the inputs.
        if self.bias_units:
            inputs = np.pad(inputs, pad_width=((0, 0), (0, 1)),
                            mode="constant", constant_values=(0.0, 1.0))
            
        # Keep the inputs around. We need them for the backwards pass.
        self._forward_inputs = inputs
        
        # Perform the matrix multiplication and apply the
        # activation function.
        #
        # Not that we flip the sign of operations here to
        # be able to treat the first dimension as the "sample"
        # dimension.
        out = inputs @ self.weights.T
        
        # Last but not least apply the activation function.
        if self.activation_function:
            out = self.activation_function(out)
            
        # We also need the outputs for the backwards run.
        self._forward_outputs = out
        
        return out
    
    
class NeuralNetwork:
    def __init__(self):
        self.layers = []
    
    def add_layer(self, layer: FullyConnectedLayer):
        self.layers.append(layer)
        
    def forward(self, inputs: np.ndarray) -> np.ndarray:
        """
        Just apply the forward operator of each layer.
        """
        inputs = np.asarray(inputs)
        out = self.layers[0].forward(inputs)
        for l in self.layers[1:]:
            out = l.forward(out)
        return out

In [None]:
# Now remember that each "layer" denotes the connections
# between two sets of layers.
fc1 = FullyConnectedLayer(input_size=1, output_size=3,
                          bias_units=False,
                          activation_function=sigmoid)
fc2 = FullyConnectedLayer(input_size=3, output_size=2,
                          bias_units=False,
                          activation_function=sigmoid)
fc3 = FullyConnectedLayer(input_size=2, output_size=1, 
                          bias_units=False)


# Add each layer to the neural network.
nn = NeuralNetwork()
nn.add_layer(fc1)
nn.add_layer(fc2)
nn.add_layer(fc3)

# Now we do a bit of a hack to get the same result
# as previously.
fc1.weights[:] = 2.0
fc2.weights[:] = 2.0
fc3.weights[:] = 2.0

# Our input data has to have the shape
# (NUMBER_OF_SAMPLES, POINTS PER SAMPLE)
inputs = np.array([0.5]).reshape(1, 1)
print(nn.forward(inputs))

# This enables us to perform the operation
# for many data samples at once.
inputs = np.array([[0.5],
                   [-10.0],
                   [0.5],
                   [0.0]])
print(nn.forward(inputs))

# Step 4: Back-propagation

The first thing we need it so find the derivative through whatever activation function we chose.

In [None]:
def sigmoid(x, derivative=False):
    # Not the true derivative but the derivative,
    # assuming x is already sigmoid(value).
    if derivative:
        return x * (1 - x)
    return 1.0 / (1.0 + np.exp(-x))

x = np.linspace(-10, 10, 1000)
plt.plot(x, sigmoid(x));
plt.show()
plt.plot(x, sigmoid(sigmoid(x), derivative=True));
plt.show()

Then we implement the backwards pass on the previously already explain fully connected layer class.

In [None]:
class FullyConnectedLayer:
    def __init__(self,
                 # The type hints after the colon are a newish Python future
                 # and fully optional. But they document the expected type
                 # and thus help clarify the code.
                 input_size: int,
                 output_size: int,
                 bias_units: bool,
                 activation_function=None):
        """
        This function is called upon object creation.
        """
        # Initialize the weights. The matrix must always be
        # an M x (N + 1) matrix. M rows, one for each output, and
        # N + 1 columns, one for each input + one for the bias.
        #
        # Here we just initialize with a normal distribution with
        # zero mean and a standard deviation of 0.1. Small initial
        # weights results in faster initial training!
        self.bias_units = bias_units
        if bias_units:
            i_size = input_size + 1
        else:
            i_size = input_size
        self.weights = np.random.randn(output_size, i_size) * 0.1
        
        # Also set the activation function.
        self.activation_function = activation_function
        

    def forward(self, inputs: np.ndarray) -> np.ndarray:
        """
        This function performs the forward pass for that layer given
        a set of inputs.
        """
        # Add the bias at the end of the inputs.
        if self.bias_units:
            inputs = np.pad(inputs, pad_width=((0, 0), (0, 1)),
                            mode="constant", constant_values=(0.0, 1.0))
            
        # Keep the inputs around. We need them for the backwards pass.
        self._forward_inputs = inputs
        
        # Perform the matrix multiplication and apply the
        # activation function.
        #
        # Not that we flip the sign of operations here to
        # be able to treat the first dimension as the "sample"
        # dimension.
        out = inputs @ self.weights.T
        
        # Last but not least apply the activation function.
        if self.activation_function:
            out = self.activation_function(out)
            
        # We also need the outputs for the backwards run.
        self._forward_outputs = out
        
        return out
        
    def backward(self, g: np.ndarray) -> np.ndarray:
        """
        Backpropagate any passed gradients and store the
        gradients with respect to the weights.
        """
        # Backprop through the activation function if any.
        if self.activation_function:
            g = self.activation_function(
                self._forward_outputs, derivative=True) * g
            
        # Store the gradients as it is needed for the
        # optimization pass.
        self._gradient_weights = g.T @ self._forward_inputs
        
        # Apply the weights for the back-propagation to the
        # previous layer.
        bp = g @ self.weights
        
        if self.bias_units:
            # No need to back-propagate the bias weights.
            return bp[:, :-1]
        else:
            return bp

And our neural network class needs to gain the ability to perform a backwards pass.

In [None]:
class NeuralNetwork:
    def __init__(self):
        self.layers = []
    
    def add_layer(self, layer: FullyConnectedLayer):
        self.layers.append(layer)
        
    def forward(self, inputs: np.ndarray) -> np.ndarray:
        """
        Just apply the forward operator of each layer.
        """
        inputs = np.asarray(inputs)
        out = self.layers[0].forward(inputs)
        for l in self.layers[1:]:
            out = l.forward(out)
        return out
    
    def backward(self, outputs: np.ndarray) -> np.ndarray:
        """
        Apply the backwards pass of each layer.
        """
        g = np.asarray(outputs)
        g = self.layers[-1].backward(g)
        for l in reversed(self.layers[:-1]):
            g = l.backward(g)
        return g

Now demonstrate the usage.

In [None]:
# Now remember that each "layer" denotes the connections
# between two sets of layers.
fc1 = FullyConnectedLayer(input_size=1, output_size=3,
                          bias_units=True,
                          activation_function=sigmoid)
fc2 = FullyConnectedLayer(input_size=3, output_size=2,
                          bias_units=True,
                          activation_function=sigmoid)
fc3 = FullyConnectedLayer(input_size=2, output_size=1, 
                          bias_units=True)


# Add each layer to the neural network.
nn = NeuralNetwork()
nn.add_layer(fc1)
nn.add_layer(fc2)
nn.add_layer(fc3)


# Our input data has to have the shape
# (NUMBER_OF_SAMPLES, POINTS PER SAMPLE)
inputs = np.array([0.5]).reshape(1, 1)
print(nn.backward(nn.forward(inputs)))

# This enables us to perform the operation
# for many data samples at once.
inputs = np.array([[0.5],
                   [-10.0],
                   [0.5],
                   [0.0]])
print(nn.backward(nn.forward(inputs)))

# Step 5: Gradient Descent

### Exercise D: Perform gradient descent on $f(x) = x^2$

or $f(x, y) = x^2 y^2$, $f(x) = sin(x)$, or some other function of choice.

In [None]:
# Starting point.
x = -4
step_length = 0.1
niter = 100

for _ in range(niter):
    grad_x = 2 * x
    x += -grad_x * step_length
    
print(x, x**2)

In [None]:
# Starting point.
x = -4
y = 3.5

step_length = 0.01
niter = 1000

for _ in range(niter):
    grad_x = 2 * x * y ** 2
    grad_y = 2 * y * x ** 2
    x += -grad_x * step_length
    y += -grad_y * step_length

print(x, y, x ** 2 * y ** 2)

# Plot the function.
X, Y = np.meshgrid(np.linspace(10, -10, 200),
                   np.linspace(10, -10, 200))
plt.pcolormesh(X, Y, X ** 2 * Y ** 2)
plt.show()

# Step 6:  Synthesis

Tie everything together.

We will define a function here, draw some samples from that function and attempt to train a neural network to reproduce that function using the data.

In [None]:
def function(x):
    return np.sin(np.pi * x) / 2.0 + 0.5


# Training data. N samples uniformly distributed
# in the entire range.
N = 100
X = np.random.random((N, 1)) * 2.0
Y = function(X)


x = np.linspace(0, 2, 1000)
plt.plot(x, function(x), label="True")
plt.scatter(X, Y, color="red", label="Training Data")
plt.legend()

The following will be a copy of our neural network classes from before with a few subtle changes:

* The `NeuralNetwork` class now has the ability to compute a squared error loss function.
* The `FullyConnectedLayer` can update its weights using the gradients from the backwards pass.
* The actual optimization is performed outside of these classes for reasons of clarity.

In [None]:
class FullyConnectedLayer:
    def __init__(self,
                 # The type hints after the colon are a newish Python future
                 # and fully optional. But they document the expected type
                 # and thus help clarify the code.
                 input_size: int,
                 output_size: int,
                 bias_units: bool,
                 activation_function=None):
        """
        This function is called upon object creation.
        """
        # Initialize the weights. The matrix must always be
        # an M x (N + 1) matrix. M rows, one for each output, and
        # N + 1 columns, one for each input + one for the bias.
        #
        # Here we just initialize with a normal distribution with
        # zero mean and a standard deviation of 0.1. Small initial
        # weights results in faster initial training!
        self.bias_units = bias_units
        if bias_units:
            i_size = input_size + 1
        else:
            i_size = input_size
        self.weights = np.random.randn(output_size, i_size) * 0.1
        
        # Also set the activation function.
        self.activation_function = activation_function
        

    def forward(self, inputs: np.ndarray) -> np.ndarray:
        """
        This function performs the forward pass for that layer given
        a set of inputs.
        """
        # Add the bias at the end of the inputs.
        if self.bias_units:
            inputs = np.pad(inputs, pad_width=((0, 0), (0, 1)),
                            mode="constant", constant_values=(0.0, 1.0))
            
        # Keep the inputs around. We need them for the backwards pass.
        self._forward_inputs = inputs
        
        # Perform the matrix multiplication and apply the
        # activation function.
        #
        # Not that we flip the sign of operations here to
        # be able to treat the first dimension as the "sample"
        # dimension.
        out = inputs @ self.weights.T
        
        # Last but not least apply the activation function.
        if self.activation_function:
            out = self.activation_function(out)
            
        # We also need the outputs for the backwards run.
        self._forward_outputs = out
        
        return out
        
    def backward(self, g: np.ndarray) -> np.ndarray:
        """
        Backpropagate any passed gradients and store the
        gradients with respect to the weights.
        """
        # Backprop through the activation function if any.
        if self.activation_function:
            g = self.activation_function(
                self._forward_outputs, derivative=True) * g
            
        # Store the gradients as it is needed for the
        # optimization pass.
        self._gradient_weights = g.T @ self._forward_inputs
        
        # Apply the weights for the back-propagation to the
        # previous layer.
        bp = g @ self.weights
        
        if self.bias_units:
            # No need to back-propagate the bias weights.
            return bp[:, :-1]
        else:
            return bp
        
    def apply_negative_gradients(self, step_length: float):
        """
        Applies the negative gradient to the weights with a given
        step length.
        """
        self.weights -= step_length * self._gradient_weights
        
        
class NeuralNetwork:
    def __init__(self):
        self.layers = []
    
    def add_layer(self, layer: FullyConnectedLayer):
        self.layers.append(layer)
        
    def forward(self, inputs: np.ndarray) -> np.ndarray:
        """
        Just apply the forward operator of each layer.
        """
        inputs = np.asarray(inputs)
        out = self.layers[0].forward(inputs)
        for l in self.layers[1:]:
            out = l.forward(out)
            
        return out
    
    def backward(self, outputs: np.ndarray) -> np.ndarray:
        """
        Apply the backwards pass of each layer.
        """
        g = np.asarray(outputs)
        g = self.layers[-1].backward(g)
        for l in reversed(self.layers[:-1]):
            g = l.backward(g)
        return g
    
    def loss(self,
             actual: np.ndarray,
             predicted: np.ndarray,
             derivative: bool = False):
        """
        Compute the squared error loss or the derivative
        thereof.
        """
        if derivative is False:
            return 0.5 * ((actual - predicted) ** 2).sum()
        # For the derivative we need it per sample and
        # output.
        return (actual - predicted)
    
    def apply_negative_gradients(self, step_length: float):
        """
        Apply the negative gradients of each layer.
        """
        for l in self.layers:
            l.apply_negative_gradients(step_length)
    

# Now remember that each "layer" denotes the connections
# between two sets of layers.
fc1 = FullyConnectedLayer(input_size=1, output_size=3,
                          bias_units=True,
                          activation_function=sigmoid)
fc2 = FullyConnectedLayer(input_size=3, output_size=2,
                          bias_units=True,
                          activation_function=sigmoid)
fc3 = FullyConnectedLayer(input_size=2, output_size=1, 
                          bias_units=True)


# Add each layer to the neural network.
nn = NeuralNetwork()
nn.add_layer(fc1)
nn.add_layer(fc2)
nn.add_layer(fc3)


# Perform the numerical optimization.
niter = 10000
step_length = 0.01


# Loop over iterations.
for i in range(niter):
    # Perform the forward pass.
    predicted = nn.forward(X)
    
    if not i % 1000:
        print(f"Iteration {i}: Loss: {nn.loss(predicted, Y)}")
    
    # Play a bit with an adaptive step length.
    if i == 6000:
        step_length /= 2.0
    
    # Compute the derivative of the loss.
    d_l = nn.loss(predicted, Y, derivative=True)
    # Backpropagate it to store the gradients with respect
    # to the weights in each layer.
    nn.backward(d_l)
    # Update each layer and rinse and repeat.
    nn.apply_negative_gradients(step_length=step_length)
    

print(f"Iteration {i}: Loss: {nn.loss(predicted, Y)}") 
    
# Print some predictions.
x = np.linspace(0, 2, 1000)
x = x.reshape(-1, 1)
plt.plot(x, function(x), label="true")
plt.plot(x, nn.forward(x), label="learned")
plt.legend()
plt.show()

# Step 7: Don'y be silly: Use a Library!

You might have noticed that there are many steps where one can slighty go wrong. Thankfully good libraries for machine learning and neural networks are in existence. Do yourself a favor and use them! The following block contains a recreation of this whole notebook using two calls two to `scikit-learn`. On Thursday and Friday you will meet other, even more powerful, neural network frameworks.

In [None]:
from sklearn.neural_network import MLPRegressor

# Define the network.
nn = MLPRegressor(
    hidden_layer_sizes=(3, 2),
    solver="sgd",
    # No batches.
    batch_size=N,
    # Logistic == sigmoid
    activation="logistic",
    # Fairly fragile to these settings here.
    learning_rate_init=0.2,
    learning_rate="adaptive",
    tol=1E-7,
    n_iter_no_change=100,
    max_iter=10000,
    # No regularization.
    alpha=0.0,
    # No validation set.
    validation_fraction=0.0)

# Learn it.
f = nn.fit(X, Y.ravel())
print(f"Final Loss: {f.loss_}")
print(f"Number of Iterations: {f.n_iter_}")
# Print some predictions.
x = np.linspace(0, 2, 1000)
x = x.reshape(-1, 1)
plt.plot(x, function(x), label="true")
plt.plot(x, nn.predict(x), label="learned")
plt.legend()
plt.show()