# Exercises Week 5

## Exercise 1: Neural Network Design
In this exercise your task is to neural networks by hand that compute simple functions.
For the nonlinear transform you can mix them any way you like but you can only use, identity, sign, relu and sigmoid transforms in the neurons.
You can make the networks as wide and deep as you would like but small networks are sufficient.
* Make a network that computes $c \cdot x$ for any constant c
* Make a network that computes xor of inputs $x_1$ and $x_2$. 
* Make a network that computes max($x_1$,$x_2$)
* Make a network that computes $x^2$ - for x in range {2,3,4,5} i.e. x is an integer


- **Hint 1: It is usually easier to find an easy mathematical expression that solves the problem and then to make a network that implements that**
- **Hint 2: The only nonlinear transfrom the teacher uses is relu (and identity)**.

A single neuron. Weights c. Nonlinear-transform is identity.

In first layer: One neuron with weights (1,1) and bias -1, taking ReLU. Another neuron with weights (-1,-1) bias +1, taking ReLU. Output layer one neuron with weight (-1,-1) and bias 1, identity.

In first layer: One neuron with weights (1,-1), taking ReLU. Another neuron with weights (-1,1) taking ReLU. And a neuron computing the sum of the two inputs using weights (1,1), identity. Output layer has weights (1/2,1/2,1/2) and identity. Observe: Sum of first two neurons in first layer is max(x_1,x_2)-min(x_1,x_2). Adding x_1+x_2 cancels out the smallest and leaves twice the max. Dividing it all by 2 gives the desired result.


We create a small network testing whether x = y. This is done by having two neurons, one having weight -1 and bias +y and another having weight 1 and bias -y. Both have ReLU. If x=y then both are 0. Otherwise, one is 0 and one is >= 1. Now create a new neuron taking the two previous as inputs with weight -1000 on both and bias y^2, taking ReLU. This becomes y^2 iff x=y, and otherwise it becomes 0. Do this for every y in 2,3,4,5 and create a neuron with weight 1,1,1,1 summing these and using identity.

## Ex 2: Neural Net Forward Pass - Vectorized
Implement the score (least squares error), and predict function for a a neural net class for regression.
The neural net has one hidden layer with  $\textrm{relu}(x) = \max(0, x)$ nonlinearity and one output neuron.

For the prediction method you must write an algorithm that takes as input a batch of data and computes the output of the neural net on each input point given in the batch.

The data batch is given as an $n \times d$ matrix $X$, where each row is a data point.


A neural net as considered here requires two sets of weights and biases
* The weights that map the input data to the input to hidden units. Call that W_1. The bias weights for this we name $b_1$.

* The weights that map the output of the hidden units to the output. Call that W_2. The bias weights for this we name $b_2$.

We organize the weighs in matrices $(W_1, W_2)$ and vectors $(b_1,b_2)$ as follows:
* The $i'th$ column of $W_1$ are the weights we multiply with the input data to get the input hidden node $i$. The shape of the $W_1$ matrix is $d \times h$
* The bias $b_1$ is a vector of size h, the i'th entry the bias to hidden neuron $i$.
* The $i'th$ column of $W_2$ are the weights we multiply with the hidden layer activations to get the input to the i'th output node. $W_2$ is a $h \times \textrm{output_size}$ matrix ($h \times 1$ matrix in our case)
* The bias $b_2$ is a vector of size output_size 

**Task:** In the cell below (partially) complete the neural net class
- Implement the predict function of the neural net
- Implement the score function (least squares $\frac{1}{n} \sum_i (\textrm{nn}(x_i) - y_i)^2$

Tests:
- We have a simple test case with random weights. The actual error here is random since we just set random weights.
- The second test case uses the weight of a pretreined network for house pricing. Here you should get a score around 0.32 (remove comment to run it)


In [2]:
%matplotlib inline
import numpy as np

class NN():
    
    def __init__(self, input_dim, hidden_size, output_size=1):
        self.W1 = np.random.rand(input_dim, hidden_size)
        self.b1 = np.random.rand(1, hidden_size)
        self.W2 = np.random.rand(hidden_size, output_size)
        self.b2 = np.random.rand(1, output_size)
        print('Neural net initialized with random values')
        
    def predict(self, X):    
        """ Evaluate the network on given data batch 
        
        np.maximum may come in handy
        
        Args:
        X: np.array shape (n, d)  Each row is a data point
        
        Output:
        pred: np.array shape (n, 1) output of network on each input point
        """
        # compute the following values
        pred = None # the output of neural net n x 1
    
        ### YOUR CODE HERE
        hiddenIn = X @ self.W1 + (np.ones((X.shape[0],1)) @ self.b1) #n x h
        nonLinear = np.maximum(hiddenIn,0) #n x h
        output = nonLinear @ self.W2 + (np.ones((X.shape[0],1)) @ self.b2) #n x o
        ### END CODE
        return output
    
    
    def score(self, X, y):
        """ Compute least squares loss (1/n sum (nn(x_i) - y_i)^2)
        
          X: np.array shape (n, d) - Data
          y: np.array shape (n, 1) - Targets

        """
        score = None
        ### YOUR CODE HERE
        pred = self.predict(X)
        score = np.mean((pred-y)**2)
        ### END CODE
        return score
        
# random data test
def simple_test():
    input_dim = 3
    hidden_size = 8
    X = np.random.rand(10, input_dim)
    y = np.random.rand(10, 1)
    my_net = NN(input_dim=input_dim, hidden_size=hidden_size)

    nn_out = my_net.predict(X)
    print('shape of nn_out', nn_out.shape) # should be n x 1
    print('least squares error: ', my_net.score(X, y))
    
# actual data test
def housing_test():
    from sklearn.preprocessing import StandardScaler
    from  sklearn.datasets import fetch_california_housing
    rdata = fetch_california_housing()
    s = StandardScaler()
    Xr = rdata.data
    yr = rdata.target   
    print('data size:', len(yr), 'num features:', Xr.shape[1])
    s.fit(Xr)
    X_scaled = s.transform(Xr)
    house_net = NN(input_dim=Xr.shape[1], hidden_size=8)
    weights = np.load('good_weights.npz')
    house_net.W1 = weights['W1']
    house_net.W2 = weights['W2']
    house_net.b1 = weights['b1'].reshape(1, -1)
    house_net.b2 = weights['b2'].reshape(1, -1)
    print('hidden layer size:', house_net.W1.shape[1])
    lsq = house_net.score(X_scaled, yr.reshape(-1, 1))
    pred = house_net.predict(X_scaled)
    print('mean house price least squares error:', lsq)
    print('5 house prediction:\nestimated price , true price')
    print(np.c_[house_net.predict(X_scaled[0:5, :]), yr[0:5]])

simple_test()
housing_test()

Neural net initialized with random values
shape of nn_out (10, 1)
least squares error:  57.77642304830964
data size: 20640 num features: 8
Neural net initialized with random values
hidden layer size: 7
mean house price least squares error: 0.32253261478095707
5 house prediction:
estimated price , true price
[[3.87439488 4.526     ]
 [3.98254363 3.585     ]
 [3.81248215 3.521     ]
 [3.1932362  3.413     ]
 [2.89667929 3.422     ]]


## Exercise 3: Implementing AdaBoost
In this exercise your task is to implement adaboost as described in the lecture and the Boosting note.
We have provided starter code in adaboost.py. See the boosting note for a description of the algorithm.

You must implement the methods
- ensemble_output
- predict
- score
- fit
in that order

To test your implementation, run adaboost.py

You shoul get a final accuracy of around 0.886



## Exercise 4: Implementing Forward Stagewise Additive Modelling
In this exercise your task is to implement a standard Forward Stategewise Additive Modelling algorithm for regression using Least Squares Loss.
We have provided starter code in **sfam.py**.

See Elements of statistical learning for a description of the algorithm.
We note that in the iteration each step computes the best new hypothesis h and best new scalar $\beta$ such that $\beta h(x)$ is to be added to the ensemble.
We implement this simply by first finding $h$ with the weak learner to fit the residuals and then computing the optimal constant $\beta$ for the $h$ found by our weak learner.


You must implement the methods
- predict
- score
- fit
in that order

Notice that fit gets two sets of data and labels X, y and X_val, y_val.
The latter X_val, y_val is a separate validation test set you must test your current ensemble on in each iteration so we can plot the development on data not used for training (where we know the error will always decrease).

To test your implementation, run sfam.py -max_depth 1

You can procide different max_depth of the base learner which is a Regression Tree (1 is default).

With a default base learner with max depth 1 the mean least squares error on both training and test data should be around 0.35. 
If you change random state then the results may be different.

If you increase the max_depth the results will change.  Try for instance max_depth 3 and 5 as well. What do you see?





## Exercise 5: Gradient Boosting by Hand
In this exercise you must complete one step of gradient boosting with exponential loss on a small data set (X, y) as shown below.

$X = [1,2,5,3,4]$

$y = [1,1,1,-1, -1]$

Asume that we initialize our ensemble model with the constant function $h(x) = 1$


**Your task requires the following three steps 
1. To compute the residuals the regression tree should fit (with least squares)
2. Construct the Regression Stump found by fitting the negative gradient
3. Optimize the leaf values such that the newly added tree minimize the exponential loss (return values in [-1, +1])

The loss function is exp(-y h(x)). Computing gradient wrt. h(x) gives -y * exp(-y h(x)). Evaluating this at 1 and -1 gives the following gradients using that h(x)=1 for all x: (-1/e, -1/e, -1/e, e, e). The residuals are thus (1/e,1/e, 1/e,-e,-e).

We consider all non-trivial splits for the root:
x<=1: One node with the single element -1/e, one node with -1/e, -1/e, e, e. Cost of first leaf is 0. Second leaf has mean (-1/e + e)/2. Cost is thus 2*(-1/e - (-1/e+e)/2)^2 + 2*(e-(-1/e+e)/2)^2 = 4*((e+1/e)/2)^2 = 9.52

x<=2: One node with -1/e, -1/e, one node with -1/e, e, e. Cost of first is 0. Second leaf has mean (-1/e + 2e)/3. Cost is thus (-1/e - (-1/e + 2e)/3)^2 + 2*(e - (-1/e + 2e)/3)^2 = 6.35

x<=3: One node with -1/e, -1/e, e and one with -1/e, e. First has mean (-2/e + e)/3. Second has mean (-1/e+e)/2. Cost of first leaf is 2*(-1/e - (-2/e+e)/3)^2 + (e-(-2/e+e)/3)^2 = 6.35. Cost of second leaf is 2*((e-1/e)/2)^2 = 2.76. Cost bigger than x<=2.

x<=4: One node with -1/e, -1/e, e, e and one with -1/e. Cost is equal to split x<=1.

Best split is x<=2.

First leaf has two elements with label 1 and already predicting 1, hence best return value in the range is 1 (minimizing 2*exp(-1*(1+v)). For the second leaf, we have elements with labels 1, -1, -1 and current prediction 1. If predict v in this leaf, then exponential loss on these become exp(-(1 * (1 + v)) + 2*exp(-(-1 * (1 + v))) = e^(-v) * 1/e + e^v * 2e. The derivative is -1/e * e^(-v) + 2e * e^v. Setting to 0: 2e * e^v = 1/e * e^(-v) <=> v + ln(2e) = -v + ln(1/e) <=> v = (1/2)ln(1/(2e^2)) = -1.35.

So to optimize the exponential loss, we predict -1 in the newly created leaf.

