# Week 4 (Theoretical) Exercises
Remember to take a look at all the exercises.


# Ex 1: Gradient Descent 
Let $f_a(x_1, x_2) = \frac{1}{2}(x_1^2 + a\cdot x_2^2)$

Where $a$ is parameter that we will change.

**Your task is to write a gradient descent algorithm that finds a minimizer of $f$, where we have decided that the starting point for the gradient descent is (256, 1). It should be possible for you to figure out what the local (global) minimum is, as well as the gradient.**
- Test your algorithm by running the cell.
- run your gradient descent algorithm for at least 40 steps to see if it converges. 
You must save the sequence of elements (2d points) considered in your gradient descent algorithm for visualization. 
We have added code to visualize this sequence.

- Try a=1, 4, 16, 64, 128, 256 and adjust the step size to see if you can make it converge.
    **hint - after trying different values for the stepsize also try approximately 1/a (for a > 1)**
- What do you see? 





In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

def f(a, x):
    return 0.5 * (x[0]**2 + a * x[1]**2)

def visualize(a, path, ax=None):
    """
    Make contour plot of f_a and plot the path on top of it
    """
    y_range = 10
    x = np.arange(-257, 257, 0.1)
    y = np.arange(-y_range, y_range, 0.1)
    xx, yy = np.meshgrid(x, y)
    z = 0.5 * (xx**2 + a * yy**2)
    if ax is None:
        fig, ax = plt.subplots(figsize=(16, 13))
    h = ax.contourf(xx, yy, z, cmap=plt.get_cmap('jet'))
    ax.plot([x[0] for x in path], [x[1] for x in path], 'w.--', markersize=4)
    ax.plot([0], [0], 'rs', markersize=8) # optimal solution
    ax.set_xlim([-257, 257])
    ax.set_ylim([-y_range, y_range])

def gd(a, step_size=0.1, steps=40):
    """ Run Gradient descent
        params:
        a - the parameter that define the function f
        step_size - constant stepsize to use for gradient descent
        steps - number of steps to run
        
        Returns: out, list with the sequence of points considered during the descent.         
    """
    out = []
    x = np.array([256.0, 1.0]) # starting point

    ### YOUR CODE HERE    
    ### END CODE
    return out

fig, axes = plt.subplots(2, 3, figsize=(20, 16))
ateam = [[1, 4, 16], [64, 128, 256]]
for i in range(2):
    for j in range(3):
        ax = axes[i][j]
        a = ateam[i][j]
        path = gd(a, step_size=0.1, steps=40) # use good step size here instead of standard value
        visualize(a, path, ax)
        ax.set_title('Gradient Descent a={0}'.format(a), fontsize=16)


# Exercise 2: In Sample Error
Assume we are given a fixed hypothesis h, and we are considering 0-1 loss ($1_{h(x)\neq y}$).
Now we receive a sample data set $D = {(x_1,y_1),\dots,(x_n,y_n)}$ where each data point is generated by sampling $x_i$  independently at random from unknown distribution $P(X)$ and then fed to the unknown $f$ to get $(x_i, y_i)$, where $y_i = f(x_i)$.


Show that the expected value (over the data set) of $E_{in}(h) = \frac{1}{n} \sum_{i=1}^n 1_{h(x_i)\neq y_i}$ ? (number of mispredictions/number of points) is $E_{out}(h)$.

Formally show that
$$
\mathbb{E}_D [\textrm{E}_\textrm{in}(h)] = \textrm{E}_\textrm{out}(h)
$$


 
# Exercise 3: Out of sample error 
Given that the target function is actually the the noisy linear model, 
$$ P(y\mid x,w) = w^\intercal x + \varepsilon $$ 
where $\varepsilon$ is a random variable with mean zero, $\mathbb{E}[\varepsilon] = 0$, and standard deviation $\sigma$.


Given that we use the least squares error function, $e(x,y) = (x-y)^2$,
what is the best **out of sample error** possible?

Hint: What is the optimal classifier? What is the out of sample error of
that one?




# Exercise 4: Softmax Gradient
As described in the softmax note, we define the softmax function as follows:
$$
\textrm{softmax}:\mathbb{R}^K \rightarrow \mathbb{R}^K, \quad
\textrm{softmax}(x)_j =
\frac{e^{x_j}}
{\sum_{i=1}^K e^{x_i}}\quad
\textrm{ for }\quad j = 1, \dots, K.
$$
where  $\textrm{softmax}(x)_j$ denote the $j$'th output of the function


Show that the matrix of derivatives of the softmax function is as follows.
$$
\left[\frac{\partial \textrm{softmax}}{\partial x}\right]_{i,j} =
\frac
{\partial \;\textrm{softmax}(x)_i}
{\partial x_j} =
(\delta_{i,j} - \textrm{softmax}(x)_j)
\textrm{softmax}(x)_i\quad\quad\text{where}\quad\quad
\delta_{ij}=\begin{cases}1 &\text{if }i=j\\
0 & \text{else}
\end{cases}
$$


