# Intro
This section of the project implements gradient-based optimization algorithms and associated helper functions. This notebook walks through their derivations. We will begin by deriving some common loss functions and their gradients.
## Squared Error Loss
Let $\mathbf{y}$ be a $N \times 1$ vector of response variable datapoints, and let $\mathbf{X}$ be a $N \times M$ matrix of predictor variables. Finally, let $\mathcal{F}(\mathbf{X})$ be a function which produces a $N \times 1$ vector $\hat{\mathbf{y}}$ of predictions. Then we define the squared error loss as follows: 
$$C(\mathbf{\hat{y}},\mathbf{y}) = \frac{1}{N}\sum_{i=1}^{n} (\hat{y}_i - y_i)^2$$
Not suppose our prediciton function is linear. then $\mathcal{F(\mathbf{X})} = \mathbf{X}\vec{\theta}$ where $\vec{\theta}$ is a $M \times 1$ vector of parameters. Then $\hat{y}_i = \sum_{k=1}^M X_{i,k}\theta_k$ and our loss function can be written:
$$C(\vec{\theta}) = \frac{1}{N}\sum_{i=1}^{N} (\sum_{k=1}^M X_{i,k}\theta_k - y_i)^2$$
Now we want to take the gradient of this function with respect to $\vec{\theta}$. First consider a single component of the gradient:
$$\frac{\partial }{\partial \theta_j}C(\vec{\theta}) = \frac{\partial }{\partial \theta_j} \frac{1}{N}\sum_{i=1}^{N} (\sum_{k=1}^M X_{i,k}\theta_k - y_i)^2 = \frac{1}{N}\sum_{i=1}^{N} 2(\sum_{k=1}^M X_{i,k}\theta_k - y_i)X_{i,j} =  \frac{2}{N}\sum_{i=1}^{N} (\hat{y}_i - y_i)X_{i,j} = \frac{2}{N}\sum_{i=1}^{N} X^{T}_{j,i}(\hat{y}_i - y_i) = \frac{2}{N}(\mathbf{X}^{T}(\hat{\mathbf{y}} - \mathbf{y}))_{j,1}$$
Therefore we can compute the entire gradient vector as follows:
$$\nabla C(\vec{\theta}) = \frac{2}{N}\mathbf{X}^{T}(\hat{\mathbf{y}} - \mathbf{y})$$
The next cell has code to compute the loss and its gradient.

In [None]:

import numpy as np


def squared_error_cost(X, y, theta):
    y_hat = X @ theta
    return np.sum(np.power(y_hat - y, 2)) / X.shape[0]


def squared_error_cost_gradient(X, y, theta):
    y_hat = X @ theta
    return X.T @ (y_hat - y) / X.shape[0]

## Sigmoid, Softmax, Cross-Entropy
The previous cell delt with a loss function which is good for optimizing linear regression problems. Next we'll cover loss functions for classification problems.

### Sigmoid and Softmax Functions
The sigmoid or logistic function maps a real number to the interval $[0,1]$ using the function:
$$f(x) = \frac{1}{1 + e^{-x}}$$
This is useful if we have examples with some number of features, and we want to classify the examples into one of two categories. If $\mathbf{y}$ are the true categories, we can form our predictions as follows:
$$\hat{y_i} = \frac{1}{1 + e^{-X_i\vec{\theta}}}$$
where $X_i$ is the $i$th training example and $\vec{\theta}$ is a parameter vector as before. Since $\hat{y_i} \in [0,1]$ we can think of it as a probability that the example falls into one of the two categories. This is standard logistic regression.

This sounds nice for just two categories, but classification problems often have many possible categories. To generalize the logistic function, we need a function that takes a vector of real numbers and transforms it into a *probability simplex*, or a vector where all the components are in the interval $[0,1]$ and sum to 1. We want this transformation to preserve relative magnitudes, such that a larger number in the original vector remains larger in the transformed vector. The Softmax function does this:
$$\mathbf{\sigma}(\vec{x})_i = \frac{e^{x_i}}{\sum_{k=1}^{K}e^{x_k}}$$

### Cross-Entropy
We can now generalize logistic regression to multiple categories. We now have a $M \times K$ parameter matrix $\Theta$. To predict the category of an example, we compute:
$$\hat{y}_i = \mathbf{\sigma}(\mathbf{X}_i\Theta)$$
Note that $\hat{y}_i$ is a $K \times 1$ vector representing the probability that the example falls into each one of the $K$ categories. Similarly, we represent the true category $y_i$ as a $K \times 1$ "one-hot" vector, which is 1 at the index of the true category and 0 everywhere else. We now need a function $C(\hat{y}_i, y_i)$ to compute the error between our prediction vector and the one-hot vector representing the true category. To do this we use the cross-entropy function:
$$-\sum_{k=1}^{K} y_{i,k}\log(\hat{y}_{i,k})$$
Notice that, if $\hat{y}_i$ is very small at the true category, this loss will be very high. To calculate the loss across the whole training set we compute:
$$-\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K} y_{i,k}\log(\hat{y}_{i,k})$$

We now take the gradient with respect to the entries of $\Theta$:
$$-\frac{\partial}{\partial \Theta_{\ell,r}}\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K} y_{i,k}\log(\hat{y}_{i,k}) = 
-\frac{\partial}{\partial \Theta_{\ell,r}}\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K} y_{i,k}\log(\sigma(X_i\Theta)_k) = 
-\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K} y_{i,k}\frac{\frac{\partial}{\partial \Theta_{\ell,r}}\sigma(X_i\Theta)_k}{\sigma(X_i\Theta)_k}$$
$$= -\frac{1}{N}\sum_{i=1}^{N}\sum_{k=1}^{K} y_{i,k}\left(\frac{\partial}{\partial \Theta_{\ell,r}}\frac{e^{X_i\Theta_k}}{\sum_{j=1}^{K}e^{X_i\Theta_j}}\right)\left(\frac{1}{\sigma(X_i\Theta)_k} \right)\\
= -\frac{1}{N}\sum_{i=1}^{N}\left(\sum_{k\neq \ell} y_{i,k}\left(\frac{\partial}{\partial \Theta_{\ell,r}}\frac{e^{X_i\Theta_k}}{\sum_{j=1}^{K}e^{X_i\Theta_j}}\right)\left(\frac{1}{\sigma(X_i\Theta)_k} \right) + y_{i,\ell}\left(\frac{\partial}{\partial \Theta_{\ell,r}}\frac{e^{X_i\Theta_\ell}}{\sum_{j=1}^{K}e^{X_i\Theta_j}}\right)\left(\frac{1}{\sigma(X_i\Theta)_\ell} \right)\right)$$

$$= -\frac{1}{N}\sum_{i=1}^{N}\left(\sum_{k\neq \ell} y_{i,k}\left(\frac{-e^{X_i\Theta_k}e^{X_i\Theta_{\ell}}X_{i,r}}{\left(\sum_{j=1}^{K}e^{X_i\Theta_j}\right)^2}\right)\left(\frac{1}{\sigma(X_i\Theta)_k} \right) + y_{i,\ell}\left(\frac{e^{X_i\Theta_\ell}X_{i,r}\left(\sum_{j=1}^{K}e^{X_i\Theta_j}\right) - (e^{X_i\Theta_\ell})^2X_{i,r}}{\left(\sum_{j=1}^{K}e^{X_i\Theta_j}\right)^2}\right)\left(\frac{1}{\sigma(X_i\Theta)_\ell} \right)\right)$$

$$= -\frac{1}{N}\sum_{i=1}^{N}\left(\sum_{k\neq \ell} y_{i,k}\left(\frac{-e^{X_i\Theta_k}e^{X_i\Theta_{\ell}}X_{i,r}}{\left(\sum_{j=1}^{K}e^{X_i\Theta_j}\right)^2}\right)\left(\frac{\sum_{j=1}^{K}e^{X_i\Theta_j}}{e^{X_i\Theta_k}} \right) + y_{i,\ell}\left(\frac{e^{X_i\Theta_\ell}X_{i,r}\left(\sum_{j=1}^{K}e^{X_i\Theta_j}\right) - (e^{X_i\Theta_\ell})^2X_{i,r}}{\left(\sum_{j=1}^{K}e^{X_i\Theta_j}\right)^2}\right)\left(\frac{\sum_{j=1}^{K}e^{X_i\Theta_j}}{e^{X_i\Theta_\ell}} \right)\right)$$

$$= -\frac{1}{N}\sum_{i=1}^{N}\left(\sum_{k\neq \ell} y_{i,k}\left(\frac{-e^{X_i\Theta_{\ell}}X_{i,r}}{\sum_{j=1}^{K}e^{X_i\Theta_j}}\right) + y_{i,\ell}\left(\frac{X_{i,r}\left(\sum_{j=1}^{K}e^{X_i\Theta_j}\right) - e^{X_i\Theta_\ell}X_{i,r}}{\sum_{j=1}^{K}e^{X_i\Theta_j}}\right)\right)$$

$$= -\frac{1}{N}\sum_{i=1}^{N}\left(\sum_{k\neq \ell} -y_{i,k}X_{i,r}\sigma(X_i\Theta)_{\ell} + y_{i,\ell}X_{i,r} -y_{i,\ell}X_{i,r}\sigma(X_i\Theta)\ell\right)$$

$$= \frac{1}{N}\sum_{i=1}^{N} X_{i,r}\sigma(X_i\Theta)_{\ell} - y_{i,\ell}X_{i,r} = 
\frac{1}{N}\sum_{i=1}^{N} X_{i,r}(\hat{y}_{i,\ell} - y_{i,\ell}) = \frac{1}{N} \left(X^{T}(\hat{Y} - Y)\right)_{r,\ell}$$

So we can write the entire gradient matrix as:
$$\frac{1}{N}\mathbf{X}^{T}(\hat{\mathbf{Y}} - \mathbf{Y})$$
where $\hat{\mathbf{Y}}$ and $\mathbf{Y}$ are $N \times K$ matrices. The $i$th row of $\hat{\mathbf{Y}}$ is the category prediction vector $\hat{y}_i$. Similarly, the rows of $\hat{\mathbf{Y}}$ are one-hot vectors for different training examples. It's interesting that this greusome derivation leads to a vectorized formula which is almost identical to the one for the squared-error loss. I don't see a good mathematical reason why this should be true, and it definitely isn't true for any loss function. Also note that the cross-entropy loss works fine for logistic regression, and is a generalization of the simplified logistic cost function that appears elsewhere. The next cell has code for computing the losses and gradients.






In [None]:
def sigmoid(X, theta):
    z = -X @ theta
    return 1 / (1 + np.exp(z))

def softmax(X, Theta):
    h = X @ Theta
    m = np.max(h)
    e = np.exp(h - m) #doesn't change answer and makes intermediate values smaller
    return e / (np.sum(e))

def cross_entropy_cost(X, Y, Theta):
    h = softmax(X, Theta)
    return -np.sum(Y * np.log(h)) / X.shape[0]


def cross_entropy_cost_gradient(X, Y, Theta):
    return X.T @ (softmax(X, Theta) - y) / X.shape[0]