# Neural Networks

## Error Functions

### Discrete vs Continuos

In order to use gradient descent, it needs to have a continuos error function. To do this, we need to move from discrete predictions to continuos.

In order to channge from discrete to continuos predictions we need to change the activation function. From the discrete step function:

$$
y =
\begin{cases}
    1 & \text{if } x \geq 0\\
    0 & \text{if } x < 0
\end{cases}
$$

To the Sigmoid Function:

$$
\sigma(x) = \dfrac{1}{1 + \mathrm{e}^{-x}}
$$

### Softmax Function

The softmax function is the equivalent of the sigmoid activation function, but when the problem has 3 or more classes.

Linear function scores: $Z_1, \ldots, Z_n$

$$P(\textrm{class i}) = \dfrac{e^{z_i}}{e^{z_1} + \ldots + e^{z_n}}$$

For $n = 2$, the Softmax function will be the same as the Sigmoid function.

In [3]:
import numpy as np

def softmax(L):
    expL = np.exp(L)
    sumExpL = sum(expL)
    result = []
    for i in expL:
        result.append(i/sumExpL)
    return result

In [4]:
softmax([5,6,7])

[0.09003057317038046, 0.24472847105479764, 0.6652409557748219]

### Maximum Likehood

#### Cross-Entropy

It's the negative of the logatithm of the products of probabilities. A higher cross-entropy implies a lower probability for an event.

$$\textrm{Cross-Entropy} = - \sum_{i = 1}^{m} y_i\ln{(p_i)} + (1 - y_i)\ln{(1 - p_i)}$$

$$
\textrm{CE}[(1, 1, 0), (0.8, 0.7, 0.1)] = 0.69 \\
\textrm{CE}[(0, 0, 1), (0.8, 0.7, 0.1)] = 5.12
$$

In [1]:
import numpy as np

def cross_entropy(Y, P):
    result = 0
    
    for i in range(0, len(Y)):
        result -= Y[i] * np.log(P[i]) + (1 - Y[i]) * np.log(1 - P[i])
    return result

In [3]:
cross_entropy([1, 1, 0], (0.8, 0.7, 0.1))

0.6851790109107685

Or simplified:

In [2]:
import numpy as np

def cross_entropy(Y, P):
    Y = np.float_(Y)
    P = np.float_(P)
    return -np.sum(Y * np.log(P) + (1 - Y) * np.log(1 - P))

In [4]:
cross_entropy([1, 1, 0], (0.8, 0.7, 0.1))

0.6851790109107685

#### Multi-class Cross-Entropy

$$\textrm{Cross-Entropy} = - \sum_{i = 1}^{n}\sum_{j = 1}^{m} y_{ij}\ln{(p_{ij})}$$

$m$ being the number of classes.

## Logistic Regression

### Error Function

$$\textrm{Error Function} = - \dfrac{1}{m} \sum_{i=1}^{m} (1 - y_i)\ln{(1 - \hat{y_i})} + y_i\ln{(\hat{y_i})}$$

Since $\hat{y_i}$ is given by the sigmoid of the linear function $Wx + b$, then the total formula is:

$$E(W,b) = - \dfrac{1}{m} \sum_{i=1}^{m} (1 - y_i)\ln{(1 - \sigma(Wx^{(i)} + b))} + y_i\ln{(\sigma(Wx^{(i)} + b))}$$

Then to minimize the error we use Gradient descent.

### Gradient Descent

Uses derivatives to minimize the error function.

The derivative of the sigmoid function:

$$\sigma'(x) = \sigma(x)(1 - \sigma(x))$$

And the derivati of the error $E$ at a point $x$, with respect to the weight $w_j$:

$$\dfrac{\partial}{\partial b}E = -(y - \hat{y})$$

A small gradient means we'll change our coordinates by a little bit, and a large gradient means we'll change our coordinates by a lot.

Therefore, since the gradient descent step simply consists in subtracting a multiple of the gradient of the error function at every point, then this updates the weights in the following way:

$$w_i' \gets w_i - \alpha[-(y - \hat{y})x_i]$$

which is equivalent to:

$$w_i' \gets w_i + \alpha(y - \hat{y})x_i$$

Similarly, it updates the bias in the following way:

$$b' \gets b + \alpha(y - \hat{y})$$

#### Pseudocode

1. Start with random weights: $w_1, \ldots, w_n, b$
2. For every point ($x_1, \ldots, x_n$):
    1. For $i = 1 \ldots n$:
        1. Update $w' \gets w_1 - \alpha(\hat{y} - y)x_i $
        2. Update $b' \gets b - \alpha(\hat{y} - y)$
3. Repeat until error is small