# Neural Networks

## Error Functions

### Discrete vs Continuos

In order to use gradient descent, it needs to have a continuos error function. To do this, we need to move from discrete predictions to continuos.

In order to channge from discrete to continuos predictions we need to change the activation function. From the discrete step function:

$$
y =
\begin{cases}
    1 & \text{if } x \geq 0\\
    0 & \text{if } x < 0
\end{cases}
$$

To the Sigmoid Function:

$$
\sigma(x) = \dfrac{1}{1 + \mathrm{e}^{-x}}
$$

### Softmax Function

The softmax function is the equivalent of the sigmoid activation function, but when the problem has 3 or more classes.

Linear function scores: $Z_1, \ldots, Z_n$

$$P(\textrm{class i}) = \dfrac{e^{z_i}}{e^{z_1} + \ldots + e^{z_n}}$$

For $n = 2$, the Softmax function will be the same as the Sigmoid function.

In [3]:
import numpy as np

def softmax(L):
    expL = np.exp(L)
    sumExpL = sum(expL)
    result = []
    for i in expL:
        result.append(i/sumExpL)
    return result

In [4]:
softmax([5,6,7])

[0.09003057317038046, 0.24472847105479764, 0.6652409557748219]

### Maximum Likehood

#### Cross-Entropy

It's the negative of the logatithm of the products of probabilities. A higher cross-entropy implies a lower probability for an event.

$$\textrm{Cross-Entropy} = - \sum_{i = 1}^{m} y_i\ln{(p_i)} + (1 - y_i)\ln{(1 - p_i)}$$

$$
\textrm{CE}[(1, 1, 0), (0.8, 0.7, 0.1)] = 0.69 \\
\textrm{CE}[(0, 0, 1), (0.8, 0.7, 0.1)] = 5.12
$$

In [1]:
import numpy as np

def cross_entropy(Y, P):
    result = 0
    
    for i in range(0, len(Y)):
        result -= Y[i] * np.log(P[i]) + (1 - Y[i]) * np.log(1 - P[i])
    return result

In [3]:
cross_entropy([1, 1, 0], (0.8, 0.7, 0.1))

0.6851790109107685

Or simplified:

In [2]:
import numpy as np

def cross_entropy(Y, P):
    Y = np.float_(Y)
    P = np.float_(P)
    return -np.sum(Y * np.log(P) + (1 - Y) * np.log(1 - P))

In [4]:
cross_entropy([1, 1, 0], (0.8, 0.7, 0.1))

0.6851790109107685

#### Multi-class Cross-Entropy

$$\textrm{Cross-Entropy} = - \sum_{i = 1}^{n}\sum_{j = 1}^{m} y_{ij}\ln{(p_{ij})}$$

$m$ being the number of classes.

## Logistic Regression

### Error Function

$$\textrm{Error Function} = - \dfrac{1}{m} \sum_{i=1}^{m} (1 - y_i)\ln{(1 - \hat{y_i})} + y_i\ln{(\hat{y_i})}$$

Since $\hat{y_i}$ is given by the sigmoid of the linear function $Wx + b$, then the total formula is:

$$E(W,b) = - \dfrac{1}{m} \sum_{i=1}^{m} (1 - y_i)\ln{(1 - \sigma(Wx^{(i)} + b))} + y_i\ln{(\sigma(Wx^{(i)} + b))}$$

Then to minimize the error we use Gradient descent.

### Gradient Descent

Uses derivatives to minimize the error function.

The derivative of the sigmoid function:

$$\sigma'(x) = \sigma(x)(1 - \sigma(x))$$

And the derivati of the error $E$ at a point $x$, with respect to the weight $w_j$:

$$\dfrac{\partial}{\partial b}E = -(y - \hat{y})$$

A small gradient means we'll change our coordinates by a little bit, and a large gradient means we'll change our coordinates by a lot.

Therefore, since the gradient descent step simply consists in subtracting a multiple of the gradient of the error function at every point, then this updates the weights in the following way:

$$w_i' \gets w_i - \alpha[-(y - \hat{y})x_i]$$

which is equivalent to:

$$w_i' \gets w_i + \alpha(y - \hat{y})x_i$$

Similarly, it updates the bias in the following way:

$$b' \gets b + \alpha(y - \hat{y})$$

#### Pseudocode

1. Start with random weights: $w_1, \ldots, w_n, b$
2. For every point ($x_1, \ldots, x_n$):
    1. For $i = 1 \ldots n$:
        1. Update $w' \gets w_1 - \alpha(\hat{y} - y)x_i $
        2. Update $b' \gets b - \alpha(\hat{y} - y)$
3. Repeat until error is small

## Feedforward

Feedforward is the process neural networks use to turn the input into an output.

$$\hat{y} = \sigma \circ W^{(2)} \circ \sigma \circ W^{(1)}(x)$$

## Backpropagation

1. Doing a feedforward operation.
2. Comparing the output of the model with the desired output.
3. Calculating the error.
4. Running the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
5. Use this to update the weights, and get a better model.
6. Continue this until we have a model that is good.

## Gradient Descent with Squared Errors

$$E = \dfrac{1}{2}\sum_{\mu}\sum_j[y_j^{\mu} - \hat{y}_j^{\mu}]^2$$

First, the inside sum over $j$. This variable $j$ represents the output units of the network. So this inside sum is saying for each output unit, find the difference between the true value $y$ and the predicted value from the network $\hat{y}$, then square the difference, then sum up all those squares.

Then the other sum over $\mu$ is a sum over all the data points. So, for each data point you calculate the inner sum of the squared differences for each output unit. Then you sum up those squared differences for each data point. That gives you the overall error for all the output predictions for all the data points.

The SSE (Sum of Squared Errors) is a good choice for a few reasons. The square ensures the error is always positive and larger errors are penalized more than smaller errors. Also, it makes the math nice, always a plus.

### Caveats

Since the weights will just go wherever the gradient takes them, they can end up where the error is low, but not the lowest. These spots are called local minima. If the weights are initialized with the wrong values, gradient descent could lead the weights into a local minimum.

## Gradient Descent implementation

One weight update can be calculated as:

$$\Delta w_i = \eta \delta x_i$$

with the error term $\delta$ as:

$$\delta = (y - \hat{y})f'(h) = (y - \hat{y})f'(\sum w_ix_i)$$

In the above equation $(y−\hat{y})$ is the output error, and $f'(h)$ refers to the derivative of the activation function, $f(h)$. We'll call that derivative the output gradient.

In [1]:
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1/(1+np.exp(-x))

def sigmoid_prime(x):
    """
    # Derivative of the sigmoid function
    """
    return sigmoid(x) * (1 - sigmoid(x))

learnrate = 0.5
x = np.array([1, 2, 3, 4])
y = np.array(0.5)

# Initial weights
w = np.array([0.5, -0.5, 0.3, 0.1])

### Calculate one gradient descent step for each weight

# Calculate the node's linear combination of inputs and weights
h = np.dot(x, w)
# x[0]*w[0] + x[1]*w[1] + ... + x[n]*w[n]

# Calculate output of neural network
nn_output = sigmoid(h)

# Calculate error of neural network
error = y - nn_output

# Calculate the error term
error_term = error * sigmoid_prime(h)

# TODO: Calculate change in weights
del_w = learnrate * error_term * x

print('Neural Network output:')
print(nn_output)
print('Amount of Error:')
print(error)
print('Change in Weights:')
print(del_w)

Neural Network output:
0.6899744811276125
Amount of Error:
-0.1899744811276125
Change in Weights:
[-0.02031869 -0.04063738 -0.06095608 -0.08127477]


In [5]:
import numpy as np
import pandas as pd

admissions = pd.read_csv('data/binary.csv')

# Make dummy variables for rank
data = pd.concat([admissions, pd.get_dummies(admissions['rank'], prefix='rank')], axis=1)
data = data.drop('rank', axis=1)

# Standarize features
for field in ['gre', 'gpa']:
    mean, std = data[field].mean(), data[field].std()
    data.loc[:,field] = (data[field]-mean)/std
    
# Split off random 10% of the data for testing
np.random.seed(42)
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
data, test_data = data.loc[data.index[sample]], data.drop(sample)

# Split into features and targets
features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']

In [7]:
import numpy as np

def sigmoid(x):
    """
    Calculate sigmoid
    """
    return 1 / (1 + np.exp(-x))

# TODO: We haven't provided the sigmoid_prime function like we did in
#       the previous lesson to encourage you to come up with a more
#       efficient solution. If you need a hint, check out the comments
#       in solution.py from the previous lecture.
def sigmoid_prime(x):
    output = sigmoid(x)
    
    return output * (1 - output)
    
# Use to same seed to make debugging easier
np.random.seed(42)

n_records, n_features = features.shape
last_loss = None

# Initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

# Neural Network hyperparameters
epochs = 1000
learnrate = 0.5

for e in range(epochs):
    del_w = np.zeros(weights.shape)
    for x, y in zip(features.values, targets):
        # Loop through all records, x is the input, y is the target

        # Note: We haven't included the h variable from the previous
        #       lesson. You can add it if you want, or you can calculate
        #       the h together with the output

        h = np.dot(x, weights)
        
        # TODO: Calculate the output
        output = sigmoid(h)

        # TODO: Calculate the error
        error = y - output

        # TODO: Calculate the error term
        error_term = error * sigmoid_prime(h)

        # TODO: Calculate the change in weights for this sample
        #       and add it to the total weight change
        del_w += learnrate * error_term * x

    # TODO: Update weights using the learning rate and the average change in weights
    weights += (learnrate * del_w) / n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        out = sigmoid(np.dot(features, weights))
        loss = np.mean((out - targets) ** 2)
        if last_loss and last_loss < loss:
            print("Train loss: ", loss, "  WARNING - Loss Increasing")
        else:
            print("Train loss: ", loss)
        last_loss = loss


# Calculate accuracy on test data
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

Train loss:  0.2634711464098939
Train loss:  0.22351730609465992
Train loss:  0.20940833581363916
Train loss:  0.20359343681746
Train loss:  0.20087567293629532
Train loss:  0.19945113185302385
Train loss:  0.1986335569379286
Train loss:  0.19813048800765415
Train loss:  0.19780393857042008
Train loss:  0.19758300602502077
Prediction accuracy: 0.725
