<br/>
$$ \huge{\underline{\textbf{ 1-Layer Neural Network }}} $$

$$ \large{\textbf{ (Binary Logistic Regression) }} $$
<br/>

<font color='red'>
TODO:
    
* redo Contents

Contents:
* [Introduction](#Introduction)
* [Load and Explore Data](#Load-and-Explore-Data)
* [Preprocess](#Preprocess)
* [Neural Network](#Neural-Network)
* [Train Classifier](#Train-Classifier)



# Introduction

This notebook presents simplest possible **1-layer neural network** trained with backpropagation. I say "neural network", but most people would call it binary logistic regression.

**Model**

* one layer: fully connected with sigmoid activation
* loss: binary cross-entropy
* optimizer: vanilla SGD

**Dependencies**
* numpy, matplotlib - neural net and backprop
* optional:
  * pandas - load college_admission dataset

# Neural Network

In [1]:
import numpy as np
import matplotlib.pyplot as plt

In [2]:
def train_classifier(x, y, nb_epochs, W, b):
    """Params:
        x - inputs  - shape: (nb_examples, nb_inputs)
        y - targets - shape: (nb_examples, nb_outputs)
        W - weights, modified in place - shape: (nb_inputs, nb_outputs)
        b - biases, modified in place  - shape: (1, nb_outputs)
    """
    losses = []     # for plotting

    for e in range(nb_epochs):
        
        # Forward
        z = x @ W + b                                      # (eq 1a)    z.shape: (batch_size, nb_neurons)
        y_hat = sigmoid(z)                                 # (eq 1b)    y_hat.shape: (batch_size, nb_neurons)
        
        # Backward
        ro = y_hat - y                                     # (eq 3)   binary CE derivative
        dW = (x.T @ ro) / len(x)
        db = np.sum(ro, axis=0, keepdims=True) / len(x)    # (eq 5)
        
        # Gradient Check (slows things hugely)
        ngW, ngb = numerical_gradient(x, y, W, b)
        assert np.allclose(ngW, dW) and np.allclose(ngb, db)

        W += -lr * dW
        b += -lr * db

        # Train loss
        loss_train = loss(y_hat, y)                            # binary cross-entropy
        losses.append(loss_train)                              # save for plotting

        if e % (nb_epochs / 10) == 0:
            print('loss {0}'.format(loss_train.round(4)))
            
    return losses

Helper Functions

In [3]:
def forward(x, W, b):                      # x.shape (batch_size, nb_inputs)
    return sigmoid( x @ W + b )

In [4]:
def loss(y_hat, y):                                   #                          y_hat, y shapes: (batch_size, nb_outputs)
    result = -y*np.log(y_hat) -(1-y)*np.log(1-y_hat)  # binary cross-entropy     result.shape: (batch_size, 1)
    return np.mean( result )                          # average over batch

In [5]:
def sigmoid(x, deriv=False):
    if deriv:
        return sigmoid(x)*(1-sigmoid(x))              # (eq 3)
    return 1/(1+np.exp(-x))                           # (eq 4)

# Equations

Forward pass is pretty simple
$$ z = xW \quad\quad \hat{y} = S(z) \tag{1a, 1b} $$

* $x$ is matrix of input features, where rows are separate training examples in mini-batch and columns are features
* $W$ is weight matrix, once column corresponds to weights of one neuron
* $z$ is matrix of preactivations, where rows correspond to $x$ and single column is our neuron
* $\hat{y}$ is model estimates in a matrix, rows correspond to $x$, one column is our output probability [0..1]
* $S$ is a sigmoid function (defined below)

Sigmoid transfer function and its derivative ([proof](https://en.wikipedia.org/wiki/Logistic_function#Derivative))
$$ S(z) = \frac{1}{1+\epsilon^{-z}} \quad\quad \frac{\partial S}{\partial z} = S(z)(1-S(z)) \tag{2a, 2b}$$

Binary Cross-Entropy loss function
$$ J(y,\hat{y}) = \frac{1}{m} \sum_{i=1}^{m} -y \log(\hat{y}) - (1-y)\log(1-\hat{y}) \tag{3}$$

$$ \frac{\partial J}{\partial z} $$

Backward pass

$$ \frac{\partial{L}}{\partial{W}} = \frac{1}{m}x^T \big[ -(y-\hat{y}) \odot S'(x) \big] \quad\quad\quad \text{ where $\odot$ is element-wise product} $$

If you are wondering how above came about, then good resources are [here](http://cs231n.stanford.edu/handouts/linear-backprop.pdf) and [here](http://cs231n.stanford.edu/handouts/derivatives.pdf), both taken from famous cs231n course.

# Solve AND-Gate

In [None]:
# training examples   A    B
x_train = np.array([[0.0, 0.0],
                    [0.0, 1.0],
                    [1.0, 0.0],
                    [1.0, 1.0]])

# desired outputs     Z
y_train = np.array([[0.0],
                    [0.0],
                    [0.0],
                    [1.0]])

Before training

In [None]:
# Hyperparams
nb_epochs = 2000
lr = 1

# Initialize
np.random.seed(0)  # for reproducibility
n_inputs, n_outputs = x_train.shape[1], y_train.shape[1]  # get dataset shape
W = np.random.normal(scale=n_inputs**-.5, size=[n_inputs, n_outputs])  # Xavier init
b = np.zeros(shape=[1, n_outputs])

Before training

In [None]:
forward(x_train, W, b).round(3)

In [None]:
losses = train_classifier(x_train, y_train, nb_epochs, W, b)

After training

In [None]:
forward(x_train, W, b).round(3)

In [None]:
plt.plot(losses)

# Neural Network

Imports

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
def forward(x, W, b):
    assert x.ndim == 2; assert W.ndim == 2

    z = x @ W + b                               # linear combination,     z.shape: (batch_size, nb_neurons)
    y_hat = sigmoid(z)                          # transfer function,      y_hat.shape: (batch_size, nb_neurons)
    
    assert z.ndim == 2; assert y_hat.ndim == 2
    return y_hat

<font color='red'>

TODO: change to BCE

Loss functions


$$ L_\text{MSE}(y,\hat{y}) = \frac{1}{2m} \sum_{i=1}^{m} (y-\hat{y})^2 $$

$m$ is length of mini-batch

In [None]:
def loss(y, y_hat):
    assert y_hat.ndim == 2                 # y_hat.shape: (batch_size, 1)
    assert y.ndim == 2                     # y.shape: (batch_size, 1)
    assert y_hat.shape[1] == 1
    
    # Option #1: binary cross entropy loss (better)
    #result = -1 * ( y*np.log(y_hat) + (1-y)*np.log(1-y_hat) )      # result.shape: [batch_size, 1]
    #result = np.mean(result)  # average over batch                 # result: scalar
    
    # Option #2: MSE loss (simpler)
    result = .5 * np.mean((y-y_hat)**2)                          # result: scalar
    
    assert y_hat.shape[1] == 1
    return result

In [None]:
def backward(x, y, W, b):
    assert x.ndim == 2; assert y.ndim == 2; assert W.ndim == 2
    
    # Forward pass
    z = x @ W + b
    y_hat = sigmoid(z)
    
    # Backward pass
    # ro = -(y-y_hat)                       # Option #1: binary CE
    ro = -(y-y_hat) * sigmoid_deriv(z)    # Option #2: MSE
    del_W = (x.T @ ro) / len(x)
    del_b = np.sum(ro, axis=0, keepdims=True) / len(x)
    
    assert del_W.ndim == 2
    assert del_b.ndim == 2
    return del_W, del_b

Numerical gradient check

In [None]:
def numerical_gradient(x, y, W, b):
    """Check gradient numerically"""
    assert W.ndim == 2
    assert b.ndim == 2
    assert b.shape[0] == 1
    
    eps = 1e-4
    
    # Weights
    del_W = np.zeros_like(W)    
    for r in range(W.shape[0]):
        for c in range(W.shape[1]):
            W_min = W.copy()
            W_pls = W.copy()
            
            W_min[r, c] -= eps
            W_pls[r, c] += eps
            
            y_hat_pls = forward(x, W_pls, b)
            y_hat_min = forward(x, W_min, b)
            
            l_pls = loss(y, y_hat_pls)
            l_min = loss(y, y_hat_min)

            del_W[r, c] = (l_pls - l_min) / (eps * 2)
            
    # Biases
    del_b = np.zeros_like(b)
    for c in range(b.shape[1]):
        b_min = b.copy()
        b_pls = b.copy()
            
        b_min[0, c] -= eps
        b_pls[0, c] += eps
            
        y_hat_pls = forward(x, W, b_pls)
        y_hat_min = forward(x, W, b_min)
            
        l_pls = loss(y, y_hat_pls)
        l_min = loss(y, y_hat_min)

        del_b[0, c] = (l_pls - l_min) / (eps * 2)
    
    return del_W, del_b

In [None]:
N_in = 10
N_out = 1
batch_size = 100

for i in range(10):
    x = np.random.rand(batch_size, N_in)
    W = np.random.randn(N_in, N_out)
    b = np.random.randn(1, N_out)
    y = np.random.randint(low=0, high=2, size=[batch_size, N_out])

    dW, db = backward(x, y, W, b)
    ngW, ngb = numerical_gradient(x, y, W, b)

    # print(np.max(np.abs(ngW-dW)))
    assert np.allclose(ngW, dW)
    assert np.allclose(ngb, db)

print('Gradient tests: OK')

# Solve AND-Gate

Simplest possible problem, let's learn AND function. Symbol below denotes AND gate in electronics.

<img src="assets/and_gate.png">

Function we are trying to learn:

| A | B | Z (output) |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |

* A - second input
* B - second input
* Z - desired ouput

Dataset:

In [None]:
# training examples   A    B
x_train = np.array([[0.0, 0.0],
                    [0.0, 1.0],
                    [1.0, 0.0],
                    [1.0, 1.0]])

# desired output      Z
y_train = np.array([[0.0],
                    [0.0],
                    [0.0],
                    [1.0]])

Initialize neural net

In [None]:
# Hyperparams
nb_epochs = 2000
lr = 1

# Initialize
np.random.seed(0)  # for reproducibility
n_inputs = x_train.shape[1]
n_outputs = y_train.shape[1]
W = np.random.normal(scale=n_inputs**-.5, size=[n_inputs, n_outputs])  # Xavier init
b = np.zeros(shape=[1, n_outputs])

Before training

In [None]:
outputs = forward(x_train, W, b)
print(outputs.round(2))

Main train loop

In [None]:
# Accumulate statistics during training (for plotting)
trace_loss = []                                   # for plotting

for e in range(nb_epochs):
    
    # Forward
    y_hat = forward(x_train, W, b)
    
    # Backprop
    dW, db = backward(x_train, y_train, W, b)    # with 4x training examples don't bother with mini-batches
    W += -lr * dW
    b += -lr * db
    
    # Train loss
    loss_train = loss(y_train, y_hat)                   # calculate loss
    trace_loss.append(loss_train)                 # save for plotting
    
    if e % (nb_epochs / 10) == 0:
        print('loss {0}'.format(loss_train.round(4)))

In [None]:
outputs = forward(x_train, W, b)
print(outputs.round(2))

Plot loss

In [None]:
plt.plot(trace_loss)
plt.title('Loss')
plt.show()

Plot learning curve

# Solve College Admissions

**Dataset**

We will use graduate school admissions data ([https://stats.idre.ucla.edu/stat/data/binary.csv]()). Each row is one student. Columns are as follows:
* admit - was student admitted or not? This is our target we will try to predict
* gre - student GRE score
* gpa - student GPA
* rank - prestige of undergrad school, 1 is highest, 4 is lowest

Extra Imports

In [None]:
import pandas as pd

Loda data with pandas

In [None]:
df = pd.read_csv('college_admissions.csv')

Show first couple rows. First column is index, added automatically by pandas.

In [None]:
df.head()

Show some more information about dataset.

In [None]:
df.info()

Plot data, each rank separately

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=[8,6])
axes = axes.flatten()
for i, rank in enumerate([1,2,3,4]):
    # pick not-admitted students with given rank
    tmp = df.loc[(df['rank']==rank) & (df['admit']==0)]
    axes[i].scatter(tmp['gpa'], tmp['gre'], color='red', marker='.', label='rejected')
    # pick admitted students with given rank
    tmp = df.loc[(df['rank']==rank) & (df['admit']==1)]
    axes[i].scatter(tmp['gpa'], tmp['gre'], color='green', marker='.', label='admitted')
    axes[i].set_title('Rank '+str(rank))
    axes[i].legend()
fig.tight_layout()

And plot scatter matrix, just for fun

In [None]:
cmap = {1: 'red', 2:'green', 3:'blue', 4:'black'}
colors = df['rank'].apply(lambda cc:cmap[cc])
pd.plotting.scatter_matrix(df[['gre', 'gpa']], c=colors, figsize=[8,6]);

#### Preprocess

Code below does following things:
* convert _rank_ column into one-hot encoded features
* normalize _gre_ and _gpa_ columns to zero mean and unit standard deviation
* splits of 20% of data as test set
* splits into input features (gre, gpa, one-hot-rank) and targets (admit)
* convert into numpy
* assert shapes are ok

In [None]:
# Create dummies
temp = pd.get_dummies(df['rank'], prefix='rank')
data = pd.concat([df, temp], axis=1)
data.drop(columns='rank', inplace=True)

# Normalize
for col in ['gre', 'gpa']:
    mean, std = data[col].mean(), data[col].std()
    # data.loc[:, col] = (data[col]-mean) / std
    data[col] = (data[col]-mean) / std

# Split off random 20% of the data for testing
np.random.seed(0)
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
data, test_data = data.iloc[sample], data.drop(sample)

# Split into features and targets
features_train = data.drop('admit', axis=1)
targets_train =  data['admit']
features_test = test_data.drop('admit', axis=1)
targets_test = test_data['admit']

# Convert to numpy
x_train = features_train.values            # features train set (numpy)
y_train = targets_train.values[:,None]     # targets train set (numpy)
x_test = features_test.values              # features validation set (numpy)
y_test = targets_test.values[:,None]       # targets valudation set (numpy)

# Assert shapes came right way around
assert x_train.shape == (360, 6)
assert y_train.shape == (360, 1)
assert x_test.shape == (40, 6)
assert y_test.shape == (40, 1)

#### Train Classifier

Initialize neural net

In [None]:
np.random.seed(0)  # for reproducibility

n_inputs = x_train.shape[1]
n_outputs = y_train.shape[1]

W = np.random.normal(scale=n_inputs**-.5, size=[n_inputs, n_outputs])  # Xavier init
b = np.zeros(shape=[1, n_outputs])

Hyperparameters

In [None]:
nb_epochs = 2000
lr = 0.01

Main train loop

In [None]:
# Accumulate statistics during training (for plotting)
trace_loss_train = []
trace_loss_test = []
trace_acc_test = []

for e in range(nb_epochs):
    
    # Forward
    y_hat = forward(x_train, W, b)
    
    # Backprop (this re-computes forward pass unneceserily, we do it properly later)
    dW, db = backward(x_train, y_train, W, b)
    W += -lr * dW
    b += -lr * db
    
    # Train loss
    loss_train = loss(y_train, y_hat)
    trace_loss_train.append(loss_train)        
    
    # if e % (nb_epochs / 10) == 0:
    y_hat_test = forward(x_test, W, b)
    loss_test = loss(y_hat_test, y_test)
    trace_loss_test.append(loss_test)
    
    # Predictions and Accuracy
    predictions = fwd(x_test, W)
    predictions = predictions > 0.5
    acc_test = np.mean(predictions == y_test)
    trace_acc_test.append(acc_test)

    if e % (nb_epochs / 10) == 0:
        print('loss {0}, tacc {1:.3f}'.format(loss_train, acc_test))

Plot learning curve

In [None]:
fig, [ax1, ax2] = plt.subplots(nrows=1, ncols=2, figsize=[12,6])
ax1.plot(trace_loss_train, label='train loss')
ax1.plot(trace_loss_test, label='test loss')
ax1.legend(loc='right')
ax1.grid()
ax2.plot(trace_acc_test, color='darkred', label='test accuracy')
ax2.legend()
plt.show()

__Quick Regression Test__

In [None]:
correct_result = np.array([0.15224777536275844,
                           0.13015955315177377,
                           0.11435294270610373,
                           0.10585677810621827,
                           0.10191394554520483,
                           0.1000000143239566,
                           0.09898677097344712,
                           0.0984065319217976,
                           0.0980521765593448,
                           0.09782432184510809])
assert np.alltrue(trace_loss_train[::200] == correct_result)