<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objective" data-toc-modified-id="Objective-1">Objective</a></span></li><li><span><a href="#The-Chain-Rule" data-toc-modified-id="The-Chain-Rule-2">The Chain Rule</a></span></li><li><span><a href="#(Binary)-Cross-Entropy-Loss" data-toc-modified-id="(Binary)-Cross-Entropy-Loss-3">(Binary) Cross Entropy Loss</a></span></li><li><span><a href="#Activations" data-toc-modified-id="Activations-4">Activations</a></span></li><li><span><a href="#Linear-Layer" data-toc-modified-id="Linear-Layer-5">Linear Layer</a></span></li><li><span><a href="#Putting-It-All-Together" data-toc-modified-id="Putting-It-All-Together-6">Putting It All Together</a></span></li><li><span><a href="#Our-Evaluation-Metric" data-toc-modified-id="Our-Evaluation-Metric-7">Our Evaluation Metric</a></span></li><li><span><a href="#Trainer" data-toc-modified-id="Trainer-8">Trainer</a></span></li><li><span><a href="#Pre-process-Data" data-toc-modified-id="Pre-process-Data-9">Pre-process Data</a></span></li><li><span><a href="#Datasets" data-toc-modified-id="Datasets-10">Datasets</a></span></li><li><span><a href="#Train" data-toc-modified-id="Train-11">Train</a></span></li></ul></div>

In [11]:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

from torch.utils.data import Dataset
from torch.utils.data import DataLoader

In [2]:
# Set seed for reproducibility
seed = 9
np.random.seed(seed)

# Objective 

The goal of this notebook is to build a logistic regression model as a neural network. The network will consist of a single linear layer followed be a sigmoid activation and binary cross entropy as the loss function. When we're done we'll see that we're able to identify malignant tumors on sklearn's breast cancer dataset with 90% accuracy in just 10 epochs. 

# The Chain Rule

Before we build the network, we need to understand how to update the network's weights. We can view our network as a composition of three functions

$$x \to \text{BCE} \circ \text{Sigmoid} \circ \text{Linear}(x)$$

While the loss function is not usually viewed as a layer of the network, treating it as the final layer makes computing the gradients simpler. Let's denote the output of the $i$-th layer by $x_i$ so that

\begin{align}
    x_1 &= \text{Linear}(x)     \\
    x_2 &= \text{Sigmoid}(x_1)  \\
    x_3 &= \text{BCE}(x_2)
\end{align}

Since the last output we have is $x_3 = \text{BCE}(x_2)$, the first gradient to compute is the gradient of $\text{BCE}$ with respect to $x_2$

$$\frac{\partial \text{BCE}}{\partial x_2} = \frac{\partial \text{BCE}}{\partial x_2}(x_2) $$

Next we have $x_2 = \text{Sigmoid}(x_1)$, so the chain rule gives us

$$\frac{\partial \text{BCE}}{\partial x_1} = \frac{\partial \text{BCE}}{\partial x_2} \times \frac{\partial \text{Sigmoid}}{\partial x_1}(x_1)$$

Last we have $x_1 = \text{Linear}(x)$, so the final gradient is

$$\frac{\partial \text{BCE}}{\partial x} = \frac{\partial \text{BCE}}{\partial x_1} \times \frac{\partial \text{Linear}}{\partial x_1}(x)$$

Notice something? The first gradient we compute--the gradient with respect to the Sigmoid output--is used to compute the next gradient--the gradient with respect to the Linear output. To compute all of the gradients, we need to start at the last layer and successively pass back the gradient to the previous layer. That's why it's called backpropagation and not "just the chain rule". It really is helpful to envision passing the gradients backwards through the network like a baton.

As a final note, so far we've been treating the input as a single variable, but most of the time the input will have more than one dimension. Don't worry, computing the gradients in the multi-variate case is more or less the same (it involves something called the Jacobian--but we'll pretend we didn't hear that).

--------

# (Binary) Cross Entropy Loss

The only difference between binary cross entropy and cross entropy is that cross entropy requires an output probability for every single class, whereas binary cross entropy requires just a single output probability--the probability of the positive class. This is slightly annoying, because it means their derivatives have different functional forms.

Binary cross entropy penalizes predictions by the logarithm of their probability $\hat{y}$ (which we called $x_2$ above):

$$\text{BCE}(y, \hat{y}) = -y \ln(\hat{y}) + (1 - y)\ln(1 - \hat{y})$$

After simplifying, you'll find its derivative is

$$\frac{\partial \text{BCE}}{\partial \hat{y}} = \frac{\hat{y} - y}{\hat{y}(1 - y)}$$

Notice that we'll need to cache the labels and predictions to compute the backward pass. Since this is the first gradient we compute, there's no previous gradient to take in. However, the next gradient (belonging to the Sigmoid layer) will receive the above BCE gradient.  

In [3]:
class BinaryCrossEntropy:
    """Container for the forward and backward pass of BCE."""
    
    def forward(self, y, y_hat):
        """Return binary cross entropy given targets and predictions."""
        self.y, self.y_hat, = y, y_hat
        return np.where(y==1, -np.log(y_hat), -np.log(1 - y_hat)).mean()
    
    def backward(self):
        """Backpropagate the gradient with respect to soft predictions."""
        return (self.y_hat - y) / (self.y_hat * (1 - self.y_hat))

-----

# Activations

The easiest components to handle are the activation functions. Our activation is Sigmoid, which you'll often see defined as one of 

$$\text{Sigmoid}(x) = \frac{1}{1 + \text{exp}(-x)} \quad \text{or} \quad \frac{\text{exp}(x)}{1 + \text{exp}(x)}$$

It turns out that we need both versions to implement a numerically stable version of Sigmoid. Notice how when $x$ is very negative, $\text{exp}(-x)$ is incredibly large, and when $x$ is very positive $\text{exp}(x)$ is incredibly large--in both cases, too large to store in memory. The easy fix is to use the first version when $x > 0$ and the second when $x < 0$.

After simplifying, you'll find the derivative of $\text{Sigmoid}$ is

$$\frac{\partial \text{Sigmoid}}{\partial x} = \text{Sigmoid}(x)(1 - \text{Sigmoid}(x))$$

You may have noticed something peculiar when comparing this derivative to that of BCE. Since we denoted $\hat{y}$ to be the output of the Sigmoid layer, the derivative of sigmoid is exactly the same as the denominator in the derivative of BCE, so the two terms will cancel when multiplied (which is exactly what the chain rule tells us will happen). This is the reason you see functions like PyTorch's `binary_cross_entropy_with_logits`, which skips the Sigmoid activation and computes BCE directly from the inputs to the Sigmoid layer (which people call logits). It's more efficient computationally to do both in one go. However, since we're just trying to get our hands dirty to understand how networks work, we won't worry about optimizing things.

In [4]:
#TODO: numerically stable sigmoid
class Sigmoid:
    """Container for the forward and backward pass of sigmoid."""
    
    def forward(self, x):
        """Pass a mini-batch through a sigmoid layer."""
        return np.where(x > 0, 1 / (1 + np.exp(-x)), np.exp(x) / (1 + np.exp(x)))
        
    def backward(self, x, grad):
        """Backpropagate the gradient given the previous gradient."""
        return grad * self.forward(x) * (1 - self.forward(x))

----------

# Linear Layer

The last component we need to implement is the linear layer, which contains weights and biases. Denoting the output of this layer by $z_i$, we have

$$z_i = \text{Linear}(x) = xw + b$$

If $x$ is a mini-batch of shape $(bs, n_{inp})$ then $w$ has shape $(n_{inp}, 1)$ and $b$ has shape $(1,)$, with addition being done via broadcasting. When computing the gradient, we'll image that we just have a batch size of one 

$$x = [x_1, \dots, x_{n_{inp}}]$$ 

There are two gradients to compute this time around, one with respect to the weights and the other with respect to the bias. To make life easier, let's write things out in terms of coordinates:

$$z_i = \sum_{k=1}^{n_{inp}} x_k w_{ki} + b$$

Then we get

$$\frac{\partial \text{BCE}}{\partial w_{ki}} = \frac{\partial \text{BCE}}{\partial z_i} \times \frac{\partial z_i}
{\partial w_{ki}} = x_k \frac{\partial \text{BCE}}{\partial z_i}$$

and

$$\frac{\partial \text{BCE}}{\partial b} = \frac{\partial \text{BCE}}{\partial z_i} \times \frac{\partial z_i}{\partial b} = \frac{\partial \text{BCE}}{\partial z_i}.$$

Remember, we'll already have the gradient of the loss with respect to the output $z_i$ of the linear layer stored in a variable $\text{grad}$ when it's time to compute the gradients with respect to the weights and biases. The nice thing is that $\text{grad}$ is exactly the gradient with respect to the $b$, so we only need to figure out how to write the gradient with respect to $w$ as a matrix product.

Whenever I have a hard time doing something like this, I just focus on getting the shapes right:

* $x$ has shape $(bs, n_{inp})$ 
* $\text{grad}$ has shape $(bs, 1)$
* $\text{grad}_w$ has shape $(n_{inp}, 1)$

The only way we can multiply $x$ and $\text{grad}$ and get something of shape $(n_{inp}, 1)$ is to resize $x$ to have shape $(bs, n_{inp}, 1)$ and $\text{grad}$ to have shape $(bs, 1, 1)$ so that ordinary matrix multiplication over the last two dimensions gives the shape $(n_{inp}, 1)$. The last thing to remember is that we average the gradients over the batch dimension to produce the final gradient updates.

Note that were there another linear layer we would also need to compute the gradient of the loss with respect to the inputs $x$ so that we could keep backpropagating the gradient, however, this isn't anymore complicated than what we've done so far.

In [18]:
class Linear:
    """Container for the forward and backward pass of a linear layer."""
    
    def __init__(self, n_inp, n_out):
        self.weights = np.random.uniform(-1 / np.sqrt(n_inp), 1 / np.sqrt(n_inp), (n_inp, n_out))
        self.bias = np.zeros(1)
        
    def forward(self, x):
        """Pass a mini-batch through a linear layer."""
        self.x = x
        return x @ self.weights + self.bias
    
    def backward(self, grad):
        """Compute the gradients of the weights and biases given previous gradient."""
        self.grad_w = (self.x[:,:,None] @ grad[:,None,:]).mean(axis=0)
        self.grad_b = grad.mean(axis=0)

---------

# Putting It All Together

Wohoo! It's finally time to string together all the work we've done so far into a single network.

In [24]:
class OneLayerBinaryClassifier:
    """Container for a single layer binary classifier."""
    
    def __init__(self, n_inp):
        """Initialise layers and loss criterion."""
        self.linear = Linear(n_inp, 1)
        self.out = Sigmoid()
        self.criterion = BinaryCrossEntropy()
            
    def forward(self, x):
        """Pass a mini-batch through the network."""
        return self.out.forward(self.linear.forward(x))
    
    def backward(self):
        """Backpropagate gradients through the network."""
        grad = self.criterion.backward()
        self.linear.backward(grad)

In [27]:
class SGD:
    """Container for optimizing a model via SGD."""
    
    def __init__(self, model, lr=1e-3):
        """Initialise model and learning rate."""
        self.model = model
        self.lr = lr
        
    def step(self):
        """Update weights and biases of linear layers."""
        #self.model.linear.weights -= self.lr * self.model.linear.grad_w
        self.model.linear.bias -= self.lr * self.model.linear.grad_b

----------

# Our Evaluation Metric

For simplicity, we'll just consider accuracy for the time being

In [8]:
def accuracy(y, y_hat):
    """Compute accuracy given soft binary predictions."""
    y_pred = y_hat > 0.5
    return (y_pred == y).mean()

---------

# Trainer

To make life easier, let's wrap all of the functionality we need to train a network into a single class.

In [21]:
class Trainer:
    """Container for training a feedforward neural net."""
    
    def __init__(self, model, optimizer, train_dl, val_dl):
        self.model = model
        self.optimizer = optimizer
        self.train_dl = train_dl
        self.val_dl = val_dl
        
    def _train(self):
        """Train for a single epoch and return the loss."""
        loss, n = 0, 0
        for x, y in self.train_dl:
            y_hat = self.model.forward(x)
            batch_loss = self.model.criterion.forward(y, y_hat)
            self.model.backward()
            self.optimizer.step()
            loss += len(y) * batch_loss
            n += len(y)
        return loss / n
            
    def train(self, n_epochs):
        """Train for multiple epochs."""
        for epoch in range(n_epochs):
            loss = self._train()
            self.model.step(self.lr)
            print(f"{epoch= :.2d} | {loss= :.3f}")
#              val_loss, val_metric = self.evaluate(self.val_dl)
#             print(f"{epoch= :2d} | {loss= :.3f} | {val_loss= :.3f} | {val_metric= :.3f}")
    
#     def evaluate(self, dl):
#         """Return loss and metric on validation or test set."""
#         loss, n, metric = 0, 0, 0
#         for x, y in dl:
#             y_hat = self.model.forward(dl)
#             batch_loss = self.model.criterion.forward(y, y_hat)
#             batch_metric = self.model.metric(y, y_hat)
#             loss += len(y) * batch_loss
#             metric_ += len(y) * batch_metric
#             n += len(y)
#         return loss / n, metric / n

---------

# Pre-process Data

We'll use sklearn's [breast cancer dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer) for our binary classification task.

In [12]:
# Load data
X, y = load_breast_cancer(return_X_y=True)
X.shape, y.shape

((569, 30), (569,))

In [13]:
# Train-test-split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=seed)
X_train.shape, X_val.shape

((455, 30), (114, 30))

In [14]:
# Normalize
mu, sigma = X_train.mean(), X_train.std()
X_train = (X_train - mu) / sigma
X_val = (X_val - mu) / sigma

---------

# Datasets

In [15]:
class TabularData(Dataset):
    """Container for tabular data."""
    
    def __init__(self, X, y):
        self.X = X
        self.y = y
        
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]
    
    def __len__(self):
        return len(self.y)

In [16]:
# Load training and validation data
train_ds = TabularData(X_train, y_train)
val_ds = TabularData(X_val, y_val)

batch_size=50
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
val_dl = DataLoader(val_ds, batch_size=batch_size, shuffle=False)

--------

# Train

Now we're ready to put our model to the test.

In [28]:
# Input and final output dims
n_inp = X_train.shape[1]
n_classes = len(np.unique(y))

# Initialise model
model = OneLayerBinaryClassifier(n_inp)

# Initialise optimizer and trainer
metric = accuracy
optimizer = SGD(model, lr=0.1)
trainer = Trainer(model, optimizer, train_dl, val_dl)

In [29]:
trainer.train(10)

ValueError: non-broadcastable output operand with shape (1,) doesn't match the broadcast shape (569,)