# Chapter 16: Binary Logistic Regression

In [1]:
# Preface: Install necessary packages:
import numpy as np
import matplotlib.pyplot as plt
import math
import nnfs
from nnfs.datasets import spiral_data
from timeit import timeit
from resources.classes import DenseLayer, ReLU, SoftMax, Loss, CategoricalCrossEntropy, SoftMaxCategoricalCrossEntropy, SGD, AdaGrad, RMSProp, Adam, DropoutLayer, Sigmoid, BinaryCrossEntropyLoss
import random

Now, let's talk about another kind of output layer: the binary logistic regression. So far, we've only talked about the categorical classification -- which always assumes one of the classes is the correct answer as all probabilities must sum to 1. 

On the other hand, binary logistic regression is a type of output where each output neuron separately represents two classes: 0 in one class and 1 in the other. That means you can have a neuron that represents "cat vs dog" or "cat vs not cat" or any of the sort. And, you can have many of these output neurons. As a quick example, you could have a neuron that decides "car vs not car" and another that decides "indoors vs outdoors". 

Binary logistic regression is a regression type of algorithm, which will differ as we now use a sigmoid activation function for the output instead of our softmax, and binary cross entropy rather than categorical cross entropy.

# Section 1: The Sigmoid Activation Function

The sigmoid activation function is used with regressors because it scales all inputs (in the range of -inf to +inf) to be bounded between 0 and 1. These bounds represent the two possible classes:
$$
\sigma(x) = \frac{1}{1+e^{-x}}
$$
For simplicity, we often rewrite this as:
$$
\sigma(z_{i,j}) = \frac{1}{1+e^{-z_{i,j}}}
$$ 

This function averages at 0.5 and squishes down to a flat line as it approaches both -inf and inf exponentially fast.

I'll skip over all the calculus and just say directly that the derivative is:
$$
\frac{d}{dz_{i,j}} = \sigma_{i,j} \cdot (1 - \sigma_{i,j})
$$

We'll create a python class for this:

In [None]:
class ExampleSigmoid:
    def __init__(self, inputs):
        # Save input and calculate output
        self.inputs = inputs
        self.output = 1 / (1 + np.exp(-inputs))
        
    def backward(self, dvalues):
        # Calculate the derivative
        self.dinputs = dvalues * (1 - self.output) * self.output

Great, but now we need a loss function!

## Section 2: Binary Cross-Entropy Loss

Binary cross entropy loss is not all that different from categorical cross entropy loss. Instead of only calculating the -log on the target class, we'll sum the log-likelihoods of the correct and incorrect classes.
$$
L_{i,j} = -y_{i,j} \cdot log(\hat{y}_{i,j}) - (1 - y_{i,j}) \cdot log(1-\hat{y}_{i,j})
$$ 

However, we must remember that a model can contain multiple binary outputs, then loss calculated on a single output is going to be a vector of losses containing one value for each output. Therefore, we need to calculate a mean from these losses:
$$
L_{i} = \frac{1}{J} \sum_{j} L_{i,j}
$$
Where i is the current sample, index j means the current output of sample i, and J means the total number of outputs. We can then perform this whole operation using numpy, where the sample_losses are a collection of various $L_{i}$
```
sample_losses = np.mean(sample_losses, axis=-1)
```
That is, where the "axis=-1" parameter means to calculate the mean along the last dimension.

## Section 3: Binary Cross-Entropy Loss Derivative

Here is the fun math-y part I'm sure we've all been waiting for! Let's break this down in two steps, with the equation for $L_{i}$ first.

We're first calculate the partial derivative of the loss function with respect to the predicted input:
$$
\frac{\partial L_{i,j}}{\partial \hat{y}_{i,j}} = \frac{\partial}{\partial \hat{y}_{i,j}}
[-y_{i,j} \cdot log(\hat{y}_{i,j}) - (1 - y_{i,j}) \cdot log(1-\hat{y}_{i,j})]
$$ 
After a bunch of calculus, that simplifies down to:
$$
-\frac{y_{i,j}}{\hat{y}_{i,j}} + \frac{1-y_{i,j}}{1-\hat{y}_{i,j}}
$$
And then ultimately down to:
$$
-(\frac{y_{i,j}}{\hat{y}_{i,j}} - \frac{1-y_{i,j}}{1-\hat{y}_{i,j}})
$$

Now, we need to find the partial derivative of the sample loss with respect to each input!
$$
\frac{\partial L_{i}}{\partial \hat{y}_{i,j}} = \frac{\partial}{\partial L_{i,j}} [\frac{1}{J} \sum_{j} L_{i,j}]
$$    
Which ultimately simplifies down to:
$$
\frac{1}{J}
$$

Therefore, our equation for the partial derivative of a sample loss with respect to a single output loss becomes:
$$
-\frac{1}{J} \cdot (\frac{y_{i,j}}{\hat{y}_{i,j}} - \frac{1-y_{i,j}}{1-\hat{y}_{i,j}})
$$

The only thing we need to do now is implement our code for this, so let's do that now. As usual, we'll create an example class here with the full one implemented in our classes.py.

In [None]:
class ExampleBinaryCrossEntropyLoss(Loss):
    # Forward pass
    def forward(self, y_pred, y_true):
        # Clip data on both sides to prevent division by 0, both sides to prevent skewing data
        y_pred_clipped = np.clip(y_pred, 1e-7, 1-1e-7)
        
        # Calculate sample-wise losses
        sample_losses = -(y_true * np.log(y_pred_clipped) + (1 - y_true) * np.log(1 - y_pred_clipped))
        sample_losses = np.mean(sample_losses, axis=-1)
        
        return sample_losses
    
    # Backward pass
    def backward(self, dvalues, y_true):
        # Sample size
        samples = len(dvalues)
        # Number of outputs per sample
        outputs = len(dvalues[0])
        
        # Clip data
        clipped_dvalues = np.clip(dvalues, 1e-7, 1-1e-7)
        
        # Calculate gradient
        self.dinputs = -(y_true / clipped_dvalues - (1-y_true) / (1-clipped_dvalues)) / outputs
        # Normalize gradient
        self.dinputs /= samples

With that done, we can now spin up a model with this new binary cross entropy loss class! 

In [8]:
# Creating some training data used the spiral_data function
X, y = spiral_data(samples=100, classes=2)

# Need to reshape our data as it is no longer sparse
y = y.reshape(-1, 1)

# Create dense layer with 2 input features and 64 output features
# Changed the layer size from 64 to 512 to improve accuracy
dense1 = DenseLayer(2, 64, weightl2=5e-4, biasl2=5e-4)

# Use a relu activation
activation1 = ReLU()

# Create a dense layer for our output with 64 as an input and 3 as an output
# NEW: changed the layer size from 64 to 512 to improve accuracy
dense2 = DenseLayer(64, 1)

# Use a softmax combined with ccel. for our output 
activation2 = Sigmoid()

# Set up the loss function
loss_function = BinaryCrossEntropyLoss()

# Initialize optimizer as Adam with a decay
optimizer = Adam(decay=5e-7)

# Create the loop that trains our model in epochs
for epoch in range(10000):
    # Perform the forward pass, as shown previously
    dense1.forward(X)
    activation1.forward(dense1.output)
    dense2.forward(activation1.output)
    activation2.forward(dense2.output)
    # calculate dataLoss, regLoss, and then add for total loss
    dataLoss = loss_function.calculate(activation2.output, y)
    regLoss = loss_function.regularizationLoss(dense1) + loss_function.regularizationLoss(dense2)
    loss = dataLoss + regLoss
    
    # Calculate the accuracy
    predictions = (activation2.output > 0.5) * 1
    accuracy = np.mean(predictions==y)
    
    if not epoch % 100:
        print(f"epoch: {epoch}, accuracy: {accuracy: .3f}, loss: {loss: .3f}, dLoss: {dataLoss}, rLoss: {regLoss}, lr: {optimizer.lr_curr}")
        
    # Perform the backward pass
    loss_function.backward(activation2.output, y)
    activation2.backward(loss_function.dinputs)
    dense2.backward(activation2.dinputs)
    activation1.backward(dense2.dinputs)
    dense1.backward(activation1.dinputs)
    
    # Use the optimizer and update the weights and biases
    optimizer.preUpdateParams()
    optimizer.updateParams(dense1)
    optimizer.updateParams(dense2)
    optimizer.postUpdateParams()

epoch: 0, accuracy:  0.455, loss:  0.693, dLoss: 0.6931611483003809, rLoss: 5.688446779362844e-06, lr: 0.001
epoch: 100, accuracy:  0.620, loss:  0.673, dLoss: 0.6722421747805468, rLoss: 0.0008714248605187014, lr: 0.0009999505024501287
epoch: 200, accuracy:  0.625, loss:  0.670, dLoss: 0.6685268864345999, rLoss: 0.001215202928005334, lr: 0.0009999005098992651
epoch: 300, accuracy:  0.625, loss:  0.667, dLoss: 0.6650754018615368, rLoss: 0.0015082201361740796, lr: 0.000999850522346909
epoch: 400, accuracy:  0.625, loss:  0.663, dLoss: 0.6609270207839064, rLoss: 0.0019611363776167105, lr: 0.0009998005397923115
epoch: 500, accuracy:  0.615, loss:  0.659, dLoss: 0.6561376435166179, rLoss: 0.0025638725897037967, lr: 0.0009997505622347225
epoch: 600, accuracy:  0.615, loss:  0.651, dLoss: 0.6476996340254764, rLoss: 0.0037124576740371067, lr: 0.0009997005896733929
epoch: 700, accuracy:  0.620, loss:  0.640, dLoss: 0.6341509154382148, rLoss: 0.0059798251136191916, lr: 0.0009996506221075735
epoc

That's our model! Naturally, we can tweak and tune some things here, but our model has overall performed relatively well! In the next chapter, we'll be working on using this to predict regressions!

### Anyways, that's it for this chapter! Thanks for following along with my annotations of *Neural Networks from Scratch* by Kinsley and Kukieła!