# Logistic Regression

This notebook focuses on answering the following questions:

- Why does the L2 norm not work for logisitc regression? Can you show how what it looks like during training?
- What types of loss functions are used for logistic regression?
- How does multiclass logistic regression work?

## Algorithm

The algorithm for logistic regression is very simple, and in fact is very similar to that of linear regression. The model is simply given by:
$$\hat{\underline{y}} = \sigma\left(\underline{\underline{X}}\underline{w}\right)$$

Where $\underline{\underline{X}}$ and $\underline{w}$ are created such that the bias term is captured, and $\sigma$ is the sigmoid function, defined as:
$$\sigma(z) = \frac{1}{1+e^{-z}}$$

Note that the sigmoid function applied to a vector means that it is applied pointwise.

Note that logisitc regression is still a linear model, since the weights are still linear. This means that the optimal weights $\underline{w}^*$ multiply the data linearly. Even though there is a sigmoid function involved, this does not change the interaction between the weights.

This is in contrast to a neural network, for example.

In a simple 2 layer neural network with sigmoid activation functions, you have the following expressions:
$$\underline{\underline{a}}^{(1)} = \underline{\underline{X}}\cdot\underline{\underline{W}}^{(1)}$$
$$\underline{\underline{z}}^{(1)} = \sigma\left(\underline{\underline{a}}^{(1)}\right)$$
$$\underline{{a}}^{(2)} = \underline{\underline{z}}^{(1)}\cdot\underline{W}^{(2)}$$
$$\hat{\underline{y}} = \underline{{z}}^{(2)} = \sigma\left(\underline{{a}}^{(2)}\right)$$

This means that our final predictor is:

$$\hat{\underline{y}} = \sigma\left(
\sigma\left(
\underline{\underline{X}}\cdot\underline{\underline{W}}^{(1)}
\right)\cdot\underline{W}^{(2)}
\right)$$

We can see very clearly that the interaction between the weights is no longer linear. If we now ignore the activation function on the first layer, we arrive at logistic regression again:
$$\hat{\underline{y}} = \sigma\left(
\left(
\underline{\underline{X}}\cdot\underline{\underline{W}}^{(1)}
\right)\cdot\underline{W}^{(2)}
\right) = \sigma\left(\underline{\underline{X}}\cdot \underline{\hat{w}}\right)$$

Where $\underline{\hat{w}} = \underline{\underline{W}}^{(1)}\underline{{W}}^{(2)}$

## Implementation

In [4]:
import math

class LogisticRegression:
    def __init__(self, loss):
        self.loss = loss
    
    def fit(self, X, y):
        # add code to enable multiclass behaviour!
        # add optimization!
        pass
    
    def predict(self, X):
        pass
    
    
def l2_loss(pred, act):
    loss = 0
    for pred_, act_ in zip(pred, act):
        loss += (pred_ - act_)**2
    return loss / len(pred)

def logistic_loss(pred, act):
    # requires actual data to be either 0 or 1!
    # this is also known as binary cross entropy
    loss = 0
    for pred_, act_ in zip(pred, act):
        loss += act*math.log(pred) + (1-act)*log(1-pred)
    return loss / - len(pred)

## Example

## Why does the L2 norm not work?

## Implementations with different loss functions

## Multiclass logistic regression