# Logistic Regression from scratch

In this review I do a complete discussion of logistic regression for binary classification. First I perform all the mathematical derivations, including backpropagation and optimisation. Then I discuss a complete Python development of the method without using any available libraries. At the end I apply the method using scikit-learn library and show how to do the same modelling quickly and easily.

## Introduction

Let's first formulate our problem. Let's suppose we want to be able to classify some observations in two classes. For example we may want to know if there is a cat in an image or not. Or we may want to know if a person will change insurance in the next six months or not.
Basically we will have some features that describe what we are observing (for example the RGB values of the pixel of the image, or the age, address, weight of a person) and we want to be able to predict if the observation is of class one or two (cat or no-cat, change insurance or not). This kind of classification is called binary classification.
Logistic regression, in its easier form, is used to perform exactly this. 

Our goal is to be able to have a model that we can train (let it learn from examples) and that can predict if a certain new observation (its input) is of one of two classes.

Let's first clarify some notation that we will use in this paper.

Our prediction will be a variable $a$ that can only be $0$ or $1$ (we will indicate with $0$ and $1$ the two classes we are trying to predict). In a more mathematical form we will have
$$
\text{prediction / estimate}\ \ \rightarrow \ \ a
\in \{0,1\}
$$

But what our method will give as output, or as a prediction, will be the probability of $a$ being 1, given the input case $x$. Or in a more mathematical form:

$$
a = P(y = 1 \ |\ x)
$$

usually we will then define an input observation to be of class $1$ if $a>0.5$ and of classe $0$ if $a <= 0.5$.

Let's assume we have $n_x$ input features (let's assume they are numerical): $x_1, x_2, ..., x_{n_x}$. Those can be written as a vector. We will indicate the vector with $x$ 

$$
x = (x_1, x_2, ..., x_{n_x})
$$

In our discussion we will also use the vector $w$ that will contain $n_x$ weigths (also numerical) and a constant $b$ (number) (usually called bias):

$$
w = (w_1, w_2, ..., w_{n_x})
$$


As you may know, what we need to do is to find the ideal weights $w$ and bias $b$ to classify our observations. 

Let's suppose for a moment that we have already found the ideal $w$ and $b$.
To apply the method we will then have to perform the following steps

### Step 1

We first build a linear combination $z$ of our inputs using $w$ and $b$

$$
z = w_1 x_1+w_2 x_2+...+w_{n_x} x_{n_x}+b
\tag{1}
$$

Now it will be very useful to consider our vectors $x$ or $w$ as tensors (or matrices) of dimensions $(n_x, 1)$ (so vertical vectors). This will make the generalisation to many training cases a lot easier, since we will be able to use the formula we will find (and that we will need to vectorize, more on that later) as they are. So we have $x$ and $w$ and both have the dimensions $(n_x,1)$. Equation (1) can then be rewritten as a matrix multiplication (is now simply the inner product of two vectors).

$$
z = w^T x + b = w_1 x_1+w_2 x_2+...+w_{n_x} x_{n_x}+b
$$

Note that in Python, as long as we are dealing with tensor of rank 1, it will not make any difference.

Consider for example

In [27]:
import numpy as np
x = np.array([1,1])
y = np.array([2,2])
print(x.shape)
print(y.shape)

(2,)
(2,)


in this case is easy to see that transposing does not make any difference

In [24]:
print(x.T)
print(x)

[1 1]
[1 1]


and so we have

In [30]:
print(np.dot(x.T,x))
print(np.dot(x,x))
print(np.dot(x,x.T))

2
2
2


But if we define x as a multidimensional array with ```[[]]```

In [46]:
x = np.array([[1],[1]])
y = np.array([[2],[2]])
print(x)
print(x.shape)

[[1]
 [1]]
(2, 1)


we have now a $2x1$ tensor, instead of a strange object with dimensions $(2,)$.
Now to perform a ```np.dot()``` inner product we will need to use the transpose of the first matrix to be able to obtain a single number

In [58]:
res = np.dot(x.T,y)
print(res)
print(res.shape)

[[4]]
(1, 1)


In [57]:
print(np.squeeze(res))
print(np.squeeze(res).shape)

4
()


You can see the difference between the shapes. Remember that to calculate $z$ you will need a float, not a $1x1$ matrix.

Usually the easiest solution is to simply reshape your arrays

In [50]:
x = np.array([1,1]).reshape(2,1)
print(x)
print(x.shape)

[[1]
 [1]]
(2, 1)


That is probably the easiest solution since you will usually have your input features as a linear vector. So ```reshape()``` will be the easiest solution.

### Step 2

Since as output we want, as briefly discussion above, only the probabibilty of $a$ being $0$ or $1$, we will use  the sigmoid function applied to $z$

$$
a = \sigma (z) = \sigma (w^T x + b)
$$

where with $w^T$ we have indicated the transpose of the vector $w$. If before $w$ was an "horizontal" vector, or better a matrix with dimension $(1,n_x)$, $w^T$ will be a column vector with dimensions $(n_x,1)$. In this way we can treat $w^T x$ as normal matrix multiplication. The result will be a scalar.

The computational graph for logistic regression algorithm (that will allow us to calculate the derivates of the loss function and the cost function) is the following

![title](logistic_regression_computational_graph.png)

In [14]:
import numpy as np
x = np.array([1,1])
y = np.array([2,2])

In [15]:
print(x)
print(y)

[1 1]
[2 2]


In [19]:
np.dot(x.T,y)

4

In [20]:
x.shape

(2,)