In [1]:
import numpy as np

# Logistic Regression

### Some matrix algebra

- Scalars are numbers
- Vectors are columns
- Transposed vectors are rows
- A matrix has rows and columns

- Multiplying two matrices is only possible if dimension 2 of array 1 == dim1 of array 2
- the resultant array shape is (array1dim1, array2dim2)

In [45]:
v0 = np.random.randint(1,10,[2,10])
v1 = np.random.randint(1,10,10)
print(v0)
print(v1)
v0.T.dot(v1)

[[2 7 2 1 2 6 7 3 2 1]
 [4 5 2 8 2 1 4 2 6 7]]
[7 5 8 1 6 6 1 7 3 3]


ValueError: shapes (10,2) and (10,) not aligned: 2 (dim 1) != 10 (dim 0)

### A few machine learning shorthands:

- x: inputs
- y: actual outputs
- $\hat{y}$: Predicted outputs
- z

### Sigmoid activation function:

$\hat{y} = \sigma ({w^T}x + b) $

where:

## $\sigma(z)=\frac{1}{1+e^{-z}}$

Intuitively, a higher z value (depicted on the x axis of a sigmoid chart) results in a $\sigma (z)$ closer to 1, and vice versa.



### Loss function for logistic regression:
You want predicted $\hat y$ to approximate actual $y$

### Loss$(\hat y, y) = -(ylog\hat y + (1 - y)log(1 - \hat y))$

Intuitively, if $y = 0$, then the function shortens to: 

$-log(1- \hat y)$,

...which implies that a small $\hat y$ leads to a large $log(1 - \hat y)$, which leads to a lower "cost" (because of the -ve sign)


... and if $y = 1$ then the formula shortens to:

$- y log \hat y$, $y^{aa}$

... so a larger $\hat y$ leads to a small $ylog\hat y$, which leads to a small "loss" because of the -ve

### and the Cost function is just an average of all the loss functions:
#### the cost is referred to as $J$

### $J(w,b) = \frac{- \displaystyle\sum_{i=1}^{m}ylog\hat y + (1 - y)log(1 - \hat y)}{m}$

### Gradient Descent:
We want to find $w, b$ that minimize $J(w,b)$

$ w:= w - \alpha\frac{dJ(w)}{dw} $,

where:

$w:=$ means repeatedly update $w$

$\alpha$ = learning rate

$\frac{dJ(w)}{dw}$: the change in the cost function with respect to the change in $w$. Note that this is sometimes referred to as just $dw$ in python variables.

Note also that because our cost function is dependant on two variables ($w$ and $b$), we will actually be solving two "partial" derivatives, and some text books change the $d$ notiation to $\delta$, i.e.:

$ w:= w - \alpha\frac{\delta J(w, b)}{\delta w} $, and

$ b:= b - \alpha\frac{\delta J(w, b)}{\delta b} $