<a href="https://colab.research.google.com/github/plastermodelsean/Deep-Learning-with-Python/blob/master/Notebook_Neural_Networks_and_Deep_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 2.1 Logistic Regression as a Neural Network#


## 2.12 Logistic Regression ##

 
 Given x, want: $$\hat{y} = P(y=1|x) ,x\in\mathcal{\mathbb{R}^{n_x}},0\leq\hat{y}\leq{1} \tag{1}$$
 Parameters:$$\mathcal{w}\in\mathbb{R}^{n_x},b\in\mathbb{R}\tag{2}$$
 Output:$$\hat{y} = \sigma(\mathcal{w}^{T}x+b)\tag{3}$$
 
where$$\sigma(z) = \cfrac{1}{1+e^{-z}}\tag{4}$$
 If z is large, then:$$\sigma(z) \approx 1,$$
Conversely, If z is a large negative number, then:$$\sigma(z) \approx 0.$$

## 2.13 Logistic Regression Cost Function ##

**Objective**

Given$\lbrace(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)})\rbrace$, want $ \hat{y}^{(i)} \approx y^{(i)}$.

**Loss(error) Function**

We won't use a square error here because it will lead to an un-convex problem and thus we cannot reach the global optima

$$\mathcal{L} = - (y\log{\hat{y}} +(1-y)\log{(1-\hat{y})})\tag{1}$$

If y = 1: $$\mathcal{L}(\hat{y},y) = -\log{\hat{y}}\tag{2}$$

we want $\log{\hat{y}}$ large, so we want $\hat{y}\to1$ ;

If y = 0: $$ \mathcal{L}(\hat{y},y) = -\log{(1-\hat{y})}\tag{3}$$

we want $\log{(1-\hat{y})}$ large, so , we want $\hat{y}\to0$.

**Cost Function**
$$\mathcal{J}(\mathcal{w},b) = \cfrac{1}{m}\sum_{i = 1}^{m}\mathcal{L}(\hat{y}^{(i)},y^{(i)})\tag{4}$$
$$= -\cfrac{1}{m}\sum_{i = 1}^{m}[y^{(i)}\log{\hat{y}^{(i)}} + (1-y^{(i)})\log{(1-\hat{y}^{(i)})}]\tag{5}$$

The loss function computes the error for a single training example; the cost function is the average of the loss functions of the entire training set.

## 2.14 Gradient Descent ##

**Objective**

We want to find $\mathcal{w},b$ that minimize $\mathcal{J}(\mathcal{w},b)$.

**Gradient Descent**

Repeat {
$$\mathcal{w} := \mathcal{w} - \alpha\cfrac{\partial \mathcal{J}(\mathcal{w},b)}{\partial\mathcal{w}}\tag{1}$$

$$b := b - \alpha\cfrac{\partial \mathcal{J}(\mathcal{w},b)}{\partial b}\tag{2}$$
}

## 2.19 Logistic Regression Gradient Descent ##

**Objective**

Say we have $z$ to be the input of the sigmoid function and $a$ to be the output of logistic regression: 

$forward:$

$$z = \mathcal{w}^T x + b\tag{1}$$

$$a = \sigma(z)\tag{2}$$

$$\mathcal{L}(a,y) = - [y\log{a} +(1 - y)\log{(1-a)}]\tag{3}$$

$backprop:$


$$\mathrm{d}a = \cfrac{\mathrm{d}\mathcal{L}(a,y)}{\mathrm{d}a}$$
$$ = -\cfrac{y}{a} + \cfrac{1 - y}{1 - a}\tag{4}$$


$$\cfrac{\mathrm{d}a}{\mathrm{d}z} = a(1-a)\tag{5}$$

$$\mathrm{d}z = \cfrac{\mathrm{d}\mathcal{L}(a,y)}{\mathrm{d}z}$$


$$= \cfrac{\mathrm{d}\mathcal{L}(a,y)}{\mathrm{d}a}\cdot\cfrac{\mathrm{d}a}{\mathrm{d}z}$$

$$= a - y\tag{6}$$

In practical, we usually use $\mathrm{d}z $ to represent $\cfrac{\mathrm{d}\mathcal{L}}{\mathrm{d}z}$, use $\mathrm{d}a $ to represent $\cfrac{\mathrm{d}\mathcal{L}}{\mathrm{d}a}$ and so on.

Thus, the parameters will be updated by:

$$\mathcal{w} := \mathcal{w} - \alpha\cdot\mathrm{d}\mathcal{w}\tag{7}$$

$$b := b - \alpha\cdot\mathrm{d}b\tag{8}$$

## 2.20 Gradient Descent on m Examples ##

Concretely, our logistic regression model is initialized by:

$\begin{matrix} n_x\\ \mathcal{J} = 0, \overbrace{\mathrm{d}\mathcal{w}_1 = 0,\mathrm{d}\mathcal{w}_2 = 0,\cdots,\mathrm{d}\mathcal{w}_{n_x} = 0 },\mathrm{d}b = 0. \end{matrix}$

$for\, i=1 :m$

$$z^{(i)} = \mathcal{w}^T x^{(i)} + b\tag{1}$$

$$a^{(i)} = \sigma(z^{(i)} )\tag{2}$$

$$\mathcal{J} \mathrel{+}= - [y^{(i)}\log{a^{(i)}} +(1 - y^{(i)})\log{(1-a^{(i)})}]\tag{3}$$

$$\mathrm{d}z^{(i)} = a^{(i)} - y^{(i)}\tag{4}$$

$$\mathrm{d}\mathcal{w}_1 \mathrel{+}= x_1^{(i)}\mathrm{d}z^{(i)} $$

$$\mathrm{d}\mathcal{w}_2 \mathrel{+}= x_2^{(i)}\mathrm{d}z^{(i)} $$

$$\begin{matrix}\vdots \end{matrix}$$

$$\mathrm{d}\mathcal{w}_{n_x} \mathrel{+}= x_{n_x}^{(i)}\mathrm{d}z^{(i)} \tag{5}$$

$$\mathrm{d}b \mathrel{+}= \mathrm{d}z\tag{6}$$

$end$

$$\mathcal{J}\mathrel{\div}= m,$$

$$\mathrm{d}\mathcal{w}_1 \mathrel{\div}= m, $$

$$\mathrm{d}\mathcal{w}_2 \mathrel{\div}= m, $$

$$\begin{matrix}\vdots \end{matrix}$$

$$\mathrm{d}\mathcal{w}_{n_x}\mathrel{\div}= m,$$
$$\mathrm{d}b\mathrel{\div}= m\tag{7}$$

$thusly,$

$$\cfrac{\partial{\mathcal{J}}}{\partial\mathcal{w_j}}=\mathrm{d}\mathcal{w}_j$$

$$\mathcal{w_j} := \mathcal{w}_j - \alpha\cdot\mathrm{d}\mathcal{w_j}$$

$$b := b - \alpha\cdot\mathrm{d}b\tag{8}$$

N.B.

In the for loop depicited above, there is only one dw variable (i.e. no i superscripts in the for loop), as the value of dw in the code is cumulative.

#2.2 Python and Vectorization#

##2.21 Vectorization##

Here is a vectorization demo written in python

In [1]:
import numpy as np

a = np.array([1,2,3,4])
print(a)

[1 2 3 4]


In [7]:
import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a,b)
toc = time.time()
print(c)
print("Vectorized version:" + str(1000*(toc-tic)) + "ms")

c = 0
tic = time.time()
for i in range(1000000):
  c += a[i]*b[i]
toc = time.time()

print(c)
print("For loop version:" + str(1000*(toc-tic)) + "ms")

250264.67451881
Vectorized version:1.2106895446777344ms
250264.6745188151
For loop version:438.8151168823242ms


##2.22 More Vectorization Examples##

**Neural network programming guideline**

Whenever possible, avoid explicit for-loops.

**Numpy built-in functions**

numpy.dot()

A**

$\cdots$

**Logistic regression derivatives**

With the vectorization approach we can get rid of one for-loop:

$\mathcal{J} = 0,\mathrm{d}\mathcal{w}= numpy.zeros((n_x,1)),\mathrm{d}b = 0.$

$for\, i=1 :m$

$$z^{(i)} = \mathcal{w}^T x^{(i)} + b\tag{1}$$

$$a^{(i)} = \sigma(z^{(i)} )\tag{2}$$

$$\mathcal{J} \mathrel{+}= - [y^{(i)}\log{a^{(i)}} +(1 - y^{(i)})\log{(1-a^{(i)})}]\tag{3}$$

$$\mathrm{d}z^{(i)} = a^{(i)} - y^{(i)}\tag{4}$$

$$\mathrm{d}\mathcal{w}\mathrel{+}= x^{(i)}\mathrm{d}z^{(i)} \tag{5}$$

$$\mathrm{d}b \mathrel{+}= \mathrm{d}z\tag{6}$$

$end$

$$\mathcal{J}\mathrel{\div}= m,$$

$$\mathrm{d}\mathcal{w}\mathrel{\div}= m,$$
$$\mathrm{d}b\mathrel{\div}= m\tag{7}$$

$thusly,$

$$\cfrac{\partial{\mathcal{J}}}{\partial\mathcal{w_j}}=\mathrm{d}\mathcal{w}_j$$

$$\mathcal{w_j} := \mathcal{w}_j - \alpha\cdot\mathrm{d}\mathcal{w_j}$$

$$b := b - \alpha\cdot\mathrm{d}b\tag{8}$$


##2.23 Vectorizing Logistic Regression##

**Forward Propgation**

Before, we have to calculate one by one:

$$z^{(1)} = \mathcal{w}^Tx^{(1)} + b, a^{(1)} = \sigma(z^{(1)});\\z^{(2)} = \mathcal{w}^Tx^{(2)} + b, a^{(2)} = \sigma(z^{(2)});\\\vdots\\z^{(m)} = \mathcal{w}^Tx^{(m)} + b, a^{(m)} = \sigma(z^{(m)}).\tag{1}$$

Now, we want to implement a vectorized approach to calculate those numbers all at the same time.Like before, our m input examples look like this:

$$X=\begin{bmatrix}|&|&&|\\x^{(1)}&x^{(2)}&\cdots&x^{(m)}\\|&|&&|\end{bmatrix}, where \,X \in \mathbb{R}^{n_x\times\, m}\tag{2}$$

We want to construct a vector $Z$, in which 
$$Z = \begin{bmatrix}z^{(1)}z^{(2)}\cdots z^{(m)}\end{bmatrix}$$

$$ =\mathcal{w}^T X + \begin{bmatrix}\underbrace{b\,b\,b\,b\,b\,b\,b\,b\,b\,b\,b\,b\,b\,b\,b\,b \cdots b}\\1\times\,m\end{bmatrix}\\=\begin{bmatrix}\mathcal{w}^Tx^{(1)} + b&\mathcal{w}^Tx^{(2)} + b&\cdots&\mathcal{w}^Tx^{(m)} + b\end{bmatrix}\\=\mathcal{w}^TX + b \tag{3}$$

In python, we use $numpy.dot(\mathcal{w}.T,X) + b$ to compute $Z$. The *Broadcasting* technique in python will broadcast the shape $(1,1)$ scalar to the same shape of the first item before adding to it.

Then, we calculate $A = \begin{bmatrix}a^{(1)}&a^{(2)}&\cdots& a^{(m)}\end{bmatrix}$ by calculate $\sigma(Z)$:

$$A = \sigma(Z)\tag{4}$$

Therefore, the cost should be calculated by ('$\cdot$' down below stands for normal matrices multiplication ):

$$J = -\cfrac{1}{m}[Y\cdot\log{A^T}+(1-Y)\cdot\log{((1-A)^T)}]\tag{5}$$

##2.24 Vectorizing Logistic Regression's Gradient Output##

**Back Propgation**

Recall we construct vector $\mathrm{d}Z = \begin{bmatrix}\mathrm{d}z^{(1)}\,\mathrm{d}z^{(2)}\,\cdots\,\mathrm{d}z^{(m)}\end{bmatrix}$ by calculating $\mathrm{d}z^{(1)} = a^{(1)} - y^{(1)}, \mathrm{d}z^{(2)} = a^{(2)} - y^{(2)}$ ...

Since we have $A = \begin{bmatrix}a{(1)}&a{(2)}&\cdots&a{(m)}\end{bmatrix}$,$Y = \begin{bmatrix}y{(1)}&y{(2)}&\cdots&y{(m)}\end{bmatrix}$, 

$$\mathrm{d}Z =A - Y\\= \begin{bmatrix}a^{(1)} - y^{(1)},a^{(2)} - y^{(2)},\cdots,a^{(m)} - y^{(m)}\end{bmatrix}\tag{1}$$

Now we are going to get rid of the inner for-loop with the same technique:

$$\mathrm{d}b = \cfrac{1}{m}\sum_{i=1}^{m}\mathrm{d}z^{(i)}\\= \cfrac{1}{m}\,numpy.sum(\mathrm{d}Z)\\=\cfrac{1}{m}\,numpy.sum(A-Y)\tag{2}$$

$$\mathrm{d}\mathcal{w} = \cfrac{1}{m}X\mathrm{d}Z^T\\= \cfrac{1}{m}\begin{bmatrix}|&|&&|\\x^{(1)}&x^{(2)}&\cdots&x^{(m)}\\|&|&&|\end{bmatrix}\begin{bmatrix}\mathrm{d}z^{(1)}\\\vdots\\\mathrm{d}z^{(m)}\end{bmatrix}\\= \cfrac{1}{m}\begin{bmatrix}x^{(1)}\mathrm{d}z^{(1)}+\cdots+ x^{(m)}\mathrm{d}z^{(m)}\end{bmatrix}\\=\cfrac{1}{m}X(A-Y)^T\tag{3}$$

where $\mathrm{d}b\in\mathbb{R},\mathrm{d}\mathcal{w}\in\mathbb{R}^n$.

So the parameters could be updated by:

$$\mathcal{w} := \mathcal{w} - \alpha\cdot\mathrm{d}\mathcal{w}$$

$$b  := b - \alpha\cdot\mathrm{d}b\tag{4}$$

##2.25 Broadcasting in Python##

**Broadcasting example**



In [8]:
import numpy as np

A = np.array([[56.0,0.0,4.0,68.0],
              [1.2,104.0,52.0,8.0],
              [1.8,135.0,99.0,0.9]])

print(A)

[[ 56.    0.    4.   68. ]
 [  1.2 104.   52.    8. ]
 [  1.8 135.   99.    0.9]]


In [10]:
# sum vertically by set axis to "0"
cal = A.sum(axis=0)
print(cal)

# sum horizontally by set axis to "1"
acal = A.sum(axis=1)
print(acal)

[ 59.  239.  155.   76.9]
[128.  165.2 236.7]


In [13]:
percentage  = 100*A/cal.reshape(1,4)
print(percentage)

[[94.91525424  0.          2.58064516 88.42652796]
 [ 2.03389831 43.51464435 33.5483871  10.40312094]
 [ 3.05084746 56.48535565 63.87096774  1.17035111]]


**General Principal**

So now let's get into a concrete example to see what is the general rule of Broadcasting.

Say we want this two matrix and vector to add up together:

$$\begin{bmatrix}1&2&3\\4&5&6\end{bmatrix} + \begin{bmatrix}100&200&300\end{bmatrix}\\= \begin{bmatrix}101&202&303\\104&205&306\end{bmatrix}$$

What has brodcasting done here?

Techniquelly, it can be regarded as a problem to add up a $(m,n)$ matrix and a $(1,n)$ row vector. First, the program copied the row vector $m$ times, thus the vector tuned out to be a $(m,n)$ matrix, just like the shape of the first matrix. Now they could be added up together.

Another example here:

$$\begin{bmatrix}1&2&3\\4&5&6\end{bmatrix} + \begin{bmatrix}100\\200\end{bmatrix}\\=\begin{bmatrix}101& 102&103\\204&205&206\end{bmatrix}$$

We add a $(m,n)$ matrix and a $(m,1)$ vector by copy the column vector $n$ times and add two matrices together.

 In Matlab/ Octave, the bsxfun() function basically does the same thing. 

##2.26 A Note on Python/ Numpy Vectors##

In [16]:
import numpy as np

a = np.random.randn(5)

print(a)
print(a.shape)
print(a.T)
print(np.dot(a,a.T))

[ 0.28909035  3.20207242  1.57667516  0.40958471 -0.11709792]
(5,)
[ 0.28909035  3.20207242  1.57667516  0.40958471 -0.11709792]
13.004217149732693


In [17]:
a = np.random.randn(5,1)
print(a)
print(a.shape)
print(a.T)
#give you an outer product here
print(np.dot(a,a.T))

[[ 0.083628  ]
 [ 0.13744904]
 [ 2.29834784]
 [ 0.89662913]
 [-2.13837141]]
(5, 1)
[[ 0.083628    0.13744904  2.29834784  0.89662913 -2.13837141]]
[[ 0.00699364  0.01149459  0.19220623  0.0749833  -0.17882772]
 [ 0.01149459  0.01889224  0.31590571  0.12324082 -0.29391711]
 [ 0.19220623  0.31590571  5.28240281  2.06076563 -4.91472131]
 [ 0.0749833   0.12324082  2.06076563  0.8039438  -1.9173261 ]
 [-0.17882772 -0.29391711 -4.91472131 -1.9173261   4.57263227]]


The code below will produce a vector with shape $(5, )$,which is called a "rank 1 
```
a = np.random.randn(5) 
```
array". A "rank 1 array" is neither a row vector nor a column vector, we should avoid using rank 1 arrays in practical.

Instead , the code below shows a more proper conduct of using Python vectors.

```
a = np.random.randn(5,1)# a is a column vector
a = np.random.randn(1,5)# a is a row vector
```

We should not be shy about using the economic $assert()$ to assure we are on the right track:

```
assert(a.shape == (5,1))
```
Also, calling $reshape()$ function is a good way to convert rank 1 arrays to a normal row/ column vector:

```
a = a.reshape((5,1))
```

##2.28 Explanation of Logistic Regression Cost Function##

**Cost on single example**

$$If \quad y = 1:\quad p(y|x) = \hat{y}\\If \quad y = 0:\quad p(y|x) = 1 - \hat{y}\tag{1}$$

merge this two equations to a single equation:

$$p(y|x) = \hat{y}^y\cdot(1-\hat{y})^{(1 - y)}\tag{2}$$

$$\log{p(y|x)} = \log\hat{y}^y(1 - \hat{y})^{(1-y)}\\=y\log{\hat{y}} + (1-y)\log{(1-\hat{y})}\\= -\mathcal{L}(\hat{y},y)$$

**Cost on m examples**

$$\max P(labels\, in\, training\, set) \\\Rightarrow\max \log{P(labels\, in\, training\, set)}\\= \max\log{\prod_{i=1}^{m}p(y^{(i)}|x^{(i)})}\\= \max -\sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)},y^{(i)})\\= min\sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)},y^{(i)})\tag{3}$$

hence, our cost function (cost objective) should be like:

$$\min J(\mathcal{w},b) = \cfrac{1}{m}\sum_{i=1}^{m}\mathcal{L}(\hat{y}^{(i)},y{(i)})\tag{4}$$

To summarize, by minimizing this cost function $J(\mathcal{w},b)$ we're really carrying out maximum likelihood estimation with the logistic regression model. Under the assumption that our training examples were *IID*, or identically independently distributed.

##2.30 Recap and Quiz##

**Multiplications in Python**

Here are some examples:

In [18]:
#numpy.multipy() and '*' only used when matrix element-wise multiplcation
a = np.random.randn(4, 3) # a.shape = (4, 3)
b = np.random.randn(3, 2) # b.shape = (3, 2)
c = a*b

ValueError: ignored

In [22]:
#numpy.dot() used to calculate noraml matrices multiplications
a = np.random.randn(12288, 150) # a.shape = (12288, 150)
b = np.random.randn(150, 45) # b.shape = (150, 45)
c = np.dot(a,b)

print(c.shape)

(12288, 45)


In [29]:
#numpy.multipy() and '*' only used when matrix element-wise multiplcation
a = np.random.randn(3, 3)
b = np.random.randn(3, 1)
c = a*b
d = np.multiply(a,b)
e = np.dot(a,b)
print(c)
print(d)
print(e)

[[ 0.19520437 -0.30376047  0.090505  ]
 [-0.0919571   0.03491028  0.30758659]
 [-0.40744506 -2.54631377 -1.97409886]]
[[ 0.19520437 -0.30376047  0.090505  ]
 [-0.0919571   0.03491028  0.30758659]
 [-0.40744506 -2.54631377 -1.97409886]]
[[-0.19931045]
 [-1.96073588]
 [-1.39225455]]


In [33]:
#numpy.dot() used to calculate noraml matrices multiplications
a = np.random.randn(5,1)
b = np.random.randn(5,1)
c = np.dot(a,b.T)
d = np.dot(a.T,b)

print(c)
print(d)

[[ 0.76186001 -1.19280924 -0.62954887 -0.22274173  1.14507213]
 [ 1.31865715 -2.06456097 -1.08964784 -0.38553011  1.9819357 ]
 [ 1.29139663 -2.02188042 -1.06712162 -0.37756007  1.94096326]
 [ 0.6724741  -1.05286183 -0.55568648 -0.19660836  1.01072552]
 [-0.0682342   0.10683116  0.05638407  0.01994934 -0.1025557 ]]
[[-2.66898663]]


In [48]:
#'**' and numpy.square() are used to element-wise compute the squares 
a = np.random.randn(2,5)
b = a**2
c= np.square(a)

print(a)
print(b)
print(c)

[[ 1.35095658  0.08055802 -2.26520695 -0.66165286  1.81847212]
 [-0.58203692  1.09517407  2.32967058  0.87431618 -0.47255583]]
[[1.82508368 0.00648959 5.13116251 0.43778451 3.30684087]
 [0.33876698 1.19940623 5.42736501 0.76442879 0.22330901]]
[[1.82508368 0.00648959 5.13116251 0.43778451 3.30684087]
 [0.33876698 1.19940623 5.42736501 0.76442879 0.22330901]]


**numpy.where()**

[see blog](https://www.cnblogs.com/massquantity/p/8908859.html)