# Logistic Regression Cost Function
## Logistic Regression Model
- $\hat y = \sigma (w^{T}X + b)$, where $\sigma (z) = \frac{1}{1 + e^{-z}}$

## Loss Function
- The loss function computes the error for a single training example.
- $L(\hat y, y) = -(ylog(\hat y) + (1-y)log(1-\hat y))$

## Cost Function
- The cost function is the average of the loss functions of the entire training set.
- $J(w, b) = \frac{1}{m} \sum_{i=1}^{m} L(\hat y^{(i)}, y^{(i)}) = -\frac{1}{m} \sum_{i=1}^{m} y^{(i)}log(\hat y^{(i)}) + (1-y^{(i)})log(1-\hat y^{(i)})$

# Gradient Descent
- We want to find $(w, b)$ that minimize $J(w, b)$.
- Note that the cost function needs to be a convex function.
- Repeat:
    - $w := w - \alpha \frac{\partial J(w, b)}{\partial w}$
        - For coding convention, use `dw` to represent the term $\frac{\partial J(w, b)}{\partial w}$.
    - $b := b - \alpha \frac{\partial J(w, b)}{\partial b}$
        - For coding convention, use `db` to represent the term $\frac{\partial J(w, b)}{\partial b}$.

# Derivatives with a Computation Graph
- Use chain rule to derive partial derivatives of the final output variable with respect to various intermediate quantities.
- For conding convension, use `dvar` to represent $\frac{\partial Var_{final}}{\partial Var_{intermediate}}$.

# Logistic Regression Gradient Descent (on *one* Example)
- Logistic Regression Recap
    - $z = w^{T} X + b$
        - E.g. $z = w_{1}x_{1} + w_{2}x_{2} + b$
    - $\hat y = a = \sigma (z)$
    - $L(a, y) = -(ylog(a) + (1-y)log(1-a))$
- Logistic Regression Derivatives
    - $da = \frac{\partial L(a, y)}{\partial a} = -\frac{y}{a} + \frac{1-y}{1-a}$
    - $dz = \frac{\partial L(a, y)}{\partial z} = \frac{\partial L(a, y)}{\partial a} \times \frac{\partial a}{\partial z} = a-y$
    - $dw_{i} = \frac{\partial L(a, y)}{\partial w_{i}} = x_{i}dz$
    - $db = \frac{\partial L(a, y)}{\partial b} = dz$
- Then, for Gradient Descent:
    - $w_{i} := w_{i} - \alpha dw_{i}$
    - $b := b - \alpha db$

# Logistic Regression Gradient Descent (on *m* Examples)
- Logistic Regression Recap
    - $z = w^{T} X + b$
    - $\hat y = a = \sigma (z)$
    - $L(a, y) = -(ylog(a) + (1-y)log(1-a))$
    - $J(w, b) = \frac{1}{m} \sum_{i=1}^{m} L(a^{(i)}, y^{(i)})$
- Gradient Descent on *m* examples:
    - For $i = 1 : m$:
        - $z^{(i)} = w^{T}X^{(i)} + b$
        - $a^{(i)} = \sigma (z^{(i)})$
        - $J = J + L(a^{(i)}, y^{(i)})$
        - $dz^{(i)} = a^{(i)} - y^{(i)}$
        - $dw_{j} = dw_{j} + x_{j}^{(i)}dz^{(i)}$
        - $db = db + dz^{(i)}$
    - $J = \frac{J}{m}$
    - $dw_{j} = \frac{dw_{j}}{m}$
    - $db = \frac{db}{m}$
    - $w_{j} := w_{j} - \alpha dw_{j}$
    - $b := b - \alpha db$

# Vectorization
- Vectorization can be used on both CPU and GPU.

In [1]:
import numpy as np

a = np.array([1,2,3,4])
print(a)

[1 2 3 4]


In [3]:
import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)

tic = time.time()
c = np.dot(a, b)
toc = time.time()

print(c)

print("Vectorized Version: " + str(1000*(toc - tic)) + 'ms')

c = 0
tic = time.time()
for i in range(1000000):
    c += a[i]*b[i]
toc = time.time()

print(c)

print("For Loop: " + str(1000*(toc - tic)) + 'ms')

249962.74545114438
Vectorized Version: 0.957489013671875ms
249962.74545113952
For Loop: 715.6250476837158ms


## Review Logistic Regression Gradient Descent on *m* examples
- For-Loop Style:
    - For $i = 1 : m$:
        - $z^{(i)} = w^{T}X^{(i)} + b$
        - $a^{(i)} = \sigma (z^{(i)})$
        - $J = J + L(a^{(i)}, y^{(i)})$
        - $dz^{(i)} = a^{(i)} - y^{(i)}$
        - $dw_{j} = dw_{j} + x_{j}^{(i)}dz^{(i)}$
        - $db = db + dz^{(i)}$
    - $J = \frac{J}{m}$
    - $dw_{j} = \frac{dw_{j}}{m}$
    - $db = \frac{db}{m}$
    - $w_{j} := w_{j} - \alpha dw_{j}$
    - $b := b - \alpha db$
- With vectorization, we can write:
    - Initilization:
        - $dw = np.zeros((n_{j}, 1))$
    - With in the loop:
        - $dw = dw + X^{(i)}dz^{(i)}$
    - After the loop:
        - $dw = dw / m$

# Vectorizing Logistic Regression
- $Z = [z^{(1)}, z^{(1)}, ..., z^{(m)}] = w^{T}X + [b, b, ..., b] = [w^{T}X^{(1)}, w^{T}X^{(2)}, ..., w^{T}X^{(m)}]$
    - $Z = np.dot(W.T, X) + b$
- $A = [a^{(1)}, a^{(2)}, ..., a^{(m)}] = \sigma (Z)$
    - $A = 1/(1 + np.exp(-Z))$

# Vectorizing Logistic Regression's Gradient Output
- $dZ = [dz^{(1)}, dz^{(2)}, ..., dz^{(m)}]$
    - $dZ = A - Y$
- $db = \frac{1}{m} \sum_{i=1}^{m}dz^{(i)}$
    - $db = np.sum(dZ) / m$
- $dw = \frac{1}{m} X dZ^{T}$
    - $dw = np.dot(X, dZ.T)/m$
    
## One Iteration of Vectorized Gradient Descent
- $Z = np.dot(W.T, X) + b$
- $A = \sigma(Z) = 1/(1 + np.exp(-Z))$
- $dZ = A - Y$
- $dw = np.dot(X, dZ.T)/m$
- $db = np.sum(dZ) / m$
- $w := w - \alpha dw$
- $b := b - \alpha db$

# Broadcasting in Python
- General Principal with Numpy
    - $(m, n) \langle operation \rangle c ==> (m, n)$, element-wise operation
    - $(m, n) \langle operation \rangle (1, n) ==> (m, n)$, same operation for each row
    - $(m, n) \langle operation \rangle (m, 1) ==> (m, n)$, same operation for each column

# A Note on Python/Numpy Vectors
- When intiating numpy arrays, it's possible to excute the code below.
- The issue with vectors like this is that they are **rank 1** vectors, which doesn't work consistently as either a row or column vector.
- E.g. `np.dot` will return inner products rather than matrix products.

In [1]:
import numpy as np

In [5]:
a = np.random.randn(5)
print(a)
print(a.shape)

[ 0.72406734  0.28408294  1.49804334 -0.16042441  0.10745449]
(5,)


In [3]:
np.dot(a, a.T)

4.094294974584723

- To return a $5 \times 5$ matrix product, we should do below instead.

In [6]:
a = np.random.randn(5, 1)
print(a)
print(a.shape)

[[ 0.99816932]
 [-1.21375599]
 [ 0.77200461]
 [-0.25038699]
 [-0.02744575]]
(5, 1)


In [7]:
np.dot(a, a.T)

array([[ 9.96341997e-01, -1.21153400e+00,  7.70591320e-01,
        -2.49928609e-01, -2.73955094e-02],
       [-1.21153400e+00,  1.47320361e+00, -9.37025225e-01,
         3.03908706e-01,  3.33124481e-02],
       [ 7.70591320e-01, -9.37025225e-01,  5.95991120e-01,
        -1.93299908e-01, -2.11882484e-02],
       [-2.49928609e-01,  3.03908706e-01, -1.93299908e-01,
         6.26936430e-02,  6.87205956e-03],
       [-2.73955094e-02,  3.33124481e-02, -2.11882484e-02,
         6.87205956e-03,  7.53269397e-04]])

In [8]:
np.matmul(a, a.T)

array([[ 9.96341997e-01, -1.21153400e+00,  7.70591320e-01,
        -2.49928609e-01, -2.73955094e-02],
       [-1.21153400e+00,  1.47320361e+00, -9.37025225e-01,
         3.03908706e-01,  3.33124481e-02],
       [ 7.70591320e-01, -9.37025225e-01,  5.95991120e-01,
        -1.93299908e-01, -2.11882484e-02],
       [-2.49928609e-01,  3.03908706e-01, -1.93299908e-01,
         6.26936430e-02,  6.87205956e-03],
       [-2.73955094e-02,  3.33124481e-02, -2.11882484e-02,
         6.87205956e-03,  7.53269397e-04]])

- Conclusion
    - For neural networks, do not use rank 1 vectors.
    - Instead, always use row or column vectors.

# Explanation of Logistic Regression Cost Function
- For logistic regression, we defined that $\hat y = P(y=1|X) = \sigma (w^{T} X + b)$.
- In other words, we have:
    - If $y = 1$: $p(y|X) = \hat y$
    - If $y = 0$: $p(y|X) = 1 - \hat y$
- With this, we can summarize it as:
    - $p(y|X) = \hat y^{y} (1 - \hat y)^{1 - y}$
- For a training set of *m* examples (assuming i.i.d), the Conditional Log-Likelihood function is:
    - $LCL((w, b); y|X) = log(\prod_{i=0}^{m} p_{(w, b)}(y^{(i)}|X^{(i)})) = \sum_{i=0}^{m} log(p_{(w, b)}(y^{(i)}|X^{(i)})$
- Note that $log(p_{(w, b)}(y^{(i)}|X^{(i)}) = y^{(i)}log(\hat y^{(i)}) + (1 - y^{(i)})log(1 - \hat y^{(i)})$, we then have:
    - $LCL((w, b); y|X) = - \sum_{i=0}^{m} L_{(w, b)(\hat y^{(i)}, y^{(i)})}$
- Thus, to find the MLE for $LCL((w, b); y|X)$, we are equivalently looking for the minimizer of $\sum_{i=0}^{m} L_{(w, b)(\hat y^{(i)}, y^{(i)})}$.
- Thus, addressing the scaling issue using $\frac{1}{m}$, we get the cost function to minimize:
    - $J_{(w, b)}(\hat y, y) = \frac{1}{m} \sum_{i=0}^{m} L_{(w, b)(\hat y^{(i)}, y^{(i)})}$