# Week 2: Neural Networks Basics

## Binary classification

Here we want to predict what a dataset can be classify as. For example, we want to predict if an image is a cat or not. Here is the notation of the problem:

![binary classification notation](images/bin_class_notation.png)

## Logistic classification

![logistic regression](images/log_reg.png)


### Loss (error) function

In logistic regression, we could use squared error ($\frac{1}{2}(\hat y - y)^2$) as our loss function. However, this function won't be convex and gradient descent won't yield a global minumum. To have a convex loss function, we will be using the following formulation:

$$
\begin{equation}
L(\hat y, y) = -(ylog{(\hat y)} + (1-y)log{(1-\hat y)})
\end{equation}
\label{eq: log loss}
$$

Here is how loss function is derived:

![logistic reg loss derivation](images/lreg_loss_derivation.png)


**Key Difference: Loss vs. Cost Function**

While the terms are often used interchangeably, there's a subtle but important distinction.

- **Loss Function:** Measures the error for a single training example.
- **Cost Function:** Measures the average error across the entire training dataset. It's the value that you ultimately seek to minimize to train your model.

$$
\begin{equation}
J(w,b) = \frac{1}{m} \sum_{i=1}^m L(\hat y^{(i)}, y^{(i)})
\end{equation}
\label{eq: cost function}
$$

And, here is how loss function is derived, assuming training examples are i.i.d. (identically independently distributed):

![logistic reg cost derivation](images/lreg_cost_derivation.png)


### Computation graph

The coding convention `dvar` represent the derivative of a final output variable with respect to various intermediate quantities. For example, $\frac{\partial J}{\partial a}$ is denoted as $da$.

### Logistic Regression Gradient Descent

Here is th algorithm's formulation for 1 training example:

![logistic reg gradient descent](images/lreg_gd.png)

$$
\begin{equation}
\begin{aligned}
w_1 &:= w_1 - \alpha (a-y) x_1 \\
w_2 &:= w_2 - \alpha (a-y) x_2 \\
b   &:= b -   \alpha (a-y)
\end{aligned}
\label{eq:lreg_gd_1m}
\end{equation}
$$

In the case of m training examples, we can rewrite the cost gradient descent algorithm using the cost function $J(w,b)$:

![logistic reg gradient descent general](images/lreg_gd_general.png)

As you see above, we need to implement two for loops: one for $m$ training examples, another for $n$ features. In deep learning, we need to deal with large set of features and training examples. For loops will make the algorithm very very slow. That's why we need to use **vectorization**.

Here is the vectorized format of the above implementation:

![logistic reg gradient descent general vectorized](images/lreg_gd_general_vector.png)

## Bonus: A few notes on Python's Numpy

- When you aim to use vector operations, it's recommended to convert rank 1 array into n-dimensional array (or vectors). You can do this using Numpy's `reshape()` function.
- Another common technique we use in Machine Learning and Deep Learning is to normalize our data. It often leads to a better performance because gradient descent converges faster after normalization. Here, by normalization we mean changing $x$ to $\frac{ùë•}{‚Äñùë•‚Äñ}$ (dividing each row vector of $x$ by its 2-norm). We can do this using:

    ```Python
    np.linalg.norm(x, axis=1, keepdims=True, ord=2)
    ```

In [1]:
import numpy as np

x = np.array([[0, 3, 4], [2, 6, 4]])
x / np.linalg.norm(x, axis=1, keepdims=True, ord=2)

array([[0.        , 0.6       , 0.8       ],
       [0.26726124, 0.80178373, 0.53452248]])

## Bonus: Softmax

You can think of softmax as a normalizing function used when your algorithm needs to classify two or more classes. You will learn more about softmax in the second course of this specialization.

**Instructions**:
- for $x \in \mathbb{R}^{1\times n}$,

\begin{align*}
 softmax(x) &= softmax\left(\begin{bmatrix}
    x_1  &&
    x_2 &&
    ...  &&
    x_n  
\end{bmatrix}\right) \\&= \begin{bmatrix}
    \frac{e^{x_1}}{\sum_{j}e^{x_j}}  &&
    \frac{e^{x_2}}{\sum_{j}e^{x_j}}  &&
    ...  &&
    \frac{e^{x_n}}{\sum_{j}e^{x_j}} 
\end{bmatrix} 
\end{align*}

- for a matrix $x \in \mathbb{R}^{m \times n}$, $x_{ij}$ maps to the element in the $i^{th}$ row and $j^{th}$ column of $x$, thus we have:

\begin{align*}
softmax(x) &= softmax\begin{bmatrix}
            x_{11} & x_{12} & x_{13} & \dots  & x_{1n} \\
            x_{21} & x_{22} & x_{23} & \dots  & x_{2n} \\
            \vdots & \vdots & \vdots & \ddots & \vdots \\
            x_{m1} & x_{m2} & x_{m3} & \dots  & x_{mn}
            \end{bmatrix} \\ \\&= 
 \begin{bmatrix}
    \frac{e^{x_{11}}}{\sum_{j}e^{x_{1j}}} & \frac{e^{x_{12}}}{\sum_{j}e^{x_{1j}}} & \frac{e^{x_{13}}}{\sum_{j}e^{x_{1j}}} & \dots  & \frac{e^{x_{1n}}}{\sum_{j}e^{x_{1j}}} \\
    \frac{e^{x_{21}}}{\sum_{j}e^{x_{2j}}} & \frac{e^{x_{22}}}{\sum_{j}e^{x_{2j}}} & \frac{e^{x_{23}}}{\sum_{j}e^{x_{2j}}} & \dots  & \frac{e^{x_{2n}}}{\sum_{j}e^{x_{2j}}} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    \frac{e^{x_{m1}}}{\sum_{j}e^{x_{mj}}} & \frac{e^{x_{m2}}}{\sum_{j}e^{x_{mj}}} & \frac{e^{x_{m3}}}{\sum_{j}e^{x_{mj}}} & \dots  & \frac{e^{x_{mn}}}{\sum_{j}e^{x_{mj}}}
\end{bmatrix} \\ \\ &= \begin{pmatrix}
    softmax\text{(first row of x)}  \\
    softmax\text{(second row of x)} \\
    \vdots  \\
    softmax\text{(last row of x)} \\
\end{pmatrix} 
\end{align*}

## Bonus: `copy.deepcopy()`

The line of Python code `w = copy.deepcopy(w)` creates a **completely independent copy** of the object `w`.

Let's break down why this is important and what it does:

* **What is `copy.deepcopy()`?**
    * Python's `copy` module provides functions for creating copies of objects.
    * `copy.deepcopy()` performs a **deep copy**. This means it not only copies the object itself but also recursively copies all objects contained within it.

* **Why use `deepcopy` instead of a simple assignment (`w = w`) or a shallow copy (`w = copy.copy(w)`)?**

    * **Simple Assignment (`w = w`):** This doesn't create a copy at all. It simply makes another variable name (`w`) point to the *exact same object* in memory. Any changes made through either variable will affect the other because they are referencing the identical object.

    * **Shallow Copy (`w = copy.copy(w)`):** A shallow copy creates a new object, but it inserts references into the new object to the *original objects* found in the original. If the original object contains other mutable objects (like lists or dictionaries), the shallow copy will still share those nested objects with the original. Changing a nested mutable object in the copy will also change it in the original, and vice-versa.

    * **Deep Copy (`w = copy.deepcopy(w)`):** This is where `deepcopy` shines. It creates a new compound object and then, recursively, inserts *copies* of the objects found in the original into the new one. This ensures that the new object and all its nested objects are entirely separate from the original. Changes made to the deep copy will **never** affect the original object, and vice-versa.

**In essence, `w = copy.deepcopy(w)` is used when you need to:**

1.  **Modify a complex data structure** (like a list of lists, a dictionary of dictionaries, or custom objects) without altering the original.
2.  **Preserve the original state** of an object while experimenting with modifications on a separate version.

Think of it like this:

* **Assignment:** Giving someone a key to your house. They can change anything inside.
* **Shallow Copy:** Making a photocopy of a binder. The pages are new, but if a page contains a smaller, nested binder, that smaller binder is still the original one.
* **Deep Copy:** Making a photocopy of a binder, and then photocopying *every single page* and *every single item* within any nested binders, creating entirely new, separate copies of everything. üìÑ

This is a very common and useful technique in programming to prevent unintended side effects when working with mutable data structures.

- A lower cost doesn't mean a better model. You have to check if there is possibly overfitting.
- It happens when the training accuracy is a lot higher than the test accuracy.
- In deep learning, we usually recommend that you:
    - Choose the learning rate that better minimizes the cost function.
    - If your model overfits, use other techniques to reduce overfitting. 