# Basic Functions with Numpy

This exercise gives a brief introduction to Python and [NumPy](https://numpy.org). Even if you've used Python before, this will help familiarize you with functions we'll need.

**Goals:**
- Be able to use iPython Notebooks
- Be able to use numpy functions and numpy matrix/vector operations
- Understand the concept of "broadcasting"
- Be able to vectorize code

iPython Notebooks are interactive coding environments embedded in a webpage.
You will use iPython notebooks on Google Colab in this class.

## 1 - Basic functions ##

Numpy is the main package for scientific computing in Python. It is maintained by a large community (www.numpy.org). In this exercise you will learn several key numpy functions such as `np.exp()`, `np.log()`, and `np.reshape()`.

### 1.1 - Sigmoid function ###

Before using `np.exp()`, you will use `math.exp()` to implement the sigmoid function. You will then see why `np.exp()` is preferable to `math.exp()`.

**Exercise**: Build a function that returns the sigmoid of a real number x. Use `math.exp(x)` for the exponential function.

**Reminder**:
$sigmoid(x) = \frac{1}{1+e^{-x}}$ is sometimes also known as the logistic function. It is a non-linear function used not only in Machine Learning (Logistic Regression), but also in Deep Learning.

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Sigmoid-function-2.svg/2880px-Sigmoid-function-2.svg.png">


In [None]:
import math

def basic_sigmoid(x):
    """
    Compute sigmoid of x.
    Arguments: x -- A scalar
    Return: s -- sigmoid(x)
    """

    ### START CODE HERE ###
    s = None
    ### END CODE HERE ###

    return s

In [None]:
basic_sigmoid(3)

Actually, we rarely use the "math" library in deep learning because the inputs of the functions are real numbers. In deep learning we mostly use matrices and vectors. This is why numpy is more useful.

In [None]:
x = [1, 2, 3]
basic_sigmoid(x) # you will see this give an error when you run it, because x is a vector.

In fact, if $ x = (x_1, x_2, ..., x_n)$ is a row vector then $np.exp(x)$ will apply the exponential function to every element of x. The output will thus be: $np.exp(x) = (e^{x_1}, e^{x_2}, ..., e^{x_n})$

In [None]:
import numpy as np

x = np.array([1, 2, 3])
print(np.exp(x)) # result is (exp(1), exp(2), exp(3))

Furthermore, if x is a vector, then a Python operation such as $s = x + 3$ or $s = \frac{1}{x}$ will output s as a vector of the same size as x.

In [None]:
# example of vector operation
x = np.array([1, 2, 3])
print (x + 3)

Any time you need more info on a numpy function, we encourage you to look at [the official documentation](https://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.exp.html).

**Exercise**: Implement the sigmoid function using numpy.

x could now be either a real number, a vector, or a matrix. The data structures we use in numpy to represent these shapes (vectors, matrices...) are called numpy arrays.

$$ \text{For } x \in \mathbb{R}^n \text{,     } sigmoid(x) = sigmoid\begin{pmatrix}
    x_1  \\
    x_2  \\
    ...  \\
    x_n  \\
\end{pmatrix} = \begin{pmatrix}
    \frac{1}{1+e^{-x_1}}  \\
    \frac{1}{1+e^{-x_2}}  \\
    ...  \\
    \frac{1}{1+e^{-x_n}}  \\
\end{pmatrix}\tag{1} $$

In [None]:
import numpy as np

def sigmoid(x):
    """
    Compute the sigmoid of x
    Arguments: x -- A scalar or numpy array of any size
    Return: s -- sigmoid(x)
    """

    ### START CODE HERE ###
    s = None
    ### END CODE HERE ###

    return s

In [None]:
x = np.array([1, 2, 3])
sigmoid(x)

### 1.2 - Sigmoid gradient

In deep learning, you need to compute gradients to optimize loss functions using backpropagation.

**Exercise**: Implement the function `sigmoid_grad()` to compute the gradient of the sigmoid function with respect to its input `x`. The formula is:

$$sigmoid\_grad(x) = \sigma'(x) = \sigma(x) (1 - \sigma(x))\tag{2}$$.

You might find your `sigmoid(x)` function useful.

In [None]:
def sigmoid_grad(x):
    """
    Compute the gradient (also called the slope or derivative) of the sigmoid function with respect to its input x.
    You can store the output of the sigmoid function into variables and then use it to calculate the gradient.

    Arguments: x -- A scalar or numpy array
    Return: ds -- Your computed gradient.
    """

    ### START CODE HERE ###
    s = None
    ds = None
    ### END CODE HERE ###

    return ds

In [None]:
x = np.array([1, 2, 3])
print ("sigmoid_grad(x) = " + str(sigmoid_grad(x)))

### 1.3 - Reshaping arrays ###

Two common numpy functions used in deep learning are [np.shape](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.shape.html) and [np.reshape()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html).
- X.shape is used to get the shape (dimension) of a matrix/vector X.
- X.reshape(...) is used to reshape X into some other dimension.

For example, in computer science, an image is represented by a 3D array of shape $(length, height, depth = 3)$. However, when you read an image as the input of an algorithm you convert it to a vector of shape $(length*height*3, 1)$. In other words, you "unroll", or reshape, the 3D array into a 1D vector.

<img src="https://miro.medium.com/max/2000/1*CSzLzsw3fC6_cE1VOuTFCA.png" style="width:500px;height:300;">

**Exercise**: Implement `image2vector()` that takes an input of shape (length, height, 3) and returns a vector of shape (length\*height\*3, 1).

In [None]:
def image2vector(image):
    """
    Argument: image -- a numpy array of shape (length, height, depth)
    Returns: v -- a vector of shape (length*height*depth, 1)
    """

    ### START CODE HERE
    v = None
    ### END CODE HERE ###

    return v

In [None]:
# This is a 3 by 3 by 2 array, where 3 represents the RGB values
image = np.array([[[ 0.67826139,  0.29380381],
        [ 0.90714982,  0.52835647],
        [ 0.4215251 ,  0.45017551]],

       [[ 0.92814219,  0.96677647],
        [ 0.85304703,  0.52351845],
        [ 0.19981397,  0.27417313]],

       [[ 0.60659855,  0.00533165],
        [ 0.10820313,  0.49978937],
        [ 0.34144279,  0.94630077]]])

print ("image2vector(image) = " + str(image2vector(image)))

### 1.4 - Normalizing rows

Another common technique we use in Machine Learning and Deep Learning is to normalize our data. It often leads to a better performance because gradient descent converges faster after normalization. Here, by normalization we mean changing x to $ \frac{x}{\| x\|} $ (dividing each row vector of x by its norm).

For example, if $$x =
\begin{bmatrix}
    0 & 3 & 4 \\
    2 & 6 & 4 \\
\end{bmatrix}\tag{3}$$ then $$\| x\| = np.linalg.norm(x, axis = 1, keepdims = True) = \begin{bmatrix}
    5 \\
    \sqrt{56} \\
\end{bmatrix}\tag{4} $$and        $$ x\_normalized = \frac{x}{\| x\|} = \begin{bmatrix}
    0 & \frac{3}{5} & \frac{4}{5} \\
    \frac{2}{\sqrt{56}} & \frac{6}{\sqrt{56}} & \frac{4}{\sqrt{56}} \\
\end{bmatrix}\tag{5}$$ Note that you can divide matrices of different sizes and it works fine: this is called *broadcasting*.


**Exercise**: Implement `normalizeRows()` to normalize the rows of a matrix. After applying this function to an input matrix `x`, each row of `x` should be a vector of unit length (meaning length 1).

In [None]:
def normalizeRows(x):
    """
    Implement a function that normalizes each row of the matrix x (to have unit length).

    Argument: x -- A numpy matrix of shape (n, m)
    Returns: x -- The normalized (by row) numpy matrix. You are allowed to modify x.
    """

    ### START CODE HERE ###
    # Compute x_norm as the norm 2 of x. Use np.linalg.norm(..., ord = 2, axis = ..., keepdims = True)
    print("x shape= " + str(x.shape))
    x_norm = None
    print("x_norm shape= " + str(x_norm.shape))
    # Divide x by its norm.
    x = None
    ### END CODE HERE ###

    return x

In [None]:
x = np.array([
    [0, 3, 4],
    [1, 6, 4]])
print("normalizeRows(x) = " + str(normalizeRows(x)))

### 1.5 - Broadcasting and the softmax function ####
A very important concept to understand in numpy is "broadcasting". It is very useful for performing mathematical operations between arrays of different shapes. For the full details on broadcasting, you can read the official [broadcasting documentation](http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html).

**Exercise**: Implement a softmax function using numpy. You can think of softmax as a normalizing function used when your algorithm needs to classify two or more classes.

- $ \text{for } x \in \mathbb{R}^{1\times n} \text{,     } softmax(x) = softmax(\begin{bmatrix}
    x_1  &&
    x_2 &&
    ...  &&
    x_n  
\end{bmatrix}) = \begin{bmatrix}
     \frac{e^{x_1}}{\sum_{j}e^{x_j}}  &&
    \frac{e^{x_2}}{\sum_{j}e^{x_j}}  &&
    ...  &&
    \frac{e^{x_n}}{\sum_{j}e^{x_j}}
\end{bmatrix} $

- $\text{for a matrix } x \in \mathbb{R}^{m \times n} \text{,  $x_{ij}$ maps to the element in the $i^{th}$ row and $j^{th}$ column of $x$, thus we have: }$  $$softmax(x) = softmax\begin{bmatrix}
    x_{11} & x_{12} & x_{13} & \dots  & x_{1n} \\
    x_{21} & x_{22} & x_{23} & \dots  & x_{2n} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    x_{m1} & x_{m2} & x_{m3} & \dots  & x_{mn}
\end{bmatrix} = \begin{bmatrix}
    \frac{e^{x_{11}}}{\sum_{j}e^{x_{1j}}} & \frac{e^{x_{12}}}{\sum_{j}e^{x_{1j}}} & \frac{e^{x_{13}}}{\sum_{j}e^{x_{1j}}} & \dots  & \frac{e^{x_{1n}}}{\sum_{j}e^{x_{1j}}} \\
    \frac{e^{x_{21}}}{\sum_{j}e^{x_{2j}}} & \frac{e^{x_{22}}}{\sum_{j}e^{x_{2j}}} & \frac{e^{x_{23}}}{\sum_{j}e^{x_{2j}}} & \dots  & \frac{e^{x_{2n}}}{\sum_{j}e^{x_{2j}}} \\
    \vdots & \vdots & \vdots & \ddots & \vdots \\
    \frac{e^{x_{m1}}}{\sum_{j}e^{x_{mj}}} & \frac{e^{x_{m2}}}{\sum_{j}e^{x_{mj}}} & \frac{e^{x_{m3}}}{\sum_{j}e^{x_{mj}}} & \dots  & \frac{e^{x_{mn}}}{\sum_{j}e^{x_{mj}}}
\end{bmatrix} = \begin{pmatrix}
    softmax\text{(first row of x)}  \\
    softmax\text{(second row of x)} \\
    ...  \\
    softmax\text{(last row of x)} \\
\end{pmatrix} $$

In [None]:
def softmax(x):
    """Calculates the softmax for each row of the input x.

    Your code should work for a row vector and also for matrices of shape (n, m).

    Argument: x -- A numpy matrix of shape (n,m)
    Returns: s -- A numpy matrix equal to the softmax of x, of shape (n,m)
    """

    ### START CODE HERE ###
    # Apply exp() element-wise to x. Use np.exp(...).

    print("x shape= "+ str(x.shape))

    x_exp = None
    print("x_exp shape= "+ str(x_exp.shape))

    # Create a vector x_sum that sums each row of x_exp. Use np.sum(..., axis = 1, keepdims = True).
    x_sum = None
    print("x_sum shape= "+ str(x_sum.shape))

    # Compute softmax(x) by dividing x_exp by x_sum. It should automatically use numpy broadcasting.
    s = None
    print("s shape= "+ str(s.shape))

    ### END CODE HERE ###

    return s

In [None]:
x = np.array([
    [9, 2, 5, 0, 0],
    [7, 5, 0, 0 ,0]])
print("softmax(x) = " + str(softmax(x)))

## 2 - Vectorization


In deep learning, you deal with very large datasets. Hence, a non-computationally-optimal function can become a huge bottleneck in your algorithm and can result in a model that takes ages to run. To make sure that your code is  computationally efficient, you will use vectorization. For example, try to tell the difference between the following implementations of the dot/outer/elementwise product.

For vectors $\mathbf{a}=\{a_1, a_2, \dots, a_n\}$ and $\mathbf{b}=\{b_1, b_2, \dots, b_n\}$, *dot product* (also known as *inner product*) is defined as:

$$ \mathbf{a} \cdot \mathbf{b} = a_1b_1 + a_2b_2 + \dots + a_nb_n$$

*Outer product* of two vectors is a matrix:

$$ {\mathbf{a} \otimes \mathbf{b}}=\left[\begin{array}{cccc}
a_{1} b_{1} & a_{1} b_{2} & \cdots & a_{1} b_{m} \\
a_{2} b_{1} & a_{2} b_{2} & \cdots & a_{2} b_{m} \\
\vdots & \vdots & \ddots & \vdots \\
a_{n} b_{1} & a_{n} b_{2} & \cdots & a_{n} b_{m}
\end{array}\right]
$$


In [None]:
import time

x1 = [9, 2, 5, 0, 0, 7, 5, 0, 0, 0, 9, 2, 5, 0, 0]
x2 = [9, 2, 2, 9, 0, 9, 2, 5, 0, 0, 9, 2, 5, 0, 0]

### CLASSIC dot product of vectors ###
tic = time.process_time()
for X in range(10000):
  dot = 0
  for i in range(len(x1)):
      dot+= x1[i]*x2[i]
toc = time.process_time()
print ("dot = " + str(dot) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### CLASSIC outer product ###
tic = time.process_time()
for X in range(10000):
  outer = np.zeros((len(x1),len(x2))) # we create a len(x1)*len(x2) matrix with only zeros
  for i in range(len(x1)):
      for j in range(len(x2)):
          outer[i,j] = x1[i]*x2[j]
toc = time.process_time()
print ("outer = " + str(outer) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### CLASSIC elementwise multiplication ###
tic = time.process_time()
for X in range(10000):
  mul = np.zeros(len(x1))
  for i in range(len(x1)):
      mul[i] = x1[i]*x2[i]
toc = time.process_time()
print ("elementwise multiplication = " + str(mul) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")


In [None]:
x1 = [9, 2, 5, 0, 0, 7, 5, 0, 0, 0, 9, 2, 5, 0, 0]
x2 = [9, 2, 2, 9, 0, 9, 2, 5, 0, 0, 9, 2, 5, 0, 0]

### VECTORIZED dot product of vectors ###
tic = time.process_time()
for X in range(10000):
  dot = np.dot(x1,x2)
toc = time.process_time()
print ("dot = " + str(dot) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### VECTORIZED outer product ###
tic = time.process_time()
for X in range(10000):
  outer = np.outer(x1,x2)
toc = time.process_time()
print ("outer = " + str(outer) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")

### VECTORIZED elementwise multiplication ###
tic = time.process_time()
for X in range(10000):
  mul = np.multiply(x1,x2)
toc = time.process_time()
print ("elementwise multiplication = " + str(mul) + "\n ----- Computation time = " + str(1000*(toc - tic)) + "ms")


As you may have noticed, the vectorized implementation is much cleaner and more efficient. For bigger vectors/matrices, the differences in running time become even bigger.

A lot of scaleable deep learning implementations are done on a GPU. But many of the demos we run in the Jupiter notebook are actually on a CPU. However, both GPU and CPU have parallelization instructions. They're sometimes called SIMD instructions. This stands for a *single instruction multiple data*. If you use built-in functions that don't require you explicitly implementing a `for` loop, it enables Phyton to take much better advantage of parallelism to do computations much faster.

### 2.1 Implement the L1 and L2 loss functions

**Exercise**: Implement the numpy vectorized version of the L1 loss. You may find the function `abs(x)` (absolute value of `x`) useful.

The loss is used to evaluate the performance of your model. The bigger your loss is, the more different your predictions ($ \hat{y} $) are from the true values ($y$). In deep learning, you use optimization algorithms like Gradient Descent to train your model and to minimize the cost.

L1 loss is defined as:
$$\begin{align*} & L_1(\hat{y}, y) = \sum_{i=0}^m|y^{(i)} - \hat{y}^{(i)}| \end{align*}\tag{6}$$

In [None]:
def L1(yhat, y):
    """
    Arguments: yhat -- vector of size m (predicted labels)
               y -- vector of size m (true labels)
    Returns: loss -- the value of the L1 loss function defined above
    """

    ### START CODE HERE ###
    loss = None
    ### END CODE HERE ###

    return loss

In [None]:
yhat = np.array([.9, 0.2, 0.1, .4, .9])
y = np.array([1, 0, 0, 1, 1])
print("L1 = " + str(L1(yhat,y)))

**Exercise**: Implement the numpy vectorized version of the L2 loss. There are several way of implementing the L2 loss but you may find the function np.dot() useful. As a reminder, if $x = [x_1, x_2, ..., x_n]$, then `np.dot(x,x)` = $\sum_{j=0}^n x_j^{2}$.

L2 loss is defined as $$\begin{align*} & L_2(\hat{y},y) = \sum_{i=0}^m(y^{(i)} - \hat{y}^{(i)})^2 \end{align*}\tag{7}$$

In [None]:
def L2(yhat, y):
    """
    Arguments: yhat -- vector of size m (predicted labels)
               y -- vector of size m (true labels)
    Returns: loss -- the value of the L2 loss function defined above
    """

    ### START CODE HERE ### (≈ 1 line of code)
    loss = None
    ### END CODE HERE ###

    return loss

In [None]:
yhat = np.array([.9, 0.2, 0.1, .4, .9])
y = np.array([1, 0, 0, 1, 1])
print("L2 = " + str(L2(yhat,y)))