# Regularization to Reduce Overfitting

You will:
- Understand what is overfit and how to reduce it.

- Understand the concept of regularization and how it is used to reduce overfit.

- Extend the previous linear and logistic cost function with a regularization term.

- Extend the previous linear and logistic gradients with a regularization term added.

##### Overfit

**To understand overfitting let's say:**

Suppose we have a training set with $m$ examples and $n$ features, we want to predict the output in weither linear or logistic prediction.

1. If our predictions from our model ***does not*** fit the training set ***well***:

    * Here we will say our model is **Underfitting** the data or has **high bias**.

2. If our predictios from our model ***fits*** the training set ***just well***.

    * Here we will say our model is **Generalized**.

3. If our predictions from our model ***fits*** the training set ***extremely well***.

    * Here we will say our model is ***Overfit*** the data or has ***high variance***

 **How to address Overfitting**

1. Collect more training data.

    * If you can get more dataset, you can add them to the training set.

2. Reduce number of features, $n$.

    * Perform **Feature sellection**.
    
        * If you have many features $n$ but fewer number of examples $m$, the good solution to reduce **overfitting** is by reducing number of features $n$.

3. Perform **Regularization**.
    * This is very ussefull technic for training models.

##### Regularization

**The Idea Behind Regularization**

* Let say we have $n$ number of features, $n = 100$.
    * $w_{0}, w_{1}, w_{2},\; \cdots\; w_{99},\; b$ - parameters

        
         If $w_{0} \; \cdots\; w_{99},\; b\; $ will be **smaller** 


        Then


        Our model will be equivalent to a **Simpler model**
        

        Therefore: our model will be **less** likely to **overfit**

* So, to get **smaller** $w_{0}, w_{1}, w_{2}, \cdots, w_{99},\; b$

    We will pinalize all $w_{j}$ by adding $\frac{\lambda}{2m}\sum_{n=0}^{n-1}{w_{j}^2}$

    Where

    $\lambda\; \; \text{is regularization parameter}, \lambda > 0$

##### Adding regularization

**Note:**
* Cost.
    * Cost functions differ significantly between Linear and Logistic Regression, but adding Regularization to the equations is the same.
* Gradient.
    * The gradient functions for Linear and Logistic Regression are very similar. They differ only in the implementation of $f_{w,b}$

##### Linear regularization

**Linear Regression**

 $f_{w,b}(\mathbf{x}^{(i)}) = w_{j} \cdot \mathbf{x}_{j}^{(i)} + b$

 $J(w,b) = \frac{1}{2m} \sum_{i=0}^{m-1}{(f_{w,b}(\mathbf{x^{(i)}}) - y^{(i)})^{2}}$

$
    \text{repeat until convergence:} \; \; \lbrace \\
    \; \; \; w_{j} = w_{j} - \alpha \frac{\partial J(w,b)}{\partial w_{j}} \; \; \text{for  j = 0, \dots, n-1} \\ \\
    \; \; \; b  =  b - \alpha \frac{\partial J(w,b)}{\partial b} \\ \\
    \\ \rbrace
$

*Where:*

$\frac{\partial J(\mathbf{w},b)}{\partial w_{j}} = \frac{1}{m}\sum_{i=0}^{m-1}(f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)}$

$\frac{\partial J(\mathbf{w},b)}{\partial b} = \frac{1}{m}\sum_{i=0}^{m-1}(f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})$

**Regularized Linear Regression**

$f_{w,b}(\mathbf{x}^{(i)}) = w_{j} \cdot \mathbf{x}_{j}^{(i)} + b$


$J(w,b) = \frac{1}{2m} \sum_{i=0}^{m-1}{(f_{w,b}(\mathbf{x}^{(i)}) - y^{(i)})^{2}} + \frac{\lambda}{2m}\sum_{j=0}^{n-1}w_{j}^{2}$


$
\text{repeat until convergence:} \; \lbrace \\
 \; \; \;w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j}\;  \text{for j := 0 \dots n-1} \\
 \; \; \; \;b = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \\
\rbrace
$

*Where:*

$\frac{\partial J(\mathbf{w},b)}{\partial w_{j}} = \frac{1}{m}\sum_{i=0}^{m-1}(f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} + \frac{\lambda}{m}\sum_{j=0}^{n-1}w^{2}$

$\frac{\partial J(\mathbf{w},b)}{\partial b} = \frac{1}{m}\sum_{i=0}^{m-1}(f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})$

Note:

* By addind regularization term, we pinalize values of $w_j$ to be smaller.

* If $\lambda = 0$ means no regularization applied, hence model will **underfit**.

* If $\lambda\;  \text{is very large}$, model will **overfit**.

* If $\lambda$ is not too **small** and not too **large** but just **right** then our model will **generalized**.

* The parameter $b$ is not regularized. This is standard practice.

Bellow is an implementation of equations (1) adn (2), Note that this uses a _*standard pattern for this course,*_ a ***for loop*** over all ***m*** examples.

**Regularized Linear Cost Function**

**Python Implementation**

**C Implementation**


$J(w,b) = \frac{1}{2m} \sum_{i=0}^{m-1}{(f_{w,b}(\mathbf{x}^{(i)}) - y^{(i)})^{2}} + \frac{\lambda}{2m}\sum_{j=0}^{n-1}w_{j}^{2}$


In [1]:
def compute_cost_linear_reg(X, y, w, b, lambda_):
    """
    computes the cost over all examples

    Args:
        X (ndarray (m,n)): Data, m examples with n features
        y (ndarray (n,)): target values
        w (ndarray (n,)): model parameters
        b (scalar)      : model parameter
        lambda (scalar) : controls amount of regularization
    Returns:
        total_cost (scalar): cost
    """

    m = X.shape[0]
    n = X.shape[1]
    cost = 0.
    for i in range(m):
        f_wb_i = np.dot(X[i], w) + b
        cost += (f_wb_i - y[i])**2
    cost /=(2*m)

    reg_cost=0
    for j in range(n):
        reg_cost += (w[j]**2)
    reg_cost = (lambda_/(2*m))*reg_cost
    total_cost = cost + reg_cost
    return total_cost

**Python main function**

**C main function**

In [2]:
import numpy as np

def main():
    np.random.seed(1)
    X_tmp=np.array([[4.17022005e-01, 7.20324493e-01, 1.14374817e-04, 3.02332573e-01, 1.46755891e-01, 9.23385948e-02],
                  [1.86260211e-01, 3.45560727e-01, 3.96767474e-01, 5.38816734e-01, 4.19194514e-01, 6.85219500e-01],
                  [2.04452250e-01, 8.78117436e-01, 2.73875932e-02, 6.70467510e-01, 4.17304802e-01, 5.58689828e-01],
                  [1.40386939e-01, 1.98101489e-01, 8.00744569e-01, 9.68261576e-01, 3.13424178e-01, 6.92322616e-01],
                  [8.76389152e-01, 8.94606664e-01, 8.50442114e-02, 3.90547832e-02, 1.69830420e-01, 8.78142503e-01]])   
    y_tmp = np.array([0,1,0,1,0])
    w_tmp = np.array([-0.40165317, -0.07889237,  0.45788953,  0.03316528,  0.19187711, -0.18448437])
    b_tmp = 0.5
    lambda_tmp = 0.7
    cost = compute_cost_linear_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)
    print("Regularized cost:", cost)

if __name__=="__main__":
    main()

Regularized cost: 0.0791723937669179


**Regularized Linear gradients**

$\frac{\partial J(\mathbf{w},b)}{\partial w_{j}} = \frac{1}{m}\sum_{i=0}^{m-1}(f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} + \frac{\lambda}{m}\sum_{j=0}^{n-1}w^{2}$

$\frac{\partial J(\mathbf{w},b)}{\partial b} = \frac{1}{m}\sum_{i=0}^{m-1}(f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})$

In [3]:
def compute_gradient_linear_reg(X, y, w, b, lambda_):
    """
    Compute the gradient for linear regression
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
      
    Returns:
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 
    """

    m,n =X.shape
    dj_dw = np.zeros((n,))
    dj_db = 0.

    for i in range(m):
        f_wb=np.dot(X[i],w) + b
        err=f_wb - y[i]
        for j in range(n):
            dj_dw[j] = dj_dw[j] + (err * X[i, j])
        dj_db = dj_db + err
    dj_dw =dj_dw/ m
    dj_db =dj_db/ m

    for j in range(n):
        dj_dw[j] =dj_dw[j] + ((lambda_/m)*w[j])

    return  dj_db, dj_dw

In [4]:
import numpy as np

def main():
    X_tmp=np.array([[4.17022005e-01, 7.20324493e-01, 1.14374817e-04, 3.02332573e-01, 1.46755891e-01, 9.23385948e-02],
                  [1.86260211e-01, 3.45560727e-01, 3.96767474e-01, 5.38816734e-01, 4.19194514e-01, 6.85219500e-01],
                  [2.04452250e-01, 8.78117436e-01, 2.73875932e-02, 6.70467510e-01, 4.17304802e-01, 5.58689828e-01],
                  [1.40386939e-01, 1.98101489e-01, 8.00744569e-01, 9.68261576e-01, 3.13424178e-01, 6.92322616e-01],
                  [8.76389152e-01, 8.94606664e-01, 8.50442114e-02, 3.90547832e-02, 1.69830420e-01, 8.78142503e-01]])   
    y_tmp = np.array([0,1,0,1,0])
    w_tmp = np.array([-0.40165317, -0.07889237,  0.45788953,  0.03316528,  0.19187711, -0.18448437])
    b_tmp = 0.5
    lambda_tmp = 0.7

    m,n = X_tmp.shape

    dj_db_tmp, dj_dw_tmp =  compute_gradient_linear_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

    print("Regularized Parameters:\n")
    for i in range(n):
        print(f"dj_dw[{i}]:\t{dj_dw_tmp[i]}")
    
    print(f"dj_db:\t{dj_db_tmp}")

if __name__=="__main__":
    main()

Regularized Parameters:

dj_dw[0]:	-0.04226595935184001
dj_dw[1]:	0.05237246827859569
dj_dw[2]:	-0.00827465637378165
dj_dw[3]:	-0.024143317470786088
dj_dw[4]:	0.012555864679079941
dj_dw[5]:	-0.07695476271064926
dj_db:	-0.008768830987243648


##### Logistic Regularization

**Logistic Regressiom**

 $f_{w,b}(\mathbf{x}^{(i)}) = g(w_{j} \cdot \mathbf{x}_{j}^{(i)} + b)$

 $J(w,b) = \frac{1}{2m} \sum_{i=0}^{m-1}{(f_{w,b}(\mathbf{x^{(i)}}) - y^{(i)})^{2}}$

$
    \text{repeat until convergence:} \; \; \lbrace \\
    \; \; \; w_{j} = w_{j} - \alpha \frac{\partial J(w,b)}{\partial w_{j}} \; \; \text{for  j = 0, \dots, n-1} \\ \\
    \; \; \; b  =  b - \alpha \frac{\partial J(w,b)}{\partial b} \\ \\
    \\ \rbrace
$

*Where:*

$z = w_{j}\cdot\mathbf{x}_{j}^{(i)} + b$

$g(z) = \frac{1}{1 + e^{(-z)}}$

$f_{w,b}(\mathbf{x}^{(i)}) = g(z)$

$\frac{\partial J(\mathbf{w},b)}{\partial w_{j}} = \frac{1}{m}\sum_{i=0}^{m-1}(f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)}$

$\frac{\partial J(\mathbf{w},b)}{\partial b} = \frac{1}{m}\sum_{i=0}^{m-1}(f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})$

**Regularized Logistic Regression**

 $f_{w,b}(\mathbf{x}^{(i)}) = g(w_{j} \cdot \mathbf{x}_{j}^{(i)} + b)$


$J(w,b) = \frac{1}{2m} \sum_{i=0}^{m-1}{(f_{w,b}(\mathbf{x}^{(i)}) - y^{(i)})^{2}} + \frac{\lambda}{2m}\sum_{j=0}^{n-1}w_{j}^{2}$


$
    \text{repeat until convergence:} \; \; \lbrace \\
    \; \; \; w_{j} = w_{j} - \alpha \frac{\partial J(w,b)}{\partial w_{j}} \; \; \text{for  j = 0, \dots, n-1} \\ \\
    \; \; \; b  =  b - \alpha \frac{\partial J(w,b)}{\partial b} \\ \\
    \\ \rbrace
$

*where*

$z = w_{j}\cdot\mathbf{x}_{j}^{(i)} + b$

$g(z) = \frac{1}{1 + e^{(-z)}}$

$f_{w,b}(\mathbf{x}^{(i)}) = g(z)$

$\frac{\partial J(\mathbf{w},b)}{\partial w_{j}} = \frac{1}{m}\sum_{i=0}^{m-1}(f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} + \frac{\lambda}{m}\sum_{j=0}^{n-1}w^{2}$

$\frac{\partial J(\mathbf{w},b)}{\partial b} = \frac{1}{m}\sum_{i=0}^{m-1}(f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})$

**Sigmoid function**

$g(z) = \frac{1}{1 + e^{(-z)}}$

In [5]:
def sigmoid(z):
    """
    Copute sigmoid
    Arg:
        z (scalar): prediction

    Return:
        logistic prediction
    """
    return 1/(1 + np.exp(-z))

**Regularized Cost Logistic Function**

$J(w,b) = \frac{1}{2m} \sum_{i=0}^{m-1}{(f_{w,b}(\mathbf{x}^{(i)}) - y^{(i)})^{2}} + \frac{\lambda}{2m}\sum_{j=0}^{n-1}w_{j}^{2}$

$f_{w,b}(\mathbf{x}^{(i)}) = g(z)$

$g(z) = \frac{1}{1 + e^{(-z)}}$

In [6]:
def compute_cost_logistic_reg(X, y, w, b, lambda_):
    """
    Computes the cost over all examples
    Args:
        X (ndarray (m,n): Data, m examples with n features
        y (ndarray (m,)): target values
        w (ndarray (n,)): model parameters  
        b (scalar)      : model parameter
        lambda_ (scalar): Controls amount of regularization
    Returns:
        total_cost (scalar):  cost 
    """

    m,n=X.shape
    cost=0

    for i in range(m):
        f_wb = np.dot(X[i],w) + b
        f_wb_i = sigmoid(f_wb)
        cost += -y[i]*np.log(f_wb_i) - (1-y[i])*(np.log(1-f_wb_i))
    cost /=m

    reg_cost=0
    for i in range(n):
        reg_cost += w[i]**2
    reg_cost =(lambda_/(2*m)) * reg_cost

    total_cost = cost + reg_cost
    return total_cost

In [10]:
def main():
    
    X=np.array([[4.17022005e-01, 7.20324493e-01, 1.14374817e-04],
                  [1.86260211e-01, 3.45560727e-01, 3.96767474e-01],
                  [2.04452250e-01, 8.78117436e-01, 2.73875932e-02],
                  [1.40386939e-01, 1.98101489e-01, 8.00744569e-01],
                  [8.76389152e-01, 8.94606664e-01, 8.50442114e-02]]) 
    y = np.array([0,1,0,1,0])
    w = np.array([-0.40165317, -0.07889237,  0.45788953])
    b = 0.5
    lambda_ = 0.7

    cost_ = compute_cost_logistic_reg(X, y, w, b, lambda_)

    print("Regularized cost:", cost_)

if __name__=="__main__":
    main()

Regularized cost: 0.6865981362012243


**Regularized Gradient Logistic Function**

$\frac{\partial J(\mathbf{w},b)}{\partial w_{j}} = \frac{1}{m}\sum_{i=0}^{m-1}(f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} + \frac{\lambda}{m}\sum_{j=0}^{n-1}w^{2}$

$\frac{\partial J(\mathbf{w},b)}{\partial b} = \frac{1}{m}\sum_{i=0}^{m-1}(f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})$

$f_{w,b}(\mathbf{x}^{(i)}) = g(z)$

$g(z) = \frac{1}{1 + e^{(-z)}}$

In [11]:
def compute_gradient_logistic_reg(X, y, w, b, lambda_):
    """
    Computes the gradient for linear regression 
 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns
      dj_dw (ndarray Shape (n,)): The gradient of the cost w.r.t. w. 
      dj_db (scalar)            : The gradient of the cost w.r.t. b. 
    """

    m,n = X.shape
    dj_dw = np.zeros((n,))
    dj_db = 0.0

    for i in range(m):
        f_wb_i = sigmoid(np.dot(X[i],w) + b)
        err_i = f_wb_i - y[i]
        for j in range(n):
            dj_dw[j] = dj_dw[j] + (err_i*X[i][j])
        dj_db = dj_db + err_i
    dj_dw = dj_dw/m
    dj_db = dj_db/m

    for j in range(n):
        dj_dw[j] = dj_dw[j] + (lambda_/m)*w[j]

    return dj_db, dj_dw

In [12]:
import numpy as np

def main():
    X=np.array([[4.17022005e-01, 7.20324493e-01, 1.14374817e-04],
                  [1.86260211e-01, 3.45560727e-01, 3.96767474e-01],
                  [2.04452250e-01, 8.78117436e-01, 2.73875932e-02],
                  [1.40386939e-01, 1.98101489e-01, 8.00744569e-01],
                  [8.76389152e-01, 8.94606664e-01, 8.50442114e-02]]) 
    y = np.array([0,1,0,1,0])
    w = np.array([-0.40165317, -0.07889237,  0.45788953])
    b = 0.5
    lambda_ = 0.7
    dj_db, dj_dw =  compute_gradient_logistic_reg(X, y, w, b, lambda_)

    m,n = X.shape

    print("Regularized Parameters: \n")
    for i in range(n):
        print(f"dj_dw[{i}]:\t{dj_dw[i]}")

    print(f"dj_db: {dj_db}", )

if __name__=="__main__":
    main()

Regularized Parameters: 

dj_dw[0]:	0.08590186822598606
dj_dw[1]:	0.2318715168852964
dj_dw[2]:	-0.0019798089387517287
dj_db: 0.20333487598147518
