# 加入正規化的成本函數與梯度

## 目標

在這個實驗中，你將會：

- 在先前的線性回歸與邏輯回歸成本函數中加入正規化（regularization）項。

- 重新執行先前的過擬合（overfitting）範例，觀察加入正規化後的差異。


In [None]:
# region 資料載入
import sys, os
import numpy as np
np.set_printoptions(precision=8)
from pathlib import Path

try:
    %matplotlib widget
except:
    %matplotlib inline
    print("Colab not support matplotlib widget")
import matplotlib.pyplot as plt
from ipywidgets import Output


#先檢查README,有的話表示是完整檔案不用下載
def find_repo_root(marker="README.md"):
    cur = Path.cwd()
    while cur != cur.parent:  # 防止無限迴圈，到達檔案系統根目錄就停
        if (cur / marker).exists():
            return cur
        cur = cur.parent
    return None


def import_data_from_github():
    import urllib.request, shutil
    
    def isRunningInColab() -> bool:
        return "google.colab" in sys.modules

    def isRunningInJupyterLab() -> bool:
        try:
            import jupyterlab
            return True
        except ImportError:
            return False
        
    def detect_env():
        from IPython import get_ipython
        if isRunningInColab():
            return "Colab"
        elif isRunningInJupyterLab():
            return "JupyterLab"
        elif "notebook" in str(type(get_ipython())).lower():
            return "Jupyter Notebook"
        else:
            return "Unknown"
        
    def get_utils_dir(env): 
        if env == "Colab": 
            if "/content" not in sys.path:
                sys.path.insert(0, "/content")
            return "/content/utils"
        else:
            return Path.cwd() / "utils"

    env = detect_env()
    UTILS_DIR = get_utils_dir(env)
    REPO_DIR = "Machine-Learning-Lab"

    #shutil.rmtree(UTILS_DIR, ignore_errors=True)
    os.makedirs(UTILS_DIR, exist_ok=True)

    BASE = f"https://raw.githubusercontent.com/mz038197/{REPO_DIR}/main"
    urllib.request.urlretrieve(f"{BASE}/utils/plt_overfit.py", f"{UTILS_DIR}/plt_overfit.py")
    urllib.request.urlretrieve(f"{BASE}/utils/lab_utils_common_classification.py", f"{UTILS_DIR}/lab_utils_common_classification.py")
    urllib.request.urlretrieve(f"{BASE}/utils/deeplearning.mplstyle", f"{UTILS_DIR}/deeplearning.mplstyle")


repo_root = find_repo_root()

if repo_root is None:
    import_data_from_github()
    repo_root = Path.cwd()
    

os.chdir(repo_root)
print(f"✅ 切換工作目錄至 {Path.cwd()}")
sys.path.append(str(repo_root)) if str(repo_root) not in sys.path else None
print(f"✅ 加入到系統路徑")

from utils.plt_overfit import overfit_example, output
from utils.lab_utils_common_classification import sigmoid


plt.style.use('utils/deeplearning.mplstyle')
print("✅ 匯入模組及設定繪圖樣式")
#endregion 資料載入

<br>

# 加入正規化（Regularization）

上面的投影片展示了線性回歸與邏輯回歸加入正規化後的**成本函數**與**梯度**。重點如下：
- 成本（Cost）
    - 線性回歸與邏輯回歸的成本函數形式差異很大，但「把正規化加進去」的方式是一樣的。
- 梯度（Gradient）
    - 線性回歸與邏輯回歸的梯度形式非常相近，主要差別只在於模型輸出 $f_{\mathbf{w},b}$ 的計算方式。

<br>

## 加入正規化的成本函數
### 正規化的線性回歸成本函數

正規化線性回歸（regularized linear regression）的成本函數為：

$$
J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2  + \frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2 \tag{1}
$$

其中：

$$ 
f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b  \tag{2} 
$$

把它和「沒有正規化」的成本函數（你在前一個實驗已實作過）相比：

$$
J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 
$$

差別在於多了一個正規化項：<span style="color:blue"> $$\frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2$$ </span>

加入這個項會鼓勵梯度下降同時把參數的「大小」也變小（避免權重過大）。注意：此範例中 **$b$ 不做正規化**，這是常見且標準的做法。

下面提供式 (1) 與 (2) 的實作。這裡使用了本課程常見的寫法：用 `for loop` 逐筆走訪全部 `m` 筆資料。

In [None]:
def compute_cost_linear_reg(X, y, w, b, lambda_ = 1):
    """
    Computes the cost over all examples
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns:
      total_cost (scalar):  cost 
    """
    # YOUR CODE HERE
    m  = X.shape[0]
    n  = len(w)
    cost = 0.
    for i in range(m):
        f_wb_i = np.dot(X[i], w) + b                                   #(n,)(n,)=scalar, see np.dot
        cost = cost + (f_wb_i - y[i])**2                               #scalar             
    cost = cost / (2 * m)                                              #scalar  
 
    reg_cost = 0
    for j in range(n):
        reg_cost += (w[j]**2)                                          #scalar
    reg_cost = (lambda_/(2*m)) * reg_cost                              #scalar
    
    total_cost = cost + reg_cost                                       #scalar
    # YOUR CODE END HERE 

    return total_cost                                                  #scalar

執行下方 cell 來看看實際效果。

In [None]:
np.random.seed(1)
X_tmp = np.random.rand(5,6)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1]).reshape(-1,)-0.5
b_tmp = 0.5
lambda_tmp = 0.7
cost_tmp = compute_cost_linear_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print("Regularized cost:", cost_tmp)

**預期輸出**：
<table>
  <tr>
    <td> <b>正規化後的 cost：</b> 0.07917239320214275 </td>
  </tr>
</table>

<br>

### 正規化的邏輯回歸成本函數

對於正規化的 **邏輯回歸（logistic regression）**，成本函數形式為：

$$
J(\mathbf{w},b) = \frac{1}{m}  \sum_{i=0}^{m-1} \left[ -y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) \right] + \frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2 \tag{3}
$$

其中：

$$ 
f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = sigmoid(\mathbf{w} \cdot \mathbf{x}^{(i)} + b)  \tag{4} 
$$ 

把它和「沒有正規化」的成本函數（你在前一個實驗已實作過）相比：

$$ 
J(\mathbf{w},b) = \frac{1}{m}\sum_{i=0}^{m-1} \left[ (-y^{(i)} \log\left(f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right) - \left( 1 - y^{(i)}\right) \log \left( 1 - f_{\mathbf{w},b}\left( \mathbf{x}^{(i)} \right) \right)\right] 
$$

和上面的線性回歸一樣，差別在於多了正規化項：<span style="color:blue"> $$\frac{\lambda}{2m}  \sum_{j=0}^{n-1} w_j^2$$ </span>

加入這個項會鼓勵梯度下降去縮小參數的大小。注意：此範例中 **$b$ 不做正規化**，這是標準作法。

In [None]:
def compute_cost_logistic_reg(X, y, w, b, lambda_ = 1):
    """
    Computes the cost over all examples
    Args:
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns:
      total_cost (scalar):  cost 
    """
    
    # YOUR CODE HERE
    m,n  = X.shape
    cost = 0.
    for i in range(m):
        z_i = np.dot(X[i], w) + b                                      #(n,)(n,)=scalar, see np.dot
        f_wb_i = sigmoid(z_i)                                          #scalar
        cost +=  -y[i]*np.log(f_wb_i) - (1-y[i])*np.log(1-f_wb_i)      #scalar
             
    cost = cost/m                                                      #scalar

    reg_cost = 0
    for j in range(n):
        reg_cost += (w[j]**2)                                          #scalar
    reg_cost = (lambda_/(2*m)) * reg_cost                              #scalar
    
    total_cost = cost + reg_cost                                       #scalar
    # YOUR CODE END HERE 
    
    return total_cost                                                  #scalar

執行下方 cell 來看看實際效果。

In [None]:
np.random.seed(1)
X_tmp = np.random.rand(5,6)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1]).reshape(-1,)-0.5
b_tmp = 0.5
lambda_tmp = 0.7
cost_tmp = compute_cost_logistic_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print("Regularized cost:", cost_tmp)

**預期輸出**：
<table>
  <tr>
    <td> <b>正規化後的 cost：</b> 0.6850849138741673 </td>
  </tr>
</table>

<br>

## 加入正規化的梯度下降（Gradient Descent）
梯度下降的基本流程在加入正規化後**不會改變**，仍然是：
$$\begin{align*}
&\text{repeat until convergence:} \; \lbrace \\
&  \; \; \;w_j = w_j -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{1}  \; & \text{for j := 0..n-1} \\ 
&  \; \; \;  \; \;b = b -  \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \\
&\rbrace
\end{align*}$$
其中每一次迭代都會對所有 $j$ 的 $w_j$ 進行**同步更新**。

加入正規化後真正改變的是：**梯度（gradients）的計算方式**。

<br>

### 加入正規化的梯度計算（線性/邏輯皆適用）

線性回歸與邏輯回歸在計算梯度時幾乎一樣，差別主要只在於 $f_{\mathbf{w},b}$ 的計算方式。

$$
\begin{align*}
\frac{\partial J(\mathbf{w},b)}{\partial w_j}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)}  +  \frac{\lambda}{m} w_j \tag{2} \\
\frac{\partial J(\mathbf{w},b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{3} 
\end{align*}
$$

- $m$：資料集中訓練樣本的數量

- $f_{\mathbf{w},b}(x^{(i)})$：模型的預測值，$y^{(i)}$：目標值（標籤）

- 對於 <span style="color:blue">**線性回歸**</span>：
    $f_{\mathbf{w},b}(x) = \mathbf{w} \cdot \mathbf{x} + b$

- 對於 <span style="color:blue">**邏輯回歸**</span>：
    $z = \mathbf{w} \cdot \mathbf{x} + b$
    $f_{\mathbf{w},b}(x) = g(z)$
    其中 $g(z)$ 是 sigmoid 函數：
    $g(z) = \frac{1}{1+e^{-z}}$

加入正規化後，梯度中多出來的項是 <span style="color:blue"> $$\frac{\lambda}{m} w_j$$ </span>

<br>

### 正規化的線性回歸梯度函數

In [None]:
def compute_gradient_linear_reg(X, y, w, b, lambda_): 
    """
    Computes the gradient for linear regression 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
      
    Returns:
      dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar):       The gradient of the cost w.r.t. the parameter b. 
    """

    # YOUR CODE HERE
    m,n = X.shape           #(number of examples, number of features)
    dj_dw = np.zeros((n,))
    dj_db = 0.

    for i in range(m):                             
        err = (np.dot(X[i], w) + b) - y[i]                 
        for j in range(n):                         
            dj_dw[j] = dj_dw[j] + err * X[i, j]               
        dj_db = dj_db + err                        
    dj_dw = dj_dw / m                                
    dj_db = dj_db / m   
    
    for j in range(n):
        dj_dw[j] = dj_dw[j] + (lambda_/m) * w[j]
    # YOUR CODE END HERE 
    
    return dj_db, dj_dw

執行下方 cell 來看看實際效果。

In [None]:
np.random.seed(1)
X_tmp = np.random.rand(5,3)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1])
b_tmp = 0.5
lambda_tmp = 0.7
dj_db_tmp, dj_dw_tmp =  compute_gradient_linear_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print(f"dj_db: {dj_db_tmp}", )
print(f"Regularized dj_dw:\n {dj_dw_tmp.tolist()}", )

**預期輸出**
```
dj_db: 0.6648774569425726
正規化後的 dj_dw:
 [0.29653214748822276, 0.4911679625918033, 0.21645877535865857]
```

<br>

### 正規化的邏輯回歸梯度函數

In [None]:
def compute_gradient_logistic_reg(X, y, w, b, lambda_): 
    """
    Computes the gradient for linear regression 
 
    Args:
      X (ndarray (m,n): Data, m examples with n features
      y (ndarray (m,)): target values
      w (ndarray (n,)): model parameters  
      b (scalar)      : model parameter
      lambda_ (scalar): Controls amount of regularization
    Returns
      dj_dw (ndarray Shape (n,)): The gradient of the cost w.r.t. the parameters w. 
      dj_db (scalar)            : The gradient of the cost w.r.t. the parameter b. 
    """

    # YOUR CODE HERE
    m,n = X.shape
    dj_dw = np.zeros((n,))                            #(n,)
    dj_db = 0.0                                       #scalar

    for i in range(m):
        f_wb_i = sigmoid(np.dot(X[i],w) + b)          #(n,)(n,)=scalar
        err_i  = f_wb_i  - y[i]                       #scalar
        for j in range(n):
            dj_dw[j] = dj_dw[j] + err_i * X[i,j]      #scalar
        dj_db = dj_db + err_i
    dj_dw = dj_dw/m                                   #(n,)
    dj_db = dj_db/m                                   #scalar

    for j in range(n):
        dj_dw[j] = dj_dw[j] + (lambda_/m) * w[j]
    # YOUR CODE END HERE 
    
    return dj_db, dj_dw  


執行下方 cell 來看看實際效果。

In [None]:
np.random.seed(1)
X_tmp = np.random.rand(5,3)
y_tmp = np.array([0,1,0,1,0])
w_tmp = np.random.rand(X_tmp.shape[1])
b_tmp = 0.5
lambda_tmp = 0.7
dj_db_tmp, dj_dw_tmp =  compute_gradient_logistic_reg(X_tmp, y_tmp, w_tmp, b_tmp, lambda_tmp)

print(f"dj_db: {dj_db_tmp}", )
print(f"Regularized dj_dw:\n {dj_dw_tmp.tolist()}", )

**預期輸出**
```
dj_db: 0.341798994972791
正規化後的 dj_dw:
 [0.17380012933994293, 0.32007507881566943, 0.10776313396851499]
```

<br>

## 重新執行過擬合（Overfitting）範例

In [None]:
plt.close("all")
display(output)
ofit = overfit_example(True)

在上面的圖中，請在先前的例子上試試看「加入正規化」的效果。建議操作如下：
- 分類（邏輯回歸）
    - 將 degree 設為 6、lambda 設為 0（不正規化），進行擬合（fit）
    - 接著把 lambda 設為 1（提高正規化強度），再擬合一次，觀察差異
- 回歸（線性回歸）
    - 用相同的步驟操作並比較結果

## 恭喜！
你已經完成：
- 線性回歸與邏輯回歸加入正規化後的成本函數與梯度計算範例
- 對「正規化如何減少過擬合」建立一些直覺