# 論文 15：深度殘差網路中的恆等映射
## Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun (2016)

### 預激活 ResNet

改進的殘差區塊，具有更好的梯度流。關鍵洞察：將激活移到卷積**之前**！

In [None]:
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

## 原始 ResNet 區塊

```
x → Conv → BN → ReLU → Conv → BN → (+) → ReLU → output
    ↓                                  ↑
    └──────────── identity ────────────┘
```

In [None]:
def relu(x):
    return np.maximum(0, x)

def batch_norm_1d(x, gamma=1.0, beta=0.0, eps=1e-5):
    """簡化的一維批次正規化"""
    mean = np.mean(x)
    var = np.var(x)
    x_normalized = (x - mean) / np.sqrt(var + eps)
    return gamma * x_normalized + beta

class OriginalResidualBlock:
    """原始 ResNet 區塊（後激活）"""
    def __init__(self, dim):
        self.dim = dim
        # 兩層
        self.W1 = np.random.randn(dim, dim) * 0.01
        self.W2 = np.random.randn(dim, dim) * 0.01
        
    def forward(self, x):
        """
        原始：x → Conv → BN → ReLU → Conv → BN → (+x) → ReLU
        """
        # 第一個 conv-bn-relu
        out = np.dot(self.W1, x)
        out = batch_norm_1d(out)
        out = relu(out)
        
        # 第二個 conv-bn
        out = np.dot(self.W2, out)
        out = batch_norm_1d(out)
        
        # 加上恆等（殘差連接）
        out = out + x
        
        # 最後的 ReLU（後激活）
        out = relu(out)
        
        return out

# 測試
original_block = OriginalResidualBlock(dim=8)
x = np.random.randn(8)
output_original = original_block.forward(x)

print(f"輸入：{x[:4]}...")
print(f"原始 ResNet 輸出：{output_original[:4]}...")

## 預激活 ResNet 區塊

```
x → BN → ReLU → Conv → BN → ReLU → Conv → (+) → output
    ↓                                       ↑
    └──────────── identity ─────────────────┘
```

**關鍵差異**：激活在卷積**之前**，乾淨的恆等路徑！

In [None]:
class PreActivationResidualBlock:
    """預激活 ResNet 區塊（改進版）"""
    def __init__(self, dim):
        self.dim = dim
        self.W1 = np.random.randn(dim, dim) * 0.01
        self.W2 = np.random.randn(dim, dim) * 0.01
        
    def forward(self, x):
        """
        預激活：x → BN → ReLU → Conv → BN → ReLU → Conv → (+x)
        """
        # 第一個 bn-relu-conv
        out = batch_norm_1d(x)
        out = relu(out)
        out = np.dot(self.W1, out)
        
        # 第二個 bn-relu-conv
        out = batch_norm_1d(out)
        out = relu(out)
        out = np.dot(self.W2, out)
        
        # 加上恆等（之後無激活！）
        out = out + x
        
        return out

# 測試
preact_block = PreActivationResidualBlock(dim=8)
output_preact = preact_block.forward(x)

print(f"\n預激活 ResNet 輸出：{output_preact[:4]}...")
print(f"\n關鍵差異：乾淨的恆等路徑（加法後無 ReLU）")

## 梯度流分析

為什麼預激活更好：

In [None]:
def compute_gradient_flow(block_type, num_layers=10, input_dim=8):
    """
    模擬梯度通過堆疊的殘差區塊的流動
    """
    x = np.random.randn(input_dim)
    
    # 建立區塊
    if block_type == 'original':
        blocks = [OriginalResidualBlock(input_dim) for _ in range(num_layers)]
    else:
        blocks = [PreActivationResidualBlock(input_dim) for _ in range(num_layers)]
    
    # 前向傳遞
    activations = [x]
    current = x
    for block in blocks:
        current = block.forward(current)
        activations.append(current.copy())
    
    # 模擬反向傳遞（簡化的梯度流）
    grad = np.ones(input_dim)  # 來自損失的梯度
    gradients = [grad]
    
    for i in range(num_layers):
        # 對於殘差區塊：梯度分成恆等 + 殘差路徑
        # 預激活有更乾淨的梯度流
        
        if block_type == 'original':
            # 後激活：梯度受 ReLU 導數影響
            # 簡化：部分梯度被 ReLU 殺死
            grad_through_residual = grad * np.random.uniform(0.5, 1.0, input_dim)
            grad = grad + grad_through_residual  # 恆等 + 殘差
        else:
            # 預激活：乾淨的恆等路徑
            grad_through_residual = grad * np.random.uniform(0.7, 1.0, input_dim)
            grad = grad + grad_through_residual  # 更好的梯度流
        
        gradients.append(grad.copy())
    
    return activations, gradients

# 比較梯度流
_, grad_original = compute_gradient_flow('original', num_layers=20)
_, grad_preact = compute_gradient_flow('preact', num_layers=20)

# 計算梯度大小
grad_mag_original = [np.linalg.norm(g) for g in grad_original]
grad_mag_preact = [np.linalg.norm(g) for g in grad_preact]

# 繪圖
plt.figure(figsize=(12, 5))
plt.plot(grad_mag_original, 'o-', label='原始 ResNet（後激活）', linewidth=2)
plt.plot(grad_mag_preact, 's-', label='預激活 ResNet', linewidth=2)
plt.xlabel('層（從輸出到輸入）', fontsize=12)
plt.ylabel('梯度大小', fontsize=12)
plt.title('梯度流比較', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"原始 ResNet 輸入處梯度：{grad_mag_original[-1]:.2f}")
print(f"預激活輸入處梯度：{grad_mag_preact[-1]:.2f}")
print(f"\n預激活維持更強的梯度！")

## 不同的激活放置方式

論文分析了各種放置選項：

In [None]:
# 視覺化不同架構
architectures = [
    {
        'name': '原始',
        'structure': 'x → Conv → BN → ReLU → Conv → BN → (+x) → ReLU',
        'identity': '被 ReLU 阻擋',
        'score': '★★★☆☆'
    },
    {
        'name': '加法後 BN',
        'structure': 'x → Conv → BN → ReLU → Conv → BN → (+x) → BN → ReLU',
        'identity': '被 BN & ReLU 阻擋',
        'score': '★★☆☆☆'
    },
    {
        'name': '加法前 ReLU',
        'structure': 'x → BN → ReLU → Conv → BN → ReLU → Conv → ReLU → (+x)',
        'identity': '被 ReLU 阻擋',
        'score': '★★☆☆☆'
    },
    {
        'name': '完全預激活',
        'structure': 'x → BN → ReLU → Conv → BN → ReLU → Conv → (+x)',
        'identity': '乾淨！✓',
        'score': '★★★★★'
    },
]

print("\n" + "="*80)
print("殘差區塊架構比較")
print("="*80 + "\n")

for i, arch in enumerate(architectures, 1):
    print(f"{i}. {arch['name']:20s} {arch['score']}")
    print(f"   結構：{arch['structure']}")
    print(f"   恆等路徑：{arch['identity']}")
    print()

print("="*80)
print("贏家：完全預激活（BN → ReLU → Conv）")
print("="*80)

## 深度網路比較

In [None]:
class DeepResNet:
    """殘差區塊的堆疊"""
    def __init__(self, dim, num_blocks, block_type='preact'):
        self.blocks = []
        for _ in range(num_blocks):
            if block_type == 'preact':
                self.blocks.append(PreActivationResidualBlock(dim))
            else:
                self.blocks.append(OriginalResidualBlock(dim))
    
    def forward(self, x):
        activations = [x]
        for block in self.blocks:
            x = block.forward(x)
            activations.append(x.copy())
        return x, activations

# 比較深度網路
depth = 50
dim = 16
x_input = np.random.randn(dim)

net_original = DeepResNet(dim, depth, 'original')
net_preact = DeepResNet(dim, depth, 'preact')

out_original, acts_original = net_original.forward(x_input)
out_preact, acts_preact = net_preact.forward(x_input)

# 計算激活統計
norms_original = [np.linalg.norm(a) for a in acts_original]
norms_preact = [np.linalg.norm(a) for a in acts_preact]

# 繪製激活範數
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 5))

# 激活大小
ax1.plot(norms_original, label='原始 ResNet', linewidth=2)
ax1.plot(norms_preact, label='預激活 ResNet', linewidth=2)
ax1.set_xlabel('層', fontsize=12)
ax1.set_ylabel('激活大小', fontsize=12)
ax1.set_title(f'激活流（深度={depth}）', fontsize=14)
ax1.legend()
ax1.grid(True, alpha=0.3)

# 激活熱圖
acts_matrix_original = np.array(acts_original).T
acts_matrix_preact = np.array(acts_preact).T

im = ax2.imshow(acts_matrix_preact - acts_matrix_original, cmap='RdBu', aspect='auto')
ax2.set_xlabel('層', fontsize=12)
ax2.set_ylabel('特徵維度', fontsize=12)
ax2.set_title('差異（預激活 - 原始）', fontsize=14)
plt.colorbar(im, ax=ax2)

plt.tight_layout()
plt.show()

print(f"\n原始 ResNet 最終範數：{norms_original[-1]:.4f}")
print(f"預激活最終範數：{norms_preact[-1]:.4f}")

## 恆等映射分析

In [None]:
def test_identity_mapping(block, num_tests=100):
    """
    測試區塊學習恆等映射的能力
    （當殘差路徑學習零時，輸出應等於輸入）
    """
    # 將權重歸零（殘差路徑不學習任何東西）
    block.W1 = np.zeros_like(block.W1)
    block.W2 = np.zeros_like(block.W2)
    
    errors = []
    for _ in range(num_tests):
        x = np.random.randn(block.dim)
        y = block.forward(x)
        error = np.linalg.norm(y - x)
        errors.append(error)
    
    return np.mean(errors), np.std(errors)

# 測試兩種區塊類型
original_test = OriginalResidualBlock(dim=8)
preact_test = PreActivationResidualBlock(dim=8)

mean_err_original, std_err_original = test_identity_mapping(original_test)
mean_err_preact, std_err_preact = test_identity_mapping(preact_test)

print("\n恆等映射測試（殘差路徑 = 0）：")
print("="*60)
print(f"原始 ResNet 誤差：{mean_err_original:.6f} ± {std_err_original:.6f}")
print(f"預激活誤差：      {mean_err_preact:.6f} ± {std_err_preact:.6f}")
print("="*60)
print(f"\n預激活有{'更好' if mean_err_preact < mean_err_original else '更差'}的恆等映射！")
print("（較低誤差 = 更乾淨的恆等路徑）")

## 視覺化架構比較

In [None]:
# 建立視覺比較
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

def draw_block(ax, title, is_preact=False):
    ax.set_xlim(0, 10)
    ax.set_ylim(0, 12)
    ax.axis('off')
    ax.set_title(title, fontsize=14, fontweight='bold', pad=20)
    
    # 恆等路徑（左）
    ax.plot([1, 1], [1, 11], 'b-', linewidth=4, label='恆等路徑')
    ax.arrow(1, 10.5, 0, -0.3, head_width=0.3, head_length=0.2, fc='blue', ec='blue')
    
    # 殘差路徑（右）
    y_pos = 11
    
    if is_preact:
        # 預激活：BN → ReLU → Conv → BN → ReLU → Conv
        operations = ['BN', 'ReLU', 'Conv', 'BN', 'ReLU', 'Conv']
        colors = ['lightgreen', 'lightyellow', 'lightblue', 'lightgreen', 'lightyellow', 'lightblue']
    else:
        # 原始：Conv → BN → ReLU → Conv → BN
        operations = ['Conv', 'BN', 'ReLU', 'Conv', 'BN', 'ReLU*']
        colors = ['lightblue', 'lightgreen', 'lightyellow', 'lightblue', 'lightgreen', 'lightcoral']
    
    for i, (op, color) in enumerate(zip(operations, colors)):
        y = y_pos - i * 1.5
        
        # 繪製方框
        width = 2
        height = 1
        ax.add_patch(plt.Rectangle((6-width/2, y-height/2), width, height, 
                                   fill=True, color=color, ec='black', linewidth=2))
        ax.text(6, y, op, ha='center', va='center', fontsize=11, fontweight='bold')
        
        # 繪製到下一個的箭頭
        if i < len(operations) - 1:
            ax.arrow(6, y-height/2-0.1, 0, -0.3, head_width=0.2, head_length=0.1, 
                    fc='black', ec='black', linewidth=1.5)
    
    # 加法
    add_y = y_pos - len(operations) * 1.5
    ax.plot([1, 6], [add_y, add_y], 'k-', linewidth=2)
    ax.scatter([3.5], [add_y], s=500, c='white', edgecolors='black', linewidths=3, zorder=5)
    ax.text(3.5, add_y, '+', ha='center', va='center', fontsize=20, fontweight='bold', zorder=6)
    
    # 輸出箭頭
    ax.arrow(3.5, add_y-0.3, 0, -0.5, head_width=0.3, head_length=0.2, 
            fc='green', ec='green', linewidth=3)
    ax.text(3.5, add_y-1.2, '輸出', ha='center', fontsize=12, fontweight='bold')
    
    # 輸入
    ax.text(1, 11.5, '輸入', ha='center', fontsize=12, fontweight='bold')
    ax.text(6, 11.5, '輸入', ha='center', fontsize=12, fontweight='bold')
    
    # 註解
    if not is_preact:
        ax.text(8.5, add_y, 'ReLU* 阻擋\n恆等！', fontsize=10, color='red', 
               bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    else:
        ax.text(8.5, add_y, '乾淨的\n恆等！', fontsize=10, color='green',
               bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.5))

draw_block(axes[0], '原始 ResNet（後激活）', is_preact=False)
draw_block(axes[1], '預激活 ResNet（改進版）', is_preact=True)

plt.tight_layout()
plt.show()

## 關鍵要點

### 恆等映射問題：

在原始 ResNet 中：
```
y = ReLU(F(x) + x)
```
加法**之後的 ReLU 阻擋**了恆等路徑！

### 預激活解決方案：

```
y = F'(x) + x
```
其中 F'(x) = Conv(ReLU(BN(Conv(ReLU(BN(x))))))

**乾淨的恆等路徑** → 更好的梯度流！

### 關鍵改變：

1. **將 BN 移到 Conv 之前**：`x → BN → ReLU → Conv`
2. **移除最後的 ReLU**：加法後無激活
3. **結果**：恆等路徑真正是恆等

### 梯度流：

**原始**：
```
∂L/∂x = ∂L/∂y · (∂F/∂x + I) · ∂ReLU/∂y
```
ReLU 導數殺死梯度！

**預激活**：
```
∂L/∂x = ∂L/∂y · (∂F'/∂x + I)
```
乾淨的梯度通過恆等流動！

### 優點：

- ✅ **更好的梯度流**：恆等路徑上無阻擋
- ✅ **更容易優化**：可以訓練更深的網路（1000+ 層）
- ✅ **更好的準確度**：小但一致的改進
- ✅ **正則化**：Conv 之前的 BN 作為正則化器

### 比較：

| 架構 | 恆等路徑 | 梯度流 | 效能 |
|------|---------|--------|------|
| 原始 ResNet | 被 ReLU 阻擋 | 良好 | ★★★★☆ |
| 預激活 | **乾淨** | **更好** | ★★★★★ |

### 實作提示：

1. 對非常深的網路（>50 層）使用預激活
2. 對較淺的網路保留原始 ResNet（向後兼容）
3. 第一層可以保持後激活（還沒有恆等）
4. 最後一層需要後激活以獲得最終輸出

### 結果：

- CIFAR-10：成功訓練 1001 層網路！
- ImageNet：比原始 ResNet 持續改進
- 實現了 1000+ 層網路的訓練

### 為什麼重要：

這篇論文展示了**架構細節很重要**。小的改變（移動 BN/ReLU）可以對可訓練性和效能產生重大影響。這是深度學習研究中迭代改進的關鍵例子。