# [17、ConvMixer模型原理及其PyTorch逐行实现](https://www.bilibili.com/video/BV1K34y1o74P?spm_id_from=333.788.player.switch&vd_source=cdd897fffb54b70b076681c3c4e4d45d)

- [Patches Are All You Need?](https://arxiv.org/pdf/2201.09792)

卷积神经网络一直是视觉领域的主导架构，但是2022年的最近，基于ViT的Transformer架构模型也表现得很好，ViT采用了图像块嵌入（把图像分割成一个个小patch作为单元）。那么ViT架构模型的表现来自 **Transformer** 还是 **图像Patch** 呢？ConvMixer 论文发现，ViT模型的强大更多源于其基于图像块（patch）的表示方式，而非 Transformer 架构本身。

![image-20250619224855366](http://assets.hypervoid.top/img/2025/06/19/image-20250619224855366-2d03.png)

In [None]:
import torch.nn as nn


def ConvMixer(h, depth, kernel_size=9, patch_size=7, n_classes=1000):
    Seq, ActBn = nn.Sequential, lambda x: Seq(x, nn.GELU(), nn.BatchNorm2d(h))
    Residual = type("Residual", (Seq,), {"forward": lambda self, x: self[0](x) + x})
    return Seq(
        ActBn(nn.Conv2d(3, h, patch_size, stride=patch_size)),
        *[
            Seq(
                Residual(ActBn(nn.Conv2d(h, h, kernel_size, groups=h, padding="same"))),
                ActBn(nn.Conv2d(h, h, 1)),
            )
            for i in range(depth)
        ],
        nn.AdaptiveAvgPool2d((1, 1)),
        nn.Flatten(),
        nn.Linear(h, n_classes)
    )