pytorch官方实现的网络模型源码：https://github.com/pytorch/vision/tree/main/torchvision/models

#### 1.加载模型
- 从torchvision.models直接加载模型
- torchvision.models中包括了resnet,mobilenet和vit等网络模型，可以直接调用

In [11]:
from torchvision.models.vision_transformer import vit_b_16
import torch.nn as nn

#### 2.加载预训练权重
1. 自动从网络下载
- 使用ViT_B_16_Weights前要先导入from torchvision.models.vision_transformer import ViT_B_16_Weights
- "IMAGENET1K_V1"表示模型是在IMAGENET1K_V1数据集上训练得到的权重，可以ViT_B_16_Weights方法中查看其它的数据集
- `vit_b_16(weights=ViT_B_16_Weights.IMAGENET1K_V1) `
- `vit_b_16(weights="IMAGENET1K_V1")  ` 

2. 从本地下载好的加载
3. 不使用预训练模型
- `resnet50(weights=None)`
- `resnet50()`

In [6]:
# 1.自动下载
net = vit_b_16(weights="IMAGENET1K_V1")

In [None]:
# 2. 从本地加载
model_weight_path = "vit_b_16-c867db91.pth"
net.load_state_dict(torch.load(model_weight_path, map_location='cpu'))

In [7]:
net

VisionTransformer(
  (conv_proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
  (encoder): Encoder(
    (dropout): Dropout(p=0.0, inplace=False)
    (layers): Sequential(
      (encoder_layer_0): EncoderBlock(
        (ln_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
        )
        (dropout): Dropout(p=0.0, inplace=False)
        (ln_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): MLPBlock(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU(approximate=none)
          (2): Dropout(p=0.0, inplace=False)
          (3): Linear(in_features=3072, out_features=768, bias=True)
          (4): Dropout(p=0.0, inplace=False)
        )
      )
      (encoder_layer_1): EncoderBlock(
        (ln_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (self_att

#### 3.根据分类类别数量修改最后的输出特征维度
- 根据网络层实际的名称取到对应的网络，不同神经网络定义的名称不相同

In [8]:
net.heads.head

Linear(in_features=768, out_features=1000, bias=True)

In [9]:
# 先取到最后一层的输入
in_channel = net.heads.head.in_features
in_channel

768

In [12]:
# 再重新定义最后一层,最后一层的名称一定要与网络实际名称一致
net.heads.head = nn.Linear(in_channel, 5)

In [13]:
net

VisionTransformer(
  (conv_proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
  (encoder): Encoder(
    (dropout): Dropout(p=0.0, inplace=False)
    (layers): Sequential(
      (encoder_layer_0): EncoderBlock(
        (ln_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (self_attention): MultiheadAttention(
          (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
        )
        (dropout): Dropout(p=0.0, inplace=False)
        (ln_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): MLPBlock(
          (0): Linear(in_features=768, out_features=3072, bias=True)
          (1): GELU(approximate=none)
          (2): Dropout(p=0.0, inplace=False)
          (3): Linear(in_features=3072, out_features=768, bias=True)
          (4): Dropout(p=0.0, inplace=False)
        )
      )
      (encoder_layer_1): EncoderBlock(
        (ln_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (self_att

最有一层的输出已经由1000改为5，然后就可以用修改后的模型训练