# 《Emerging Properties in Self-Supervised Vision Transformers》——ICCV2021

#Ctrl+Shift+V preview 
```mermaid
graph TD
    A[Input Images] --> B[Multi-crop Augmentation]
    B --> C[DINO Framework]
    C --> D[Student Network]
    C --> E[Teacher Network]
    D --> F[Student Output]
    E --> G[Teacher Output]
    F --> H[DINO Loss]
    G --> H
    D --> I[EMA Update]
    I --> E
```

Distillation with NO labels,用于在不需要标记数据的情况下训练ViT，本质上是一个Teacher-Student框架。

# Student-Teacher Framework


![alt text](../../Image/DINO_1.png)

![alt text](../../Image/Student-Teacher.png)

Student Net——$g_{\theta_s}$：学习预测教师网络的输出，通过梯度下降更新 <p>
Teacher Net——$g_{\theta_t}$：学习Target Representation，通过**学生网络参数EMA进行更新** <p>
- 两个网络分别输出概率分布$P_s$和%$P_t$,这些概率分布是通过将两个网络的输出进行softmax而来的<p>
$P_s(x)(i)=\frac{\exp(g_{\theta_s}(x)(i)/\tau_s)}{\sum_{k=1}^K\exp(g_{\theta_s}(x)(k)/\tau_s)}$;$\tau_t$是温度参数，控制输出分布的锐度<p>
- 交叉熵损失更新$\theta_s$:$\min_{\theta_s}H(P_t(x),P_s(x))$ 其中,$H(a,b)=-a\log b$
- EMA更新教师网络：$\theta_t=\lambda \theta_t+(1-\lambda)\theta_s$<p>
其中，λ是一个接近1的值，通常在训练过程中从0.996线性衰减到1

**网络架构**：DINO的神经网络由一个主干网络（如ViT或ResNet）和一个投影头组成。<p>
投影头是一个3层多层感知机（MLP），后面跟着一个权重归一化的全连接层，输出维度为K。在训练过程中，我们不使用批量归一化（BN），因为ViT架构默认不使用BN。这种设计使得DINO在ViT上完全不依赖BN，提高了训练效率。

In [None]:
    import torch
    import torch.nn as nn
    # ============ building student and teacher networks ... ============
    # we changed the name DeiT-S for ViT-S to avoid confusions
    args.arch = args.arch.replace("deit", "vit")

    # 主要load：student model、teacher model and embed_dim
    # if the network is a Vision Transformer (i.e. vit_tiny, vit_small, vit_base)
    if args.arch in vits.__dict__.keys():
        student = vits.__dict__[args.arch](
            patch_size=args.patch_size,
            drop_path_rate=args.drop_path_rate,  # stochastic depth
        )
        teacher = vits.__dict__[args.arch](patch_size=args.patch_size)
        embed_dim = student.embed_dim

    # if the network is a XCiT
    elif args.arch in torch.hub.list("facebookresearch/xcit:main"):
        student = torch.hub.load('facebookresearch/xcit:main', args.arch,
                                 pretrained=False, drop_path_rate=args.drop_path_rate)
        teacher = torch.hub.load('facebookresearch/xcit:main', args.arch, pretrained=False)
        embed_dim = student.embed_dim

    # otherwise, we check if the architecture is in torchvision models
    elif args.arch in torchvision_models.__dict__.keys():
        student = torchvision_models.__dict__[args.arch]()
        teacher = torchvision_models.__dict__[args.arch]()
        embed_dim = student.fc.weight.shape[1]
    else:
        print(f"Unknow architecture: {args.arch}")

    # multi-crop wrapper handles forward with inputs of different resolutions
    student = utils.MultiCropWrapper(student, DINOHead(
        embed_dim,
        args.out_dim,
        use_bn=args.use_bn_in_head,
        norm_last_layer=args.norm_last_layer,
    ))
    teacher = utils.MultiCropWrapper(
        teacher,
        DINOHead(embed_dim, args.out_dim, args.use_bn_in_head),
    )

    # move networks to gpu
    student, teacher = student.cuda(), teacher.cuda()

    # synchronize batch norms (if any)
    if utils.has_batchnorms(student):
        student = nn.SyncBatchNorm.convert_sync_batchnorm(student)
        teacher = nn.SyncBatchNorm.convert_sync_batchnorm(teacher)

        # we need DDP wrapper to have synchro batch norms working...
        teacher = nn.parallel.DistributedDataParallel(teacher, device_ids=[args.gpu])
        teacher_without_ddp = teacher.module
    else:
        # teacher_without_ddp and teacher are the same thing
        teacher_without_ddp = teacher

    # ---------------------------------------------------------------------------
    student = nn.parallel.DistributedDataParallel(student, device_ids=[args.gpu])
    # teacher and student start with the same weights
    teacher_without_ddp.load_state_dict(student.module.state_dict())
    # there is no backpropagation through the teacher, so no need for gradients
    for p in teacher.parameters():
        p.requires_grad = False
    # ---------------------------------------------------------------------------
    print(f"Student and Teacher are built: they are both {args.arch} network.")

## Multi-Crop Data Augmentation

- Glboal views: