# MoCo: 无监督视觉表征学习的动量对比

MoCo是一种为比较学习构建动态词典的机制，可用于各种借口任务。在本文中，我们遵循一个简单的示例辨别任务：如果它们是同一图像的编码视图（例如，不同的裁剪），则查询匹配一个键。利用这个借口任务，MoCo展示了在Cifar10数据集中线性分类的通用协议下的竞争结果。

## 模型简介

MoCo网络从另外一个角度来理解对比学习，即从一个字典查询的角度来理解对比学习。MoCo中建立一个动态的字典，这个字典由两个部分组成：一个队列和一个移动平均的编码器。队列中的样本不需要做梯度反传，因此可以在队列中存储很多负样本，从而使这个字典可以变得很大。使用移动平均编码器的目的是使队列中的样本特征尽可能保持一致（即不同的样本通过尽量相似的编码器获得特征的编码表示）。在MoCo网络的构建过程中，使用Resnrt作为网络的骨干网络，除此之外，还需要利用动量更新机制去对网络进行动量更新，在模型结构中由一个momentum updata函数进行更新，在模型输出阶段，有infonce_loss函数来求模型损失值。MoCo网络结构示意图如下图所示。

<p align="center"><img src="image/moco.png" width="300"></p>

## 对比损失机制

比较了现今具有影响力的三种对比机制。为了关注对比损失机制的影响，我们以相同的借口任务实施了所有这些机制。并且还使用与对比损失函数相同的InfoNCE形式。因此，仅在三种机制上进行比较。结果在下图中。总体而言，这三种机制都受益于较大的K值。在内存机制下的中也观察到了类似的趋势，而在这里我们表明这种趋势更为普遍，并且可以在所有机制中看到。这些结果支持了我们建立大型字典去进行MOCO实验的动机。

<p align="center"><img src="image/construction.png" width="400"></p>

## MoCo训练

### 1 数据处理

开始实验之前，请确保本地已经安装了Python环境并安装了MindSpore Vision套件。

#### 1.1 数据准备

本案例使用CIFAR-10数据集作为训练集与验证集来进行实验。CIFAR-10数据集包括10个类别的60000个32x32彩色图像，每个类别有6000个图像。有50000张训练图像和10000张测试图像。该数据集分为五个训练批次和一个测试批次，每个批次有10000张图像。测试批次包含从每个类中随机选择的正好1000个图像。训练批包含随机顺序的剩余图像，但是一些训练批可能包含来自一个类比另一个类的更多图像。其中，训练批次包含来自每个类的5000张图片。

<p align="center"><img src="image/cifar10.png" width="200"></p>

#### 1.2 数据预处理

在训练之前需要对数据集进行处理，本模型在训练过程中需要放入两张图片进行对比训练，所以在提取数据集的时候需要一下提取两张图片，在这个时候需要对数据进行数据集切片处理，而mindspore并没有现成的数据集切片API，那就自定义一个cifar10数据集类。



In [1]:
import os
import pickle
import numpy as np
from PIL import Image

class CiFar10():
    """ training set or test set."""
    train_list = [
        'data_batch_1',
        'data_batch_2',
        'data_batch_3',
        'data_batch_4',
        'data_batch_5']
    test_list = ['test_batch']

    def __init__(self, root, train, transform=None, target_transform=None):
        self.root = root
        self.train = train
        if self.train:
            downloaded_list = self.train_list
        else:
            downloaded_list = self.test_list
        self.data = []
        self.targets = []
        self.transform = transform
        self.target_transform = target_transform

        # now load the picked numpy arrays
        for file_name in downloaded_list:
            file_path = os.path.join(self.root, file_name)
            with open(file_path, 'rb') as f:
                entry = pickle.load(f, encoding='latin1')
                self.data.append(entry['data'])
                if 'labels' in entry:
                    self.targets.extend(entry['labels'])
                else:
                    self.targets.extend(entry['fine_labels'])

        self.data = np.vstack(self.data).reshape(-1, 3, 32, 32)
        self.data = self.data.transpose((0, 2, 3, 1))  # convert to HWC

    def __getitem__(self, index):
        """
        Args:
            index (int): Index

        Returns:
            tuple: (image1, image2) where target is index of the target class.
        """
        img1, img2 = self.data[index], self.data[index]

        # doing this so that it is consistent with all other datasets
        # to return a PIL Image
        img1 = Image.fromarray(img1)
        img2 = Image.fromarray(img2)

        if self.transform is not None:
            img1 = self.transform(img1)
            img2 = self.transform(img2)

        return img1, img2

    def __len__(self):
        return len(self.data)


这样就定义好了一个数据集自定义类，并且能够输出两张图片，接下来要用这来对具体的cifar10数据集进行数据增强。

#### 1.3 数据增强

数据增强分为训练图像增强和测试图像增强，因为只有训练时需要同时输入两张图片，测试时不需要两张图片，所以测试时应用MindSpore已有的Cifar10接口。


In [2]:
from mindvision.dataset import Cifar10
import mindspore.dataset.vision.py_transforms as vp
import mindspore.dataset.vision.c_transforms as vc
import mindspore.dataset.transforms.c_transforms as c


def create_dataset(data_path):
    train_transform = c.Compose([
        vc.RandomResizedCrop(32),
        vc.RandomHorizontalFlip(0.5),
        vp.ToTensor(),
        vp.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010])])
    test_transform = c.Compose([
        vp.ToTensor(),
        vp.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010])])

    dataset1 = Cifar10(path=data_path, split="train", batch_size=256, resize=32, shuffle=True,
                       download=False, transform=train_transform)
    dataset2 = Cifar10(path=data_path, split="train", batch_size=256, resize=32, shuffle=False,
                       download=False, transform=train_transform)
    dataset3 = Cifar10(path=data_path, split="test", batch_size=256, resize=32, shuffle=False,
                       download=False, transform=test_transform)

    train_data = dataset1.run()

    memory_data = dataset2.run()

    test_data = dataset3.run()

    return train_data, memory_data, test_data


数据增强接口就定义完成了，其中设置batch_size为256，对图像进行shuffle操作。接下来进行数据加载，看看效果。

#### 1.4 数据加载

In [12]:
import mindspore

train_data, memory_data, test_data = create_dataset("src/Cifar10/cifar-10-batches-py")

data = next(train_data.create_dict_iterator())
im1 = mindspore.Tensor(data["image"].asnumpy())
im2 = mindspore.Tensor(data["image"].asnumpy())
print(f'Image1 shape: {im1.shape}')
print(f'Image2 shape: {im2.shape}')

Image1 shape: (256, 3, 32, 32)
Image2 shape: (256, 3, 32, 32)


可以看到训练数据输出shape为(256,3,32,32),图片大小为32x32，符合对输出的预期，数据集准备好了，接下来对模型进行构建，使用自定义Resnet模型作为backbone，构建MoCo网络。

### 2 网络构建

MoCo网络从另外一个角度来理解对比学习，即从一个字典查询的角度来理解对比学习。MoCo中建立一个动态的字典，这个字典由两个部分组成：一个队列和一个移动平均的编码器。队列中的样本不需要做梯度反传，因此可以在队列中存储很多负样本，从而使这个字典可以变得很大。使用移动平均编码器的目的是使队列中的样本特征尽可能保持一致（即不同的样本通过尽量相似的编码器获得特征的编码表示），其伪代码可见下图。

<p align="center"><img src="image/code.png" width="600"></p>

#### 2.1 Resnet结构

Resnet作为MoCo网络的backbone，具有非常特殊的结构，那就残差网络结构。Residual net(残差网络)：靠前若干层的某一层数据输出直接跳过多层引入到后面数据层的输入部分。残差神经单元：假定某段神经网络的输入是x,期望输出是H(x),如果我们直接将输入x传到输出作为初始结果，那么我们需要学习的目标就是F(x)=H(x)-x，这就是一个残差神经单元，相当于将学习目标改变了，不再是学习一个完整的输出H(x)，只是输出和输入的差别H(x)-x，即残差。Resnet18结构图如下所示。

<p align="center"><img src="image/resnet.jpg" width="400"></p>

#### 2.2 Split_Batchnorm结构

首先构建MindSpore框架下的Resnet网络结构，需要自定义Split_Batchnorm类，目的是用单GPU模型模拟多GPU模型结构，具体代码如下。



In [3]:
import mindspore.nn as nn
import mindspore.numpy as np
import mindspore.ops as ops

class SplitBatchNorm(nn.BatchNorm2d):
    """SplitBatchNorm:Simulate the behavior of BatchNorm's multiple gpus."""
    def __init__(self, num_features, num_splits=8, **kw):
        super().__init__(num_features, **kw)
        self.num_splits = num_splits

    def construct(self, inputs):
        """ build SplitBatchNorm network."""
        n, c, h, w = inputs.shape

        if self.training or not self.use_batch_statistics:
            moving_mean_split = np.tile(self.moving_mean, self.num_splits)
            moving_var_split = np.tile(self.moving_variance, self.num_splits)
            outcome = ops.BatchNorm(is_training=True, epsilon=1e-5, momentum=0.9,
                                    data_format="NCHW")(inputs.view(-1, c * self.num_splits, h, w),
                                                        np.tile(self.gamma, self.num_splits),
                                                        np.tile(self.beta, self.num_splits), moving_mean_split,
                                                        moving_var_split)[0]
            outcome = outcome.view(n, c, h, w)
            self.moving_mean.set_data(moving_mean_split.view(self.num_splits, c).mean(axis=0))
            self.moving_variance.set_data(moving_var_split.view(self.num_splits, c).mean(axis=0))
            return outcome
        return ops.BatchNorm(is_training=False, epsilon=1e-5, momentum=0.9)(
            inputs, self.moving_mean, self.moving_variance, self.gamma, self.beta)


#### 2.3 骨干网络

构建BasicBlock类和Resnet类

In [3]:
import math
import mindspore.nn as nn
import mindspore.ops as ops
from mindspore.ops import constexpr

class BasicBlock(nn.Cell):
    """
    BasicBlock of ResNet18

    Args:
        in_planes: Input channel
        planes:  Output channel
        kernel_size: Convolution kernel size
        stride: Convolution step
    """
    def __init__(self, in_planes, planes, kernel_size=3, stride=1):

        super(BasicBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_planes, planes, kernel_size=kernel_size, stride=stride, padding=1, has_bias=False,
                               pad_mode='pad', bias_init="zeros", data_format="NCHW")
        self.bn1 = nn.BatchNorm2d(planes, eps=1e-5, momentum=0.99, affine=True, gamma_init='ones', beta_init='zeros',
                                  moving_mean_init='zeros', moving_var_init='ones', use_batch_statistics=None, data_format='NCHW')
        self.relu = nn.ReLU()
        self.conv2 = nn.Conv2d(planes, planes, kernel_size=kernel_size, stride=1, padding=1, has_bias=False,
                               pad_mode='pad', bias_init="zeros", data_format="NCHW")
        self.bn2 = nn.BatchNorm2d(planes, eps=1e-5, momentum=0.99, affine=True, gamma_init='ones', beta_init='zeros',
                                  moving_mean_init='zeros', moving_var_init='ones', use_batch_statistics=None, data_format='NCHW')

        if stride != 1 or in_planes != planes:
            self.downsample = nn.SequentialCell(
                nn.Conv2d(in_planes, planes, kernel_size=1, stride=stride, has_bias=True, pad_mode='valid',
                          bias_init="zeros", data_format="NCHW"), nn.BatchNorm2d(planes, eps=1e-5, momentum=0.99, affine=True, gamma_init='ones', beta_init='zeros',
                                                                                 moving_mean_init='zeros', moving_var_init='ones', use_batch_statistics=None, data_format='NCHW'))
        else:
            self.downsample = nn.SequentialCell()

    def construct(self, inx):
        """calculate basic module output."""
        x = self.relu(self.bn1(self.conv1(inx)))
        x = self.bn2(self.conv2(x))
        out = x + self.downsample(inx)
        out = self.relu(out)
        return out


@constexpr
def compute_kernel_size(inp_shape, output_size):
    """AdaptiveAvgPool2d script"""
    kernel_width, kernel_height = inp_shape[2], inp_shape[3]
    if isinstance(output_size, int):
        kernel_width = math.ceil(kernel_width / output_size)
        kernel_height = math.ceil(kernel_height / output_size)
    elif isinstance(output_size, (list, tuple)):
        kernel_width = math.ceil(kernel_width / output_size[0])
        kernel_height = math.ceil(kernel_height / output_size[1])

    return (kernel_width, kernel_height)


class AdaptiveAvgPool2d(nn.Cell):
    """build AdaptiveAvgPool2d for Ascend."""
    def __init__(self, output_size):
        super().__init__()
        self.output_size = output_size

    def construct(self, x):
        inp_shape = x.shape
        kernel_size = compute_kernel_size(inp_shape, self.output_size)
        return ops.AvgPool(kernel_size, kernel_size)(x)


class ResNet18(nn.Cell):
    """backbone of MoCo"""
    def __init__(self, basicblocks, blocknums, nb_classes):
        super(ResNet18, self).__init__()
        self.in_planes = 64
        self.conv1 = nn.Conv2d(3, self.in_planes, kernel_size=3, stride=1, padding=1, has_bias=False,
                               pad_mode='pad', bias_init="zeros", data_format="NCHW")
        self.bn1 = nn.BatchNorm2d(self.in_planes, eps=1e-5, momentum=0.99, affine=True, gamma_init='ones', beta_init='zeros',
                                  moving_mean_init='zeros', moving_var_init='ones', use_batch_statistics=None, data_format='NCHW')
        self.relu = nn.ReLU()

        self.layer1 = self._make_layers(basicblocks, blocknums[0], 64, 1)
        self.layer2 = self._make_layers(basicblocks, blocknums[1], 128, 2)
        self.layer3 = self._make_layers(basicblocks, blocknums[2], 256, 2)
        self.layer4 = self._make_layers(basicblocks, blocknums[3], 512, 2)
        self.avgpool = AdaptiveAvgPool2d((1, 1))
        self.flatten = nn.Flatten()
        self.fc = nn.Dense(in_channels=512, out_channels=nb_classes,
                           weight_init='normal', bias_init='zeros', has_bias=True)

    def _make_layers(self, basicblock, blocknum, plane, stride):
        """
        make_layers for ResNet18

        Args:
            basicblock: Basic residual block class
            blocknum: The number of basic residual blocks in the current layer is 2 for each layer of resnet18
            plane: Number of output channels
            stride: Convolution step
        """
        layers = []
        for i in range(blocknum):
            if i == 0:
                layer = basicblock(self.in_planes, plane, 3, stride=stride)
            else:
                layer = basicblock(plane, plane, 3, stride=1)
            layers.append(layer)
        self.in_planes = plane
        return nn.SequentialCell(*layers)

    def construct(self, inx):
        """calculate ResNet18 output."""
        x = self.relu(self.bn1(self.conv1(inx)))
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.avgpool(x)
        x = self.flatten(x)
        out = self.fc(x)

        return out

#### 2.4 infoNCE 损失函数

对于MoCo来说，也是需要构建损失函数来进行目标学习的，他学习的目标就是希望query与字典中那个与之相匹配的key越近似越好，与其他不匹配的越不相似越好。MoCo使用的是一种对比损失函数，叫InfoNCE损失函数。
NCE(noise contrastive estimation)：当使用交叉熵的时候，如果类比太多(比如在这里，类别数相当于就是数据集图片数)，那么交叉熵是没法运行的。于是不如就是把多分类看成了数据样本和噪声样本两种类别，变成一系列的二分类问题，希望通过采样部分的负样本来近似使用全部负样本的效果。那么采样的负样本越多，自然就越接近全部样本的效果，但是同样也越难体现出交叉熵函数的作用。
InfoNCE认为不能单纯的看成二分类问题，因为很多负样本本身还是很不相似的，所以还是需要看成多分类的：

$$\mathcal{L}_{q}=-\log \frac{\exp \left(q \cdot k_{+} / \tau\right)}{\sum_{i=0}^{K} \exp \left(q \cdot k_{i} / \tau\right)}$$

In [16]:
def infoNCE_loss(self, im_q, im_k):
    """infoNCE损失函数"""
    # compute query features
    q = self.encoder_q(im_q)  # queries: NxC
    q = ops.L2Normalize(axis=1)(q)  # already normalized

    # compute key features

    # no gradient to keys
    # shuffle for making use of BN
    im_k_, idx_unshuffle = self.batch_shuffle_single_gpu(im_k)

    k = self.encoder_k(im_k_)  # keys: NxC
    k = ops.L2Normalize(axis=1)(k)  # already normalized

    # undo shuffle
    k = self.batch_unshuffle_single_gpu(k, idx_unshuffle)
    k = ops.stop_gradient(k)

    # compute logits
    # Einstein sum is more intuitive
    # positive logits: Nx1
    einsum0 = ops.ReduceSum()(q * k, -1)
    l_pos = ops.ExpandDims()(einsum0, -1)
    # negative logits: NxK
    l_neg = ops.MatMul()(q, self.queue)

    # logits: Nx(1+K)
    logits = ops.Concat(axis=1)((l_pos, l_neg))  # 按照维度连接两个张量
    logits_n = ops.Cast()(logits, mindspore.float32)

    # apply temperature
    logits_x = logits_n / self.t

    # labels: positive key indicators
    labels_n = ops.Zeros()((logits.shape[0]), mindspore.int32)
    labels = ops.Cast()(labels_n, mindspore.int32)

    loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')(logits_x, labels)
    k = ops.stop_gradient(k)
    loss = ops.stop_gradient(loss)

    return loss, q, k


说明：infonce_loss是在moco网络内部定义的，这样提取出来，只作为展示，供参考

#### 2.5 MoCo整体网络构建

在完成了resent18网络和infonce loss的构建之后，需要将backbone嵌入到moco网络中，作为它的一部分，如伪代码所示。

In [4]:
import mindspore
from mindspore import Tensor
import mindspore.nn as nn
import mindspore.ops as ops

class ModelMoCo(nn.Cell):
    """MoCo model based on ResNet18."""
    def __init__(self, i=4096, m=0.01, t=0.1, symmetric=False):
        super(ModelMoCo, self).__init__()

        self.i = i
        self.m = m
        self.t = t
        self.symmetric = symmetric

        # create the encoders
        self.encoder_q = ResNet18(BasicBlock, [2, 2, 2, 2], 128)
        self.encoder_k = ResNet18(BasicBlock, [2, 2, 2, 2], 128)

        for param_q, param_k in zip(self.encoder_q.trainable_params(), self.encoder_k.trainable_params()):
            param_k = param_q.clone()
            param_k.requires_grad = False

        self.queue = mindspore.Parameter(ops.Zeros()((128, 4096), mindspore.float32), name="queue", requires_grad=False)
        self.queue = ops.L2Normalize(axis=0)(self.queue)

        self.queue_ptr = mindspore.Parameter(ops.Zeros()(1, mindspore.float32), name="queue_ptr", requires_grad=False)

    def _momentum_update_key_encoder(self):
        """Momentum update of the key encoder."""
        for param_q, param_k in zip(self.encoder_q.trainable_params(),
                                    self.encoder_k.trainable_params()):
            param_k.set_data(param_k.data * (1 - self.m) + param_q.data * self.m)

    def _dequeue_and_enqueue(self, keys):
        """encoding and decoding function."""
        batch_size = keys.shape[0]

        ptr = int(self.queue_ptr)

        self.queue[:, ptr:ptr + batch_size] = keys.T  # transpose
        ptr = (ptr + batch_size) % self.i             # move pointer

        self.queue_ptr[0] = ptr

    @staticmethod
    def _batch_shuffle_single_gpu(x):
        """batch shuffle is used for multi gpu simulation."""

        # random shuffle index
        n_x = Tensor([x.shape[0]], dtype=mindspore.int32)
        randperm = ops.Randperm(max_length=x.shape[0], pad=-1)
        idx_shuffle = randperm(n_x)
        n_2 = ops.Cast()(idx_shuffle, mindspore.float32)

        # index for restoring
        idx_unshuffle_2 = ops.Sort()(n_2)
        idx_unshuffle = idx_unshuffle_2[1]

        return x[idx_shuffle], idx_unshuffle

    @staticmethod
    def _batch_unshuffle_single_gpu(x, idx_unshuffle):
        """Undo batch shuffle is used for multi gpu simulation."""

        return x[idx_unshuffle]

    def infonce_loss(self, im_q, im_k):
        """InfoNCE loss function."""
        # compute query features
        q = self.encoder_q(im_q)  # queries: NxC
        q = ops.L2Normalize(axis=1)(q)  # already normalized

        # compute key features
        im_k_, idx_unshuffle = ModelMoCo._batch_shuffle_single_gpu(im_k)

        k = self.encoder_k(im_k_)  # keys: NxC
        k = ops.L2Normalize(axis=1)(k)  # already normalized

        # undo shuffle
        k = ModelMoCo._batch_unshuffle_single_gpu(k, idx_unshuffle)
        k = ops.stop_gradient(k)

        einsum0 = ops.ReduceSum()(q * k, -1)
        l_pos = ops.ExpandDims()(einsum0, -1)
        # negative logits: NxK
        l_neg = ops.MatMul()(q, self.queue)

        # logits: Nx(1+K)
        logits = ops.Concat(axis=1)((l_pos, l_neg))
        logits_n = ops.Cast()(logits, mindspore.float32)

        # apply temperature
        logits_x = logits_n / self.t

        # labels: positive key indicators
        labels_n = ops.Zeros()((logits.shape[0]), mindspore.int32)
        labels = ops.Cast()(labels_n, mindspore.int32)

        # Calculate the infonce loss
        loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')(logits_x, labels)
        k = ops.stop_gradient(k)
        loss = ops.stop_gradient(loss)

        return loss, k

    def construct(self, im1, im2):
        """
        Input:
            im_q: a batch of query images
            im_k: a batch of key images
        Output:
            loss
        """
        self._momentum_update_key_encoder()

        # compute loss
        if self.symmetric:
            # asymmetric loss
            loss_12, k2 = self.infonce_loss(im1, im2)
            loss_21, k1 = self.infonce_loss(im2, im1)
            loss = loss_12 + loss_21
            k = ops.Concat(axis=0)(k1, k2)
        else:
            # asymmetric loss
            loss, k = self.infonce_loss(im1, im2)
        self._dequeue_and_enqueue(k)

        return loss


至此MoCo网络整体结构就搭建完成了，咱们可以打印一下其网络结构，查看网络内部结构组成。

In [5]:
models = ModelMoCo(i=4096, m=0.01, t=0.1, symmetric=False)
print(models.encoder_q)

ResNet18<
  (conv1): Conv2d<input_channels=3, output_channels=64, kernel_size=(3, 3), stride=(1, 1), pad_mode=pad, padding=1, dilation=(1, 1), group=1, has_bias=False, weight_init=normal, bias_init=zeros, format=NCHW>
  (bn1): BatchNorm2d<num_features=64, eps=1e-05, momentum=0.010000000000000009, gamma=Parameter (name=encoder_q.bn1.gamma, shape=(64,), dtype=Float32, requires_grad=True), beta=Parameter (name=encoder_q.bn1.beta, shape=(64,), dtype=Float32, requires_grad=True), moving_mean=Parameter (name=encoder_q.bn1.moving_mean, shape=(64,), dtype=Float32, requires_grad=False), moving_variance=Parameter (name=encoder_q.bn1.moving_variance, shape=(64,), dtype=Float32, requires_grad=False)>
  (relu): ReLU<>
  (layer1): SequentialCell<
    (0): BasicBlock<
      (conv1): Conv2d<input_channels=64, output_channels=64, kernel_size=(3, 3), stride=(1, 1), pad_mode=pad, padding=1, dilation=(1, 1), group=1, has_bias=False, weight_init=normal, bias_init=zeros, format=NCHW>
      (bn1): BatchNorm2

### 3 模型训练

在网络构建好之后，loss定义完成后，需要初始化优化器，然后加载数据集，设置好epoch以及学习率lr，就可以对网络进行训练了。

In [6]:
import time
from mindspore import context

def train(net, train_data, epoch, lr):
    """
    MoCo train

    Args:
        net : MoCo model
        train_data : train_data
        epoch: epoch
        args: train args
        lr: learning rate
    """
    start = time.time()
    net.set_train()
    total_loss, total_num, step = 0.0, 0, 0
    steps = train_data.get_dataset_size()

    for d in train_data.create_dict_iterator():
        im1 = mindspore.Tensor(d["image"].asnumpy())
        #im2 = mindspore.Tensor(d["image2"].asnumpy())
        im2 = im1.copy()
        loss = net(im1, im2)
        total_num += 256
        total_loss += loss * 256
        if step % 25 == 0:
            print(f"Epoch: [{epoch} / {30}], "
                  f"step: [{step} / {steps}], "
                  f"loss: {total_loss / total_num},"
                  f"lr: {lr}")
        step += 1
    stop = time.time() - start
    print(f"time of one step: {stop/steps} s/step, ")
    return total_loss / total_num

train_data, memory_data, test_data = create_dataset("Cifar10/cifar-10-batches-py")
context.set_context(mode=context.PYNATIVE_MODE, device_target='Ascend')

for epoch in range(1, 5):
    exponential_decay_lr = nn.ExponentialDecayLR(0.06, 0.8, 10)
    lr = exponential_decay_lr(epoch)
    optimizer = nn.SGD(params=models.trainable_params(), learning_rate=lr, weight_decay=5e-4, momentum=0.9)
    train_net = nn.TrainOneStepCell(models, optimizer)
    start = time.time()
    train_loss = train(train_net, train_data, epoch, lr)

Epoch: [1 / 30], step: [0 / 195], loss: 8.578479,lr: 0.058675967
Epoch: [1 / 30], step: [25 / 195], loss: 7.855386,lr: 0.058675967
Epoch: [1 / 30], step: [50 / 195], loss: 7.7259436,lr: 0.058675967
Epoch: [1 / 30], step: [75 / 195], loss: 7.703345,lr: 0.058675967
Epoch: [1 / 30], step: [100 / 195], loss: 7.7242727,lr: 0.058675967
Epoch: [1 / 30], step: [125 / 195], loss: 7.7489104,lr: 0.058675967
Epoch: [1 / 30], step: [150 / 195], loss: 7.7651553,lr: 0.058675967
Epoch: [1 / 30], step: [175 / 195], loss: 7.767565,lr: 0.058675967
time of one step: 0.3461952600723658 s/step, 
Epoch: [2 / 30], step: [0 / 195], loss: 7.6540165,lr: 0.05738115
Epoch: [2 / 30], step: [25 / 195], loss: 7.599515,lr: 0.05738115
Epoch: [2 / 30], step: [50 / 195], loss: 7.5501733,lr: 0.05738115
Epoch: [2 / 30], step: [75 / 195], loss: 7.5048146,lr: 0.05738115
Epoch: [2 / 30], step: [100 / 195], loss: 7.465658,lr: 0.05738115
Epoch: [2 / 30], step: [125 / 195], loss: 7.4326143,lr: 0.05738115
Epoch: [2 / 30], step: [

训练完毕，那么接下来需要验证训练效果，从数据集提取验证数据集用于模型评估。

### 4 模型评估

需要自定义模型评估脚本，里面用到KNN近邻算法来预测数据。

#### 4.1 KNN近邻算法

简单说就是采用测量不同特征值之间的距离方法进行分类（k-Nearest Neighbor，KNN）,拥有精度高、对异常值不敏感、无数据输入假等特点。  
1、当样本不平衡时，比如一个类的样本容量很大，其他类的样本容量很小，输入一个样本的时候，K个临近值中大多数都是大样本容量的那个类，这时可能就会导致分类错误。改进方法是对K临近点进行加权，也就是距离近的点的权值大，距离远的点权值小。  
2、计算量较大，每个待分类的样本都要计算它到全部点的距离，根据距离排序才能求得K个临近点，改进方法是：先对已知样本点进行剪辑，事先去除对分类作用不大的样本。  
适用数据范围：  
1、标称型(离散型)：标称型目标变量的结果只在有限目标集中取值，如真与假(标称型目标变量主要用于分类)。  
2、数值型：数值型目标变量则可以从无限的数值集合中取值，如0.100。

<p align="center"><img src="image/knn.png" width="400"></p>


In [9]:
import numpy as np
import mindspore
from mindspore import Tensor
import mindspore.ops as ops
import mindspore.dataset

def knn_predict(feature, feature_bank, feature_labels, classes, knn_k, knn_t):
    """
    knn_predict:compute cos similarity between each feature vector and feature bank ---> [B, N]
    knn monitor as in InstDisc https://arxiv.org/abs/1805.01978
    implementation follows https://github.com/leftthomas/SimCLR
    """
    sim_matrix = ops.MatMul()(feature, feature_bank)

    topk = ops.TopK()
    sim_weight, sim_indices = topk(sim_matrix, knn_k)

    sim_labels = ops.GatherD()(ops.BroadcastTo((feature.shape[0], -1))(feature_labels), -1, sim_indices)
    sim_weight = ops.Exp()(sim_weight / knn_t)

    on_value, off_value = Tensor(1.0, mindspore.float32), Tensor(0.0, mindspore.float32)
    one_hot_label = ops.OneHot(axis=-1)(sim_labels.view(-1), 10, on_value, off_value)

    pred_scores = ops.ReduceSum()(one_hot_label.view(feature.shape[0], -1,
                                                     classes) * ops.ExpandDims()(sim_weight, -1), 1)
    sort = ops.Sort(axis=-1, descending=True)
    pred_labels = sort(pred_scores)[1]

    return pred_labels


#### 4.2 模型评估脚本



In [10]:
def test(net, memory_data_loader, test_data_loader, epoch):
    """
    test for MoCo

    Args:
        net : MoCo.encoder_q model
        memory_data_loader : memory_data
        test_data_loader: test_data
        epoch: epoch
        args: test args
    """
    net.set_train(False)
    classes = 10
    total_top1, total_num, step, feature_bank = 0.0, 0, 0, []
    x1 = np.random.normal(1, 1, (0))
    steps = test_data_loader.get_dataset_size()

    # generate feature bank
    for data1 in memory_data_loader.create_dict_iterator():
        feature = net(data1["image"])
        feature = ops.L2Normalize(axis=1)(feature)
        feature_bank.append(feature)
        x2 = data1["label"].asnumpy()
        x1 = np.concatenate([x1, x2], axis=0)

    feature_bank1 = ops.Concat(axis=0)(feature_bank)
    feature_bank2 = feature_bank1.T
    feature_labels = mindspore.Tensor(x1, mindspore.int32)

    # loop test data to predict the label by weighted knn search
    for data2 in test_data_loader.create_dict_iterator():
        feature = net(data2["image"])
        feature = ops.L2Normalize(axis=1)(feature)
        pred_labels = knn_predict(feature, feature_bank2, feature_labels, classes, 200, 0.1)
        cast = ops.Cast()
        total_num += data2["image"].shape[0]
        number = cast((pred_labels[:, 0] == data2["label"]), mindspore.float32)
        total_top1 += number.sum()
        if step % 5 == 0:

            print(f"Epoch: [{epoch} / {30}], "
                  f"step: [{step} / {steps}], "
                  f"Acc@1:{total_top1 / total_num * 100}%")
        step += 1

    return total_top1 / total_num * 100


#### 4.3 模型评估整体

在完成了KNN近邻函数和评估脚本的写后，然后可以直接调用，对模型进行评估。

In [11]:
for epoch in range(1, 5):
    test_acc_1 = test(models.encoder_q, memory_data, test_data, epoch)

Epoch: [1 / 30], step: [0 / 39], Acc@1:10.15625%
Epoch: [1 / 30], step: [5 / 39], Acc@1:10.481771%
Epoch: [1 / 30], step: [10 / 39], Acc@1:10.973012%
Epoch: [1 / 30], step: [15 / 39], Acc@1:10.791016%
Epoch: [1 / 30], step: [20 / 39], Acc@1:11.160714%
Epoch: [1 / 30], step: [25 / 39], Acc@1:11.253004%
Epoch: [1 / 30], step: [30 / 39], Acc@1:11.378529%
Epoch: [1 / 30], step: [35 / 39], Acc@1:11.263021%
Epoch: [2 / 30], step: [0 / 39], Acc@1:10.15625%
Epoch: [2 / 30], step: [5 / 39], Acc@1:10.611979%
Epoch: [2 / 30], step: [10 / 39], Acc@1:10.795455%
Epoch: [2 / 30], step: [15 / 39], Acc@1:10.498047%
Epoch: [2 / 30], step: [20 / 39], Acc@1:10.900298%
Epoch: [2 / 30], step: [25 / 39], Acc@1:10.952524%
Epoch: [2 / 30], step: [30 / 39], Acc@1:11.101311%
Epoch: [2 / 30], step: [35 / 39], Acc@1:10.904947%
Epoch: [3 / 30], step: [0 / 39], Acc@1:9.765625%
Epoch: [3 / 30], step: [5 / 39], Acc@1:10.677084%
Epoch: [3 / 30], step: [10 / 39], Acc@1:10.795455%
Epoch: [3 / 30], step: [15 / 39], Acc@1:

### 5 模型推理（验证）

模型训练评估好之后需要验证模型正确性，查看模型输出，验证可视化输出。
<p align="center">
  <img src="image/visual_1.png" width="600">
</p>

<p align="center">
  <img src="image/visual_2.png" width="600">
</p>

<p align="center">
  <img src="image/visual_3.png" width="600">
</p>

### 6 总结

本案例对Momentum Contrast for Unsupervised Visual Representation Learning这篇论文中提出的模型进行了详细的解释，向读者完整地展现了该算法的流程。如需查看详细代码，可参考course仓库。
具体网址：https://gitee.com/mindspore/course/tree/master/application_example/MoCo

### 引用

[1] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, Ross Girshick:Momentum Contrast for Unsupervised Visual Representation Learning

论文地址：https://arxiv.org/abs/1911.05722v3

[2] Pytorch 开源代码：单GPU实现MoCo

代码地址：https://github.com/leftthomas/MoCo
