<img src="img/mmselfsup_logo.png">

# 模型自监督预训练 之 SimCLR

<a href="https://colab.research.google.com/github/open-mmlab/OpenMMLabCourse/blob/main/codes/MMSelfSup_tutorials/【1】模型自监督预训练%20之%20SimCLR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**MMSelfSup Repo**：[https://github.com/open-mmlab/mmselfsup](https://github.com/open-mmlab/mmselfsup)

**MMSelfSup 官方文档链接**：[https://mmselfsup.readthedocs.io/en/latest](https://mmselfsup.readthedocs.io/en/latest)

**MMSelfSup 视频教学**：[https://space.bilibili.com/1293512903/channel/collectiondetail?sid=657287](https://space.bilibili.com/1293512903/channel/collectiondetail?sid=657287)

**MMSelfSup 代码库介绍 PPT 获取方式**：关注 OpenMMLab 公众号，后台回复：mmselfsup，即可获取课程 PPT

**加入微信社群方式**：关注公众号，选择 “加入我们” -> “微信社区”，即可获取入群二维码。非常期待你的到来呀~

**作者**：OpenMMLab

## 0. 自监督预训练方法介绍：SimCLR

**论文地址**：https://arxiv.org/pdf/2002.05709.pdf

**SimCLR 基本思想**：对一张图片做两次不同的数据增强操作，增强后的两张图片互为彼此的正样本，同一个 batch 里其他图片的增强结果为这两张增强图片的负样本。SimCLR 要求编码器最大化当前图像与其正样本表示的相似度，最小化当前图像与其负样本表示的相似度。

<img src="img/SimCLR.png">

## 1. 环境配置

### 1.1 查看 Python、PyTorch 和 Torchvision 的版本

In [1]:
# Check nvcc version
!nvcc -V

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Jun__8_16:59:34_Pacific_Daylight_Time_2022
Cuda compilation tools, release 11.7, V11.7.99
Build cuda_11.7.r11.7/compiler.31442593_0


In [2]:
# Check GCC version
!gcc --version

gcc (4.3.3-tdm-1 mingw32) 4.3.3
Copyright (C) 2008 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.



In [3]:
# Check PyTorch installation
import torch, torchvision
print(torch.__version__)
print(torch.cuda.is_available())

2.0.0
True


### 1.2 安装 MMSelfSup 的依赖库：MMCV

In [4]:
!pip install openmim



In [5]:
!mim install mmcv

Looking in links: https://download.openmmlab.com/mmcv/dist/cu117/torch2.0.0/index.html


### 1.3  安装 MMSelfSup

In [7]:
%cd /content

[WinError 2] The system cannot find the file specified: '/content'
d:\Develop\msc-ai-dev\individual-project


In [10]:
!git clone https://github.com/open-mmlab/mmselfsup.git

[WinError 2] The system cannot find the file specified: '/mmselfsup'
d:\Develop\msc-ai-dev\individual-project


Cloning into 'mmselfsup'...


In [13]:
# Install MMSelfSup from source
!pip install -e ./mmselfsup 

Obtaining file:///D:/Develop/msc-ai-dev/individual-project/mmselfsup
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting attrs
  Downloading attrs-23.1.0-py3-none-any.whl (61 kB)
     ---------------------------------------- 0.0/61.2 kB ? eta -:--:--
     ------------ ------------------------- 20.5/61.2 kB 330.3 kB/s eta 0:00:01
     ------------------------------- ------ 51.2/61.2 kB 525.1 kB/s eta 0:00:01
     -------------------------------------- 61.2/61.2 kB 547.5 kB/s eta 0:00:00
Collecting einops
  Using cached einops-0.6.0-py3-none-any.whl (41 kB)
Collecting future
  Using cached future-0.18.3.tar.gz (840 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting mmcls<1.1.0,>=1.0.0rc6
  Downloading mmcls-1.0.0rc6-py2.py3-none-any.whl (906 kB)
     ---------------------------------------- 0.0/906.1 kB ? eta -:--:--
     ------------- ------------------------ 31

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
wandb 0.12.17 requires GitPython>=1.0.0, which is not installed.
wandb 0.12.17 requires pathtools, which is not installed.
wandb 0.12.17 requires setproctitle, which is not installed.
wandb 0.12.17 requires shortuuid>=0.5.0, which is not installed.
pytest 7.1.2 requires iniconfig, which is not installed.
pytest 7.1.2 requires tomli>=1.0.0, which is not installed.
allennlp 2.9.3 requires lmdb, which is not installed.
allennlp 2.9.3 requires sentencepiece, which is not installed.
wandb 0.12.17 requires protobuf<4.0dev,>=3.12.0, but you have protobuf 4.22.3 which is incompatible.
googleapis-common-protos 1.56.2 requires protobuf<4.0.0dev,>=3.15.0, but you have protobuf 4.22.3 which is incompatible.
google-api-core 2.8.1 requires protobuf<4.0.0dev,>=3.15.0, but you have protobuf 4.22.3 which is incompatible.
cached-pa

### 1.4 检查安装是否正确

In [15]:
import mmselfsup
print(mmselfsup.__version__)

AttributeError: module 'mmselfsup' has no attribute '__version__'

: 

## 2. 准备数据集

### 2.0 数据集介绍

本教程将在 `Tiny ImageNet` 数据集上训练自监督模型 SimCLR。

Tiny ImageNet 数据集是 ImageNet 的一个子集。

该数据集包含 200 个类别，每个类别有 500 张训练图片、50 张验证图片和 50 张测试图片，共 120,000 张图像。每张图片均为 64×64 彩色图片。

数据集官方下载地址：http://cs231n.stanford.edu/tiny-imagenet-200.zip

### 2.1 下载数据集

使用 GNU [Wget](https://www.gnu.org/software/wget/) 工具从斯坦福官方网站下载：http://cs231n.stanford.edu/tiny-imagenet-200.zip

In [None]:
%cd /content/mmselfsup

In [None]:
!mkdir data
%cd data
!wget http://cs231n.stanford.edu/tiny-imagenet-200.zip

### 2.2 解压数据集

In [None]:
!unzip -q tiny-imagenet-200.zip

In [None]:
!rm -rf tiny-imagenet-200.zip

### 2.3 查看数据集目录

In [None]:
# Check data directory
!apt-get install tree
!tree -d /content/mmselfsup/data

### 2.4 准备标注文件

为了减少大家重写 `加载数据集` 代码的负担，我们整理好了标注文件，复制到数据集根目录 `mmselfsup/data/tiny-imagenet-200` 下即可。

In [None]:
%cd /content/mmselfsup/data

In [None]:
!wget https://raw.githubusercontent.com/open-mmlab/OpenMMLabCourse/main/codes/MMSelfSup_tutorials/anno_files/train.txt -P tiny-imagenet-200
!wget https://raw.githubusercontent.com/open-mmlab/OpenMMLabCourse/main/codes/MMSelfSup_tutorials/anno_files/val.txt -P tiny-imagenet-200

## 3. 写模型自监督预训练的配置文件

1. 新建一个名为 `simclr_resnet50_1xb32-coslr-1e_tinyin200.py` 的配置文件。（配置文件命名要求 & 含义可参考[这里](https://mmsegmentation.readthedocs.io/zh_CN/latest/tutorials/config.html#id3))



2. `simclr_resnet50_1xb32-coslr-1e_tinyin200.py` 训练配置文件的内容：
    1. 继承 [simclr_resnet50_8xb32-coslr-200e_in1k.py](https://github.com/open-mmlab/mmselfsup/blob/master/configs/selfsup/simclr/simclr_resnet50_8xb32-coslr-200e_in1k.py) 配置文件
    2. 根据需求修改参数 samples_per_gpu（单个 GPU 的 Batch size）和 workers_per_gpu （单个 GPU 分配的数据加载线程数）
    3. 修改数据集路径和数据标注文件路径
    4. 根据 batch size 调整学习率（调整原则请参考：[这里](https://mmselfsup.readthedocs.io/zh_CN/latest/get_started.html#id2)）
    5. 修改训练的总轮数 epoch

In [None]:
%cd /content/mmselfsup

In [None]:
%%writefile /content/mmselfsup/configs/selfsup/simclr/simclr_resnet50_1xb32-coslr-1e_tinyin200.py

_base_ = 'simclr_resnet50_8xb32-coslr-200e_in1k.py'

# dataset
data = dict(
    samples_per_gpu=32, 
    workers_per_gpu=2,
    train=dict(
        data_source=dict(
            data_prefix='data/tiny-imagenet-200/train',
            ann_file='data/tiny-imagenet-200/train.txt',
        )
    )
)

# optimizer
optimizer = dict(
    lr=0.3 * ((32 * 1) / (32 * 8)),
)

runner = dict(max_epochs=1)

## 4. 模型自监督预训练

我们推荐使用分布式训练工具 [tools/dist_train.sh](https://github.com/open-mmlab/mmselfsup/blob/master/tools/dist_train.sh) 来启动训练任务（即使您只用一张 GPU 进行训练）。
因为一些自监督预训练算法需要用多张 GPU 进行训练，为此 MMSelfSup 支持了多卡训练可能会用到的模块，如 `SyncBN` 等。如果算法在训练的过程中使用到了这些模块，但不使用分布式训练，就会报错。

```shell
bash tools/dist_train.sh ${CONFIG_FILE} ${GPUS} --work-dir ${YOUR_WORK_DIR} [optional arguments]
```

参数:
+ CONFIG_FILE：自监督训练的配置文件所在路径

+ GPUS：进行训练时所使用的 GPU 数量

+ work-dir：训练过程中产生模型和日志等文件的保存路径

其他可选参数 `optional arguments` 可参考[这里](https://mmselfsup.readthedocs.io/zh_CN/latest/get_started.html#id3)。

In [None]:
%cd /content/mmselfsup

In [None]:
!bash tools/dist_train.sh \
configs/selfsup/simclr/simclr_resnet50_1xb32-coslr-1e_tinyin200.py \
1 \
--work_dir work_dirs/selfsup/simclr/simclr_resnet50_1xb32-coslr-1e_tinyin200/ 