Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Support DSDL Dataset #1503

Open
wants to merge 1 commit into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .circleci/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,11 @@ jobs:
command: |
python -V
pip install torch==<< parameters.torch >>+cpu torchvision==<< parameters.torchvision >>+cpu -f https://download.pytorch.org/whl/torch_stable.html
- when:
condition:
equal: ["3.10.0", << parameters.python >>]
steps:
- run: pip install dsdl
- run:
name: Install mmpretrain dependencies
command: |
Expand Down Expand Up @@ -113,6 +118,11 @@ jobs:
command: |
python -V
pip install torch==<< parameters.torch >>+cpu torchvision==<< parameters.torchvision >>+cpu -f https://download.pytorch.org/whl/torch_stable.html
- when:
condition:
equal: ["3.10.0", << parameters.python >>]
steps:
- run: pip install dsdl
- run:
name: Install mmpretrain dependencies
command: |
Expand Down
112 changes: 112 additions & 0 deletions configs/dsdl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# DSDL: Standard Description Language for DataSet

## 1. Abstract

Data is the cornerstone of artificial intelligence. The efficiency of data acquisition, exchange, and application directly impacts the advances in technologies and applications. Over the long history of AI, a vast quantity of data sets have been developed and distributed. However, these datasets are defined in very different forms, which incurs significant overhead when it comes to exchange, integration, and utilization -- it is often the case that one needs to develop a new customized tool or script in order to incorporate a new dataset into a workflow.

To overcome such difficulties, we develop **Data Set Description Language (DSDL)**. More details please visit our [official documents](https://opendatalab.github.io/dsdl-docs/getting_started/overview/), dsdl datasets can be downloaded from our platform [OpenDataLab](https://opendatalab.com/).

## Steps

- install dsdl and opendatalab:

```
pip install dsdl
pip install opendatalab
```

- install mmpretrain and pytorch:
please refer this [installation documents](https://mmpretrain.readthedocs.io/en/latest/get_started.html).

- prepare dsdl dataset (take cifar10 as an example)

- download dsdl dataset (you will need an opendatalab account to do so. [register one now](https://opendatalab.com/))

```
cd data

odl login
odl get CIFAR-10
```

usually, dataset is compressed on opendatalab platform, the downloaded cifar10 dataset should be like this:

```
data/
├── CIFAR-10
│   ├── dsdl
│   │   └── dsdl_Cls_full.zip
│   ├── raw
│   │   ├── cifar-10-binary.tar.gz
│   │   ├── cifar-10-matlab.tar.gz
│   │   └── cifar-10-python.tar.gz
│   └── README.md
└── ...
```

- decompress dataset

decompress dsdl files:

```
cd dsdl
unzip dsdl_Cls_full.zip
```

decompress raw data and save as image files, we prepared a python script to do so:

```
cd ..
python dsdl/dsdl_Cls_full/tools/prepare.py raw/

cd ../../
```

after running this script, there will be a new folder named as `prepared` (this does not happen on every dataset, for cifar10 has binary files and needs to be extracted as image files):

```
data/
├── CIFAR-10
│   ├── dsdl
│   │   └── ...
│   ├── raw
│   │   └── ...
│   ├── prepared
│   │   └── images
│   └── README.md
└── ...
```

- change traning config

open the [cifar10 config file](cifar10.py) and set some file paths as below:

```
data_root = 'data/CIFAR-10'
img_prefix = 'prepared'
train_ann = 'dsdl/dsdl_Cls_full/set-train/train.yaml'
val_ann = 'dsdl/dsdl_Cls_full/set-test/test.yaml'
```

as dsdl datasets with one task using one dataloader, we can simplly change these file paths to train a model on a different dataset.

- train:

- using single gpu:

```
python tools/train.py {config_file}
```

- using slrum:

```
./tools/slurm_train.sh {partition} {job_name} {config_file} {work_dir} {gpu_nums}
```

## 3. Test Results

| Datasets | Model | Top-1 Acc (%) | Config |
| :---------: | :-------------------------------------------------------------------------------------------------------------: | :-----------: | :-----------------------: |
| cifar10 | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth) | 94.83 | [config](./cifar10.py) |
| ImageNet-1k | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_8xb32_in1k_20210831-fbbb1da6.pth) | 69.84 | [config](./imagenet1k.py) |
Copy link
Contributor

@zzc98 zzc98 May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid that the ImageNet category mapping used by DSDL is not compatible with mmpretrain, and the result cannot be reproduced. The result can be reproduced by using the following code for mapping according to ILSVRC2012_mapping.txt.

def load_data_list(self):
    # ...
    # For ImageNet
    id2name = {}
    folders = []
    with open('ILSVRC2012_mapping.txt', 'r') as f:
        for line in f.readlines()[:1000]:
            line = line[:-1]
            cid, name = line.split()
            id2name[int(cid)] = name
            folders.append(name)
    folders.sort()

        # ...
        label_index = data['Label'][0].index_in_domain() - 1
        name = id2name[label_index + 1]
        label_index = folders.index(name)

Copy link
Author

@wufan-tb wufan-tb May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是这样的,因为dsdl在转换的时候,顺序确实和原始的ImageNet不一致,所以用之前预训练好的模型测试,结果会不一致,但是表格里的数据是我在load代码里进行了顺序对齐之后跑出来的结果,是可以对齐的,但是这个顺序对齐的部分是只针对ImageNet的,所以在合并的时候我把这部分去掉了;
实际上如果用DSDLDataset重新训练一个模型,就不需要顺序对齐了,精度也是可以对齐的,所以代码里是不需要这个顺序对齐的工作的。

Copy link
Author

@wufan-tb wufan-tb May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

要不为了减少歧义,把这两行表格删掉吧

60 changes: 60 additions & 0 deletions configs/dsdl/cifar10.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
_base_ = [
'../_base_/models/resnet18_cifar.py',
'../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py'
]

# dataset settings
dataset_type = 'DSDLClsDataset'
data_root = 'data/CIFAR-10'
img_prefix = 'prepared'
train_ann = 'dsdl/dsdl_Cls_full/set-train/train.yaml'
val_ann = 'dsdl/dsdl_Cls_full/set-test/test.yaml'

data_preprocessor = dict(
num_classes=10,
# RGB format normalization parameters
mean=[125.307, 122.961, 113.8575],
std=[51.5865, 50.847, 51.255],
to_rgb=True)

train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomCrop', crop_size=32, padding=4),
dict(type='RandomFlip', prob=0.5, direction='horizontal'),
dict(type='PackInputs'),
]

test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='PackInputs'),
]

train_dataloader = dict(
batch_size=16,
num_workers=2,
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file=train_ann,
data_prefix=dict(img_path=img_prefix),
test_mode=False,
pipeline=train_pipeline),
sampler=dict(type='DefaultSampler', shuffle=True),
)

val_dataloader = dict(
batch_size=16,
num_workers=2,
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file=val_ann,
data_prefix=dict(img_path=img_prefix),
test_mode=True,
pipeline=test_pipeline),
sampler=dict(type='DefaultSampler', shuffle=False),
)
val_evaluator = dict(type='Accuracy', topk=(1, ))

test_dataloader = val_dataloader
test_evaluator = val_evaluator
63 changes: 63 additions & 0 deletions configs/dsdl/imagenet1k.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
_base_ = [
'../_base_/models/resnet18.py', '../_base_/schedules/imagenet_bs256.py',
'../_base_/default_runtime.py'
]

# dataset settings
dataset_type = 'DSDLClsDataset'
data_root = 'data/ImageNet-1K'
img_prefix = 'raw/ImageNet-1K'
train_ann = 'dsdl/dsdl_Cls_full/set-train/train.yaml'
val_ann = 'dsdl/dsdl_Cls_full/set-val/val.yaml'

data_preprocessor = dict(
num_classes=1000,
# RGB format normalization parameters
mean=[123.675, 116.28, 103.53],
std=[58.395, 57.12, 57.375],
to_rgb=True,
)

train_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='RandomResizedCrop', scale=224),
dict(type='RandomFlip', prob=0.5, direction='horizontal'),
dict(type='PackInputs'),
]

test_pipeline = [
dict(type='LoadImageFromFile'),
dict(type='ResizeEdge', scale=256, edge='short'),
dict(type='CenterCrop', crop_size=224),
dict(type='PackInputs'),
]

train_dataloader = dict(
batch_size=32,
num_workers=2,
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file=train_ann,
data_prefix=dict(img_path=img_prefix),
test_mode=False,
pipeline=train_pipeline),
sampler=dict(type='DefaultSampler', shuffle=True),
)

val_dataloader = dict(
batch_size=32,
num_workers=2,
dataset=dict(
type=dataset_type,
data_root=data_root,
ann_file=val_ann,
data_prefix=dict(img_path=img_prefix),
test_mode=True,
pipeline=test_pipeline),
sampler=dict(type='DefaultSampler', shuffle=False),
)
val_evaluator = dict(type='Accuracy', topk=(1, 5))

test_dataloader = val_dataloader
test_evaluator = val_evaluator
4 changes: 3 additions & 1 deletion mmpretrain/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
from .cub import CUB
from .custom import CustomDataset
from .dataset_wrappers import KFoldDataset
from .dsdl import DSDLClsDataset
from .dtd import DTD
from .fgvcaircraft import FGVCAircraft
from .flowers102 import Flowers102
Expand All @@ -28,5 +29,6 @@
'VOC', 'build_dataset', 'ImageNet21k', 'KFoldDataset', 'CUB',
'CustomDataset', 'MultiLabelDataset', 'MultiTaskDataset', 'InShop',
'Places205', 'Flowers102', 'OxfordIIITPet', 'DTD', 'FGVCAircraft',
'StanfordCars', 'SUN397', 'Caltech101', 'Food101'
'StanfordCars', 'SUN397', 'Caltech101', 'Food101', 'Places205',
'DSDLClsDataset'
]
75 changes: 75 additions & 0 deletions mmpretrain/datasets/dsdl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# Copyright (c) OpenMMLab. All rights reserved.
import os
from typing import List

from mmpretrain.registry import DATASETS
from .base_dataset import BaseDataset

try:
from dsdl.dataset import DSDLDataset
except ImportError:
DSDLDataset = None


@DATASETS.register_module()
class DSDLClsDataset(BaseDataset):
"""Dataset for dsdl classification.

Args:
specific_key_path(dict): Path of specific key which can not
be loaded by it's field name.
pre_transform(dict): pre-transform functions before loading.
"""

METAINFO = {}

def __init__(self,
specific_key_path: dict = {},
pre_transform: dict = {},
**kwargs) -> None:

if DSDLDataset is None:
raise RuntimeError(
'Package dsdl is not installed. Please run "pip install dsdl".'
)

loc_config = dict(type='LocalFileReader', working_dir='')
if kwargs.get('data_root'):
kwargs['ann_file'] = os.path.join(kwargs['data_root'],
kwargs['ann_file'])
self.required_fields = ['Image', 'Label']

self.dsdldataset = DSDLDataset(
dsdl_yaml=kwargs['ann_file'],
location_config=loc_config,
required_fields=self.required_fields,
specific_key_path=specific_key_path,
transform=pre_transform,
)

BaseDataset.__init__(self, **kwargs)

def load_data_list(self) -> List[dict]:
"""Load data info from a dsdl yaml file named as ``self.ann_file``
Returns:
List[dict]: A list of data list.
"""
self._metainfo['classes'] = tuple(self.dsdldataset.class_names)

data_list = []

for i, data in enumerate(self.dsdldataset):
if len(data['Label']) == 1:
label_index = data['Label'][0].index_in_domain() - 1
else:
# multi labels
label_index = [
category.index_in_domain() - 1
for category in data['Label']
]
datainfo = dict(
img_path=os.path.join(self.data_prefix['img_path'],
data['Image'][0].location),
gt_label=label_index)
data_list.append(datainfo)
return data_list
14 changes: 14 additions & 0 deletions tests/data/dsdl_cls/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Copyright (c) OpenMMLab. All rights reserved.

local = dict(
type="LocalFileReader",
working_dir="the root path of the prepared dataset",)

ali_oss = dict(
type="AliOSSFileReader",
access_key_secret="your secret key of aliyun oss",
endpoint="your endpoint of aliyun oss",
access_key_id="your access key of aliyun oss",
bucket_name="your bucket name of aliyun oss",
working_dir="the path of the prepared dataset without the bucket's name")

16 changes: 16 additions & 0 deletions tests/data/dsdl_cls/defs/class-dom.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
$dsdl-version: "0.5.3"

Cifar10ImageClassificationClassDom:
$def: class_domain
classes:
- airplane
- automobile
- bird
- cat
- deer
- dog
- frog
- horse
- ship
- truck

9 changes: 9 additions & 0 deletions tests/data/dsdl_cls/defs/template.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
$dsdl-version: "0.5.3"

Cifar10Sample:
$def: struct
$params: ['cdom']
$fields:
image: Image
label: Label[dom=$cdom]
$optional: ['label']
Loading