-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Support DSDL Dataset #1503
Open
wufan-tb
wants to merge
1
commit into
open-mmlab:dev
Choose a base branch
from
wufan-tb:dev-dsdl
base: dev
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
# DSDL: Standard Description Language for DataSet | ||
|
||
## 1. Abstract | ||
|
||
Data is the cornerstone of artificial intelligence. The efficiency of data acquisition, exchange, and application directly impacts the advances in technologies and applications. Over the long history of AI, a vast quantity of data sets have been developed and distributed. However, these datasets are defined in very different forms, which incurs significant overhead when it comes to exchange, integration, and utilization -- it is often the case that one needs to develop a new customized tool or script in order to incorporate a new dataset into a workflow. | ||
|
||
To overcome such difficulties, we develop **Data Set Description Language (DSDL)**. More details please visit our [official documents](https://opendatalab.github.io/dsdl-docs/getting_started/overview/), dsdl datasets can be downloaded from our platform [OpenDataLab](https://opendatalab.com/). | ||
|
||
## Steps | ||
|
||
- install dsdl and opendatalab: | ||
|
||
``` | ||
pip install dsdl | ||
pip install opendatalab | ||
``` | ||
|
||
- install mmpretrain and pytorch: | ||
please refer this [installation documents](https://mmpretrain.readthedocs.io/en/latest/get_started.html). | ||
|
||
- prepare dsdl dataset (take cifar10 as an example) | ||
|
||
- download dsdl dataset (you will need an opendatalab account to do so. [register one now](https://opendatalab.com/)) | ||
|
||
``` | ||
cd data | ||
|
||
odl login | ||
odl get CIFAR-10 | ||
``` | ||
|
||
usually, dataset is compressed on opendatalab platform, the downloaded cifar10 dataset should be like this: | ||
|
||
``` | ||
data/ | ||
├── CIFAR-10 | ||
│ ├── dsdl | ||
│ │ └── dsdl_Cls_full.zip | ||
│ ├── raw | ||
│ │ ├── cifar-10-binary.tar.gz | ||
│ │ ├── cifar-10-matlab.tar.gz | ||
│ │ └── cifar-10-python.tar.gz | ||
│ └── README.md | ||
└── ... | ||
``` | ||
|
||
- decompress dataset | ||
|
||
decompress dsdl files: | ||
|
||
``` | ||
cd dsdl | ||
unzip dsdl_Cls_full.zip | ||
``` | ||
|
||
decompress raw data and save as image files, we prepared a python script to do so: | ||
|
||
``` | ||
cd .. | ||
python dsdl/dsdl_Cls_full/tools/prepare.py raw/ | ||
|
||
cd ../../ | ||
``` | ||
|
||
after running this script, there will be a new folder named as `prepared` (this does not happen on every dataset, for cifar10 has binary files and needs to be extracted as image files): | ||
|
||
``` | ||
data/ | ||
├── CIFAR-10 | ||
│ ├── dsdl | ||
│ │ └── ... | ||
│ ├── raw | ||
│ │ └── ... | ||
│ ├── prepared | ||
│ │ └── images | ||
│ └── README.md | ||
└── ... | ||
``` | ||
|
||
- change traning config | ||
|
||
open the [cifar10 config file](cifar10.py) and set some file paths as below: | ||
|
||
``` | ||
data_root = 'data/CIFAR-10' | ||
img_prefix = 'prepared' | ||
train_ann = 'dsdl/dsdl_Cls_full/set-train/train.yaml' | ||
val_ann = 'dsdl/dsdl_Cls_full/set-test/test.yaml' | ||
``` | ||
|
||
as dsdl datasets with one task using one dataloader, we can simplly change these file paths to train a model on a different dataset. | ||
|
||
- train: | ||
|
||
- using single gpu: | ||
|
||
``` | ||
python tools/train.py {config_file} | ||
``` | ||
|
||
- using slrum: | ||
|
||
``` | ||
./tools/slurm_train.sh {partition} {job_name} {config_file} {work_dir} {gpu_nums} | ||
``` | ||
|
||
## 3. Test Results | ||
|
||
| Datasets | Model | Top-1 Acc (%) | Config | | ||
| :---------: | :-------------------------------------------------------------------------------------------------------------: | :-----------: | :-----------------------: | | ||
| cifar10 | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_b16x8_cifar10_20210528-bd6371c8.pth) | 94.83 | [config](./cifar10.py) | | ||
| ImageNet-1k | [model](https://download.openmmlab.com/mmclassification/v0/resnet/resnet18_8xb32_in1k_20210831-fbbb1da6.pth) | 69.84 | [config](./imagenet1k.py) | | ||
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,60 @@ | ||
_base_ = [ | ||
'../_base_/models/resnet18_cifar.py', | ||
'../_base_/schedules/cifar10_bs128.py', '../_base_/default_runtime.py' | ||
] | ||
|
||
# dataset settings | ||
dataset_type = 'DSDLClsDataset' | ||
data_root = 'data/CIFAR-10' | ||
img_prefix = 'prepared' | ||
train_ann = 'dsdl/dsdl_Cls_full/set-train/train.yaml' | ||
val_ann = 'dsdl/dsdl_Cls_full/set-test/test.yaml' | ||
|
||
data_preprocessor = dict( | ||
num_classes=10, | ||
# RGB format normalization parameters | ||
mean=[125.307, 122.961, 113.8575], | ||
std=[51.5865, 50.847, 51.255], | ||
to_rgb=True) | ||
|
||
train_pipeline = [ | ||
dict(type='LoadImageFromFile'), | ||
dict(type='RandomCrop', crop_size=32, padding=4), | ||
dict(type='RandomFlip', prob=0.5, direction='horizontal'), | ||
dict(type='PackInputs'), | ||
] | ||
|
||
test_pipeline = [ | ||
dict(type='LoadImageFromFile'), | ||
dict(type='PackInputs'), | ||
] | ||
|
||
train_dataloader = dict( | ||
batch_size=16, | ||
num_workers=2, | ||
dataset=dict( | ||
type=dataset_type, | ||
data_root=data_root, | ||
ann_file=train_ann, | ||
data_prefix=dict(img_path=img_prefix), | ||
test_mode=False, | ||
pipeline=train_pipeline), | ||
sampler=dict(type='DefaultSampler', shuffle=True), | ||
) | ||
|
||
val_dataloader = dict( | ||
batch_size=16, | ||
num_workers=2, | ||
dataset=dict( | ||
type=dataset_type, | ||
data_root=data_root, | ||
ann_file=val_ann, | ||
data_prefix=dict(img_path=img_prefix), | ||
test_mode=True, | ||
pipeline=test_pipeline), | ||
sampler=dict(type='DefaultSampler', shuffle=False), | ||
) | ||
val_evaluator = dict(type='Accuracy', topk=(1, )) | ||
|
||
test_dataloader = val_dataloader | ||
test_evaluator = val_evaluator |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
_base_ = [ | ||
'../_base_/models/resnet18.py', '../_base_/schedules/imagenet_bs256.py', | ||
'../_base_/default_runtime.py' | ||
] | ||
|
||
# dataset settings | ||
dataset_type = 'DSDLClsDataset' | ||
data_root = 'data/ImageNet-1K' | ||
img_prefix = 'raw/ImageNet-1K' | ||
train_ann = 'dsdl/dsdl_Cls_full/set-train/train.yaml' | ||
val_ann = 'dsdl/dsdl_Cls_full/set-val/val.yaml' | ||
|
||
data_preprocessor = dict( | ||
num_classes=1000, | ||
# RGB format normalization parameters | ||
mean=[123.675, 116.28, 103.53], | ||
std=[58.395, 57.12, 57.375], | ||
to_rgb=True, | ||
) | ||
|
||
train_pipeline = [ | ||
dict(type='LoadImageFromFile'), | ||
dict(type='RandomResizedCrop', scale=224), | ||
dict(type='RandomFlip', prob=0.5, direction='horizontal'), | ||
dict(type='PackInputs'), | ||
] | ||
|
||
test_pipeline = [ | ||
dict(type='LoadImageFromFile'), | ||
dict(type='ResizeEdge', scale=256, edge='short'), | ||
dict(type='CenterCrop', crop_size=224), | ||
dict(type='PackInputs'), | ||
] | ||
|
||
train_dataloader = dict( | ||
batch_size=32, | ||
num_workers=2, | ||
dataset=dict( | ||
type=dataset_type, | ||
data_root=data_root, | ||
ann_file=train_ann, | ||
data_prefix=dict(img_path=img_prefix), | ||
test_mode=False, | ||
pipeline=train_pipeline), | ||
sampler=dict(type='DefaultSampler', shuffle=True), | ||
) | ||
|
||
val_dataloader = dict( | ||
batch_size=32, | ||
num_workers=2, | ||
dataset=dict( | ||
type=dataset_type, | ||
data_root=data_root, | ||
ann_file=val_ann, | ||
data_prefix=dict(img_path=img_prefix), | ||
test_mode=True, | ||
pipeline=test_pipeline), | ||
sampler=dict(type='DefaultSampler', shuffle=False), | ||
) | ||
val_evaluator = dict(type='Accuracy', topk=(1, 5)) | ||
|
||
test_dataloader = val_dataloader | ||
test_evaluator = val_evaluator |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,75 @@ | ||
# Copyright (c) OpenMMLab. All rights reserved. | ||
import os | ||
from typing import List | ||
|
||
from mmpretrain.registry import DATASETS | ||
from .base_dataset import BaseDataset | ||
|
||
try: | ||
from dsdl.dataset import DSDLDataset | ||
except ImportError: | ||
DSDLDataset = None | ||
|
||
|
||
@DATASETS.register_module() | ||
class DSDLClsDataset(BaseDataset): | ||
"""Dataset for dsdl classification. | ||
|
||
Args: | ||
specific_key_path(dict): Path of specific key which can not | ||
be loaded by it's field name. | ||
pre_transform(dict): pre-transform functions before loading. | ||
""" | ||
|
||
METAINFO = {} | ||
|
||
def __init__(self, | ||
specific_key_path: dict = {}, | ||
pre_transform: dict = {}, | ||
**kwargs) -> None: | ||
|
||
if DSDLDataset is None: | ||
raise RuntimeError( | ||
'Package dsdl is not installed. Please run "pip install dsdl".' | ||
) | ||
|
||
loc_config = dict(type='LocalFileReader', working_dir='') | ||
if kwargs.get('data_root'): | ||
kwargs['ann_file'] = os.path.join(kwargs['data_root'], | ||
kwargs['ann_file']) | ||
self.required_fields = ['Image', 'Label'] | ||
|
||
self.dsdldataset = DSDLDataset( | ||
dsdl_yaml=kwargs['ann_file'], | ||
location_config=loc_config, | ||
required_fields=self.required_fields, | ||
specific_key_path=specific_key_path, | ||
transform=pre_transform, | ||
) | ||
|
||
BaseDataset.__init__(self, **kwargs) | ||
|
||
def load_data_list(self) -> List[dict]: | ||
"""Load data info from a dsdl yaml file named as ``self.ann_file`` | ||
Returns: | ||
List[dict]: A list of data list. | ||
""" | ||
self._metainfo['classes'] = tuple(self.dsdldataset.class_names) | ||
|
||
data_list = [] | ||
|
||
for i, data in enumerate(self.dsdldataset): | ||
if len(data['Label']) == 1: | ||
label_index = data['Label'][0].index_in_domain() - 1 | ||
else: | ||
# multi labels | ||
label_index = [ | ||
category.index_in_domain() - 1 | ||
for category in data['Label'] | ||
] | ||
datainfo = dict( | ||
img_path=os.path.join(self.data_prefix['img_path'], | ||
data['Image'][0].location), | ||
gt_label=label_index) | ||
data_list.append(datainfo) | ||
return data_list |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Copyright (c) OpenMMLab. All rights reserved. | ||
|
||
local = dict( | ||
type="LocalFileReader", | ||
working_dir="the root path of the prepared dataset",) | ||
|
||
ali_oss = dict( | ||
type="AliOSSFileReader", | ||
access_key_secret="your secret key of aliyun oss", | ||
endpoint="your endpoint of aliyun oss", | ||
access_key_id="your access key of aliyun oss", | ||
bucket_name="your bucket name of aliyun oss", | ||
working_dir="the path of the prepared dataset without the bucket's name") | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
$dsdl-version: "0.5.3" | ||
|
||
Cifar10ImageClassificationClassDom: | ||
$def: class_domain | ||
classes: | ||
- airplane | ||
- automobile | ||
- bird | ||
- cat | ||
- deer | ||
- dog | ||
- frog | ||
- horse | ||
- ship | ||
- truck | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
$dsdl-version: "0.5.3" | ||
|
||
Cifar10Sample: | ||
$def: struct | ||
$params: ['cdom'] | ||
$fields: | ||
image: Image | ||
label: Label[dom=$cdom] | ||
$optional: ['label'] |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid that the ImageNet category mapping used by DSDL is not compatible with mmpretrain, and the result cannot be reproduced. The result can be reproduced by using the following code for mapping according to ILSVRC2012_mapping.txt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是这样的,因为dsdl在转换的时候,顺序确实和原始的ImageNet不一致,所以用之前预训练好的模型测试,结果会不一致,但是表格里的数据是我在load代码里进行了顺序对齐之后跑出来的结果,是可以对齐的,但是这个顺序对齐的部分是只针对ImageNet的,所以在合并的时候我把这部分去掉了;
实际上如果用DSDLDataset重新训练一个模型,就不需要顺序对齐了,精度也是可以对齐的,所以代码里是不需要这个顺序对齐的工作的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
要不为了减少歧义,把这两行表格删掉吧