Datasets for domain adaptation and transfer learning

How many times have you been struggling to find the useful datasets?
How much time have you been wasting to preprocess datasets?
How burdersome is it to compare with other methods? Will you re-run their code? or there is No code?

Datasets are critical to machine learning, but You should focus on YOUR work! So we want to save your time by:

JUST GIVE THEM TO YOU so you can DIRECTLY use them!

If you are tired of repeating the experiments of other methods, you can directly use the benchmark.

Most datasets are image datasets:

Dataset	Area	#Sample	#Feature	#Class	Subdomain	Reference
Office+Caltech	Object recognition	2533	SURF:800 DeCAF:4096	10	C, A, W, D	[1]
Office-31	Object recognition	4110	SURF:800 DeCAF:4096	31	A, W, D	[1]
Modern Office-31	Image Classification	6712	—	31	A, S, W	[20]
MNIST+USPS	Digit recognition	3800	256	10	USPS, MNIST	[4]
COIL20	Object recognition	1440	1024	20	COIL1, COIL2	[4]
PIE	Face recognition	11554	1024	68	PIE1~PIE5	[6]
VOC2007	Object recognition	3376	DeCAF:4096	5	V	[8]
LabelMe	Object recognition	2656	DeCAF:4096	5	L	[2]
SUN09	Object recognition	3282	DeCAF:4096	5	S	[9]
Caltech101	Object recognition	1415	DeCAF:4096	5	C	[3]
IMAGENET	Object recognition	7341	DeCAF:4096	5	I	[7]
AWA	Animal recognition	30475	DeCAF:4096 SIFT/SURF:2000	50	I	[5]
Office-Home	Object recognition	30475	Original Images	65	4 domains	[10]
Cross-dataset Testbed	Image Classification	*	Decaf7	40	3 domains	[15]
ImageCLEF	Image Classification	*	raw	12	3 domains	[17]
VisDA	Image Classification / segmentation	280157	raw	12	3 domains/3 domain	[18]
LSDAC	Image Classification	569010	raw	345	6 domains	[19]
Adaptiope	Image Classification	36900	—	123	P, S, R	[20]

NEW A even larger dataset called DomainNet is released by BU, with half a million images, 6 domains, and 345 classes!

NEW A new dataset released by Stanford and UC Berkeley: Syn2Real: A New Benchmark forSynthetic-to-Real Visual Domain Adaptation

Text datasets:

Amazon review for sentiment classification

You can download the datasets here with code a82t.

Office+Caltech

Area

Visual object recognition

Introduction

Perhaps it is the most popular dataset for domain adaptation. Four domains are included: C(Caltech), A(Amazon), W(Webcam) and D(DSLR). In fact, this dataset is constructed from two datasets: Office-31 (which contains 31 classes of A, W and D) and Caltech-256 (which contains 256 classes of C). There are just 10 common classes in both, so the Office+Caltech dataset is formed.

Even for the same category, the data distribution of different domains is exactly different. The following picture [1] indicates this fact by the monitor images from 4 domains.

Features

There are ususlly two kinds of features: SURF and DeCAF6. They are with the same number of samples per domain, resulting 2533 samples in total:

C: 1123
A: 958
W: 295
D: 157

And the dimension of features is:

For SURF: 800
For DeCAF6: 4096

Copyright

This dataset was first introduced by Gong et al. [1]. I got the SURF features from the website of [1], while DeCAF features from [10].

See benchmark results of many popular methods here(SURF) and here(DeCaf).

Download

Download Office+Caltech original images [Baiduyun]

Download Office+Caltech SURF dataset [MEGA|Baiduyun]

Download Office+Caltech DeCAF dataset [MEGA|Baiduyun]

Office-31

This is the full Office dataset, which contains 31 categories from Amazon, webcam and DSLR.

See benchmarks on Office-31 datasets here.

Download

Download Office-31 raw images:

Jianguoyun (Password: FcaDrw)
MEGA
Azure (supports wget)

Download Office-31 DeCAF6 and DeCAF7 features:

Jianguoyun (Password: qqLA7D)

Download Office-31 Resnet-50 features:

Jianguoyun (Password: eS5fMT)
MEGA
Azure (Supports wget)

MNIST+USPS

Area Handwritten digit recognition

It is also popular. It contains randomly selected samples from MNIST and USPS. Then the source and target domains are constructed using each other.

Download MNIST+USPS SURF dataset [MEGA|Baiduyun]

COIL20

Area Object recognition

It contains 20 classes. There are two datasets extracted: COIL1 and COIL2.

Download COIL20 SURF dataset [MEGA|Baiduyun]

PIE

Area Facial recognition

It is a relatively large dataset with many classes.

Download PIE SURF dataset [MEGA|Baiduyun]

VLSC

Area Image classification

It contains four domains: V(VOC2007), L(LabelMe), S(SUN09) and C(Caltech). There are 5 classes: 'bird', 'car', 'chair', 'dog', 'person'.

Download the VLSC DeCAF dataset [MEGA|Baiduyun]

IMAGENET

It is selected from imagenet challange.

Download the IMAGENET DeCAF dataset [MEGA|Baiduyun]

Animals-with-Attributes

Download the SURF/SIFT/DeCAF features [MEGA|Baiduyun]

Office-Home

This is a new dataset released at CVPR 2017. It contains 65 kinds of objects crawled from the web. The main research goal is for domain adatpation algorithms benchmark.

The project home page is: http://hemanthdv.org/OfficeHome-Dataset/.

Download original images:

Jianguoyun (Password: 726GYD)
Azure (Supports wget)

Download ResNet-50 pre-trained features:

Baiduyun
MEGA
Azure (Supports wget)

Cross-dataset Testbed

This is a Decaf7 based cross-dataset image classification dataset. It contains 40 categories of images from 3 domains: 3,847 images in Caltech256(C), 4,000 images in ImageNet(I), and 2,626 images for SUN(S).

Download the Cross-dataset testbed

ImageCLEF

This is a dataset from ImageCLEF 2014 challenge.

Download original images:

Jianguoyun (Password: e5v8GG)
MEGA
Azure (Supports wget)

Download ResNet-50 pre-trained features:

Baiduyun
MEGA

VisDA

This is a dataset from VisDA 2017 challenge. It contains two subdatasets, one for image classification tasks and the other for image segmentation tasks.

Download the VisDA-classification dataset

Download the VisDA-segmentation dataset

Download VisDA classification dataset features by ResNet-50 | Download from MEGA

LSDAC

This is probably the largest and latest domain adaptation datasets ever! It is collected by Boston U, which contains 6 domains from 345 categories, leading to 600K images. Dataset download link will be available soon once the authors release their datasets. You can refer to [19] for more information.

Amazon review

Downlaod Amazon review dataset:

Mega
Jianguoyun (Password: AXMDi5)
Azure (Supports wget)

Adaptiope

Adaptiope is probably one of the most versatile domain adaptation datasets with synthetic images (3D renderings). Overall, Adaptiope contains images of 123 categories in the 3 domains product, real life and synthetic for a total of 36,900 images. Please refer to the project website for more information.

Download Adaptiope

Modern-Office-31

Modern Office-31 is a modernized version of the popular Office-31 dataset. This version fixes many of the annotation errors in the original dataset and also adds a challenging synthetic domain. Overall, Modern Office-31 contains 6,712 images in the 3 domains Amazon, synthetic and webcam. Please refer to the project website for more information.

Download Modern Office-31

References

[1] Gong B, Shi Y, Sha F, et al. Geodesic flow kernel for unsupervised domain adaptation[C]//Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012: 2066-2073.

[2] Russell B C, Torralba A, Murphy K P, et al. LabelMe: a database and web-based tool for image annotation[J]. International journal of computer vision, 2008, 77(1): 157-173.

[3] Griffin G, Holub A, Perona P. Caltech-256 object category dataset[J]. 2007.

[4] Long M, Wang J, Ding G, et al. Transfer feature learning with joint distribution adaptation[C]//Proceedings of the IEEE International Conference on Computer Vision. 2013: 2200-2207.

[5] http://attributes.kyb.tuebingen.mpg.de/

[6] http://www.cs.cmu.edu/afs/cs/project/PIE/MultiPie/Multi-Pie/Home.html

[7] http://www.cs.dartmouth.edu/~chenfang/proj_page/FXR_iccv13/

[8] M. Everingham, L. Van-Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.

[9] M. J. Choi, J. J. Lim, A. Torralba, and A. S. Willsky, “Exploiting hierarchical context on a large database of object categories,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogit., 2010, pp. 129–136

[10] http://www.uow.edu.au/~jz960/

[11] Zhang J, Li W, Ogunbona P. Joint Geometrical and Statistical Alignment for Visual Domain Adaptation[C]. CVPR 2017.

[12] Tahmoresnezhad J, Hashemi S. Visual domain adaptation via transfer feature learning[J]. Knowledge and Information Systems, 2017, 50(2): 585-605.

[13] Long M, Wang J, Sun J, et al. Domain invariant transfer kernel learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(6): 1519-1532.

[14] Venkateswara H, Eusebio J, Chakraborty S, et al. Deep hashing network for unsupervised domain adaptation[C]. CVPR 2017.

[15] Daumé III H. Frustratingly easy domain adaptation[J]. arXiv preprint arXiv:0907.1815, 2009.

[16] Luo L, Chen L, Hu S. Discriminative Label Consistent Domain Adaptation[J]. arXiv preprint arXiv:1802.08077, 2018.

[17] http://imageclef.org/2014/adaptation

[18] Peng X. VisDA: The Visual Domain Adaptation Challenge. arXiv preprint arXiv:1710.06924.

[19] Xingchao Peng, et al. Moment Matching for Multi-Source Domain Adaptation. arXiv 1812.01754.

[20] T. Ringwald, et al. "Adaptiope: A Modern Benchmark for Unsupervised Domain Adaptation", Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset.md

dataset.md

Datasets for domain adaptation and transfer learning

Text datasets:

Office+Caltech

Area

Introduction

Features

Copyright

Download

Office-31

Download

MNIST+USPS

COIL20

PIE

VLSC

IMAGENET

Animals-with-Attributes

Office-Home

Cross-dataset Testbed

ImageCLEF

VisDA

LSDAC

Amazon review

Adaptiope

Modern-Office-31

References

Files

dataset.md

Latest commit

History

dataset.md

File metadata and controls

Datasets for domain adaptation and transfer learning

Text datasets:

Office+Caltech

Area

Introduction

Features

Copyright

Download

Office-31

Download

MNIST+USPS

COIL20

PIE

VLSC

IMAGENET

Animals-with-Attributes

Office-Home

Cross-dataset Testbed

ImageCLEF

VisDA

LSDAC

Amazon review

Adaptiope

Modern-Office-31

References