- How many times have you been struggling to find the useful datasets?
- How much time have you been wasting to preprocess datasets?
- How burdersome is it to compare with other methods? Will you re-run their code? or there is No code?
Datasets are critical to machine learning, but You should focus on YOUR work! So we want to save your time by:
JUST GIVE THEM TO YOU so you can DIRECTLY use them!
If you are tired of repeating the experiments of other methods, you can directly use the benchmark.
Most datasets are image datasets:
Dataset | Area | #Sample | #Feature | #Class | Subdomain | Reference |
---|---|---|---|---|---|---|
Office+Caltech | Object recognition | 2533 | SURF:800 DeCAF:4096 | 10 | C, A, W, D | [1] |
Office-31 | Object recognition | 4110 | SURF:800 DeCAF:4096 | 31 | A, W, D | [1] |
Modern Office-31 | Image Classification | 6712 | — | 31 | A, S, W | [20] |
MNIST+USPS | Digit recognition | 3800 | 256 | 10 | USPS, MNIST | [4] |
COIL20 | Object recognition | 1440 | 1024 | 20 | COIL1, COIL2 | [4] |
PIE | Face recognition | 11554 | 1024 | 68 | PIE1~PIE5 | [6] |
VOC2007 | Object recognition | 3376 | DeCAF:4096 | 5 | V | [8] |
LabelMe | Object recognition | 2656 | DeCAF:4096 | 5 | L | [2] |
SUN09 | Object recognition | 3282 | DeCAF:4096 | 5 | S | [9] |
Caltech101 | Object recognition | 1415 | DeCAF:4096 | 5 | C | [3] |
IMAGENET | Object recognition | 7341 | DeCAF:4096 | 5 | I | [7] |
AWA | Animal recognition | 30475 | DeCAF:4096 SIFT/SURF:2000 | 50 | I | [5] |
Office-Home | Object recognition | 30475 | Original Images | 65 | 4 domains | [10] |
Cross-dataset Testbed | Image Classification | * | Decaf7 | 40 | 3 domains | [15] |
ImageCLEF | Image Classification | * | raw | 12 | 3 domains | [17] |
VisDA | Image Classification / segmentation | 280157 | raw | 12 | 3 domains/3 domain | [18] |
LSDAC | Image Classification | 569010 | raw | 345 | 6 domains | [19] |
Adaptiope | Image Classification | 36900 | — | 123 | P, S, R | [20] |
NEW A even larger dataset called DomainNet is released by BU, with half a million images, 6 domains, and 345 classes!
NEW A new dataset released by Stanford and UC Berkeley: Syn2Real: A New Benchmark forSynthetic-to-Real Visual Domain Adaptation
Amazon review for sentiment classification
You can download the datasets here with code a82t
.
Visual object recognition
Perhaps it is the most popular dataset for domain adaptation. Four domains are included: C(Caltech), A(Amazon), W(Webcam) and D(DSLR). In fact, this dataset is constructed from two datasets: Office-31 (which contains 31 classes of A, W and D) and Caltech-256 (which contains 256 classes of C). There are just 10 common classes in both, so the Office+Caltech dataset is formed.
Even for the same category, the data distribution of different domains is exactly different. The following picture [1] indicates this fact by the monitor images from 4 domains.
There are ususlly two kinds of features: SURF and DeCAF6. They are with the same number of samples per domain, resulting 2533 samples in total:
- C: 1123
- A: 958
- W: 295
- D: 157
And the dimension of features is:
- For SURF: 800
- For DeCAF6: 4096
This dataset was first introduced by Gong et al. [1]. I got the SURF features from the website of [1], while DeCAF features from [10].
See benchmark results of many popular methods here(SURF) and here(DeCaf).
Download Office+Caltech original images [Baiduyun]
Download Office+Caltech SURF dataset [MEGA|Baiduyun]
Download Office+Caltech DeCAF dataset [MEGA|Baiduyun]
This is the full Office dataset, which contains 31 categories from Amazon, webcam and DSLR.
See benchmarks on Office-31 datasets here.
Download Office-31 raw images:
- Jianguoyun (Password: FcaDrw)
- MEGA
- Azure (supports wget)
Download Office-31 DeCAF6 and DeCAF7 features:
- Jianguoyun (Password: qqLA7D)
Download Office-31 Resnet-50 features:
- Jianguoyun (Password: eS5fMT)
- MEGA
- Azure (Supports wget)
Area Handwritten digit recognition
It is also popular. It contains randomly selected samples from MNIST and USPS. Then the source and target domains are constructed using each other.
Download MNIST+USPS SURF dataset [MEGA|Baiduyun]
Area Object recognition
It contains 20 classes. There are two datasets extracted: COIL1 and COIL2.
Download COIL20 SURF dataset [MEGA|Baiduyun]
Area Facial recognition
It is a relatively large dataset with many classes.
Download PIE SURF dataset [MEGA|Baiduyun]
Area Image classification
It contains four domains: V(VOC2007), L(LabelMe), S(SUN09) and C(Caltech). There are 5 classes: 'bird', 'car', 'chair', 'dog', 'person'.
Download the VLSC DeCAF dataset [MEGA|Baiduyun]
It is selected from imagenet challange.
Download the IMAGENET DeCAF dataset [MEGA|Baiduyun]
Download the SURF/SIFT/DeCAF features [MEGA|Baiduyun]
This is a new dataset released at CVPR 2017. It contains 65 kinds of objects crawled from the web. The main research goal is for domain adatpation algorithms benchmark.
The project home page is: http://hemanthdv.org/OfficeHome-Dataset/.
Download original images:
- Jianguoyun (Password: 726GYD)
- Azure (Supports wget)
Download ResNet-50 pre-trained features:
This is a Decaf7 based cross-dataset image classification dataset. It contains 40 categories of images from 3 domains: 3,847 images in Caltech256(C), 4,000 images in ImageNet(I), and 2,626 images for SUN(S).
Download the Cross-dataset testbed
This is a dataset from ImageCLEF 2014 challenge.
Download original images:
- Jianguoyun (Password: e5v8GG)
- MEGA
- Azure (Supports wget)
Download ResNet-50 pre-trained features:
This is a dataset from VisDA 2017 challenge. It contains two subdatasets, one for image classification tasks and the other for image segmentation tasks.
Download the VisDA-classification dataset
Download the VisDA-segmentation dataset
Download VisDA classification dataset features by ResNet-50 | Download from MEGA
This is probably the largest and latest domain adaptation datasets ever! It is collected by Boston U, which contains 6 domains from 345 categories, leading to 600K images. Dataset download link will be available soon once the authors release their datasets. You can refer to [19] for more information.
Downlaod Amazon review dataset:
- Mega
- Jianguoyun (Password: AXMDi5)
- Azure (Supports wget)
Adaptiope is probably one of the most versatile domain adaptation datasets with synthetic images (3D renderings). Overall, Adaptiope contains images of 123 categories in the 3 domains product, real life and synthetic for a total of 36,900 images. Please refer to the project website for more information.
Modern Office-31 is a modernized version of the popular Office-31 dataset. This version fixes many of the annotation errors in the original dataset and also adds a challenging synthetic domain. Overall, Modern Office-31 contains 6,712 images in the 3 domains Amazon, synthetic and webcam. Please refer to the project website for more information.
[1] Gong B, Shi Y, Sha F, et al. Geodesic flow kernel for unsupervised domain adaptation[C]//Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012: 2066-2073.
[2] Russell B C, Torralba A, Murphy K P, et al. LabelMe: a database and web-based tool for image annotation[J]. International journal of computer vision, 2008, 77(1): 157-173.
[3] Griffin G, Holub A, Perona P. Caltech-256 object category dataset[J]. 2007.
[4] Long M, Wang J, Ding G, et al. Transfer feature learning with joint distribution adaptation[C]//Proceedings of the IEEE International Conference on Computer Vision. 2013: 2200-2207.
[5] http://attributes.kyb.tuebingen.mpg.de/
[6] http://www.cs.cmu.edu/afs/cs/project/PIE/MultiPie/Multi-Pie/Home.html
[7] http://www.cs.dartmouth.edu/~chenfang/proj_page/FXR_iccv13/
[8] M. Everingham, L. Van-Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.
[9] M. J. Choi, J. J. Lim, A. Torralba, and A. S. Willsky, “Exploiting hierarchical context on a large database of object categories,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogit., 2010, pp. 129–136
[10] http://www.uow.edu.au/~jz960/
[11] Zhang J, Li W, Ogunbona P. Joint Geometrical and Statistical Alignment for Visual Domain Adaptation[C]. CVPR 2017.
[12] Tahmoresnezhad J, Hashemi S. Visual domain adaptation via transfer feature learning[J]. Knowledge and Information Systems, 2017, 50(2): 585-605.
[13] Long M, Wang J, Sun J, et al. Domain invariant transfer kernel learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 27(6): 1519-1532.
[14] Venkateswara H, Eusebio J, Chakraborty S, et al. Deep hashing network for unsupervised domain adaptation[C]. CVPR 2017.
[15] Daumé III H. Frustratingly easy domain adaptation[J]. arXiv preprint arXiv:0907.1815, 2009.
[16] Luo L, Chen L, Hu S. Discriminative Label Consistent Domain Adaptation[J]. arXiv preprint arXiv:1802.08077, 2018.
[17] http://imageclef.org/2014/adaptation
[18] Peng X. VisDA: The Visual Domain Adaptation Challenge. arXiv preprint arXiv:1710.06924.
[19] Xingchao Peng, et al. Moment Matching for Multi-Source Domain Adaptation. arXiv 1812.01754.
[20] T. Ringwald, et al. "Adaptiope: A Modern Benchmark for Unsupervised Domain Adaptation", Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021