Skip to content

Latest commit

 

History

History
111 lines (96 loc) · 6.28 KB

Data.md

File metadata and controls

111 lines (96 loc) · 6.28 KB

Source Datasets of REBU-Syn

  • We collected labeled data from 16 publicly available real datasets to construct REBU-Syn. The details of these datasets are listed in the following table.
Data file name Size Link License
OpenVINO 1.5M https://storage.googleapis.com/openimages/web/index.html Apache License 2.0
TextOCR 0.8M https://textvqa.org/textocr/dataset/ CC BY 4.0
ICDAR2013 843 https://rrc.cvc.uab.es/?ch=2 Unknown
ICDAR2015 4,467 https://rrc.cvc.uab.es/?ch=4 CC BY 4.0
IIIT5K 2,000 https://cvit.iiit.ac.in/research/projects/cvit-projects/the-iiit-5k-word-dataset MIT License
SVT 257 http://www.iapr-tc11.org/mediawiki/index.php/The_Street_View_Text_Dataset Unknown
Total-Text 12,251 https://github.com/cs-chan/Total-Text-Dataset BSD-3 license
CTW1500 3,170 https://github.com/Yuliang-Liu/Curve-Text-Detector Unknown
Uber 127,850 https://s3-us-west-2.amazonaws.com/uber-common-public/ubertext/index.html Unknown
RCTW17 10,245 https://rctw.vlrlab.net/dataset Unknown
COCOv2.0 72,950 https://vision.cornell.edu/se3/coco-text-2/ CC BY 4.0
LSVT 8,164 https://rrc.cvc.uab.es/?ch=16 Unknown
MLT19 55,112 https://rrc.cvc.uab.es/?ch=15 CC BY-NC 4.0
ReCTS 26,040 https://rrc.cvc.uab.es/?ch=12 Unknown
ArT 31,966 https://rrc.cvc.uab.es/?ch=14 Unknown
Union14M_L_lmdb_format 3M https://github.com/Mountchicken/Union14M/tree/main?tab=readme-ov-file#34-download MIT License
  • We collected labeled data from 4 publicly available synthetic datasets to construct REBU-Syn. The details of these datasets are listed in the following table.
Data file name Size Link License
MJ 6M https://www.robots.ox.ac.uk/~vgg/data/text/ Unknown
ST 9M https://www.robots.ox.ac.uk/~vgg/data/scenetext/ Unknown
Curved SynthText 1.7M https://github.com/Jyouhou/ICDAR2019-ArT-Recognition-Alchemy Apache License 2.0
SynthAdd 1.2M https://github.com/wangpengnorman/SAR-Strong-Baseline-for-Text-Recognition Unknown

Datasets

Download the training dataset from the following links:

  1. LMDB archives for MJ, ST, IIIT5k, SVT, SVTP, IC13, IC15, CUTE80, ArT, RCTW17, ReCTS, LSVT, MLT19, COCO-Text, and Uber-Text.
  2. LMDB archives for TextOCR and OpenVINO.
  3. LMDB archives for Union14M_L_lmdb_format.
  4. CTW1500
  5. Total-Text
  6. SynthAdd
  7. Curved SynthText

Then, organize the data as follows:

├── REBU-Syn
├── train
│   └── synth_and_real
│       ├── Curved_SynthText
│       │   ├── syntext1
│       │   └── syntext2
│       ├── SynthAdd
│       │   ├── data.mdb
│       │   └── lock.mdb
│       ├── Union14M_L_lmdb_format
│       │   ├── difficult
│       │   ├── hard
│       │   ├── hell
│       │   ├── medium
│       │   └── simple
│       ├── benchmark
│       │   ├── ICDAR2013
│       │   ├── ICDAR2015
│       │   ├── IIIT5K
│       │   └── SVT
│       ├── extra
│       │   ├── CTW1500
│       │   └── total_text
│       └── real_data
│       │   ├── ArT
│       │   ├── COCOv2.0
│       │   ├── LSVT
│       │   ├── MLT19
│       │   ├── OpenVINO
│       │   ├── RCTW17
│       │   ├── ReCTS
│       │   ├── TextOCR
│       │   └── Uber
│       └── mj_st
│           ├── data.mdb
│           └── lock.mdb
└── val
│   ├── CUTE80
│   ├── IC13_1015
│   ├── IC15_1811
│   ├── IIIT5k
│   ├── SVT
│   └── SVTP
├── test
│   ├── CUTE80
│   ├── IC13_1015
│   ├── IC13_857
│   ├── IC15_1811
│   ├── IIIT5k
│   ├── SVT
│   └── SVTP

Data Generation

We generated MJST+(60M) using TextRecognitionDataGenerator and SynthText. For specific generation methods, please refer to GenData.md

Acknowledgement

We sincerely thank all the constructors of the 20 datasets used in REBU-Syn.

  • PARSeq: the dataset we built upon. Thanks for their wonderful work!
  • Union14M: organizes a challenging STR training data. Don't forget to check this great open-source work if you don't know it before!