PyTorch implementation of BMVC2022 paper Masked Vision-Language Transformers for Scene Text Recognition.
- The code was tested on PyTorch==1.12.0, timm==0.3.2
- other requirements: lmdb, pillow, torchvision, tensorboad
- Training Datasets
- Evaluation Datasets
- IIIT, SVT, IC13, IC15, SVTP, and CUTE
We use LMDB of MJ, ST and evaluation datasets by downloaded from ABINet. Real datasets can be downloaded from STR-Fewer-Labels and we only need training datasets of the real datasets.
#create data directory
$ cd MVLT
$ mkdir data
The structure of data
directory is
```
data
├── training
│ ├── MJ
│ │ ├── MJ_test
│ │ ├── MJ_train
│ │ └── MJ_valid
│ ├── ST
│ ├── RealLabel
│ │ ├── 1.SVT
│ │ ├── 2.IIIT
│ │ ├── 3.IC13
│ │ ├── 4.IC15
│ │ ├── 5.COCO
│ │ ├── 6.RCTW17
│ │ ├── 7.Uber
│ │ ├── 8.ArT
│ │ ├── 9.LSVT
│ │ ├── 10.MLT19
│ │ └── 11.ReCTS
│ └── RealUnlabel
│ ├── U1.Book32
│ ├── U2.TextVQA
│ └── U3.STVQA
└── evaluation
├── CUTE80
├── IC13_857
├── IC15_1811
├── IIIT5k_3000
├── SVT
└── SVTP
```
You can get the all models from BaiduNetdisk(passwd:409r)
- pretrain MVLT (using only synthetic data)
bash scripts/run_mvlt_pretrain.sh
- pretrain MVLT* (using additional unlabeled real data)
bash scripts/run_mvlt_pretrain_ur.sh
- fine-tune
bash scripts/run_mvlt_finetune.sh OUTPUT_DIR_PATH/checkpoint-xxx.pth
bash scripts/run_mvlt_test.sh OUTPUT_DIR_PATH/checkpoint-best.pth
Our implementation is based on MAE, ABINet, deep-text-recognition-benchmark.
This project is under the CC-BY-NC 4.0 license. See LICENSE for details.