A curated list of resources dedicated to document layout analysis
- *CODE means official code and CODE means not official code
Conf. | Date | Title | Highlight | code/cite |
---|---|---|---|---|
ICDAR2021 | 2021/5/13 | VSR: A Unified Framework for Document Layout Analysis combining Vision, Semantics and Relations | V,S,R | *CODE |
KDD 2020 | 2020/6/16 | LayoutLM: Pre-training of Text and Layout for Document Image Understanding | multimodal/pretrain | *CODE |
ACL2021 | 2022/1/10 | LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding | multimodal/pretrain | *CODE |
Conf. | Date | Title | Highlight | code/cite |
---|---|---|---|---|
KDD2018 | 2018/5/24 | Corpus Conversion Service: A Machine Learning Platform to Ingest Documents at Scale | pipeline | none |
Conf. | Date | Title | Highlight | code/cite |
---|---|---|---|---|
ICDAR2017 | 2017/11/9 | DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images | table det | 193 |
Conf. | Date | Title | Highlight | code/cite |
---|---|---|---|---|
ITC-irst Technical Report | 1998 | Geometric layout analysis techniques for document image understanding: a review | traditional | 216 |
IWDAS | 2002/8/19 | Two geometric algorithms for layout analysis | traditional | 231 |
PSDIUT | 2003 | High performance document layout analysis | traditional | 139 |
Dataset | Description | dataset link |
---|---|---|
PubLayNet | PubLayNet is a large dataset of document images, of which the layout is annotated with both bounding boxes and polygonal segmentations.The annotations are automatically generated by matching the PDF format and the XML format.The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated. | PubLayNet |
DocBank | DocBank is a new large-scale dataset that is constructed using a weak supervision approach. It enables models to integrate both the textual and layout information for downstream tasks. The current DocBank dataset totally includes 500K document pages, where 400K for training, 50K for validation and 50K for testing. | DocBank |
TO DO