Skip to content

mxsurui/doc_gcn

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Doc-GCN: Heterogeneous Graph Convolutional Networks for Document

Layout Analysis This repository contains code for the paper Doc-GCN: Heterogeneous Graph Convolutional Networks for Document Layout Analysis.

Siwen Luo*, Yihao Ding*, Siqu Long, Soyeon Caren Han, Josiah Poon

Dataset Prepare

This paper uses three widely used benchmark datasets, including FUNSD (paper), Publaynet (paper), and Docbank (paper). (All three datasets are publicly available and can be gotten via their officially provided download link.)

Before feeding into various graphs to get enhanced feature representation, some preprocessing procedures are required to generate multi-aspect feature representations. Detailed procedure please refer here. For Publaynet dataset, we use Google Cloud Vision OCR tool to extract the text content before we feed into the pre-trained BERT to get textual representations.

Acquire Multi-aspect Features

We provide a tutorial to show how to generate an appropriate json file on FUNSD dataset for training or get GCN enhanced representation by our pre-trained GCN models. Other two datasets could follow the same procedure to get the required format json file for the following procedures. Please refer our paper to see the detailed descriptions of node and edge representations and check the google colab notebook to see how to generate them.

Graph Construction

We use GCNs to enhance the proposed four aspect feature representations: Appearance, Density, Semantic and Syntactic. Those GCNs have the same architecture but different node and edge representations. We generally divided into two types based on distinct edge representations:

Appearance and Density Graphs (Gap-distance Weighted)

The first type is gap distance based of including apprearance and density graphs of which edge features is the inverse of the nearest-top k segments. Node features of this type are visual and density features of each segment, repectively. Please refer this ipybn notebook to check how it works on FUNSD dataset.

Semantic and Syntactic Graphs (Parent-Child based)

Another type is the parent-child relation based (see example on FUNSD dataset). If two segments have parent-child relation, the edge value is set to be 1, otherwise 0. The graph construction workflow can be found in below graphs. More detailed information can be found in our paper and Appendix.

Classifier

After get the enhance the feature representations, we feed them into our model for training and testing. We also provide an ipynb notebook to show how it works on FUNSD dataset. Please refer DocGCN paper to get more detailed description about our classifier.

Evaluation Results

DocGCN can achieve SoTA performance based on considerable experiments compared with other baselines. Here we just show the overall performance on three benchmark datasets, more results analysis and ablation studies can be found in Section 5 of our paper

The overall performances of DocGCN compared to the baselines on test set in Precision rate (%), Recall rate (%) and F1 score (%). The second best is underlined. Our DocGCN can achieve highest performance among all benchmark datasets and evaluation metrics.

Case Study

We visualized the predicted results from DocGCN and compared with Top-3 baseline models for each dataset. Here is an example on PubLayNet Dataset. Below figure shows RoBERTa and Faster-RCNN have wrongly recognised a Text into List, whereas our Doc-GCN has accurately recognized all components. This is because by simply considering the semantic or visual information, it is hard to distinguish the List and Text, indicating the importance of capturing the mutli-aspect features and structural relationships between layout components for the better performance. More case studies can be found in Appendix B of DocGCN paper.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%