Skip to content

Roll-Face/table_extraction

Repository files navigation

Architecture

  1. Table detection: Using SOTA detectron2
  2. Table Line: Using architecture Unet + rule base
  3. OCR: Using SOTA easyocr

Train

Prepare dataset:

Data is private not public, you can learn on internet about tabular data, You can label data by labelme (wkentaro/labelme: Image Polygonal Annotation with Python (polygon, rectangle, circle, line, point and image-level flag annotation). (github.com))

Refers datasets:

Training

Config params: file base_config.yaml

bash sh scripts/train.sh 

Demo Table Line

bash sh scripts/infer_table_line.sh 

Step 1: Table detection

Step 2: Table Line

Input:

1671954396105

Output:

1671954423770

Table OCR

Step 1: Table detection

Step 2: Table line

Step 3: Crop image according line

Step 4: OCR

Step 5: Save file csv/excel

sh scripts/infer_table_ocr.sh

Input: ./datasets/demo_examples/demo2.png

1671956437092

Output: ./results/demo.csv

1671956507888

Docker

docker run --name table_extraction nam157/table_extraction:v1.0.0

References

  1. nanonets-blog
  2. table-detection-structure-recognition
  3. table-transformer
  4. TableNet: Deep Learning Model for End-to-end Table Detection and Tabular Data Extraction from Scanned Document Images