Skip to content

Commit

Permalink
Added example for table extraction, and enabled multi-page table hand…
Browse files Browse the repository at this point in the history
…ling pipeline (#1467)

* Added example for table extraction, and enabled multi-page table handling pipeline.

Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>
  • Loading branch information
XinyuYe-Intel committed Apr 10, 2024
1 parent 5f0430f commit db9e6fb
Show file tree
Hide file tree
Showing 18 changed files with 4,371 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Extract Tables From PDF File

We leveraged [table-transformer](https://github.com/microsoft/table-transformer) for tables extraction and adapted the multi-page table solution of [Amazon Textract response parser](https://github.com/aws-samples/amazon-textract-response-parser) library to table-transformer for handling multi-page table.

## Prepare Environment

```
pip install -r requirements.txt
```
Note that additional language library of tesseract is needed for handling language other than English, for example, for Simplified Chinese, below library is needed.
```
apt-get install tesseract-ocr-chi-sim
```

## Prepare Models

```
git clone https://huggingface.co/bsmock/tatr-pubtables1m-v1.0
git clone https://huggingface.co/bsmock/TATR-v1.1-All
```

## Usage

### Run the table extraction script
For local pdf file, run below command:
```
python extract_tables.py --pdf_file /path/to/pdf_file --structure_model_path TATR-v1.1-All/TATR-v1.1-All-msft.pth --detection_model_path tatr-pubtables1m-v1.0/pubtables1m_detection_detr_r18.pth -c
```

For url of pdf file, run below command:
```
python extract_tables.py --pdf_file url_of_pdf --structure_model_path TATR-v1.1-All/TATR-v1.1-All-msft.pth --detection_model_path tatr-pubtables1m-v1.0/pubtables1m_detection_detr_r18.pth -c
```

## Acknowledgements

This example is mostly adapted from [table-transformer](https://github.com/microsoft/table-transformer). We thank the related authors for their great work!
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{
"lr":5e-5,
"lr_backbone":1e-5,
"batch_size":2,
"weight_decay":1e-4,
"epochs":20,
"lr_drop":1,
"lr_gamma":0.9,
"clip_max_norm":0.1,

"backbone":"resnet18",
"num_classes":2,
"dilation":false,
"position_embedding":"sine",
"emphasized_weights":{},

"enc_layers":6,
"dec_layers":6,
"dim_feedforward":2048,
"hidden_dim":256,
"dropout":0.1,
"nheads":8,
"num_queries":15,
"pre_norm":true,

"masks":false,

"aux_loss":false,

"mask_loss_coef":1,
"dice_loss_coef":1,
"ce_loss_coef":1,
"bbox_loss_coef":5,
"giou_loss_coef":2,
"eos_coef":0.4,

"set_cost_class":1,
"set_cost_bbox":5,
"set_cost_giou":2,

"device":"cuda",
"seed":42,
"start_epoch":0,
"num_workers":1
}

Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
{
"lr":5e-5,
"lr_backbone":1e-5,
"batch_size":2,
"weight_decay":1e-4,
"epochs":20,
"lr_drop":1,
"lr_gamma":0.9,
"clip_max_norm":0.1,

"backbone":"resnet18",
"num_classes":6,
"dilation":false,
"position_embedding":"sine",
"emphasized_weights":{},

"enc_layers":6,
"dec_layers":6,
"dim_feedforward":2048,
"hidden_dim":256,
"dropout":0.1,
"nheads":8,
"num_queries":125,
"pre_norm":true,

"masks":false,

"aux_loss":false,

"mask_loss_coef":1,
"dice_loss_coef":1,
"ce_loss_coef":1,
"bbox_loss_coef":5,
"giou_loss_coef":2,
"eos_coef":0.4,

"set_cost_class":1,
"set_cost_bbox":5,
"set_cost_giou":2,

"device":"cuda",
"seed":42,
"start_epoch":0,
"num_workers":1
}

0 comments on commit db9e6fb

Please sign in to comment.