-
Notifications
You must be signed in to change notification settings - Fork 192
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added example for table extraction, and enabled multi-page table hand…
…ling pipeline (#1467) * Added example for table extraction, and enabled multi-page table handling pipeline. Signed-off-by: Ye, Xinyu <xinyu.ye@intel.com>
- Loading branch information
1 parent
5f0430f
commit db9e6fb
Showing
18 changed files
with
4,371 additions
and
0 deletions.
There are no files selected for viewing
37 changes: 37 additions & 0 deletions
37
...ension_for_transformers/neural_chat/examples/plugins/table_extraction/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Extract Tables From PDF File | ||
|
||
We leveraged [table-transformer](https://github.com/microsoft/table-transformer) for tables extraction and adapted the multi-page table solution of [Amazon Textract response parser](https://github.com/aws-samples/amazon-textract-response-parser) library to table-transformer for handling multi-page table. | ||
|
||
## Prepare Environment | ||
|
||
``` | ||
pip install -r requirements.txt | ||
``` | ||
Note that additional language library of tesseract is needed for handling language other than English, for example, for Simplified Chinese, below library is needed. | ||
``` | ||
apt-get install tesseract-ocr-chi-sim | ||
``` | ||
|
||
## Prepare Models | ||
|
||
``` | ||
git clone https://huggingface.co/bsmock/tatr-pubtables1m-v1.0 | ||
git clone https://huggingface.co/bsmock/TATR-v1.1-All | ||
``` | ||
|
||
## Usage | ||
|
||
### Run the table extraction script | ||
For local pdf file, run below command: | ||
``` | ||
python extract_tables.py --pdf_file /path/to/pdf_file --structure_model_path TATR-v1.1-All/TATR-v1.1-All-msft.pth --detection_model_path tatr-pubtables1m-v1.0/pubtables1m_detection_detr_r18.pth -c | ||
``` | ||
|
||
For url of pdf file, run below command: | ||
``` | ||
python extract_tables.py --pdf_file url_of_pdf --structure_model_path TATR-v1.1-All/TATR-v1.1-All-msft.pth --detection_model_path tatr-pubtables1m-v1.0/pubtables1m_detection_detr_r18.pth -c | ||
``` | ||
|
||
## Acknowledgements | ||
|
||
This example is mostly adapted from [table-transformer](https://github.com/microsoft/table-transformer). We thank the related authors for their great work! |
46 changes: 46 additions & 0 deletions
46
..._transformers/neural_chat/examples/plugins/table_extraction/configs/detection_config.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
{ | ||
"lr":5e-5, | ||
"lr_backbone":1e-5, | ||
"batch_size":2, | ||
"weight_decay":1e-4, | ||
"epochs":20, | ||
"lr_drop":1, | ||
"lr_gamma":0.9, | ||
"clip_max_norm":0.1, | ||
|
||
"backbone":"resnet18", | ||
"num_classes":2, | ||
"dilation":false, | ||
"position_embedding":"sine", | ||
"emphasized_weights":{}, | ||
|
||
"enc_layers":6, | ||
"dec_layers":6, | ||
"dim_feedforward":2048, | ||
"hidden_dim":256, | ||
"dropout":0.1, | ||
"nheads":8, | ||
"num_queries":15, | ||
"pre_norm":true, | ||
|
||
"masks":false, | ||
|
||
"aux_loss":false, | ||
|
||
"mask_loss_coef":1, | ||
"dice_loss_coef":1, | ||
"ce_loss_coef":1, | ||
"bbox_loss_coef":5, | ||
"giou_loss_coef":2, | ||
"eos_coef":0.4, | ||
|
||
"set_cost_class":1, | ||
"set_cost_bbox":5, | ||
"set_cost_giou":2, | ||
|
||
"device":"cuda", | ||
"seed":42, | ||
"start_epoch":0, | ||
"num_workers":1 | ||
} | ||
|
45 changes: 45 additions & 0 deletions
45
..._transformers/neural_chat/examples/plugins/table_extraction/configs/structure_config.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
{ | ||
"lr":5e-5, | ||
"lr_backbone":1e-5, | ||
"batch_size":2, | ||
"weight_decay":1e-4, | ||
"epochs":20, | ||
"lr_drop":1, | ||
"lr_gamma":0.9, | ||
"clip_max_norm":0.1, | ||
|
||
"backbone":"resnet18", | ||
"num_classes":6, | ||
"dilation":false, | ||
"position_embedding":"sine", | ||
"emphasized_weights":{}, | ||
|
||
"enc_layers":6, | ||
"dec_layers":6, | ||
"dim_feedforward":2048, | ||
"hidden_dim":256, | ||
"dropout":0.1, | ||
"nheads":8, | ||
"num_queries":125, | ||
"pre_norm":true, | ||
|
||
"masks":false, | ||
|
||
"aux_loss":false, | ||
|
||
"mask_loss_coef":1, | ||
"dice_loss_coef":1, | ||
"ce_loss_coef":1, | ||
"bbox_loss_coef":5, | ||
"giou_loss_coef":2, | ||
"eos_coef":0.4, | ||
|
||
"set_cost_class":1, | ||
"set_cost_bbox":5, | ||
"set_cost_giou":2, | ||
|
||
"device":"cuda", | ||
"seed":42, | ||
"start_epoch":0, | ||
"num_workers":1 | ||
} |
Oops, something went wrong.