Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding TableNet model to extract tabular data #524

Open
felixdittrich92 opened this issue Oct 4, 2021 · 8 comments
Open

Adding TableNet model to extract tabular data #524

felixdittrich92 opened this issue Oct 4, 2021 · 8 comments
Assignees
Labels
help wanted Extra attention is needed module: models Related to doctr.models type: enhancement Improvement
Milestone

Comments

@felixdittrich92
Copy link
Contributor

add a tablenet model to extract tabular data as dataframe from images
(i have a ready to use model(.pt) trained on marmot dataset and need a bit guidiance where to add - prefered as onnx and for self training i can add also in reference /same for dataset but only in Pytorch (Lightning))

After the restructuring / hocr pdfa export
@fg-mindee @charlesmindee

@charlesmindee charlesmindee self-assigned this Oct 5, 2021
@charlesmindee charlesmindee added type: enhancement Improvement module: models Related to doctr.models labels Oct 5, 2021
@charlesmindee
Copy link
Collaborator

Hi @felixdittrich92,

Thanks for bringing this on the table, it is a very interesting and useful feature.
It would be interesting to integrate such a model in doctr, however we need to think about the global architecture:
Should it be a separate model (no shared features) from our detection + recognition pipeline (which would for sure slow down the end to end prediction), or should it be integrated to the detection predictor to maximize feature sharing ?

To answer this question we can look at the speed of your model, can you benchmark this on your side ?

If it is fast enough, we can start by implementing it separately in a new module, and it will run independently from the main pipeline. We can first implement the model in pytorch as you suggested, and provide a pretrained version (.pt) in the config, and tackle the dataset/training script integration later on!

Have a nice day ! 😄

@charlesmindee charlesmindee added the help wanted Extra attention is needed label Oct 6, 2021
@felixdittrich92
Copy link
Contributor Author

@charlesmindee
yes i will do i think later today :)
I wish you the same
I have attached the tensorboard logs if you want to take a look
version_0.zip

@felixdittrich92
Copy link
Contributor Author

felixdittrich92 commented Oct 6, 2021

@charlesmindee
on: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
the onnx model takes ~ 3-3.5 sec without (tesseract) OCR (tomorrow i can test the pure .pt model also if you want !?)
(I think optimizations are still possible, such as smaller input sizes or model prunning)
Sample output:

                                                0     1      2      3     4     5     6
0   Protein-ligand Complex #rotable bonds stoDock  Dock  FlexX    ICM  GOLD   T10   120
1                                     3pib 3 0.80  0.59     Mu    0.4   109  0.56   054
2                                      ing 2 0.62  0.86    108    O71   189  0.70  0.69
3                                      Lin) 3 121   156    173   2.17   190   142  1.50
4                                      ink 4 1.69   187   1.70   2.53   308  1.16    14
5                                      ini 5 2.61  5.26   2.73   3.40   493  2.22  2.22
6                                     Lipp 7 1.80  3.25    195     un   233  2.43   253
7                                    Ipph "1 5.14  3.91   3.27    144    43  4.00  0.53
8                                     Ipht 1 2.09  2.39   4.68    123    42   120  1.20
9                                     Iphg 5 3.52   537    487   0.46   420   107   108
10                                    2epp 3 3.40  2.48    04d   2.53   349  3.26  3.27
11                                    Inse 2 1.40  4.86   6.00    180   102   147  1.40
12                                    Insd n 1.20   451    156   1.04   096   18s   18s
13                                   Innb nl 0.92   451   0.92  1.08,   034  1.67  3.97
14                                    lebx 5 1.33  3.13    132   0.82   187  0.62  0.62
15                                    Bepa 8 2.22  6.48    151     on    87  2.22  2.22
16                                    Gepa 16 830   830   9.83   1.60   496  4.00  4.00
17                                    labe 4 0.16   187    OSS    036   ois  0.56  0.56
18                                    labf 5 0.48  3.25   0.76   0.61   030  0.68  0.70
19                                    Sabp 6 0.48  3.89   4.68    oss   030  0.48   O51
20                                    letr 15 461  6.66   7.26   0.87   $90  1.09  1.09
21                                    lets B 5.06  3.93      2   6.22   230   197   197
22                                     lett n 812   133   6.24  0.98,   130  0.82  0.82
23                                    3tmn 10 4si  7.09    530    136   396  3.65  3.65
24                                     Stln 4 534   139    633    142   160   421   421
25                                    ima 20 8.72   778    451   2.60   ssa   221   224
26                                    apt 30 1.89  8.06   5.95   0.88   882  5.72  4.79
27                                   lapu 29 9.10   758    843   2.02  1070   132   132
28                                   2itb 1s 3.09   143   8.94   1.04    26  2.09  5.19
29                                     teil 6 581  2.78   3.52   2.00    04  1.86  1.86
30                                      lok 5 854  5.65    422   3.03   385  2.84  2.84
31                                    Lenx B 10.9   735    683   2.09   632  6.20  6.20
32        

What do you think ?

@charlesmindee
Copy link
Collaborator

charlesmindee commented Oct 12, 2021

Hi @felixdittrich92,

Thanks for the benchmark, does the ONNX model which takes 3s to run include the OCR task as well (I understand that it doesn't include tesseract but is there any other module appart from the raw tablenet ?) ?
If so, we should benchmark to tab detection part alone, and if it is only the tab detection module it seems quite slow (we are aiming at ~1s inference per page for our end to end pipe on CPU, maybe more if the document is large), and we should see how we can optimize that.

Have a nice day! 😄

@felixdittrich92
Copy link
Contributor Author

@charlesmindee
yes currently the pure table segmentation needs ~3sec for this reason i have wrote model prunning, a smaller input size, the try for teacher / student experiment or else can be helpful to optimize.
I currently have an internal problem to take care of, so I probably won't get to it in the near future (just like with the reorganized problem # 512). However, if you want, I can send you the data set and the training scripts !?

I wish you the same

@charlesmindee
Copy link
Collaborator

Hi @felixdittrich92,

It is absolutely not a problem if we don't take care of this in the near future, It could be indeed great for us if you could share the dataset/training scripts but don't get too wrapped up in it!

Best!

@felixdittrich92
Copy link
Contributor Author

felixdittrich92 commented Oct 13, 2021

@charlesmindee
you can download it (also my pretrained) at Dataset_Model_Trained tell me if you got it :)
One thing: if you train this on a multi gpu system before saving the model you have to set the world rank to zero or save after training from checkpoint :)

@felixdittrich92
Copy link
Contributor Author

Topic for contrib module

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed module: models Related to doctr.models type: enhancement Improvement
Projects
None yet
Development

No branches or pull requests

3 participants