Adding TableNet model to extract tabular data #524

felixdittrich92 · 2021-10-04T16:56:29Z

add a tablenet model to extract tabular data as dataframe from images
(i have a ready to use model(.pt) trained on marmot dataset and need a bit guidiance where to add - prefered as onnx and for self training i can add also in reference /same for dataset but only in Pytorch (Lightning))

After the restructuring / hocr pdfa export
@fg-mindee @charlesmindee

charlesmindee · 2021-10-06T07:52:24Z

Hi @felixdittrich92,

Thanks for bringing this on the table, it is a very interesting and useful feature.
It would be interesting to integrate such a model in doctr, however we need to think about the global architecture:
Should it be a separate model (no shared features) from our detection + recognition pipeline (which would for sure slow down the end to end prediction), or should it be integrated to the detection predictor to maximize feature sharing ?

To answer this question we can look at the speed of your model, can you benchmark this on your side ?

If it is fast enough, we can start by implementing it separately in a new module, and it will run independently from the main pipeline. We can first implement the model in pytorch as you suggested, and provide a pretrained version (.pt) in the config, and tackle the dataset/training script integration later on!

Have a nice day ! 😄

felixdittrich92 · 2021-10-06T09:51:39Z

@charlesmindee
yes i will do i think later today :)
I wish you the same
I have attached the tensorboard logs if you want to take a look
version_0.zip

felixdittrich92 · 2021-10-06T13:30:50Z

@charlesmindee
on: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
the onnx model takes ~ 3-3.5 sec without (tesseract) OCR (tomorrow i can test the pure .pt model also if you want !?)
(I think optimizations are still possible, such as smaller input sizes or model prunning)
Sample output:

                                                0     1      2      3     4     5     6
0   Protein-ligand Complex #rotable bonds stoDock  Dock  FlexX    ICM  GOLD   T10   120
1                                     3pib 3 0.80  0.59     Mu    0.4   109  0.56   054
2                                      ing 2 0.62  0.86    108    O71   189  0.70  0.69
3                                      Lin) 3 121   156    173   2.17   190   142  1.50
4                                      ink 4 1.69   187   1.70   2.53   308  1.16    14
5                                      ini 5 2.61  5.26   2.73   3.40   493  2.22  2.22
6                                     Lipp 7 1.80  3.25    195     un   233  2.43   253
7                                    Ipph "1 5.14  3.91   3.27    144    43  4.00  0.53
8                                     Ipht 1 2.09  2.39   4.68    123    42   120  1.20
9                                     Iphg 5 3.52   537    487   0.46   420   107   108
10                                    2epp 3 3.40  2.48    04d   2.53   349  3.26  3.27
11                                    Inse 2 1.40  4.86   6.00    180   102   147  1.40
12                                    Insd n 1.20   451    156   1.04   096   18s   18s
13                                   Innb nl 0.92   451   0.92  1.08,   034  1.67  3.97
14                                    lebx 5 1.33  3.13    132   0.82   187  0.62  0.62
15                                    Bepa 8 2.22  6.48    151     on    87  2.22  2.22
16                                    Gepa 16 830   830   9.83   1.60   496  4.00  4.00
17                                    labe 4 0.16   187    OSS    036   ois  0.56  0.56
18                                    labf 5 0.48  3.25   0.76   0.61   030  0.68  0.70
19                                    Sabp 6 0.48  3.89   4.68    oss   030  0.48   O51
20                                    letr 15 461  6.66   7.26   0.87   $90  1.09  1.09
21                                    lets B 5.06  3.93      2   6.22   230   197   197
22                                     lett n 812   133   6.24  0.98,   130  0.82  0.82
23                                    3tmn 10 4si  7.09    530    136   396  3.65  3.65
24                                     Stln 4 534   139    633    142   160   421   421
25                                    ima 20 8.72   778    451   2.60   ssa   221   224
26                                    apt 30 1.89  8.06   5.95   0.88   882  5.72  4.79
27                                   lapu 29 9.10   758    843   2.02  1070   132   132
28                                   2itb 1s 3.09   143   8.94   1.04    26  2.09  5.19
29                                     teil 6 581  2.78   3.52   2.00    04  1.86  1.86
30                                      lok 5 854  5.65    422   3.03   385  2.84  2.84
31                                    Lenx B 10.9   735    683   2.09   632  6.20  6.20
32

What do you think ?

charlesmindee · 2021-10-12T09:49:07Z

Hi @felixdittrich92,

Thanks for the benchmark, does the ONNX model which takes 3s to run include the OCR task as well (I understand that it doesn't include tesseract but is there any other module appart from the raw tablenet ?) ?
If so, we should benchmark to tab detection part alone, and if it is only the tab detection module it seems quite slow (we are aiming at ~1s inference per page for our end to end pipe on CPU, maybe more if the document is large), and we should see how we can optimize that.

Have a nice day! 😄

felixdittrich92 · 2021-10-12T13:53:08Z

@charlesmindee
yes currently the pure table segmentation needs ~3sec for this reason i have wrote model prunning, a smaller input size, the try for teacher / student experiment or else can be helpful to optimize.
I currently have an internal problem to take care of, so I probably won't get to it in the near future (just like with the reorganized problem # 512). However, if you want, I can send you the data set and the training scripts !?

I wish you the same

charlesmindee · 2021-10-13T10:34:24Z

Hi @felixdittrich92,

It is absolutely not a problem if we don't take care of this in the near future, It could be indeed great for us if you could share the dataset/training scripts but don't get too wrapped up in it!

Best!

felixdittrich92 · 2021-10-13T12:05:03Z

@charlesmindee
you can download it (also my pretrained) at Dataset_Model_Trained tell me if you got it :)
One thing: if you train this on a multi gpu system before saving the model you have to set the world rank to zero or save after training from checkpoint :)

felixdittrich92 · 2024-05-22T13:58:51Z

Topic for contrib module

charlesmindee self-assigned this Oct 5, 2021

charlesmindee added type: enhancement Improvement module: models Related to doctr.models labels Oct 5, 2021

charlesmindee added the help wanted Extra attention is needed label Oct 6, 2021

charlesmindee mentioned this issue Oct 14, 2021

mismatch in sequence of words in result.export() #528

Closed

fg-mindee added this to the 1.0.0 milestone Dec 10, 2021

felixdittrich92 modified the milestones: 1.0.0, 2.0.0 Feb 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding TableNet model to extract tabular data #524

Adding TableNet model to extract tabular data #524

felixdittrich92 commented Oct 4, 2021

charlesmindee commented Oct 6, 2021

felixdittrich92 commented Oct 6, 2021

felixdittrich92 commented Oct 6, 2021 •

edited

Loading

charlesmindee commented Oct 12, 2021 •

edited

Loading

felixdittrich92 commented Oct 12, 2021

charlesmindee commented Oct 13, 2021

felixdittrich92 commented Oct 13, 2021 •

edited

Loading

felixdittrich92 commented May 22, 2024

Adding TableNet model to extract tabular data #524

Adding TableNet model to extract tabular data #524

Comments

felixdittrich92 commented Oct 4, 2021

charlesmindee commented Oct 6, 2021

felixdittrich92 commented Oct 6, 2021

felixdittrich92 commented Oct 6, 2021 • edited Loading

charlesmindee commented Oct 12, 2021 • edited Loading

felixdittrich92 commented Oct 12, 2021

charlesmindee commented Oct 13, 2021

felixdittrich92 commented Oct 13, 2021 • edited Loading

felixdittrich92 commented May 22, 2024

felixdittrich92 commented Oct 6, 2021 •

edited

Loading

charlesmindee commented Oct 12, 2021 •

edited

Loading

felixdittrich92 commented Oct 13, 2021 •

edited

Loading