Add .extract_table(...)
logic to avoid assigning characters to multiple cells
#1013
Labels
.extract_table(...)
logic to avoid assigning characters to multiple cells
#1013
(See #1006 (reply in thread) for context.)
Currently,
Page.extract_table(...)
uses.crop(...)
internally for each cell, capturing all characters that overlap at all with the cell's bounding box. So if a character straddles multiple cells in a table, its text would appear in multiple cells of the.extract_table(...)
output. This seems undesirable. Perhaps the method should either:(a) "keep track" of characters it has already assigned to a cell (and then not use them again in another cell), or
(b) only assign characters to cells if they are more than 50% inside? (And then what to do if a cell is perfectly divided 50/50 between two cells? Or 30/30/40 across three cells?)
Thoughts / suggestions / other approaches?
The text was updated successfully, but these errors were encountered: