Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add .extract_table(...) logic to avoid assigning characters to multiple cells #1013

Closed
jsvine opened this issue Oct 13, 2023 · 1 comment
Closed
Assignees

Comments

@jsvine
Copy link
Owner

jsvine commented Oct 13, 2023

(See #1006 (reply in thread) for context.)

Currently, Page.extract_table(...) uses .crop(...) internally for each cell, capturing all characters that overlap at all with the cell's bounding box. So if a character straddles multiple cells in a table, its text would appear in multiple cells of the .extract_table(...) output. This seems undesirable. Perhaps the method should either:

(a) "keep track" of characters it has already assigned to a cell (and then not use them again in another cell), or

(b) only assign characters to cells if they are more than 50% inside? (And then what to do if a cell is perfectly divided 50/50 between two cells? Or 30/30/40 across three cells?)

Thoughts / suggestions / other approaches?

@jsvine jsvine self-assigned this Oct 13, 2023
@jsvine
Copy link
Owner Author

jsvine commented Oct 14, 2023

Whoops, my assessment was incorrect here, and based on an assumption I failed to double-check in the code. The good news is we already handle this in a reasonable-seeming way:

def extract(self, **kwargs: Any) -> List[List[Optional[str]]]:
chars = self.page.chars
table_arr = []
def char_in_bbox(char: T_obj, bbox: T_bbox) -> bool:
v_mid = (char["top"] + char["bottom"]) / 2
h_mid = (char["x0"] + char["x1"]) / 2
x0, top, x1, bottom = bbox
return bool(
(h_mid >= x0) and (h_mid < x1) and (v_mid >= top) and (v_mid < bottom)
)

Thanks to @cmdlineluser in #1006 (reply in thread)

@jsvine jsvine closed this as completed Oct 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant