New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bounding boxes on characters with rotation are incorrect #454
Comments
Hi @mhrmsn, thanks for raising the issue. Also, very nicely illustrated. It would be very nice if we can add support for rotated characters! Could be a unique selling point. Not sure how to do it yet though ;) For starters we could make sure that the bounding boxes surround the whole character. But it would be even nicer if we can add native support for rotated bounding boxes. That could also make the distinction between horizontal and vertical text superfluous. A parameter could control the amount of allowed rotation between characters of the same word. |
Hi @pietermarsman, yes, support for this would be really nice. For native bounding boxes and everything that comes with it regarding layout analysis I have no good overview what parts in pdfminer are affected and how it exactly works. One could probably do it somehow by keeping the current two-point bounding box and the character's rotation and then to a similar layout analysis, but in a character-based coordinate system or something like that? One could also add another minimum bounding box type do have both options available, or define four-point bounding boxes etc... |
Proposal: lets start simple in this issue and use the surrounding bounding box approach. That should improve the layout analysis of rotated characters. After that, and if desired, we can add more thorough support for rotated bounding boxes. |
This is a cool idea. However, supporting rotated bounding boxes properly does seem very hard? Suppose all the text is rotated at 45 degrees. Now all of your character margin, line margin values need to be computed at the given angle before comparing with the layout params to decide if e.g. two text lines are merged into a text box. For an arbitrary angle, this seems like reasonably complex geometry? Even if you solve that, what about the more advanced layout analysis with boxes flow? E.g. what does the dist function look like if Even so, it would be a cool feature. A lot of the code is currently implemented for both horizontal and vertical, but with this that would be handled by the same code. I agree that starting simple by expanding the bounding boxes to completely contain the character is a good idea, although I would note that such bounding boxes will overlap. |
Is there any way I can tell a LTChar is rotated now? |
I'm not quite sure (I haven't checked) but I think that is encoded in |
Unfortunately, the matrix will tell you it is rotated, but since the boxes coming from PDF Miner are incoherent with respect to the rotation, it is not worthwhile to make use of the rotation matrix. Here is a test example, where I have rotated each next character by 45 degrees, and visualized their bounding boxes: Here is the same example, but with the rotation matrix applied and then a new bounding box built around the rotated polygon: A proper solution can only come from the library itself, so far I have only been able to come up with the hack of always rotating every box by exactly 90 degrees, and then building a new bounding box, which takes the maximum extents into every direction BOTH from the original bounding box, as well as the 90 degrees rotated, with these results: This is now mostly satisfying. |
Using pdfminer.six 20200124.
Bounding boxes on characters that are not strictly horizontal or vertical are incorrect. I assume this is because bounding boxes are only defined with two points (x0, y0), (x1, y1) which are rotated with the rotational matrix (around the center of the character's diagonal?), without further processing. These two new points can even lie above each other defining a bounding box with area close to zero.
I've attached a test pdf file, created with InkScape:
text-test.pdf
Here are the resulting LTChar bounding boxes rendered into a png:
A further problem is, that this will also lead to erroneous LTTextBoxes. Getting these correct for diagonal text would probably require minimal bounding boxes that are not aligned with the x and y axes, but the character's orientation. Here's the example with default layout params:
Function used to extract LTChars and LTTextBoxHorizontals:
The text was updated successfully, but these errors were encountered: