Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bounding boxes on characters with rotation are incorrect #454

Open
mhrmsn opened this issue Jul 3, 2020 · 7 comments
Open

Bounding boxes on characters with rotation are incorrect #454

mhrmsn opened this issue Jul 3, 2020 · 7 comments

Comments

@mhrmsn
Copy link

mhrmsn commented Jul 3, 2020

Using pdfminer.six 20200124.
Bounding boxes on characters that are not strictly horizontal or vertical are incorrect. I assume this is because bounding boxes are only defined with two points (x0, y0), (x1, y1) which are rotated with the rotational matrix (around the center of the character's diagonal?), without further processing. These two new points can even lie above each other defining a bounding box with area close to zero.

I've attached a test pdf file, created with InkScape:
text-test.pdf

Here are the resulting LTChar bounding boxes rendered into a png:
text-test-ltchars

A further problem is, that this will also lead to erroneous LTTextBoxes. Getting these correct for diagonal text would probably require minimal bounding boxes that are not aligned with the x and y axes, but the character's orientation. Here's the example with default layout params:
text-test-lttextboxhorizontals

Function used to extract LTChars and LTTextBoxHorizontals:

def raw_parse_pdf(
    file_path,
    line_overlap=0.5,
    char_margin=2.0,
    line_margin=0.5,
    word_margin=0.1,
    boxes_flow=0.5,
    detect_vertical=False,
    all_texts=False
):

    pdf = open(file_path, 'rb')
    parser = PDFParser(pdf)
    document = PDFDocument(parser)
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed
    rsrcmgr = PDFResourceManager()
    laparams = LAParams(
        line_overlap=line_overlap,
        char_margin=char_margin,
        line_margin=line_margin,
        word_margin=word_margin,
        boxes_flow=boxes_flow,
        detect_vertical=detect_vertical,
        all_texts=all_texts
    )
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    page = next(PDFPage.create_pages(document))
    interpreter.process_page(page)
    layout = device.get_result()

    ltcurves = []
    lttextboxhorizontals = []
    lttextlines = []
    ltchars = []

    for child in layout:
        # Get LTCurves
        if isinstance(child, LTCurve):
            ltcurves.append(child)
                        
        # Get LTChars
        elif isinstance(child, LTTextBox):
            lttextboxhorizontals.append(child)

            for lttextline in child._objs:
                lttextlines.append(lttextline)

                for ltchar in lttextline._objs:
                    if isinstance(ltchar, LTChar):
                        ltchars.append(ltchar)

    return page, ltcurves, lttextboxhorizontals, lttextlines, ltchars
@pietermarsman pietermarsman added type: bug component: converter Related to any PDFLayoutAnalyzer labels Jul 5, 2020
@pietermarsman
Copy link
Member

Hi @mhrmsn, thanks for raising the issue. Also, very nicely illustrated. It would be very nice if we can add support for rotated characters! Could be a unique selling point.

Not sure how to do it yet though ;) For starters we could make sure that the bounding boxes surround the whole character. But it would be even nicer if we can add native support for rotated bounding boxes. That could also make the distinction between horizontal and vertical text superfluous. A parameter could control the amount of allowed rotation between characters of the same word.

@mhrmsn
Copy link
Author

mhrmsn commented Jul 7, 2020

Hi @pietermarsman, yes, support for this would be really nice.
I think surrounding the whole character correctly can be achieved easily by applying the rotation to the other two bounding box points and taking the new min/max x/y values for the new bounding box.

For native bounding boxes and everything that comes with it regarding layout analysis I have no good overview what parts in pdfminer are affected and how it exactly works. One could probably do it somehow by keeping the current two-point bounding box and the character's rotation and then to a similar layout analysis, but in a character-based coordinate system or something like that?

One could also add another minimum bounding box type do have both options available, or define four-point bounding boxes etc...

@pietermarsman
Copy link
Member

Proposal: lets start simple in this issue and use the surrounding bounding box approach. That should improve the layout analysis of rotated characters. After that, and if desired, we can add more thorough support for rotated bounding boxes.

@jstockwin
Copy link
Member

This is a cool idea. However, supporting rotated bounding boxes properly does seem very hard?

Suppose all the text is rotated at 45 degrees. Now all of your character margin, line margin values need to be computed at the given angle before comparing with the layout params to decide if e.g. two text lines are merged into a text box. For an arbitrary angle, this seems like reasonably complex geometry?

Even if you solve that, what about the more advanced layout analysis with boxes flow? E.g. what does the dist function look like if obj1 and obj2 have different rotations?

Even so, it would be a cool feature. A lot of the code is currently implemented for both horizontal and vertical, but with this that would be handled by the same code.

I agree that starting simple by expanding the bounding boxes to completely contain the character is a good idea, although I would note that such bounding boxes will overlap.

@pietermarsman pietermarsman added this to new in pdfminer.six via automation Jul 8, 2020
@pietermarsman pietermarsman moved this from new to accepted in pdfminer.six Jul 8, 2020
@i4never
Copy link

i4never commented Apr 24, 2022

Is there any way I can tell a LTChar is rotated now?

@pietermarsman
Copy link
Member

I'm not quite sure (I haven't checked) but I think that is encoded in LTChar.matrix. See Section 4.2 (Coordinate Spaces) of the PDF reference.

@stefan-bordag
Copy link

stefan-bordag commented Oct 9, 2023

Unfortunately, the matrix will tell you it is rotated, but since the boxes coming from PDF Miner are incoherent with respect to the rotation, it is not worthwhile to make use of the rotation matrix. Here is a test example, where I have rotated each next character by 45 degrees, and visualized their bounding boxes:

image

Here is the same example, but with the rotation matrix applied and then a new bounding box built around the rotated polygon:

image

A proper solution can only come from the library itself, so far I have only been able to come up with the hack of always rotating every box by exactly 90 degrees, and then building a new bounding box, which takes the maximum extents into every direction BOTH from the original bounding box, as well as the 90 degrees rotated, with these results:

image

This is now mostly satisfying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
pdfminer.six
  
accepted
Development

No branches or pull requests

5 participants