-
Notifications
You must be signed in to change notification settings - Fork 613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding extract_text_dir_sensitive #1040
Conversation
Thanks, @afriedman412. Could you provide a bit more context? Specifically, how does this differ from the |
OK so the idea here was to make it easier/more intuitive to manually control the direction in which text is read on both axes. If you have text that is rotated in some multiple of 90 degrees, you can just say "the words go right to left and the lines go top to bottom" and it will parse it correctly. As I understand it, the The bigger picture for me is that text direction on both axes is controlled the same way, by choosing which Does that make more sense? |
Ah, got it! I do like the idea of being able to specify char/line direction with more granularity. In fact, I think this would be nice as a core part of the main extraction methods. Doing that will require a bit of code-surgery, so I'm going to take that on myself, but will credit you clearly. Largely as a note to self, it sounds like there are a few different types of scenarios in which the reading direction of text on a page is not left-to-right, top-to-bottom:
|
I can do it when I have time (if you haven't already). But I made the standalone function as a way to soft launch the syntax, with an eye towards full implementation whenever.
Yeah I mean this issue is technically about issues parsing rotated text. And granted, a lot of this could be easily sorted upstream or downstream of text extraction, but it makes sense to put all text direction control in one place given how easy that is. |
Good news: I've made some progress on incorporating this more deeply into One wrinkle I realized: There are basically two variations of RTL text: (a) text that runs right-to-left for page rotation reasons (such as those in your examples) and (b) text in scripts/languages that naturally run right-to-left. In (a) the assumption, also reflected in the tests in this PR, seems to be that the user would want that text "fixed" in the output — i.e., for the output to read LTR. But for (b) I think it's safe to say that users would want the text to remain RTL. I don't think there's an automated way to tell the difference between those two scenarios with high fidelity, so I'm planning to add two additional parameters — One other note: |
Thanks for the the input! Fully agree about using I haven't gone back and looked at the code, but I think the idea was to infer the "reading" direction from the line direction. Anyways, new params is fine with me, although I would suggest something like |
Thanks for the response! Re. |
yo in the interest of expediency im fine with whatever you think is best! |
Now added in 850fd45 and available in |
Fixes #848 (partially)
Adds
extract_text_dir_sensitive
function to.utils.text
which lets the user specify which direction the lines and characters should be read.Because the syntax is new, I didn't want to just alter
extract_text_simple
but I can integrate the two if that would be preferable!