Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about line segmenting #59

Open
cinjon opened this issue Oct 11, 2015 · 0 comments
Open

Question about line segmenting #59

cinjon opened this issue Oct 11, 2015 · 0 comments

Comments

@cinjon
Copy link

cinjon commented Oct 11, 2015

(Examples taken from this pdf - https://www.dropbox.com/s/6sy77shnro7sqdf/6.pdf?dl=0)

I have a bunch of files from which I've extracted the text in both a line format and a coherent blob format and I'm trying to understand what the best practices are for using ocropy-linegen.

An example in the document given is lines 5-8 (reproduced below):

流動資産は、たな卸資産が減少したものの、受取手形及び売掛金などが増加したことなどにより、 前連結会計年度末に比べ5億65百万円増加し、630億33百万円となりました。固定資産は、有形固 定資産、無形固定資産ともに減価償却により減少したものの、投資有価証券の評価差額が増加した ことにより、前連結会計年度末に比べ5億8百万円増加し、212億86百万円となりました。

Here, I could feed that whole blob to ocropy-linegen or I could feed it line by line:

流動資産は、たな卸資産が減少したものの、受取手形及び売掛金などが増加したことなどにより、
...
ことにより、前連結会計年度末に比べ5億8百万円増加し、212億86百万円となりました

I get the sense that the latter is what it expects. Is that right?

For another example, see the table further down on that page. The second row is:

自己資本比率    28.1    ...    23.4

Does ocropy-linegen want the full line (row), the full line with the spacing, or would it rather have each cell individually?

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants