Question about line segmenting #59

cinjon · 2015-10-11T02:40:02Z

(Examples taken from this pdf - https://www.dropbox.com/s/6sy77shnro7sqdf/6.pdf?dl=0)

I have a bunch of files from which I've extracted the text in both a line format and a coherent blob format and I'm trying to understand what the best practices are for using ocropy-linegen.

An example in the document given is lines 5-8 (reproduced below):

流動資産は、たな卸資産が減少したものの、受取手形及び売掛金などが増加したことなどにより、 前連結会計年度末に比べ5億65百万円増加し、630億33百万円となりました。固定資産は、有形固 定資産、無形固定資産ともに減価償却により減少したものの、投資有価証券の評価差額が増加した ことにより、前連結会計年度末に比べ5億8百万円増加し、212億86百万円となりました。

Here, I could feed that whole blob to ocropy-linegen or I could feed it line by line:

流動資産は、たな卸資産が減少したものの、受取手形及び売掛金などが増加したことなどにより、
...
ことにより、前連結会計年度末に比べ5億8百万円増加し、212億86百万円となりました

I get the sense that the latter is what it expects. Is that right?

For another example, see the table further down on that page. The second row is:

自己資本比率    28.1    ...    23.4

Does ocropy-linegen want the full line (row), the full line with the spacing, or would it rather have each cell individually?

Thanks.

The text was updated successfully, but these errors were encountered:

zuphilip added the ❔ question label Oct 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about line segmenting #59

Question about line segmenting #59

cinjon commented Oct 11, 2015

Question about line segmenting #59

Question about line segmenting #59

Comments

cinjon commented Oct 11, 2015