Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text output badly formatted #3

Open
Anterotesis opened this issue Nov 14, 2016 · 1 comment
Open

Text output badly formatted #3

Anterotesis opened this issue Nov 14, 2016 · 1 comment

Comments

@Anterotesis
Copy link

I'm trying out the command line version of OpenConvert on the recently released British Library alto xml, aiming to turn it into plain text.
Although I have OpenConvert working on a sample (test4.xml, attached to this report in zip format), the text is very badly formatted. Each word is on a separate line, with 6 tabs before it, a blank line between each word, and 454 (!) blank lines between the header information and the first page text.
The online version of OpenConvert is formats the text very well.

(Using Mac OS X 10.11.6, & Java 1.8.112 in case that matters.)

test4.xml.zip

@JessedeDoes
Copy link
Member

Dear John,

You are right, the plain text output from ALTO is execrable.
The reason is that conversion takes place indirectly, ALTO --> tokenized TEI with zoning --> plain text.
The final TEI to plain text step introduces the white space.
In the online version, the route is ALTO --> TEI --> Formatted HTML.

Currently I do not have the time to fix this. I will see what I can do next week.

Best,
Jesse

Van: John Levin [mailto:notifications@github.com]
Verzonden: maandag 14 november 2016 23:04
Aan: INL/OpenConvert
Onderwerp: [INL/OpenConvert] Text output badly formatted (#3)

I'm trying out the command line version of OpenConvert on the recently released British Library alto xml, aiming to turn it into plain text.
Although I have OpenConvert working on a sample (test4.xml, attached to this report in zip format), the text is very badly formatted. Each word is on a separate line, with 6 tabs before it, a blank line between each word, and 454 (!) blank lines between the header information and the first page text.
The online version of OpenConvert is formats the text very well.

(Using Mac OS X 10.11.6, & Java 1.8.112 in case that matters.)

test4.xml.ziphttps://github.com/INL/OpenConvert/files/590598/test4.xml.zip


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHubhttps://github.com//issues/3, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ABDv1ILMFhSS4ZI2vrjPct20n3WDhyKeks5q-Nq8gaJpZM4Kx3eg.



Aan dit bericht kunnen geen rechten worden ontleend.
Het bericht is alleen bestemd voor de geadresseerde.
Indien het bericht niet voor u is bestemd, verzoeken wij
u dit aan ons te melden en het bericht te verwijderen.

This message shall not constitute any obligations.
This message is intended solely for the addressee.
If you have received this message in error, please

inform us and delete the message.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants