Request: ODT reader #1768

ciampix · 2014-11-19T07:16:16Z

Since Libre/Open Office docbook export function si seriously broken, there is no easy way to convert ODT into md/asciidoc/name_a_format apart from rest since there is a odt2sphinx converter (that is far from perfect BTW). Actually I am converting ODT to Asciidoc in this way:

odt->rest (odt2sphinx)
rest->asciidoc (pandoc)

with many lost things in the process that I have to re-add manually... :-(

TIA

mszep · 2014-11-20T22:32:26Z

I don't know what your use case is exactly, but in the past I've brought odt documents to pandoc by converting them to html with

lowriter --headless --convert-to html myfile.odt

and then reading the html in using pandoc. I didn't have any problems with elements getting lost in the process, but that's just my personal experience of course. I'm curious if this route works better for you and if not, which elements exactly are lost?

ciampix · 2014-11-21T07:27:41Z

The problem with your method is that the libreoffice/openoffice html exporter embeds all images without the option to extract the images (and, worst of all, losing the image file names...). odt2sphinx is able to extract all the incorporate images into a separate "images" directory but have some nasty little bugs (and it can't recover image names so you get all images name in the form: 100000000000000000.png, 100000000000000002.png. and so on...). Anyway if pandoc would have an option to extract all embedded images, like for the epub and docx reader option--extract-media=DIR, that would close the issue. I have noted that pandoc have the opposite option, to force the embedding of the image in the output (--self-contained?). For sake of simmetry it could be useful to add that option for all readers or writers as a general option.

mszep · 2014-11-21T12:28:32Z

Ah yes, embedded images in the HTML were annoying for me too, but I fortunately didn't need them, and was able to purge them from the document with a simple filter.

However, it does seem as if it would make sense for the HTML reader to put embedded images in a MediaBag, like it does for images in epub and docx documents. Looking at the source file Pandoc.hs, it seems it would require some tinkering, since html is a text format and therefore uses a StringReader, which can't return MediaBags IIUC, rather than a ByteStringReader.

mpickering · 2014-11-22T02:32:33Z

I have some code on a branch somewhere which parses images embedded in HTML so it wouldn't be too much effort to extend the HTML reader with this functionality.

akahl-owl · 2015-09-30T16:14:30Z

Hi, I'd like to know why #2302 was closed without actually being merged. Is ODT ingestion support still a thing? I will probably need it soon, and my Haskell is still very basic.

jgm · 2015-10-03T23:45:55Z

#2302 was merged. So this issue can be closed.

jgm added the enhancement label Nov 20, 2014

jgm changed the title ~~missing odt input support~~ Request: ODT reader Nov 20, 2014

MarLinn mentioned this issue Jul 15, 2015

Candidate odt reader for review #2302

Closed

jgm closed this as completed Oct 3, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: ODT reader #1768

Request: ODT reader #1768

ciampix commented Nov 19, 2014

mszep commented Nov 20, 2014

ciampix commented Nov 21, 2014

mszep commented Nov 21, 2014

mpickering commented Nov 22, 2014

akahl-owl commented Sep 30, 2015

jgm commented Oct 3, 2015

Request: ODT reader #1768

Request: ODT reader #1768

Comments

ciampix commented Nov 19, 2014

mszep commented Nov 20, 2014

ciampix commented Nov 21, 2014

mszep commented Nov 21, 2014

mpickering commented Nov 22, 2014

akahl-owl commented Sep 30, 2015

jgm commented Oct 3, 2015