Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: ODT reader #1768

Closed
ciampix opened this issue Nov 19, 2014 · 6 comments
Closed

Request: ODT reader #1768

ciampix opened this issue Nov 19, 2014 · 6 comments

Comments

@ciampix
Copy link

ciampix commented Nov 19, 2014

Since Libre/Open Office docbook export function si seriously broken, there is no easy way to convert ODT into md/asciidoc/name_a_format apart from rest since there is a odt2sphinx converter (that is far from perfect BTW). Actually I am converting ODT to Asciidoc in this way:

  • odt->rest (odt2sphinx)
  • rest->asciidoc (pandoc)

with many lost things in the process that I have to re-add manually... :-(

TIA

@jgm jgm changed the title missing odt input support Request: ODT reader Nov 20, 2014
@mszep
Copy link
Contributor

mszep commented Nov 20, 2014

I don't know what your use case is exactly, but in the past I've brought odt documents to pandoc by converting them to html with

lowriter --headless --convert-to html myfile.odt

and then reading the html in using pandoc. I didn't have any problems with elements getting lost in the process, but that's just my personal experience of course. I'm curious if this route works better for you and if not, which elements exactly are lost?

@ciampix
Copy link
Author

ciampix commented Nov 21, 2014

The problem with your method is that the libreoffice/openoffice html exporter embeds all images without the option to extract the images (and, worst of all, losing the image file names...). odt2sphinx is able to extract all the incorporate images into a separate "images" directory but have some nasty little bugs (and it can't recover image names so you get all images name in the form: 100000000000000000.png, 100000000000000002.png. and so on...). Anyway if pandoc would have an option to extract all embedded images, like for the epub and docx reader option--extract-media=DIR, that would close the issue. I have noted that pandoc have the opposite option, to force the embedding of the image in the output (--self-contained?). For sake of simmetry it could be useful to add that option for all readers or writers as a general option.

@mszep
Copy link
Contributor

mszep commented Nov 21, 2014

Ah yes, embedded images in the HTML were annoying for me too, but I fortunately didn't need them, and was able to purge them from the document with a simple filter.

However, it does seem as if it would make sense for the HTML reader to put embedded images in a MediaBag, like it does for images in epub and docx documents. Looking at the source file Pandoc.hs, it seems it would require some tinkering, since html is a text format and therefore uses a StringReader, which can't return MediaBags IIUC, rather than a ByteStringReader.

@mpickering
Copy link
Collaborator

I have some code on a branch somewhere which parses images embedded in HTML so it wouldn't be too much effort to extend the HTML reader with this functionality.

@akahl-owl
Copy link

Hi, I'd like to know why #2302 was closed without actually being merged. Is ODT ingestion support still a thing? I will probably need it soon, and my Haskell is still very basic.

@jgm
Copy link
Owner

jgm commented Oct 3, 2015

#2302 was merged. So this issue can be closed.

@jgm jgm closed this as completed Oct 3, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants