Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pptx/docx not working in sandbox mode #8128

Open
rosseljost opened this issue Jun 15, 2022 · 5 comments
Open

pptx/docx not working in sandbox mode #8128

rosseljost opened this issue Jun 15, 2022 · 5 comments

Comments

@rosseljost
Copy link

The Problem:
When trying to convert to .pptx or .docx with the --sandbox flag enabled Pandoc dies with exitcode 97 and the message

Could not find data file data/data/pptx/[Content_Types].xml

How to Reproduce:

  • Create a simple Markdown file (I assume any Pandoc-supported format will do). For the sake of an example:
    Filename: test.md
    Content: # Heading\n\ntext\n
  • Run pandoc --sandbox -o test.pptx test.md

System Information:

pandoc 2.17.1.1
Compiled with pandoc-types 1.22.2, texmath 0.12.5, skylighting 0.12.3,
citeproc 0.6.0.1, ipynb 0.2

native on Arch Linux

What I Assume About the Bug:
The [Content_Types].xml is a file required in the Open-Packaging Conventions (the underlying specification of Microsoft Office documents), it defines the types for file extensions in the ZIP archive.
I generally can be static, as there are a limited amount of file types used for an Office document (I assume it is hardcoded in the respective writer module).
From my perspective, there is no reason this would not work in sandbox mode, as (if my assumption is correct) the [Content_Types].xml is not influenced by user input.

What I'd Asume Happens:
Option 1: The Office file is created.
Option 2: (if there are valid reasons why this can't work in sandbox mode) A proper error message.

@rosseljost rosseljost added the bug label Jun 15, 2022
@jgm
Copy link
Owner

jgm commented Jun 15, 2022

You're probably using a version of pandoc compiled without the embed_data_files flag. When this flag is enabled, the data files needed to create a docx are embedded in the binary, and thus accessible without any file I/O. When it is not enabled, the data files live somewhere on your file system (e.g. /usr/share/pandoc) and pandoc cannot get them without doing I/O -- hence they are inaccessible with --sandbox.

@jgm
Copy link
Owner

jgm commented Jun 15, 2022

Running in sandbox mode, pandoc can still use an ersatz file system stored in memory (which is populated at the outset with the files specified on the command line). But this doesn't include all of pandoc's data files.

A couple of possibilities:

  • You can work around your own difficulty by compiling pandoc with embed_data_files, or just using the linux binary we provide.
  • We could load all of pandoc's data files (936K) into the ersatz file system when --sandbox is used. Then this error would not occur. But this might have a noticeable performance impact that would be undesirable for many uses.
  • We could try to load data files into the ersatz file system intelligently, e.g. loading the files necessary for docx when docx output is selected. This adds a lot of complexity.
  • We could change the error message issued by readFileLazy and readFileStrict in T.P.Class.PandocPure, so that in addition to saying "not found," it says something like "resources on the file system are not available in --sandbox mode." The problem with this is that the --sandbox option is specific to the pandoc CLI; those who use pandoc as a library might use these functions and then the error message would be inappropriate.

@rosseljost
Copy link
Author

Thanks for the detailed and fast answer. From my point of view, this issue can be closed, as there exists a viable workaround.
And good idea to add this to the manual.
If you want to keep it open for keeping track of this behavior (and many changing it in the future) that is up to you.

I see one additional possibility to address this:

  • Provide an additional CLI flag (something like --full-sandbox maybe?) that enables the --sandbox behavior and the ersatz file system.

@jgm
Copy link
Owner

jgm commented Jun 16, 2022

I suppose another possibility would be to put the behavior under CPP, so that the data files are included in the ersatz file system only if pandoc is compiled without the embed_data_files flag. That would limit the performance impact to systems that store the data files on disk. (But of course that includes standard linux distributions, so this may not be a good plan if the performance impact is significant.)

Anyway, first step would be to measure the performance impact. Maybe it could be limited if we use lazy IO. I'll keep this open.

GZGavinZhao added a commit to GZGavinZhao/packages that referenced this issue Nov 17, 2023
Summary:
Add the `-f embed_data_files` when building, because when running in
`--sandbox` mode, `pandoc` cannot access the external file system.

A similar case is presented in
jgm/pandoc#8128.

Test plan:
Successfully ran and tested `python-pypandoc` and `apostrophe`.

Signed-off-by: Gavin Zhao <git@gzgz.dev>
GZGavinZhao added a commit to GZGavinZhao/packages that referenced this issue Nov 17, 2023
Summary:
Add the `-f embed_data_files` when building, because when running in
`--sandbox` mode, `pandoc` cannot access the external file system.

A similar case is presented in
jgm/pandoc#8128.

Test plan:
Successfully ran and tested `python-pypandoc` and `apostrophe`.

Signed-off-by: Gavin Zhao <git@gzgz.dev>
@alerque
Copy link
Contributor

alerque commented Dec 4, 2023

Please do keep this open as I don't think Arch Linux plans on enabling the embeded data files option. As a distro we have several reasons for preferring not doing that, and it would be better if another resolution was found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants