New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal : lossless markdown cells in ipynb #5408
Comments
Well, that's a good question! I agree that having the markdown cell reformatted was a bit surprising the first time I saw it. But as you explained, this preserves the text syntax and therefore also the HTML rendering. And that can also be a feature, as the long lines of markdown content can be automatically reformatted into shorter lines, and headers get a consistent style. Now, I think the Regarding the implementation, I do not think I would recommend copying the original text in the metadata, for two reasons:
|
Marc Wouts <notifications@github.com> writes:
Now, I think the `--wrap=preserve` option already reduces by much the reformatting. Maybe that's a good start already? Do you think you need to go further than that?
I guess that's my question for you. I can imagine that
someone who goes back and forth a lot might like more
consistency in the markdown cells, to avoid spurious
changes. But maybe it's not an issue in practice.
Regarding the implementation, I do not think I would recommend copying the original text in the metadata, for two reasons:
- Duplicating the original text would also make file history more noisy in version control
- And it would prevent the user from being able to edit the content of the cell (since on the next conversion it would be overwritten with the original text)
To be clear, my idea was to store this in the metadata
in pandoc's AST, but not to render this metadata in
markdown or ipynb.
|
I see! Sure, if you implement an option that preserves the content of the cells when converting between Markdown and Jupyter Notebook, I will make that the default for pandoc from Jupytext, as it will make users more confident with the round trip. |
Just want to say I love where this conversation is heading :-) I'm happy to play around with things and give feedback whenever there are prototypes (on either side) |
It seems to me setting it as RawBlock of markdown is a cleaner approach (as a command line option.) People specifically looking for round trip identity would have been using the correct output format (and as a side benefit a free filter to remove any markdown cells.) Reading elsewhere it seems like only a particular combinations of command line arguments would provide round trip identity, in this case may be similar to those markdown variants (eg multimarkdown as a short cut to combine multiple behaviors), a particular ipynb variants can be defined to provide round trip identity. |
@jgm To be able to have reproducible research, IMHO round-trip identity is required. I think having an explicit command line option to ensure it would be ideal. Having another option to influence which cells get written would be also really useful. Any timelines on this? |
If I understand this issue correctly, then one way to solve this could be to implement @jgm's idea for a "format algebra".
Then the cell could be represented by something like
|
I think the original idea of storing the source in an attribute on the cell is still the best. I'm going to try to implement that now. |
The reader now parses the contents of the markdown cell to a Pandoc structure, but *also* stores the raw markdown in a `source` attribute on the cell Div. When we convert back to markdown, this attribute is stripped off and the original source is used. When we convert to other formats, the attribute is usually ignored (though it will come through in HTML as a `data-source` attribute, not unhelpfully). I'll note some potential drawbacks of this approach: - It makes it impossible to use pandoc to clean up or change the contents of markdown cells, e.g. going from `+smart` to `-smart`. - There may be formats where the addition of the `source` attribute is problematic. I can't think of any, though. Closes #5408.
I've committed a change that should make round-tripping possible. Please try it out and consider the potential drawbacks noted in the commit message. We may go back on this change if it proves problematic for reasons I haven't thought of. |
The reader now parses the contents of the markdown cell to a Pandoc structure, but *also* stores the raw markdown in a `source` attribute on the cell Div. When we convert back to markdown, this attribute is stripped off and the original source is used. When we convert to other formats, the attribute is usually ignored (though it will come through in HTML as a `data-source` attribute, not unhelpfully). I'll note some potential drawbacks of this approach: - It makes it impossible to use pandoc to clean up or change the contents of markdown cells, e.g. going from `+smart` to `-smart`. - There may be formats where the addition of the `source` attribute is problematic. I can't think of any, though. Closes #5408.
I don't have the know-how to test it. Any pointers? If I understand well this would add a new "section" Does it mean that the images will not be added as attachments (related to #5299)? E.g. ![A Markdown logo](markdown.png){#fig:fig1} would stay exactly the same during the round trip. |
If you don't want to compile from source, you can grab a nightly tomorrow and test with it. Images would still be added as attachments, yes. |
I have grabbed version 2.10 but don't see any difference (?) |
It's not in 2.10. It's in the nightly we build from the development version: |
I have downloaded the nightly and tested:
|
I don't understand this comment, can you elaborate?
Can you give an example to illustrate what you're talking about? This is a bit too terse for me.
Example?
That's not good, I'll look into that. |
Reopening for possible tweaks |
OK, I see what you may be getting at; if I convert ipynb -> md now, I get a mess:
At the very least, we'd need to somehow escape the newlines in the source attribute so we don't get something invalid here. But maybe using the raw block is a better option after all. |
For now I've reverted these commits. Back to drawing board. |
What I propose is essentially to only use raw markdown when converting. In your example, it seems natural to expect ![A Markdown logo](test/lalune.jpg){#fig:fig1} and not ![A Markdown logo](test/lalune.jpg){\#fig:fig1} I think it is very confusing when the markdown cells are converted. In this instance, there is a specific reason to not escape. This is the syntax used by pandoc-fignos . The same goes when converting back, the snippet ![A Markdown logo](test/lalune.jpg){#fig:fig1} should stay the same and not be replaced with an attachment. The user would have embedded as an attachment if that was his requirement. In conclusion, I think the markdown cells should not be converted at all in both directions, or at least there should be a way (possibly a flag) to enforce it. In my use case (reproducible reports), I am planning to:
|
You are using pandoc's link attribute syntax |
Two possibilities:
I incline towards option 1 I think. |
Jupyter supports GitHub flavored markdown. It does not support the attribute syntax. However, this is not really important. As said, Jupyter is only helping to interactively build the final markdown. We can live with the fact that it cannot fully support Pandoc's Markdown. Regarding the options, I also think option 1 is more natural. |
The reader now parses the contents of the markdown cell to a Pandoc structure, but *also* stores the raw markdown in a `source` attribute on the cell Div. When we convert back to markdown, this attribute is stripped off and the original source is used. When we convert to other formats, the attribute is usually ignored (though it will come through in HTML as a `data-source` attribute, not unhelpfully). I'll note some potential drawbacks of this approach: - It makes it impossible to use pandoc to clean up or change the contents of markdown cells, e.g. going from `+smart` to `-smart`. - There may be formats where the addition of the `source` attribute is problematic. I can't think of any, though. Closes #5408.
Pandoc currently stores markdown cells in its native format, so when they're rendered back to ipynb, the formatting can change (in semantically insignificant ways). This might be undesirable for some purposes.
We could have an option that causes the ipynb reader to store the original markdown text in a special attribute of the markdown cell div. The ipynb writer and markdown writer could check this attribute, and if it's present, render the stored text as raw markdown, exactly as in the source, rather than rendering the contents on the cell in the normal way.
I guess the attribute would be fairly harmless in other output formats, but we could automatically run a filter to strip it out for formats other than markdown or ipynb, I suppose.
Alternatively, the option could cause the ipynb reader to parse markdown cells into
RawBlock (Format "markdown")
. With this option, no output would appear for markdown cells when writing to formats other than ipynb or markdown; but that may not be an issue if it's triggered by a particular command-line option.This would not work with pandoc-citeproc or filters in general, since these modify the AST, and the modifications would be ignored if we just substituted the original markdown.
Thoughts on this proposal?
@mwouts
@choldgraf
The text was updated successfully, but these errors were encountered: