Proposal : lossless markdown cells in ipynb #5408

jgm · 2019-03-29T17:31:14Z

Pandoc currently stores markdown cells in its native format, so when they're rendered back to ipynb, the formatting can change (in semantically insignificant ways). This might be undesirable for some purposes.

We could have an option that causes the ipynb reader to store the original markdown text in a special attribute of the markdown cell div. The ipynb writer and markdown writer could check this attribute, and if it's present, render the stored text as raw markdown, exactly as in the source, rather than rendering the contents on the cell in the normal way.

I guess the attribute would be fairly harmless in other output formats, but we could automatically run a filter to strip it out for formats other than markdown or ipynb, I suppose.

Alternatively, the option could cause the ipynb reader to parse markdown cells into RawBlock (Format "markdown"). With this option, no output would appear for markdown cells when writing to formats other than ipynb or markdown; but that may not be an issue if it's triggered by a particular command-line option.

This would not work with pandoc-citeproc or filters in general, since these modify the AST, and the modifications would be ignored if we just substituted the original markdown.

Thoughts on this proposal?
@mwouts
@choldgraf

The text was updated successfully, but these errors were encountered:

mwouts · 2019-03-29T23:19:19Z

Well, that's a good question! I agree that having the markdown cell reformatted was a bit surprising the first time I saw it. But as you explained, this preserves the text syntax and therefore also the HTML rendering. And that can also be a feature, as the long lines of markdown content can be automatically reformatted into shorter lines, and headers get a consistent style.

Now, I think the --wrap=preserve option already reduces by much the reformatting. Maybe that's a good start already? Do you think you need to go further than that?

Regarding the implementation, I do not think I would recommend copying the original text in the metadata, for two reasons:

Duplicating the original text would also make file history more noisy in version control
And it would prevent the user from being able to edit the content of the cell (since on the next conversion it would be overwritten with the original text)

jgm · 2019-03-30T01:19:27Z

Marc Wouts <notifications@github.com> writes:

Now, I think the `--wrap=preserve` option already reduces by much the reformatting. Maybe that's a good start already? Do you think you need to go further than that?

I guess that's my question for you. I can imagine that someone who goes back and forth a lot might like more consistency in the markdown cells, to avoid spurious changes. But maybe it's not an issue in practice.

Regarding the implementation, I do not think I would recommend copying the original text in the metadata, for two reasons: - Duplicating the original text would also make file history more noisy in version control - And it would prevent the user from being able to edit the content of the cell (since on the next conversion it would be overwritten with the original text)

To be clear, my idea was to store this in the metadata in pandoc's AST, but not to render this metadata in markdown or ipynb.

mwouts · 2019-03-30T10:55:24Z

I see! Sure, if you implement an option that preserves the content of the cells when converting between Markdown and Jupyter Notebook, I will make that the default for pandoc from Jupytext, as it will make users more confident with the round trip.

choldgraf · 2019-03-30T16:38:42Z

Just want to say I love where this conversation is heading :-)

I'm happy to play around with things and give feedback whenever there are prototypes (on either side)

ickc · 2019-05-21T11:28:32Z

It seems to me setting it as RawBlock of markdown is a cleaner approach (as a command line option.) People specifically looking for round trip identity would have been using the correct output format (and as a side benefit a free filter to remove any markdown cells.)

Reading elsewhere it seems like only a particular combinations of command line arguments would provide round trip identity, in this case may be similar to those markdown variants (eg multimarkdown as a short cut to combine multiple behaviors), a particular ipynb variants can be defined to provide round trip identity.

teucer · 2020-06-30T08:06:26Z

@jgm To be able to have reproducible research, IMHO round-trip identity is required. I think having an explicit command line option to ensure it would be ideal.

Having another option to influence which cells get written would be also really useful.

Any timelines on this?

tarleb · 2020-06-30T18:17:27Z

If I understand this issue correctly, then one way to solve this could be to implement @jgm's idea for a "format algebra".

One thought I've sometimes had is that it would be cool if you could specify disjunctions of formats:
Format (LaTeX `or` ConTeXt)
or maybe even complements:
Format (not (LaTeX `or` ConTexT))

Then the cell could be represented by something like

[ Format Ipynb [RawBlock ...], Format (not Ipynb) [<parsed markdown>] ]

jgm · 2020-06-30T19:07:21Z

I think the original idea of storing the source in an attribute on the cell is still the best. I'm going to try to implement that now.

The reader now parses the contents of the markdown cell to a Pandoc structure, but *also* stores the raw markdown in a `source` attribute on the cell Div. When we convert back to markdown, this attribute is stripped off and the original source is used. When we convert to other formats, the attribute is usually ignored (though it will come through in HTML as a `data-source` attribute, not unhelpfully). I'll note some potential drawbacks of this approach: - It makes it impossible to use pandoc to clean up or change the contents of markdown cells, e.g. going from `+smart` to `-smart`. - There may be formats where the addition of the `source` attribute is problematic. I can't think of any, though. Closes #5408.

jgm · 2020-06-30T19:20:25Z

I've committed a change that should make round-tripping possible. Please try it out and consider the potential drawbacks noted in the commit message. We may go back on this change if it proves problematic for reasons I haven't thought of.

The reader now parses the contents of the markdown cell to a Pandoc structure, but *also* stores the raw markdown in a `source` attribute on the cell Div. When we convert back to markdown, this attribute is stripped off and the original source is used. When we convert to other formats, the attribute is usually ignored (though it will come through in HTML as a `data-source` attribute, not unhelpfully). I'll note some potential drawbacks of this approach: - It makes it impossible to use pandoc to clean up or change the contents of markdown cells, e.g. going from `+smart` to `-smart`. - There may be formats where the addition of the `source` attribute is problematic. I can't think of any, though. Closes #5408.

teucer · 2020-06-30T19:58:09Z

@jgm

I don't have the know-how to test it. Any pointers?

If I understand well this would add a new "section" source to the output and use it to convert back.

Does it mean that the images will not be added as attachments (related to #5299)? E.g.

![A Markdown logo](markdown.png){#fig:fig1}

would stay exactly the same during the round trip.

jgm · 2020-06-30T20:09:49Z

If you don't want to compile from source, you can grab a nightly tomorrow and test with it.

Images would still be added as attachments, yes.

teucer · 2020-07-01T07:06:38Z

I have grabbed version 2.10 but don't see any difference (?)

jgm · 2020-07-01T23:29:37Z

It's not in 2.10. It's in the nightly we build from the development version:
https://github.com/jgm/pandoc/actions/runs/153727389

teucer · 2020-07-02T09:22:49Z

@jgm

I have downloaded the nightly and tested:

I think it would be useful to have a flag (e.g. --source) just to keep only the source both ways ipynb <-> md
If there is no source attribute in the markdown file, we could consider the content as source
When converting from markdown, we should not keep the source attribute
Related to above, when one does multiple rounds it keeps on adding nested source attributes in some instances

jgm · 2020-07-02T15:55:26Z

I think it would be useful to have a flag (e.g. --source) just to keep only the source both ways ipynb <-> md

I don't understand this comment, can you elaborate?

If there is no source attribute in the markdown file, we could consider the content as source?

Can you give an example to illustrate what you're talking about? This is a bit too terse for me.

When converting from markdown, we should not keep the source attribute

Example?

Related to above, when one does multiple rounds it keeps on adding nested source attributes in some instances

That's not good, I'll look into that.

jgm · 2020-07-02T15:56:58Z

Reopening for possible tweaks

See #5408.

jgm · 2020-07-02T16:02:33Z

OK, I see what you may be getting at; if I convert ipynb -> md now, I get a mess:

::: {.cell .markdown source="Inline     math, such as $a, b, c \\geq 2.5$, doesn't render properly in RST.

Multi-line _maths_ does:

$$ E = m c^2 $$
![A Markdown logo](test/lalune.jpg){#fig:fig1}"}
Inline math, such as $a, b, c \geq 2.5$, doesn\'t render properly in
RST.

Multi-line *maths* does:

$$ E = m c^2 $$ ![A Markdown logo](test/lalune.jpg){\#fig:fig1}
:::

At the very least, we'd need to somehow escape the newlines in the source attribute so we don't get something invalid here. But maybe using the raw block is a better option after all.

jgm · 2020-07-02T16:04:15Z

For now I've reverted these commits. Back to drawing board.

teucer · 2020-07-02T16:41:02Z

What I propose is essentially to only use raw markdown when converting. In your example, it seems natural to expect

![A Markdown logo](test/lalune.jpg){#fig:fig1}

and not

![A Markdown logo](test/lalune.jpg){\#fig:fig1}

I think it is very confusing when the markdown cells are converted. In this instance, there is a specific reason to not escape. This is the syntax used by pandoc-fignos .

The same goes when converting back, the snippet

![A Markdown logo](test/lalune.jpg){#fig:fig1}

should stay the same and not be replaced with an attachment. The user would have embedded as an attachment if that was his requirement.

In conclusion, I think the markdown cells should not be converted at all in both directions, or at least there should be a way (possibly a flag) to enforce it.

In my use case (reproducible reports), I am planning to:

use jupyter as an interactive editor for markdown+code
convert it to pure markdown for version controlling (possibly stripping out output cells): the collaborators would convert it back to ipynb and go back to step 1.
convert to latex (via markdown, possibly stripping out code cells and keeping the output cells only) when finished

jgm · 2020-07-02T18:01:25Z

You are using pandoc's link attribute syntax {#fig}. Does Jupyter notebooks actually support that?

jgm · 2020-07-02T18:08:28Z

Two possibilities:

Add an extension raw_markdown that affects the ipynb reader. If this is set, markdown cell contents are put into a RawBlock (Format "markdown") instead of being parsed as markdown. You'd trigger this with -f ipynb+raw_markdown.
Add a command-line option to do this, something like --raw-markdown-cells or --lossless-markdown-cells.

I incline towards option 1 I think.

teucer · 2020-07-02T19:53:52Z

Jupyter supports GitHub flavored markdown. It does not support the attribute syntax. However, this is not really important. As said, Jupyter is only helping to interactively build the final markdown. We can live with the fact that it cannot fully support Pandoc's Markdown.

Regarding the options, I also think option 1 is more natural.

The reader now parses the contents of the markdown cell to a Pandoc structure, but *also* stores the raw markdown in a `source` attribute on the cell Div. When we convert back to markdown, this attribute is stripped off and the original source is used. When we convert to other formats, the attribute is usually ignored (though it will come through in HTML as a `data-source` attribute, not unhelpfully). I'll note some potential drawbacks of this approach: - It makes it impossible to use pandoc to clean up or change the contents of markdown cells, e.g. going from `+smart` to `-smart`. - There may be formats where the addition of the `source` attribute is problematic. I can't think of any, though. Closes #5408.

See #5408.

jgm added format:ipynb format:Markdown labels Mar 29, 2019

mwouts mentioned this issue Apr 1, 2019

Pandoc format for Jupyter notebook mwouts/jupytext#208

Closed

jgm added this to the 2.7.4 milestone Jul 20, 2019

jgm removed this from the 2.8 milestone Sep 11, 2019

jgm mentioned this issue Jun 29, 2020

ipynb to markdown: disable escaping #6489

Closed

jgm mentioned this issue Jun 30, 2020

Are round trips supported between ipynb and org-mode? #6367

Closed

jgm closed this as completed Jun 30, 2020

jgm reopened this Jul 2, 2020

jgm added a commit that referenced this issue Jul 2, 2020

Ipynb reader: fix duplication of 'source' attribute.

2d00936

See #5408.

jgm added this to the next release milestone Jul 2, 2020

jgm added a commit that referenced this issue Jul 18, 2020

Ipynb reader: fix duplication of 'source' attribute.

2b54a5d

See #5408.

jgm closed this as completed in 48fb6d9 Jul 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal : lossless markdown cells in ipynb #5408

Proposal : lossless markdown cells in ipynb #5408

jgm commented Mar 29, 2019 •

edited

mwouts commented Mar 29, 2019

jgm commented Mar 30, 2019 via email

mwouts commented Mar 30, 2019

choldgraf commented Mar 30, 2019

ickc commented May 21, 2019

teucer commented Jun 30, 2020 •

edited

tarleb commented Jun 30, 2020

jgm commented Jun 30, 2020

jgm commented Jun 30, 2020

teucer commented Jun 30, 2020

jgm commented Jun 30, 2020

teucer commented Jul 1, 2020

jgm commented Jul 1, 2020

teucer commented Jul 2, 2020

jgm commented Jul 2, 2020

jgm commented Jul 2, 2020

jgm commented Jul 2, 2020

jgm commented Jul 2, 2020

teucer commented Jul 2, 2020

jgm commented Jul 2, 2020

jgm commented Jul 2, 2020

teucer commented Jul 2, 2020 •

edited

Proposal : lossless markdown cells in ipynb #5408

Proposal : lossless markdown cells in ipynb #5408

Comments

jgm commented Mar 29, 2019 • edited

mwouts commented Mar 29, 2019

jgm commented Mar 30, 2019 via email

mwouts commented Mar 30, 2019

choldgraf commented Mar 30, 2019

ickc commented May 21, 2019

teucer commented Jun 30, 2020 • edited

tarleb commented Jun 30, 2020

jgm commented Jun 30, 2020

jgm commented Jun 30, 2020

teucer commented Jun 30, 2020

jgm commented Jun 30, 2020

teucer commented Jul 1, 2020

jgm commented Jul 1, 2020

teucer commented Jul 2, 2020

jgm commented Jul 2, 2020

jgm commented Jul 2, 2020

jgm commented Jul 2, 2020

jgm commented Jul 2, 2020

teucer commented Jul 2, 2020

jgm commented Jul 2, 2020

jgm commented Jul 2, 2020

teucer commented Jul 2, 2020 • edited

jgm commented Mar 29, 2019 •

edited

teucer commented Jun 30, 2020 •

edited

teucer commented Jul 2, 2020 •

edited