Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal : lossless markdown cells in ipynb #5408

Closed
jgm opened this issue Mar 29, 2019 · 22 comments
Closed

Proposal : lossless markdown cells in ipynb #5408

jgm opened this issue Mar 29, 2019 · 22 comments

Comments

@jgm
Copy link
Owner

jgm commented Mar 29, 2019

Pandoc currently stores markdown cells in its native format, so when they're rendered back to ipynb, the formatting can change (in semantically insignificant ways). This might be undesirable for some purposes.

We could have an option that causes the ipynb reader to store the original markdown text in a special attribute of the markdown cell div. The ipynb writer and markdown writer could check this attribute, and if it's present, render the stored text as raw markdown, exactly as in the source, rather than rendering the contents on the cell in the normal way.

I guess the attribute would be fairly harmless in other output formats, but we could automatically run a filter to strip it out for formats other than markdown or ipynb, I suppose.

Alternatively, the option could cause the ipynb reader to parse markdown cells into RawBlock (Format "markdown"). With this option, no output would appear for markdown cells when writing to formats other than ipynb or markdown; but that may not be an issue if it's triggered by a particular command-line option.

This would not work with pandoc-citeproc or filters in general, since these modify the AST, and the modifications would be ignored if we just substituted the original markdown.

Thoughts on this proposal?
@mwouts
@choldgraf

@mwouts
Copy link

mwouts commented Mar 29, 2019

Well, that's a good question! I agree that having the markdown cell reformatted was a bit surprising the first time I saw it. But as you explained, this preserves the text syntax and therefore also the HTML rendering. And that can also be a feature, as the long lines of markdown content can be automatically reformatted into shorter lines, and headers get a consistent style.

Now, I think the --wrap=preserve option already reduces by much the reformatting. Maybe that's a good start already? Do you think you need to go further than that?

Regarding the implementation, I do not think I would recommend copying the original text in the metadata, for two reasons:

  • Duplicating the original text would also make file history more noisy in version control
  • And it would prevent the user from being able to edit the content of the cell (since on the next conversion it would be overwritten with the original text)

@jgm
Copy link
Owner Author

jgm commented Mar 30, 2019 via email

@mwouts
Copy link

mwouts commented Mar 30, 2019

I see! Sure, if you implement an option that preserves the content of the cells when converting between Markdown and Jupyter Notebook, I will make that the default for pandoc from Jupytext, as it will make users more confident with the round trip.

@choldgraf
Copy link

Just want to say I love where this conversation is heading :-)

I'm happy to play around with things and give feedback whenever there are prototypes (on either side)

@ickc
Copy link
Contributor

ickc commented May 21, 2019

It seems to me setting it as RawBlock of markdown is a cleaner approach (as a command line option.) People specifically looking for round trip identity would have been using the correct output format (and as a side benefit a free filter to remove any markdown cells.)

Reading elsewhere it seems like only a particular combinations of command line arguments would provide round trip identity, in this case may be similar to those markdown variants (eg multimarkdown as a short cut to combine multiple behaviors), a particular ipynb variants can be defined to provide round trip identity.

@jgm jgm added this to the 2.7.4 milestone Jul 20, 2019
@jgm jgm removed this from the 2.8 milestone Sep 11, 2019
@teucer
Copy link

teucer commented Jun 30, 2020

@jgm To be able to have reproducible research, IMHO round-trip identity is required. I think having an explicit command line option to ensure it would be ideal.

Having another option to influence which cells get written would be also really useful.

Any timelines on this?

@tarleb
Copy link
Collaborator

tarleb commented Jun 30, 2020

If I understand this issue correctly, then one way to solve this could be to implement @jgm's idea for a "format algebra".

One thought I've sometimes had is that it would be cool if you could specify disjunctions of formats:

Format (LaTeX `or` ConTeXt)

or maybe even complements:

Format (not (LaTeX `or` ConTexT))

Then the cell could be represented by something like

[ Format Ipynb [RawBlock ...], Format (not Ipynb) [<parsed markdown>] ]

@jgm
Copy link
Owner Author

jgm commented Jun 30, 2020

I think the original idea of storing the source in an attribute on the cell is still the best. I'm going to try to implement that now.

jgm added a commit that referenced this issue Jun 30, 2020
The reader now parses the contents of the markdown cell to a Pandoc
structure, but *also* stores the raw markdown in a `source`
attribute on the cell Div.  When we convert back to markdown,
this attribute is stripped off and the original source is used.
When we convert to other formats, the attribute is usually
ignored (though it will come through in HTML as a `data-source`
attribute, not unhelpfully).

I'll note some potential drawbacks of this approach:

- It makes it impossible to use pandoc to clean up or
  change the contents of markdown cells, e.g.
  going from `+smart` to `-smart`.

- There may be formats where the addition of the `source`
  attribute is problematic.  I can't think of any, though.

Closes #5408.
@jgm
Copy link
Owner Author

jgm commented Jun 30, 2020

I've committed a change that should make round-tripping possible. Please try it out and consider the potential drawbacks noted in the commit message. We may go back on this change if it proves problematic for reasons I haven't thought of.

@jgm jgm closed this as completed Jun 30, 2020
jgm added a commit that referenced this issue Jun 30, 2020
The reader now parses the contents of the markdown cell to a Pandoc
structure, but *also* stores the raw markdown in a `source`
attribute on the cell Div.  When we convert back to markdown,
this attribute is stripped off and the original source is used.
When we convert to other formats, the attribute is usually
ignored (though it will come through in HTML as a `data-source`
attribute, not unhelpfully).

I'll note some potential drawbacks of this approach:

- It makes it impossible to use pandoc to clean up or
  change the contents of markdown cells, e.g.
  going from `+smart` to `-smart`.

- There may be formats where the addition of the `source`
  attribute is problematic.  I can't think of any, though.

Closes #5408.
@teucer
Copy link

teucer commented Jun 30, 2020

@jgm

I don't have the know-how to test it. Any pointers?

If I understand well this would add a new "section" source to the output and use it to convert back.

Does it mean that the images will not be added as attachments (related to #5299)? E.g.

![A Markdown logo](markdown.png){#fig:fig1}

would stay exactly the same during the round trip.

@jgm
Copy link
Owner Author

jgm commented Jun 30, 2020

If you don't want to compile from source, you can grab a nightly tomorrow and test with it.

Images would still be added as attachments, yes.

@teucer
Copy link

teucer commented Jul 1, 2020

I have grabbed version 2.10 but don't see any difference (?)

@jgm
Copy link
Owner Author

jgm commented Jul 1, 2020

It's not in 2.10. It's in the nightly we build from the development version:
https://github.com/jgm/pandoc/actions/runs/153727389

@teucer
Copy link

teucer commented Jul 2, 2020

@jgm

I have downloaded the nightly and tested:

  • I think it would be useful to have a flag (e.g. --source) just to keep only the source both ways ipynb <-> md
  • If there is no source attribute in the markdown file, we could consider the content as source
  • When converting from markdown, we should not keep the source attribute
  • Related to above, when one does multiple rounds it keeps on adding nested source attributes in some instances

@jgm
Copy link
Owner Author

jgm commented Jul 2, 2020

I think it would be useful to have a flag (e.g. --source) just to keep only the source both ways ipynb <-> md

I don't understand this comment, can you elaborate?

If there is no source attribute in the markdown file, we could consider the content as source?

Can you give an example to illustrate what you're talking about? This is a bit too terse for me.

When converting from markdown, we should not keep the source attribute

Example?

Related to above, when one does multiple rounds it keeps on adding nested source attributes in some instances

That's not good, I'll look into that.

@jgm jgm reopened this Jul 2, 2020
@jgm
Copy link
Owner Author

jgm commented Jul 2, 2020

Reopening for possible tweaks

jgm added a commit that referenced this issue Jul 2, 2020
@jgm
Copy link
Owner Author

jgm commented Jul 2, 2020

OK, I see what you may be getting at; if I convert ipynb -> md now, I get a mess:

::: {.cell .markdown source="Inline     math, such as $a, b, c \\geq 2.5$, doesn't render properly in RST.

Multi-line _maths_ does:

$$ E = m c^2 $$
![A Markdown logo](test/lalune.jpg){#fig:fig1}"}
Inline math, such as $a, b, c \geq 2.5$, doesn\'t render properly in
RST.

Multi-line *maths* does:

$$ E = m c^2 $$ ![A Markdown logo](test/lalune.jpg){\#fig:fig1}
:::

At the very least, we'd need to somehow escape the newlines in the source attribute so we don't get something invalid here. But maybe using the raw block is a better option after all.

@jgm
Copy link
Owner Author

jgm commented Jul 2, 2020

For now I've reverted these commits. Back to drawing board.

@jgm jgm added this to the next release milestone Jul 2, 2020
@teucer
Copy link

teucer commented Jul 2, 2020

What I propose is essentially to only use raw markdown when converting. In your example, it seems natural to expect

![A Markdown logo](test/lalune.jpg){#fig:fig1}

and not

![A Markdown logo](test/lalune.jpg){\#fig:fig1}

I think it is very confusing when the markdown cells are converted. In this instance, there is a specific reason to not escape. This is the syntax used by pandoc-fignos .

The same goes when converting back, the snippet

![A Markdown logo](test/lalune.jpg){#fig:fig1}

should stay the same and not be replaced with an attachment. The user would have embedded as an attachment if that was his requirement.

In conclusion, I think the markdown cells should not be converted at all in both directions, or at least there should be a way (possibly a flag) to enforce it.

In my use case (reproducible reports), I am planning to:

  1. use jupyter as an interactive editor for markdown+code
  2. convert it to pure markdown for version controlling (possibly stripping out output cells): the collaborators would convert it back to ipynb and go back to step 1.
  3. convert to latex (via markdown, possibly stripping out code cells and keeping the output cells only) when finished

@jgm
Copy link
Owner Author

jgm commented Jul 2, 2020

You are using pandoc's link attribute syntax {#fig}. Does Jupyter notebooks actually support that?

@jgm
Copy link
Owner Author

jgm commented Jul 2, 2020

Two possibilities:

  1. Add an extension raw_markdown that affects the ipynb reader. If this is set, markdown cell contents are put into a RawBlock (Format "markdown") instead of being parsed as markdown. You'd trigger this with -f ipynb+raw_markdown.

  2. Add a command-line option to do this, something like --raw-markdown-cells or --lossless-markdown-cells.

I incline towards option 1 I think.

@teucer
Copy link

teucer commented Jul 2, 2020

Jupyter supports GitHub flavored markdown. It does not support the attribute syntax. However, this is not really important. As said, Jupyter is only helping to interactively build the final markdown. We can live with the fact that it cannot fully support Pandoc's Markdown.

Regarding the options, I also think option 1 is more natural.

jgm added a commit that referenced this issue Jul 18, 2020
The reader now parses the contents of the markdown cell to a Pandoc
structure, but *also* stores the raw markdown in a `source`
attribute on the cell Div.  When we convert back to markdown,
this attribute is stripped off and the original source is used.
When we convert to other formats, the attribute is usually
ignored (though it will come through in HTML as a `data-source`
attribute, not unhelpfully).

I'll note some potential drawbacks of this approach:

- It makes it impossible to use pandoc to clean up or
  change the contents of markdown cells, e.g.
  going from `+smart` to `-smart`.

- There may be formats where the addition of the `source`
  attribute is problematic.  I can't think of any, though.

Closes #5408.
@jgm jgm closed this as completed in 48fb6d9 Jul 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants