A new notebook document format for improved workflow integration #4

Open
wants to merge 11 commits into
from

Conversation

Projects
None yet
6 participants
@khinsen

khinsen commented Sep 14, 2015

For the background, see this blog post.

@Carreau

This comment has been minimized.

Show comment
Hide comment
@Carreau

Carreau Sep 14, 2015

Member

Hey Konrad.

One of the things I would like to distinguish more in Jupyter notebook is the in-memory format, vs on disk format. There are for sure things that you can keep in memory that give you more information of not-yet ran cell, and wether the kernel has restarted and cell are not in sync with kernel, that do not (obviously) belong on disk.

I'll read your proposal with more attention later.

Thanks !

Member

Carreau commented Sep 14, 2015

Hey Konrad.

One of the things I would like to distinguish more in Jupyter notebook is the in-memory format, vs on disk format. There are for sure things that you can keep in memory that give you more information of not-yet ran cell, and wether the kernel has restarted and cell are not in sync with kernel, that do not (obviously) belong on disk.

I'll read your proposal with more attention later.

Thanks !

+
+## Problem
+
+Jupyter notebooks do not integrate well with other tools supporting complex workflows in computational science. Version control systems require a clear separation of human-edited content and computed content. The current notebook file format mixes them. Workflow managers and provenance trackers require that all computations be replicable. For interactive computations, replicability requires storing a full log of user actions. The current notebook file format does not preserve this information, although it is available at execution time.

This comment has been minimized.

@ellisonbg

ellisonbg Oct 18, 2015

I am not sure I agree with this statement about the separation of human and compute content in VCS. Also, I think your working definition of replicability is subtle enough that many folks in the community will disagree with your statement about it requiring a full log of user actions. More background on your definitions would be helpful. To make it more clear, we regularly speak of the notebook as offering reproducibility for computations.

@ellisonbg

ellisonbg Oct 18, 2015

I am not sure I agree with this statement about the separation of human and compute content in VCS. Also, I think your working definition of replicability is subtle enough that many folks in the community will disagree with your statement about it requiring a full log of user actions. More background on your definitions would be helpful. To make it more clear, we regularly speak of the notebook as offering reproducibility for computations.

This comment has been minimized.

@khinsen

khinsen Oct 19, 2015

For my definitions of replicability and reproducibility see my blog post. This specific use of the terms is quite common by now, but not yet universal. In short, replication refers to repeating a calculation identically for verification, whereas reproduction is about re-doing a computational experiment using different tools. Replication is a purely technical step that requires no understanding of the scientific content, whereas reproduction implies understanding a method and implementing it differently.

@khinsen

khinsen Oct 19, 2015

For my definitions of replicability and reproducibility see my blog post. This specific use of the terms is quite common by now, but not yet universal. In short, replication refers to repeating a calculation identically for verification, whereas reproduction is about re-doing a computational experiment using different tools. Replication is a purely technical step that requires no understanding of the scientific content, whereas reproduction implies understanding a method and implementing it differently.

This comment has been minimized.

@Carreau

Carreau Oct 19, 2015

Member

I agree with @khinsen on replicable, I try to use replicable with notebook, even if the habbit of saying reproducible is hard to get rid of. Nothing prevent from linking to content that describe in more precision replicable vs reproducible. Also people that will read this document are most likely more aware of the difference.

@Carreau

Carreau Oct 19, 2015

Member

I agree with @khinsen on replicable, I try to use replicable with notebook, even if the habbit of saying reproducible is hard to get rid of. Nothing prevent from linking to content that describe in more precision replicable vs reproducible. Also people that will read this document are most likely more aware of the difference.

+
+Jupyter notebooks do not integrate well with other tools supporting complex workflows in computational science. Version control systems require a clear separation of human-edited content and computed content. The current notebook file format mixes them. Workflow managers and provenance trackers require that all computations be replicable. For interactive computations, replicability requires storing a full log of user actions. The current notebook file format does not preserve this information, although it is available at execution time.
+
+The core of the problem is that Jupyter's notebook file format is closely tied to Jupyter's functionality and design. It is essentially an on-disk representation of the internal state of the Jupyter notebook client, storing only the information required to open the notebook later or elsewhere.

This comment has been minimized.

@ellisonbg

ellisonbg Oct 18, 2015

Mostly true, but we do store some additional metadata that is useful in other contexts.

@ellisonbg

ellisonbg Oct 18, 2015

Mostly true, but we do store some additional metadata that is useful in other contexts.

+
+## Proposed Enhancement
+
+The main goal of this proposal is a change of focus: notebooks should become digital documents with well-defined semantics, and the Jupyter notebook client should become just one out of many possible tools that process such notebook documents.

This comment has been minimized.

@ellisonbg

ellisonbg Oct 18, 2015

This sentence implies that the current notebook format doesn't have well-defined semantic and that there is only one tool for working with them. I view both of those as being false and I think many in the community would as well.

@ellisonbg

ellisonbg Oct 18, 2015

This sentence implies that the current notebook format doesn't have well-defined semantic and that there is only one tool for working with them. I view both of those as being false and I think many in the community would as well.

This comment has been minimized.

@khinsen

khinsen Oct 19, 2015

The documentation I am aware of is one-way: it allows me to read a notebook file and extract information from it. Many relations between different parts of a notebook are undocumented, but must be respected in order to write a meaningful notebook file. One example is the relation between source and output in code cells.

As for tools that use the Jupyter notebook format, is there a list of them anywhere?

@khinsen

khinsen Oct 19, 2015

The documentation I am aware of is one-way: it allows me to read a notebook file and extract information from it. Many relations between different parts of a notebook are undocumented, but must be respected in order to write a meaningful notebook file. One example is the relation between source and output in code cells.

As for tools that use the Jupyter notebook format, is there a list of them anywhere?

@ellisonbg

This comment has been minimized.

Show comment
Hide comment
@ellisonbg

ellisonbg Oct 18, 2015

Some general comments...

I think some of the ideas you have here are very interesting. The main point for me is that it would be useful to have a full record of code blocks that a kernel runs and a clear link between those code blocks+output and the ones that appear in a notebook. That idea is worth thinking about and is mostly independent of the broader version control issues.

At the same time, given the large number of users we currently have (and their millions of notebooks), there is no way we can completely break the existing notebook format. I am not at all convinced that breaking the existing notebook format is required to address the main point above. It would not be difficult to write a kernel session monitor that records the full record of the cells and their output in a way that is linkable to the same cells in a current format notebook. With a small amount of changes to the notebook format (hashes of code cells and/or cell uuids) the relationship between the kernel record and the notebook document could be strengthened even further.

If you can come up with concrete proposals that address the questions here without requiring any changes to the notebook format, there is a chance that the community could become interested. Most importantly, in order to justify even small breakages to the notebook format, we would need to see that prototypes of the ideas here, that leveraged the existing notebook format, were actually solving user's problems in significant ways.

Some general comments...

I think some of the ideas you have here are very interesting. The main point for me is that it would be useful to have a full record of code blocks that a kernel runs and a clear link between those code blocks+output and the ones that appear in a notebook. That idea is worth thinking about and is mostly independent of the broader version control issues.

At the same time, given the large number of users we currently have (and their millions of notebooks), there is no way we can completely break the existing notebook format. I am not at all convinced that breaking the existing notebook format is required to address the main point above. It would not be difficult to write a kernel session monitor that records the full record of the cells and their output in a way that is linkable to the same cells in a current format notebook. With a small amount of changes to the notebook format (hashes of code cells and/or cell uuids) the relationship between the kernel record and the notebook document could be strengthened even further.

If you can come up with concrete proposals that address the questions here without requiring any changes to the notebook format, there is a chance that the community could become interested. Most importantly, in order to justify even small breakages to the notebook format, we would need to see that prototypes of the ideas here, that leveraged the existing notebook format, were actually solving user's problems in significant ways.

@khinsen

This comment has been minimized.

Show comment
Hide comment
@khinsen

khinsen Oct 19, 2015

Sorry, no, I cannot do that. I am not sufficiently familiar with the internals of Jupyter to make such a proposal. The notebook format definition is not sufficient, as it doesn't specify what is and isn't a correct notebook file. For example, if I add a file to the "code cell" structure, is that a change to the notebook format or not?

As for solving user's problems, I am mainly interested in solving non-user's problems, i.e. the problems that prevent people like me from using Jupyter. It is unlikely that there is much demand for those in the existing community. My proposal is about extending the community.

khinsen commented Oct 19, 2015

Sorry, no, I cannot do that. I am not sufficiently familiar with the internals of Jupyter to make such a proposal. The notebook format definition is not sufficient, as it doesn't specify what is and isn't a correct notebook file. For example, if I add a file to the "code cell" structure, is that a change to the notebook format or not?

As for solving user's problems, I am mainly interested in solving non-user's problems, i.e. the problems that prevent people like me from using Jupyter. It is unlikely that there is much demand for those in the existing community. My proposal is about extending the community.

@ellisonbg

This comment has been minimized.

Show comment
Hide comment
@ellisonbg

ellisonbg Oct 19, 2015

@khinsen some of the statements you are making just aren't true. For example, we have a json schema for the notebook format and we validate notebooks against that schema. Here is that schema:

https://github.com/jupyter/nbformat/blob/master/nbformat/v4/nbformat.v4.schema.json

If a notebook doesn't validate against that schema, then it is not a valid notebook. If it does it is.

@khinsen some of the statements you are making just aren't true. For example, we have a json schema for the notebook format and we validate notebooks against that schema. Here is that schema:

https://github.com/jupyter/nbformat/blob/master/nbformat/v4/nbformat.v4.schema.json

If a notebook doesn't validate against that schema, then it is not a valid notebook. If it does it is.

@khinsen

This comment has been minimized.

Show comment
Hide comment
@khinsen

khinsen Oct 20, 2015

@ellisonbg Thanks for the pointer to the schema! There is no reference to this in the notebook format documentation, so it's a bit hard to find and it's not clear whether it is part of the format definition or just a useful tool.

But the main information that's missing from my point of view is a definition of notebook semantics. I have added an example to the repository which is syntactically valid but semantically invalid: the output doesn't match the source code.

My tiny example is obviously wrong, so it's not a real problem. But for more complex computations it is not obvious which relations between source code and output are supposed to hold inside a notebook file. This is a core issue for replicability. It is also an issue for version control, because merge operations can easily lead to syntactically correct but semantically invalid files.

There is no way to validate semantics with reasonable effort, so notebook files that have been tampered with (such as my example) are not easy to detect. But a good notebook format should allow detection of accidentally introduced semantic inconsistencies. This is why my proposal includes SHA-1 hashes.

Could such hashes be added to the current notebook format? Syntactically, this looks difficult: if I understand the schema correctly, there is no room for adding fields. Perhaps one could figure out a way to squeeze this information into existing fields somehow. But the first question is: does the notebook format make any promises about consistency at all?

khinsen commented Oct 20, 2015

@ellisonbg Thanks for the pointer to the schema! There is no reference to this in the notebook format documentation, so it's a bit hard to find and it's not clear whether it is part of the format definition or just a useful tool.

But the main information that's missing from my point of view is a definition of notebook semantics. I have added an example to the repository which is syntactically valid but semantically invalid: the output doesn't match the source code.

My tiny example is obviously wrong, so it's not a real problem. But for more complex computations it is not obvious which relations between source code and output are supposed to hold inside a notebook file. This is a core issue for replicability. It is also an issue for version control, because merge operations can easily lead to syntactically correct but semantically invalid files.

There is no way to validate semantics with reasonable effort, so notebook files that have been tampered with (such as my example) are not easy to detect. But a good notebook format should allow detection of accidentally introduced semantic inconsistencies. This is why my proposal includes SHA-1 hashes.

Could such hashes be added to the current notebook format? Syntactically, this looks difficult: if I understand the schema correctly, there is no room for adding fields. Perhaps one could figure out a way to squeeze this information into existing fields somehow. But the first question is: does the notebook format make any promises about consistency at all?

@Carreau

This comment has been minimized.

Show comment
Hide comment
@Carreau

Carreau Oct 20, 2015

Member

There is no reference to this in the notebook format documentation, so it's a bit hard to find and it's not clear whether it is part of the format definition or just a useful tool.

Good point, we can try to fix that.

About CRC, and other cryptographic sum that insure consistency, I (personally) think it will be a tough sell to make them mandatory, and tools would have to implement them correctly to guaranty consistency. A tool can perfectly save a 3+1 = 7 notebok with valid hashes.

We had discussion on marking "dirty" cells in UI, which turned out to be more complicated than we thought. One of the problem with current way the notebook works is that the kernel can get disconnected so some decision on how to persist what where are a bit weird,
in particular there is a in-memory vs on-disk format. You could have a in-memory which is not-yet consistant (waiting for kernel reply, contain ID of future reply), while on-disk have to be consistant. This is not something we do currently.

Could such hashes be added to the current notebook format?

Yes,

if I understand the schema correctly, there is no room for adding fields

No the current schema does support adding keys. In general metadata:{} are arbitrary, and up for interpretation by implementations.

Some extra-field in other place make the notebook valid but cell become unrecognized , so technically valid, but implementations are allow to ignore these.

This would allow us to make a minor revision, by adding fields, that will not be backward incompatible. Though, before comitting to, for example a sha1 key at top level, nothing prevent us or any one to to play with metadata.sha1= <sha1>, this would be just ignored b other implementation.

Jhamrick had a prototype of that to grade notebook with nbgrader, in order to check that the test-case cell where not tampered with by students (in the end the hash was moved to SQlite for other reason), but the metadata does contain other info which is nbgrader specific.

But the first question is: does the notebook format make any promises about consistency at all?

In the format itself, no. There used to be an optional signature to be sure the notebook was actually generated by the current machine (for security).
This was moved to a sqlite (Library/Jupyter/nbsignatures.db on OS X), so hashing and having (some) guaranties of consistency is possible but likely a hard problem.
In particular, I am concerned that if the requirement to create a valid notebook are too high, people will just not use them.

Does that make sens and respond to some of your question ?

I can try to see if I can come up with a nbconvert plugin that hash all cells, store the hash, and allows you to check the hash. Would that help ?

Member

Carreau commented Oct 20, 2015

There is no reference to this in the notebook format documentation, so it's a bit hard to find and it's not clear whether it is part of the format definition or just a useful tool.

Good point, we can try to fix that.

About CRC, and other cryptographic sum that insure consistency, I (personally) think it will be a tough sell to make them mandatory, and tools would have to implement them correctly to guaranty consistency. A tool can perfectly save a 3+1 = 7 notebok with valid hashes.

We had discussion on marking "dirty" cells in UI, which turned out to be more complicated than we thought. One of the problem with current way the notebook works is that the kernel can get disconnected so some decision on how to persist what where are a bit weird,
in particular there is a in-memory vs on-disk format. You could have a in-memory which is not-yet consistant (waiting for kernel reply, contain ID of future reply), while on-disk have to be consistant. This is not something we do currently.

Could such hashes be added to the current notebook format?

Yes,

if I understand the schema correctly, there is no room for adding fields

No the current schema does support adding keys. In general metadata:{} are arbitrary, and up for interpretation by implementations.

Some extra-field in other place make the notebook valid but cell become unrecognized , so technically valid, but implementations are allow to ignore these.

This would allow us to make a minor revision, by adding fields, that will not be backward incompatible. Though, before comitting to, for example a sha1 key at top level, nothing prevent us or any one to to play with metadata.sha1= <sha1>, this would be just ignored b other implementation.

Jhamrick had a prototype of that to grade notebook with nbgrader, in order to check that the test-case cell where not tampered with by students (in the end the hash was moved to SQlite for other reason), but the metadata does contain other info which is nbgrader specific.

But the first question is: does the notebook format make any promises about consistency at all?

In the format itself, no. There used to be an optional signature to be sure the notebook was actually generated by the current machine (for security).
This was moved to a sqlite (Library/Jupyter/nbsignatures.db on OS X), so hashing and having (some) guaranties of consistency is possible but likely a hard problem.
In particular, I am concerned that if the requirement to create a valid notebook are too high, people will just not use them.

Does that make sens and respond to some of your question ?

I can try to see if I can come up with a nbconvert plugin that hash all cells, store the hash, and allows you to check the hash. Would that help ?

@khinsen

This comment has been minimized.

Show comment
Hide comment
@khinsen

khinsen Oct 21, 2015

Making hashes optional sounds fine, as long as it is straightforward for users to produce notebooks that do contain them. Any tool attempting validation would flag a hash-less notebook as "dubious".

The point of hashes is not to prevent buggy software from producing wrong notebooks; there is no way to prevent that in general. The point is to allow merging of independent changes to a notebook and recognize output data that has become invalidated in the course of the merge. However, I am not convinced that the addition of hashes is of much interest in itself. To make notebooks good citizens of version controlled repositories, I think it is also necessary to separate human input from computational output as I explain in my proposal. The reason is that merging differences in the computational output will most likely lead to a complete mess, including syntactically wrong MIME data and other unpleasant things.

I looked at the discussion about "dirty" cells and it seems to me that the difficulties with that idea are ultimately the same as the problems I am trying to solve with this proposal: the current notebook data model has no clear notion of dependencies between its data items. My "stale output" cell type addresses the same issue as those "dirty" cells but does so on the basis of real computational dependency information.

I don't quite understand the issue of making the requirements for creating a valid notebook too difficult. None of what I propose requires any user intervention. It's the Jupyter notebook tool that should do all the work behind the scenes.

khinsen commented Oct 21, 2015

Making hashes optional sounds fine, as long as it is straightforward for users to produce notebooks that do contain them. Any tool attempting validation would flag a hash-less notebook as "dubious".

The point of hashes is not to prevent buggy software from producing wrong notebooks; there is no way to prevent that in general. The point is to allow merging of independent changes to a notebook and recognize output data that has become invalidated in the course of the merge. However, I am not convinced that the addition of hashes is of much interest in itself. To make notebooks good citizens of version controlled repositories, I think it is also necessary to separate human input from computational output as I explain in my proposal. The reason is that merging differences in the computational output will most likely lead to a complete mess, including syntactically wrong MIME data and other unpleasant things.

I looked at the discussion about "dirty" cells and it seems to me that the difficulties with that idea are ultimately the same as the problems I am trying to solve with this proposal: the current notebook data model has no clear notion of dependencies between its data items. My "stale output" cell type addresses the same issue as those "dirty" cells but does so on the basis of real computational dependency information.

I don't quite understand the issue of making the requirements for creating a valid notebook too difficult. None of what I propose requires any user intervention. It's the Jupyter notebook tool that should do all the work behind the scenes.

@rgbkrk

This comment has been minimized.

Show comment
Hide comment
@rgbkrk

rgbkrk Oct 21, 2015

Member

You can't verify the accuracy of all computations with hashes alone. You can't even fully verify with certifying algorithms. Trivial ones certainly, but you're still also at the behest of the operating environment (versions of software, hardware, etc.) That's not to say that it shouldn't be done or isn't a plausible goal, just that it is a way larger scope than can be dictated in this proposal.

Member

rgbkrk commented Oct 21, 2015

You can't verify the accuracy of all computations with hashes alone. You can't even fully verify with certifying algorithms. Trivial ones certainly, but you're still also at the behest of the operating environment (versions of software, hardware, etc.) That's not to say that it shouldn't be done or isn't a plausible goal, just that it is a way larger scope than can be dictated in this proposal.

@minrk

This comment has been minimized.

Show comment
Hide comment
@minrk

minrk Oct 21, 2015

Member

If the primary goal is separating input from output for version control, this can be done relatively simply, and there are a variety of ways to go about it (ipymd does it, nbexplode does it, etc.). Hashes are one possible implementation detail for locating output with its matching input, and since those hashes would reside exclusively in the not-always-tracked output file / directory / database / whatever, they wouldn't be polluting anything. We've talked about the 'output sidecar' file before, and could consider adopting one such implementation as an optional, official way to split the notebook storage.

Member

minrk commented Oct 21, 2015

If the primary goal is separating input from output for version control, this can be done relatively simply, and there are a variety of ways to go about it (ipymd does it, nbexplode does it, etc.). Hashes are one possible implementation detail for locating output with its matching input, and since those hashes would reside exclusively in the not-always-tracked output file / directory / database / whatever, they wouldn't be polluting anything. We've talked about the 'output sidecar' file before, and could consider adopting one such implementation as an optional, official way to split the notebook storage.

@Carreau

This comment has been minimized.

Show comment
Hide comment
@Carreau

Carreau Oct 21, 2015

Member

I don't quite understand the issue of making the requirements for creating a valid notebook too difficult. None of what I propose requires any user intervention. It's the Jupyter notebook tool that should do all the work behind the scenes.

Making a field optional and hard to get the semantic right is a receipt to get something not or badly used. We can do it right in the notebook, but people rely
notebook format to be simple enough to generate their own.

I don't want to get to something like windows vista UAC where everybody clicks without reading.

Member

Carreau commented Oct 21, 2015

I don't quite understand the issue of making the requirements for creating a valid notebook too difficult. None of what I propose requires any user intervention. It's the Jupyter notebook tool that should do all the work behind the scenes.

Making a field optional and hard to get the semantic right is a receipt to get something not or badly used. We can do it right in the notebook, but people rely
notebook format to be simple enough to generate their own.

I don't want to get to something like windows vista UAC where everybody clicks without reading.

@khinsen

This comment has been minimized.

Show comment
Hide comment
@khinsen

khinsen Oct 21, 2015

@Carreau Which programs other than Jupyter actually create notebook files from scratch? I have tried to find some but so far without success.

khinsen commented Oct 21, 2015

@Carreau Which programs other than Jupyter actually create notebook files from scratch? I have tried to find some but so far without success.

@Carreau

This comment has been minimized.

Show comment
Hide comment
@Carreau

Carreau Oct 21, 2015

Member

Pycharm from the top of my head.

Member

Carreau commented Oct 21, 2015

Pycharm from the top of my head.

@Carreau

This comment has been minimized.

Show comment
Hide comment
@Carreau

Carreau Oct 21, 2015

Member

Sphinx gallery from Gael Varoquaux want to auto-generate notebook from sphinx doc, so that you can write docs as rst and have a "download as notebook" for user. In progress maybe not finished yet.

ipymd have to generate at least in memory one, runipy, likely too as they have templated variables.

I don't know how much they rely on nbformat to do so though.

Member

Carreau commented Oct 21, 2015

Sphinx gallery from Gael Varoquaux want to auto-generate notebook from sphinx doc, so that you can write docs as rst and have a "download as notebook" for user. In progress maybe not finished yet.

ipymd have to generate at least in memory one, runipy, likely too as they have templated variables.

I don't know how much they rely on nbformat to do so though.

@khinsen

This comment has been minimized.

Show comment
Hide comment
@khinsen

khinsen Oct 26, 2015

I saw a presentation this morning at the Saclay Open Software Day on Sphinx Gallery and also another project that generates notebooks as a documentation of a computation. I think they actually illustrate the problem I am trying to solve, because they use notebooks not as a storage and exchange format, but for output only - it's strictly one-way. A bit like generating PDF, with some obvious added value. The goal of my proposal is that such tools could read and write notebooks.

khinsen commented Oct 26, 2015

I saw a presentation this morning at the Saclay Open Software Day on Sphinx Gallery and also another project that generates notebooks as a documentation of a computation. I think they actually illustrate the problem I am trying to solve, because they use notebooks not as a storage and exchange format, but for output only - it's strictly one-way. A bit like generating PDF, with some obvious added value. The goal of my proposal is that such tools could read and write notebooks.

@Carreau

This comment has been minimized.

Show comment
Hide comment
@Carreau

Carreau Oct 26, 2015

Member

Do you know if these presentations have been recorded. I saw Gael make a 5min Lightning Talk on Sphinx Gallery, but would like to know more.

I'm not sure why Sphinx Gallery couldn't read notebooks, IIRC Gael was complaining about manual edition, not format.

Also @fperez is likely to be around Saclay these days, you might be able to get a back and forth with him in person, which might much more productive than discussing by mail.

Member

Carreau commented Oct 26, 2015

Do you know if these presentations have been recorded. I saw Gael make a 5min Lightning Talk on Sphinx Gallery, but would like to know more.

I'm not sure why Sphinx Gallery couldn't read notebooks, IIRC Gael was complaining about manual edition, not format.

Also @fperez is likely to be around Saclay these days, you might be able to get a back and forth with him in person, which might much more productive than discussing by mail.

@khinsen

This comment has been minimized.

Show comment
Hide comment
@khinsen

khinsen Oct 26, 2015

There's a camera next to me, so I suppose the sessions were recorded. I'll post a link when I know more. And yes, @fperez is here as well, he gave the opening keynote.

khinsen commented Oct 26, 2015

There's a camera next to me, so I suppose the sessions were recorded. I'll post a link when I know more. And yes, @fperez is here as well, he gave the opening keynote.

@Carreau

This comment has been minimized.

Show comment
Hide comment
@Carreau

Carreau Oct 26, 2015

Member

Ok, great ! Say Hi ! (and looking forward for the video)

Member

Carreau commented Oct 26, 2015

Ok, great ! Say Hi ! (and looking forward for the video)

@khinsen

This comment has been minimized.

Show comment
Hide comment
@khinsen

khinsen Oct 29, 2015

@Carreau The videos are up! Unfortunately I didn't find an occasion to talk to @fperez about anything technical such as this issue.

khinsen commented Oct 29, 2015

@Carreau The videos are up! Unfortunately I didn't find an occasion to talk to @fperez about anything technical such as this issue.

@Carreau

This comment has been minimized.

Show comment
Hide comment
@Carreau

Carreau Oct 29, 2015

Member

@Carreau The videos are up! Unfortunately I didn't find an occasion to talk to @fperez about anything technical such as this issue.

Thanks for the head's up. I'll try to find some time to watch and it might help me to understand !

Member

Carreau commented Oct 29, 2015

@Carreau The videos are up! Unfortunately I didn't find an occasion to talk to @fperez about anything technical such as this issue.

Thanks for the head's up. I'll try to find some time to watch and it might help me to understand !

@khinsen khinsen referenced this pull request in everpub/openscienceprize Feb 24, 2016

Open

Scope and deliverables #24

@timoc

This comment has been minimized.

Show comment
Hide comment
@timoc

timoc Mar 18, 2016

@khinsen have you seen org-markup?
It is a plain text markup language like markdown, but better. It would seem to be the natural format of a juypter notebook, as embedding data and executable code is a native feature. It can be used for literate programming and to create reproducible research - see [http://orgmode.org/worg/org-papers.html] and [http://orgmode.org/worg/org-contrib/babel/uses.html] or [http://orgmode.org/worg/org-contrib/babel/examples/data-collection-analysis.html].

Bonus: Its a native github format too [https://github.com/fniessen/refcard-org-mode/blob/master/README.org].

@ellisonbg org-markup (specifically org-babel) already has mechanisms to separate the source from the result of any given embedded calculation. Even better it has tagging support. Tagging means you can tag parts of the document, to assign completeness status (q.g. TODO) or what is executed at publishing time (e.g. noexport). I use this feature myself as part of my test driven document development process, and literate programing development process.

in addition:

  • There are org-markup parser libraries available in many languages [http://orgmode.org/worg/org-tools/index.html] including python [https://github.com/bjonnh/PyOrgMode], though some may not support all of the the babel features.
  • Org-markup is plain text and so can be better managed with source code control. It is a 'source document' that can also store arbitrary data in tables and calculations, or SQL queries etc. diffing and merging are easier due to the plain text nature, but there is also a git merge tool somewhere too.
  • Org-markup can be used to create publishable documents in many formats. For example i am using org source files and pandoc to create ms-word and PDF documents.
  • Org-markup also supports blogging and project management, among other things.
  • update:
    I should also mention it can embed uml digrams (plantuml), graphs (gnuplot), images, any many other media sources.

timoc commented Mar 18, 2016

@khinsen have you seen org-markup?
It is a plain text markup language like markdown, but better. It would seem to be the natural format of a juypter notebook, as embedding data and executable code is a native feature. It can be used for literate programming and to create reproducible research - see [http://orgmode.org/worg/org-papers.html] and [http://orgmode.org/worg/org-contrib/babel/uses.html] or [http://orgmode.org/worg/org-contrib/babel/examples/data-collection-analysis.html].

Bonus: Its a native github format too [https://github.com/fniessen/refcard-org-mode/blob/master/README.org].

@ellisonbg org-markup (specifically org-babel) already has mechanisms to separate the source from the result of any given embedded calculation. Even better it has tagging support. Tagging means you can tag parts of the document, to assign completeness status (q.g. TODO) or what is executed at publishing time (e.g. noexport). I use this feature myself as part of my test driven document development process, and literate programing development process.

in addition:

  • There are org-markup parser libraries available in many languages [http://orgmode.org/worg/org-tools/index.html] including python [https://github.com/bjonnh/PyOrgMode], though some may not support all of the the babel features.
  • Org-markup is plain text and so can be better managed with source code control. It is a 'source document' that can also store arbitrary data in tables and calculations, or SQL queries etc. diffing and merging are easier due to the plain text nature, but there is also a git merge tool somewhere too.
  • Org-markup can be used to create publishable documents in many formats. For example i am using org source files and pandoc to create ms-word and PDF documents.
  • Org-markup also supports blogging and project management, among other things.
  • update:
    I should also mention it can embed uml digrams (plantuml), graphs (gnuplot), images, any many other media sources.
@khinsen

This comment has been minimized.

Show comment
Hide comment
@khinsen

khinsen Mar 18, 2016

@timoc Yes, I know org-markup, I use it all the time for lots of things. And yes, it is one step up from Jupyter's format in terms of managing the ingredients of a notebook. But it doesn't keep a trace of the computation either, so in my view it is not sufficient.

khinsen commented Mar 18, 2016

@timoc Yes, I know org-markup, I use it all the time for lots of things. And yes, it is one step up from Jupyter's format in terms of managing the ingredients of a notebook. But it doesn't keep a trace of the computation either, so in my view it is not sufficient.

@timoc

This comment has been minimized.

Show comment
Hide comment
@timoc

timoc Mar 18, 2016

@khinsen , maybe I'm missing the point of this feature request. I am completely new to jupyter, and i came from an emacs background using org mode. I posted to this feature request explicitly because i saw the overlap.

If i understand this feature request at all, its more from the comments than the premise, but if i understand premise of your original feature, it is to separate these concerns. I agree.

The concerns being those of the (org/jupyter) document as a source artefact, that of the 'computation' as one or more compilation artefact(s), and that of the result, which is the final set of result artefact(s) based on the 'compilation' artefacts. Even in a distributed computation environment, this would seem to be the case. This seems to be the same process you find in any sufficiently mature continuous build and test and delivery infrastructure, if you separate the concerns as you outline.

I think org-markup is the choice for the source document format, because with tags you can encode the code to test and validate the outcome in the org document. I would suggest a pre, post and final tagset, so that computational code fragments that can be used to validate (possibly with a hash?) the computation and result artefacts as part of a traditional build approach.

I have yet to look at any of the videos, so maybe i am being naive about the challenges you face that org does not address.

in the presumption my assumptions are not correct, can you suggest an English presentation that will give better context on this problem?

timoc commented Mar 18, 2016

@khinsen , maybe I'm missing the point of this feature request. I am completely new to jupyter, and i came from an emacs background using org mode. I posted to this feature request explicitly because i saw the overlap.

If i understand this feature request at all, its more from the comments than the premise, but if i understand premise of your original feature, it is to separate these concerns. I agree.

The concerns being those of the (org/jupyter) document as a source artefact, that of the 'computation' as one or more compilation artefact(s), and that of the result, which is the final set of result artefact(s) based on the 'compilation' artefacts. Even in a distributed computation environment, this would seem to be the case. This seems to be the same process you find in any sufficiently mature continuous build and test and delivery infrastructure, if you separate the concerns as you outline.

I think org-markup is the choice for the source document format, because with tags you can encode the code to test and validate the outcome in the org document. I would suggest a pre, post and final tagset, so that computational code fragments that can be used to validate (possibly with a hash?) the computation and result artefacts as part of a traditional build approach.

I have yet to look at any of the videos, so maybe i am being naive about the challenges you face that org does not address.

in the presumption my assumptions are not correct, can you suggest an English presentation that will give better context on this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment