Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: 4.5 Format Cell ID #61

Closed
MSeal opened this issue Aug 11, 2020 · 47 comments · Fixed by #62
Closed

Proposal: 4.5 Format Cell ID #61

MSeal opened this issue Aug 11, 2020 · 47 comments · Fixed by #62

Comments

@MSeal
Copy link
Contributor

MSeal commented Aug 11, 2020

This is a Pre-proposal for adding a cell id field to the Jupyter Notebook Format to be included in the next minor version bump.

Why

There's a range of applications that need a mechanism for recalling particular cells across mutations of the notebooks inside and outside of a particular notebook session. Some examples include:

  • generating url links to specific cells
  • external document association like code reviews or comments
  • associating a particular cell's outputs across runs to compare

Traditionally users have used custom tags on cells to track particular use-cases for cell activity. This works well for things like identifying the class of content within a cell (e.g. papermill parameters cell tag) but not for activities where an application may want to dynamically associate a cell to an action or resource. Additionally not having a cell id field has led to applications generating ids in different ways (e.g. metadata["cell_id"] = "some-string" vs metadata[application_name]["id"] = cell_guuid).

Most resource applications include ids as a standard part of the resource / sub-resources. This proposal is not touching on an overall notebook id field, but the sub-resource of cells in this instance are oftentimes treated relationally and adding an id for this field would help with improving the quality of abstractions built on-top of notebooks.

Outline

This change would be whole encompassed by adding an id field to each cell type in the 4.4 json_schema. Specifically the raw_cell, markdown, and code_cell required sections would add the id field with the following schema:

"id": {
    "description": "A UUID field representing the identifier of this particular cell.",
    "type": "uuid"
}

The uuid type was recently added to json-schema referencing RFC.4122. If needed for older library implementations one can also use a str format with a regex pattern match.

This field would always be required for any future nbformat versions (4.5+). The field would not be optional to avoid applications having to conditionally check if an id is present or not. This is an important aspect to the change as adding an optional field would lead to partial implementation in applications and difficulty in having consistent experiences with build ontop of the id change. Older formats can be loaded by nbformat and trivially updated to 4.5 format by running uuid.uuid4() to populate the new field, The change would go into effect once the nbformat PR is submitted, merged, and released with a new schema.

Why a JEP

These two aspects defined as requiring a JEP are both met with this propsal:

Does the proposal/implementation PR impact multiple orgs, or have widespread community impact?
Example: Updating nbformat
Does the proposal/implementation change an invariant in one or more orgs?
Example: Defining a unique cell identifier

One of the examples is literally this proposal, so that seem fortuitous towards formalizing a JEP 😄

Who'd be Interested

The 10 assignees + @captainsafia, @ivanov @yuvipanda and probably several others I missed. Github only allows 10 assignees and this topic has come up for the past couple years in conversation with most of the community so I am including most of the active people I can think of.

@choldgraf
Copy link
Contributor

I'm +1 on the idea in general. A couple quick thoughts:

  • I think this would be a helpful addition for folks that want to reference cells in general (e.g. maybe @bollwyvl would be interested in this as I know he's interested in linked data)
  • Specifically I wonder if this could be helpful for the scholarship / archiving / citation world
  • I believe the web annotation world such as hypothes.is would like this as well (so they could know which cells a comment points to)

some questions

  • Would cell ID be changed if the cell content changes, or just created one time when the cell is created?
    • As an extreme example: What if the content of the cell is cut out entirely and pasted into a new cell? My assumption is the ID would remain the same, right?
  • So if nbformat >= 4.5 loads in a pre 4.5 notebook, then a cell ID would be generated and added to each cell?
  • If a cell is cut out of a notebook and pasted into another, should the cell ID be retained?

@Carreau
Copy link
Member

Carreau commented Aug 11, 2020

I think this was suggested some time ago, it make some implementation I believe likely really complicated, typically how to do handle:

  1. splitting cells
  2. what if I copy and past (surely yo do not want duplicate ids...)
  3. what if you cut-past (surely you want to keep the id).
  4. what if you cut-past, and past a second time .. hum.

I think this has the strong potential on having notebook format implementation to be wrong, at least if each cell has an id, and the id must be unique. depending on how the ID is generated it also means that notebook will have randomly variable fields and the order of operation in which you create a notebook change its final (on disk) state (bad for reproducibility)

To ensure uniqueness it would be better to change the notebook format into a (list of ids), and a (mapping id to cells). Though that's profoundly different and cell need to know their id, which is not that good.

{and then you can change the "list" to any other DAG structure if you wish but let's forgot those two paragraph for now}.

I will not oppose to such a change, but I think guaranteeing uniqueness and auto generation will be quite tough, and has a potential to not be followed.

@blois
Copy link

blois commented Aug 11, 2020

The metadata object has additionalProperties: true and Colaboratory has been generating notebooks with an id field with a string value that is not a proper uuid. Introducing a stricter requirement may lead to user pain.

I do not believe that a formal UUID is necessary because:

  • De-duplication is necessary since tools and users will invariably create notebooks with duplicated values (such as a naive merge flow). Tools should gracefully handle this by giving a new unique ID when a conflict is detected.
  • In Colab the ID appears in the URL and a full UUID is more text than strictly necessary. I imagine cleaner URLs for cells would apply to other projects as well.

Because of the de-duplication I would even say that a counter would be more succinct and sufficient.

Implementation:

  1. splitting cells

One of the cells gets a new ID.

  1. what if I copy and paste

On paste give the pasted cell a different ID if there's already one with the same ID as being pasted.

  1. what if you cut-past

See above.

  1. what if you cut-past, and past a second time .. hum.

See above.

Additionally on notebook load, if an ID is duplicated then give subsequent cells new IDs and consider this a user edit operation.

@MSeal
Copy link
Contributor Author

MSeal commented Aug 11, 2020

I can get into deeper discussion on the actual JEP as well, but these are good questions. Let me take a stab at answering what I think should be done:

Would cell ID be changed if the cell content changes, or just created one time when the cell is created?
As an extreme example: What if the content of the cell is cut out entirely and pasted into a new cell? My assumption is the ID would remain the same, right?

Correct. It stays the same once created.

So if nbformat >= 4.5 loads in a pre 4.5 notebook, then a cell ID would be generated and added to each cell?

Yes.

If a cell is cut out of a notebook and pasted into another, should the cell ID be retained?

No. Much like copying contents out of one document into another -- you have a new cell with equivalent contends and a new id.

splitting cells

One cell (preferably the one with the top half of the code) keeps the id, the other gets a new id. This could be adjusted if folks want a different behavior without being a huge problem so long as we're consistent.

what if I copy and past (surely yo do not want duplicate ids...)

Correct the copied cell should have a new id -- I should have denoted that cell ids should be unique within a document and not reused.

what if you cut-past (surely you want to keep the id).

I'd agree it should try to preserve the ID -- you've moving the cell in entirity.

what if you cut-past, and past a second time .. hum.

New id -- you have a duplicate of the original. If we go with a "all cell ids must be unique within a notebook" rule this would trump other behavior when in conflict.

@captainsafia
Copy link
Member

Thanks for opening this issue, @MSeal! I'm +1 on this proposal overall as it definitely helps with a lot of scenarios that require us to reason about cells as if they were independent entities.

The UX considerations that @Carreau points out are good to identify. I think most of the work here will be in establishing the conventions (duplicated cells have unique IDs, etc.) than the technical implementation. Hopefully, we can resolve these in the course of the JEP.

Colaboratory has been generating notebooks with an id field with a string value that is not a proper uuid.

@blois Can you share what string value you use? Do you have universal uniqueness into it or do you expect the IDs to be local per notebook?

@MSeal
Copy link
Contributor Author

MSeal commented Aug 11, 2020

@blois But the additional properties is false for the each cell type: https://github.com/jupyter/nbformat/blob/master/nbformat/v4/nbformat.v4.4.schema.json#L183. This is not adding to metadata but setting the ID in the cell itself.

I do not believe that a formal UUID is necessary because:

De-duplication is necessary since tools and users will invariably create notebooks with duplicated values (such as a naive merge flow). Tools should gracefully handle this by giving a new unique ID when a conflict is detected.
In Colab the ID appears in the URL and a full UUID is more text than strictly necessary. I imagine cleaner URLs for cells would apply to other projects as well.

Open to suggestions on this. UUIDs have been defacto standard for document id fields for a while now in most settings. It simplifies the contract for specifying how one sets a missing id as well as the format of the message in a universal manner. If not UUID we'd need some regex schema we'd want to follow I imagine. I do agree the URLs could get large with a UUID pattern.

@choldgraf
Copy link
Contributor

choldgraf commented Aug 11, 2020

Another thought - do we consider the notebook as a part of uniquely identifying a cell? E.g. is each cell's identity a combination of a notebook + a cell ID, or just a cell ID? I don't think there's anything like a unique ID for notebook either...not sure if that is a topic that has been discussed before. I'm not sure how relevant it is but seems like it could be useful if we're considering ways to have shorter cell IDs so they're nicer from a UX perspective (e.g. if each notebook has a unique notebook ID and cells are referred to by notebookID + cellID, couldn't the cell IDs within a notebook just be a strictly-increasing integer value or something simple like that?).

@MSeal
Copy link
Contributor Author

MSeal commented Aug 11, 2020

Another thought - do we consider the notebook as a part of uniquely identifying a cell? E.g. is each cell's identity a combination of a notebook + a cell ID, or just a cell ID? I don't think there's anything like a unique ID for notebook either...not sure if that is a topic that has been discussed before. I'm not sure how relevant it is but seems like it could be useful if we're considering ways to have shorter cell IDs so they're nicer from a UX perspective.

I wanted to start with just the cell id first as notebook id has more complications. I believe each cell's identity would be notebook + a cell ID since you might copy a notebook and edit it -- meaning you share ids across notebooks now. There's not an universal way to guarantee this doesn't happen since you might load the notebooks in different tools, or at different times.. or edit the JSON directly. But within a managed system, you could make each notebook cell id unique to that entire system if you are using UUIDs in theory. Say for example a teaching course, where each course notebook has unique cell ids, would be within reach of an application ontop of this abstraction addition.

@blois
Copy link

blois commented Aug 11, 2020

@MSeal thanks for the correction. Colab has made heavy use of cell IDs for many many years, it does seem generally useful.

I want to stress that Colab generates cell IDs right now 12 characters long (but will accept any string value) and 12 chars is honestly too long given that non-unique IDs within a single notebook have to be dealt with gracefully. Colab's use of cell IDs within the URL seems like a common scenario.

@blois
Copy link

blois commented Aug 12, 2020

@captainsafia IDs will be unique within the notebook but Colab will automatically fix conflicts on open. There are many other tools which will generate conflicts.

The cell ID is probably best considered a fragment of a URL where the notebook would constitute the rest of the path. A globally unique cell ID would be the combination of a unique notebook ID (URL) and cell ID. In the case of a github notebook this would include repo, path and revision.

An example Colab notebook with some auto-generated and manually modified cell IDs is:
https://colab.research.google.com/gist/blois/947c4016fbe4726d1976ea0b63867e4f/cell_id_example.ipynb#scrollTo=this_is_an_example_of_a_really_long_cell_ID.

@blois
Copy link

blois commented Aug 12, 2020

This field would always be required for any future nbformat versions (4.5+).

Is there a good way to avoid some of the issues such as jupyter/nbformat#167? Currently in JupyterLab 2.1.2 one can open a notebook with format 4.5, add cells, then save resulting in a 4.5 notebook without cell_ids.

@bollwyvl
Copy link

Gah, I think adding top-level attributes to things that were previously additionalProperties: false that don't give users extremely cool and obvious benefits really fast in lots of clients are going to be a hard sell.

Other thing is: nbformat already has name reserved in all the cell metadatas which already says the same thing (should be unique, etc.), aside from being a well-known format...

The uuid type was recently added to json-schema referencing RFC.4122. If needed for older library implementations one can also use a str format with a regex pattern match.

If it must be guid, 👍 to keep-it-stupid-strings regexen. Having them be well-formed is a great property, but relying on bleeding edge features for a spec seems a hard road. Also the timescale of cell generation is relatively slow vs kernel messages, where you really want to know how wide your messages on the wire are... human readable values would be indeed be best, but if auto-generated: short, starting with a letter, and unlikely to generate profanity (really, this is important for a file format you expect people to be able to email) is probably better than full guid. It would be worth digging up a cross-platform approach that has some of those properties.

More broadly: I would not start relying on features from draft8 (or whatever they are calling it) until they are broadly supported by important jupyter upstreams: presently python-jsonschema and ajv. Last i checked, ajv might never support draft8 due to maintanability and complexity concerns. Also, jumping from draft4 to draft8 sounds like a lot more than a minor release. draft7 on the other hand, is worth investigating, and is very broadly supported.

I believe the web annotation world such as hypothes.is would like this as well (so they could know which cells a comment points to)

well.... to my knowledge there is still not an official WADM selector for JSON. So this isn't going to move the dial on making the format annotateable in-place... directly. However, hoisting the concern that most clients would actually start populating the id (whatever it's called) is a good start however, as we could describe the projection the id into a concrete representation that can be annotated: in most Jupyter/hypothes.is cases, that would be the DOM: e.g. #u-u-i-d... another strike against uuid, though: some DOM APIs can't actually deal with uuids that start with numbers, so a canonical prefix would be required.

As to being able to validate uniqueness: yerp, nope, can't do it with JSON Schema, aside from properties and uniqueItems. To @Carreau's point: you can enforce uniqueItems on an array, so "real" ordered ids + keys could happen there, and a jsonpointer could refer to it unambiguously... but again, that's a major, super-breaking change.

If nbformat (and other official jupyter ipynb implementations) was going to start enforcing id (or metadata.name), these rules should probably also be encoded in a language-neutral way. JSON-e or JMESPath could maybe validate for uniqeuness... but neither are standards (though both have multiple implementations).

the notebook as a part of uniquely identifying a cell

To further pursue the annotation question: at this point about the only thing that uniquely identifies a cell is a verifiable source of truth, like a git commit, ipfs id, trusted URL endpoint, etc, then the notebook path (probably), and then the cell. Because the first part of that gives you veracity... you can just annotate using the cell number, nothing special there. But useful annotation probably needs to talk about a place in a cell, e.g. "the range of characters 5-10 on line 10 of the output of cell 2". Ugh.

True uniqueness is not something a file format can or should be able to enforce.

@blois
Copy link

blois commented Aug 12, 2020

@bollwyvl that's a great technical explanation of why this is a difficult problem. I'd like to underscore that it is still extremely useful to have:

  1. Reasonably unique, reasonably stable identifiers for cells which are included by common notebook editing tools.
  2. General understanding among tools that these identifiers may be auto-generated and will have a best-effort approach at persisting IDs across edits.
  3. Allowing the Jupyter ecosystem to leverage these identifiers to deliver improved user experiences.

Colab has been doing this for quite some time and I think it would benefit the broader ecosystem if it were more broadly available to tools. Colab uses it for:

  1. Providing relatively stable URLs to specific cells within a notebook.
  2. Linking Google Drive comments back to cells within a notebook.
  3. ... And quite a bit more.

I'd also like to emphasize that Colab has been getting along fine with metadata.id- it's not perfect but I think it would benefit the ecosystem if it could be relied on more broadly.

@minrk
Copy link
Member

minrk commented Aug 12, 2020

For the spec, as we've done with message ids, requiring only that it be a unique string within the scope of the notebook is my preference since it allows for a variety of strategies. I would be specific about:

  • if undefined, e.g. when loading from older formats, it should be filled out with a unique value
  • cell id shall be a string, unique within a given notebook
  • uniqueness across notebooks is not a goal
  • UUIDs are one valid, simple way of ensuring uniqueness, but not necessary
  • If useful, we can specify a max length of e.g. 64 characters (this was relevant to static-allocation of msg ids in C++ implementations, but probably not as useful a restriction in the file format).

I wouldn't actually recommend using UUIDs in our own default implementations. Lots of large random strings in notebooks can be frustrating, and are something we've tried hard to avoid. 128-bit UUIDs are also vast overkill for the level of uniqueness we need within a notebook with <1000 candidates for collisions. They make for opaque URLs, noise in the files, etc. The shorter and more intelligible the better, especially for something that is to be used in user-visible places like links.

It should be a valid strategy, when populating cell ids from a notebook on import from another id-less source or older format version, to use e.g. strings from an integer counter. In fact, if an editor app keeps track of current cell ids, the following strategy ensures uniqueness:

cell_id_counter = 0
existing_cell_ids = set()

def get_cell_id(cell_id=None):
    """Return a new unique cell id

    if cell_id is given, use it if available (e.g. preserving cell id on paste, while ensuring no collisions)
    """
    global cell_id_counter

    if cell_id and cell_id not in existing_cell_ids:
        # requested cell id is available
        existing_cell_ids.add(cell_id)
        return cell_id

    # generate new unique id
    cell_id = f"id{cell_id_counter}"
    while cell_id in existing_cell_ids:
       cell_id_counter += 1
       cell_id = f"id{cell_id_counter}"
    existing_cell_ids.add(cell_id)
    cell_id_counter += 1
    return cell_id

def free_cell_id(cell_id):
    """record that a cell id is no longer in use"""
    existing_cell_ids.remove(cell_id)

If bookkeeping of current cell ids is not desirable, a 64-bit random id (11 chars without padding in b64) has a 10^-14 chance of collisions on 1000 cells, while an 8-char b64 string (48b) is still 10^-9.

@minrk
Copy link
Member

minrk commented Aug 12, 2020

I think adding top-level attributes to things that were previously additionalProperties: false that don't give users extremely cool and obvious benefits really fast in lots of clients are going to be a hard sell.

I don't quite follow this line. What's the downside? Defining new properties where they weren't previously is the main point of new minor revisions. Defining them where additionalProperties: true means there is an opportunity for collision in prior formats, while adding them where additionalProperties: false means there can't be. "Downgrade" from an unknown future minor revision generally means that these additional fields will be lost, which is okay.

@betatim
Copy link
Member

betatim commented Aug 12, 2020

@choldgraf re: ID for notebooks, there is this thread (and links in it) jupyter/nbformat#148

@bollwyvl
Copy link

main point of new minor revision

yerp, always get confused... minor vs point, and of course it's in the title of issue. Adding an additional prop required, with no grace period, still seems rough. Perhaps a more measured rollout:

  • 4.5: adds id, transparently "fixes" id omissions, collisions, etc. and hoists/derives cell/metadata/name if present,
    • but starts throwing warnings
  • 4.6: required, stop applying fixes, and requires clients/processors to provide notebook-unique IDs

following strategy ensures uniqueness

I think a lot of this becomes more tenuous in relation to multi-client support, to which the nbformat contributes a non-trivial amount of headache (see, ordered list of objects). An incrementer doesn't scale for people coming in-and-out of a "swarm" of editors, and still expecting things to "work" without a lot of heuristic approaches.

With some substantial implementation complexity: consider an optional, notebook-level id_mode: notebooks created locally, with no intent of being shared, wouldn't even add/look at this field, or cell ids, for all of the reasons described by @minrk.

id_mode: count would try to do a simple counting, based on that algorithm.

As to a more multi-client robust, cross-language approach: yes, an algorithm can be encoded in a few hours, and I don't know of a standard that meets all our needs 😢. But again from the widely-implemented-but-not-a-standard stable, there is nanoid, which has currently 14 language implementations (notably not julia or R). They aren't all that pretty or short, though: e.g. V1StGXR8_Z5jdHi6B-myT but are URL-safe and does support things like profanity filters. But anyhow, this could be id_mode: nanoid mode.

ID for notebooks,

yeah, if trying to do multi-level, de-referenceable identity, it would be somewhat hurtful to not look at the JSON standards officially supported by some of the tools mentioned above (e.g. WADM).

id_mode: jsonschema: JSON Schema does support $id, which is a URI reference. You kinda get in-document dereferencing with it, but a reference is kind of a weird id. Also, as mentioned, there is not a standard that supports selecting from a list by attribute value (CSS and XPath, do, for XML-style stuff).

id_mode: jsonld: JSON-LD uses @id, which must be a full URI, but also supports a predictable algorithm for "blank node" ids (based on position in the document). My favorite way to get clean URIs like this is with the urn syntax, which is great, e.g. urn:<cell count when created>:<small random value>, urn:15:abc123. Oh yeah, and JSON-LD handles localization like a champ.

Going further, the "shape" of the document can be validated with constraints, but is not as lightweight as jsmespath/json-e.

This mode would probably also required a @context at the root of the notebook. And while we're at it, why not allow (then enforce) more things to have @id: off the top of my head, from having worked on collaborative things that would benefit from identity: notebook root, cell, output_n.

But...

Let's burn the id_mode straw man to the ground: my recommendation would be to just adopt an existing identity standard rather than making up something new. At least there's at least a chance folk will be able to use an existing, conformance-tested implementation to parse, validate and dereference it.

@MSeal
Copy link
Contributor Author

MSeal commented Aug 12, 2020

So general consensus sounds like let's make the id a unique string, which could accommodate a uuid if needed but defaults to something shorter and simpler with a fixed range of characters and a min/max length. I think I can easily adopt that into the actual JEP proposal.

On that front, following https://github.com/jupyter/enhancement-proposals/blob/master/jupyter-enhancement-proposal-guidelines/jupyter-enhancement-proposal-guidelines.md#phase-1-pre-proposal who should be the designated Shepard for this pre-proposal?

I'll keep trying to address concerns or adjust design constraints in this thread in the meantime until we have a go/no-go about me promoting to a full JEP.

@MSeal
Copy link
Contributor Author

MSeal commented Aug 12, 2020

Is there a good way to avoid some of the issues such as jupyter/nbformat#167? Currently in JupyterLab 2.1.2 one can open a notebook with format 4.5, add cells, then save resulting in a 4.5 notebook without cell_ids.

I think we can make this a requirement of changes needed to be made to address. I don't see this being insurmountable going forward.

True uniqueness is not something a file format can or should be able to enforce.

Uniqueness within a single file seems easily achievable. It can't be done purely with json schema without contorting the format as @Carreau noted. We can enforce this at the library / application level without a lot of cause for concern. Specifically nbformat can be used to validate uniqueness and even provide opt-in to repair of uniqueness issues if a client makes a mistake. Without uniqueness you can't reliably build ontop of the abstraction.

Other thing is: nbformat already has name reserved in all the cell metadatas which already says the same thing (should be unique, etc.), aside from being a well-known format...

This came up years ago in the id conversations (I can look for the public threads later). It's not unique, not required, and not constrained to certain characters which made building ontop of name unreliable and inconsistent. It also puts your display of information and programmatic references at odds with each other. Time since hasn't improved this story and it's parallel to other systems that needed a split on display info and identity info.

To further pursue the annotation question: at this point about the only thing that uniquely identifies a cell is a verifiable source of truth, like a git commit, ipfs id, trusted URL endpoint, etc, then the notebook path (probably), and then the cell. Because the first part of that gives you veracity... you can just annotate using the cell number, nothing special there. But useful annotation probably needs to talk about a place in a cell, e.g. "the range of characters 5-10 on line 10 of the output of cell 2". Ugh.

Annotating with cell number is not sufficient. You get disassociation when the cell gets moved within a notebook with an identifier to map against. You can do this with a metadata field (like Colab and Deepnote do) but it's inconsistent across services and causes fragmentation for people to build ontop of ids in a general manner. Noteable is going to be in the same boat, where we'll have to implement our own id if it's not part of the standard.

yerp, always get confused... minor vs point, and of course it's in the title of issue. Adding an additional prop required, with no grace period, still seems rough. Perhaps a more measured rollout:

I'd much prefer to add a new required field in one go, and have the most common base libraries support roll-forward / roll-back. We have schema version for exactly this purpose and the proposed change is backwards / forward compatible here.

id_mode: count would try to do a simple counting, based on that algorithm.

This can have a risk of collision or confusion if a user expects the ids to be sequential integers in the notebook permanently. Otherwise I don't see a strong reason to block a number string being used. I think having a random string of length X (6, or 8? characters) would give a better default expectation for how this field is intended to be used.

As to a more multi-client robust, cross-language approach

UUID strings (or binaries that can be stringified) is well supported cross-language. If we allow a format that could include those this is a universally usable pattern -- or suggest using uuidv4 and taking the last k characters. Psuedo-random strings is also pretty easy cross language if we specify the characters allowed.

Let's burn the id_mode straw man to the ground: my recommendation would be to just adopt an existing identity standard rather than making up something new. At least there's at least a chance folk will be able to use an existing, conformance-tested implementation to parse, validate and dereference it.

Sounds good -- thanks for the inputs.

@ellisonbg
Copy link
Contributor

I am in favor of seeing this move forward. In practice, it turns out to be difficult to implement a jupyter frontend without some notion of cell ids. JupyterLab even passes its cell id as metadata to the kernel, and some kernels leverage this to do cell dependency tracking. The only question I have is if it makes sense to go further and replace the "list of cells" structure by "list of cell ids" + "map from cell ids to cells".

@echarles
Copy link
Member

I am +1 on id for cell. I have to implement jupyerlab extrension needing ids for the cells and was sorry to not have an explicit field for that.

Although uuid sounds like a good fit, a simple string would bring more flexibility.

My understanding is that the spec defines the format (mandatory/optional, type), but it is up to the frontend to decide how to use that.

@MSeal
Copy link
Contributor Author

MSeal commented Aug 13, 2020

They may not appear superfluous to your use cases. We focus a lot on the interactive nature of notebook documents but the reality is that notebooks spend most of their lives static and unchanged. In this state, cells are an ordered list and they already have IDs that jsonpointer can resolve; any extra id is superfluous to the value of the static document. In ten years, I expect that cells, metadata, source, cell_type will endure, but the meaning of id will change over time. IDs in the notebook format seem to disrupt the archival quality of the existing nbformat.

Can you describe more how archival would be impacted? If they're set once and preserved as artifacts move around I don't quite follow how it'd impact persistence / recall since the id wouldn't change over time unless the application chooses to rewrite it. One of the intentions here is to help application recall as a notebook may change ordering of cells (in or out of said application) but wish to preserve association of a particular cell independent of position in a standard fashion.

@tonyfast
Copy link
Contributor

Can you describe more how archival would be impacted? If they're set once and preserved as artifacts move around I don't quite follow how it'd impact persistence / recall since the id wouldn't change over time unless the application chooses to rewrite.

Over a really long time scale we can't rely on assumptions that things will be a certain way. A notebook artifact is going to be identified as whole object based on a SHA. Relative to the SHA there are cells in order. The SHA and cell ordering are two of the only things we can rely on for a really long period of time; and RDF/JSON-LD contexts.

One of the intentions here is to help application recall as a notebook may change ordering of cells (in or out of said application) but wish to preserve association of a particular cell independent of position in a standard fashion.

The notebook currently stores application information in the metadata, if this is application level data then it is independent of the cells.

Out of order notebooks are dangerous; they are common practice, but they are not sustainable. Hopefully, the community to can establish best practices to curb this.

In the nbformat definitions, cells are ordered, they have ids already. Using a list in a schema implies ordering, out order notebooks don't fit that convention. I am not confident that in ten years out of order notebooks could still work while I have more confidence in ordered notebooks.

If cells are out-of-order then maybe there is a way to use references and definitions to separate the id's from the linearity. This solution could allow for a mix of the old and the new.

{
    "cells": [
        { "$ref": "#/cell_definitions/uuid1" },
        { "$ref": "#/cell_definitions/uuid2" },
        { "metadata": ..., "source": ..., "cell_type": ... }
    ],
    "cell_definitions": {
        "uuid1": { "metadata": ..., "source": ..., "cell_type": ... }, # this is a cell type
        "uuid2": { "metadata": ..., "source": ..., "cell_type": ... }
    }
}

@choldgraf
Copy link
Contributor

Just a meta-point here. This conversation is fantastic with a lot of viewpoints. Is somebody willing to help serve as a shepherd to guide the conversation forward, make sure that voices are heard, summarize and synthesize, etc? I think that will make sure that we have the right amount of information to move forward.*

Another question: which group of folks owns the decision on this one? I suppose it would be core maintainers on the nbformat repository since that's the reference implementation of the notebook spec?

*I'd do it but I am expecting a baby in -1 days :-)

@ellisonbg
Copy link
Contributor

ellisonbg commented Aug 13, 2020 via email

@willingc
Copy link
Member

I think that there's enough interest to begin iterating on a JEP and collaborating on the best technical approach. I suspect that the governance being in transition is secondary to the content of the JEP and whichever group is deciding at the time the JEP is done (Steering Council or TBD) can respond to the JEP.

I'm willing to meet with interested folks on alternate weeks from the RTC meeting run by Saul. The preliminary JEP meeting for cell id/information could be Monday August 17th 9:30am - 10am Pacific. Here's a HackMD to get folks started: https://hackmd.io/@Y6xjRiXFRUmwV7lDeM-5nQ/rJ-VFemfv/edit

@willingc
Copy link
Member

I'm suggesting a meeting as a better way to share perspectives and collaborate than doing this all via text on an issue.

@ellisonbg
Copy link
Contributor

ellisonbg commented Aug 13, 2020 via email

@willingc
Copy link
Member

Thanks @ellisonbg. I completely agree re: scope.

@MSeal
Copy link
Contributor Author

MSeal commented Aug 14, 2020

I've attached a Zoom call in the meeting Carol started. Planning for 30 minutes on Monday it should be open to anyone to join (with the join password in the doc).

@choldgraf
Copy link
Contributor

I will not be able to join any of these meetings due to pending 👶 but I am supportive of the idea and those who are interested in pushing this forward, and confident that we can come to decision in an inclusive and productive manner ✨

@MSeal
Copy link
Contributor Author

MSeal commented Aug 14, 2020

Best of luck @choldgraf ! ... DM me with what my future will be in 2 months! I'll take silence as a sign of much restful sleep 😉

@MSeal
Copy link
Contributor Author

MSeal commented Aug 17, 2020

Thanks folks that attended! We're planning to repeat the meeting in 2 weeks at the same time and get a draft of the actual JEP with all the feedback so far included as prep for that session. Notes are captured in https://hackmd.io/AkuHK5lPQ5-0BBTF8-SPzQ (I'll need to change up the call setup for next time as there were technical difficulties with the link I gave).

@ellisonbg
Copy link
Contributor

ellisonbg commented Aug 17, 2020 via email

@rgbkrk
Copy link
Member

rgbkrk commented Aug 17, 2020

I missed the meeting! Thank you all for pushing this forward so we can all start jumping out of our backchannel ways of doing this.

@MSeal
Copy link
Contributor Author

MSeal commented Aug 27, 2020

Full proposal is up at #62 and we have a call next Monday (8/31) at 9:30 AM Pacific (your timezone) for people to be able to discuss the proposal in (virtual) person.

@MSeal
Copy link
Contributor Author

MSeal commented Aug 27, 2020

And this time the zoom link should work for folks

@willingc
Copy link
Member

Biweekly meeting was attended by @MSeal and me. Here are the agenda/minutes:

  • Comments or questions about PR#62 Cell ID Proposal
    • Matt and Carol reviewed comments on the JEP and have responded in the PR.
  • Current plan to send the JEP to Steering Council on Friday for approval which is 7 days after the PR was opened.

@tonyfast
Copy link
Contributor

bummed i missed this. how do we stay up to date with events like this?

@willingc
Copy link
Member

Sorry you missed this @tonyfast. Matt had mentioned above. I don't know the best way :(

@echarles
Copy link
Member

echarles commented Sep 1, 2020

Current plan to send the JEP to Steering Council on Friday for approval which is 7 days after the PR was opened.

@willingc (Sorry to hijack this issue with a question) Some PR in the enhancement-proposals repo are open since months: what are the guidelines for those which are still waiting on an approval?

@willingc
Copy link
Member

willingc commented Sep 1, 2020

@echarles My understanding is that those JEP proposed would need to be sent to the Steering Council for pronouncement (approval, rework, reject).

@echarles
Copy link
Member

echarles commented Sep 1, 2020

@echarles My understanding is that those JEP proposed would need to be sent to the Steering Council for pronouncement (approval, rework, reject).

Thx for the answer @willingc. I have more questions like How to submit a JEP to the Steering Council, Is there a minimal quorum required per JEP before submitting... Maybe there is a public doc for all these questions/procedures? If not, I am happy to open another issue on a repo (this one?) or just continue the discussion here...

@willingc
Copy link
Member

willingc commented Sep 1, 2020

@echarles In case you haven't seen, https://github.com/jupyter/enhancement-proposals/blob/master/jupyter-enhancement-proposal-guidelines/jupyter-enhancement-proposal-guidelines.md

This is likely the best procedure doc that I have seen. Feel free to see if there is another issue open or create a new issue for further discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.