Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store Y updates #12600

Closed
wants to merge 9 commits into from
Closed

Conversation

davidbrochart
Copy link
Contributor

@davidbrochart davidbrochart commented May 19, 2022

References

Fixes #12596.

Code changes

The first time a document is accessed from the front-end, it is read from the source file and a Y document is created in the back-end (thus no history of changes).
The next time a document is accessed from a front-end, it is synced with the Y document, either from memory if the room was not deleted, or from file otherwise. When the room was deleted, the Y updates were saved to a file. Thus, the history of changes is always available, at least if using a YStore that persists between JupyterLab sessions. By default, an SQLiteYStore is used and does persist.

User-facing changes

None at the moment, but later we can implement a timeline of the document changes in the front-end, and potentially restore the document at a point in time.

Backwards-incompatible changes

None.

@jupyterlab-probot
Copy link

Thanks for making a pull request to jupyterlab!
To try out this branch on binder, follow this link: Binder

@davidbrochart
Copy link
Contributor Author

This is ready for review.

@davidbrochart davidbrochart changed the title Save Y updates to sidecar file Store Y updates May 27, 2022
@jtpio jtpio added this to the 4.0 milestone May 30, 2022
@jtpio
Copy link
Member

jtpio commented May 30, 2022

Thanks @davidbrochart this looks good, and would be indeed be useful for debugging.

@jtpio
Copy link
Member

jtpio commented May 30, 2022

Looks like this could be merged in its current state so it could be tested, without blocking on #12614?

@davidbrochart
Copy link
Contributor Author

Yes I also think that #12614 is quite independent. I'm in favor of merging it.

@jtpio
Copy link
Member

jtpio commented May 31, 2022

least during the life cycle of the Jupyter server

Curious of where this logic lives. Is it in SQLiteYStore? Looking at the diff the database does not seem to be modified or deleted on server shutdown?

@davidbrochart
Copy link
Contributor Author

That was when we used a TempFileYStore, but now we have a default SQLiteYStore which persists between JupyterLab sessions.

@jtpio
Copy link
Member

jtpio commented Jun 3, 2022

OK I see you updated the description in the top-level comment, thanks 👍

Copy link
Member

@jtpio jtpio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this looks good.

Since this will automatically create a new .jupyter_ystore.db file and users might wonder what this is about, maybe we could mention that in the RTC documentation?

https://jupyterlab.readthedocs.io/en/latest/user/rtc.html

@davidbrochart
Copy link
Contributor Author

Thanks @jtpio, I added a note about this file and also updated the documentation with auto-save.

@ellisonbg
Copy link
Contributor

@davidbrochart what would it look like if we merge this PR now and then later get it to work with the document ID service described in #12614. I would love to see this move forward in the meantime, but want to understand the migration path.



class JupyterSQLiteYStore(SQLiteYStore):
db_path = ".jupyter_ystore.db"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this file located? Is there one per directory? One per server? What are the tradeoffs of the two options?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is located in the directory where JupyterLab (i.e. the server) is run. There is one file for the whole duration of the server, and if the server is run again in the same directory, the file is reused, so there is persistence between "JupyterLab sessions".
Having one file per directory could be interesting because we wouldn't identify a document with its path, but only its name. That means if JupyterLab is launched from another directory, we are still able to easily find the Y store for any opened document. With the current approach, the (relative) path to a document is encoded in the database, so changing the JupyterLab directory would create a new Y store and updates from the previous session would be lost.
Thanks for raising that up @ellisonbg, I think we should have one SQLiteYStore per directory, what do you think?

@davidbrochart
Copy link
Contributor Author

@davidbrochart what would it look like if we merge this PR now and then later get it to work with the document ID service described in #12614. I would love to see this move forward in the meantime, but want to understand the migration path.

Maybe we could have only one database for storing the Y updates and the document IDs? Actually, the database for storing the document IDs would suffer the same issues of "server relocation", so having one database per directory as discussed would be relevant for the document ID database too. So why not one database per directory for storing every document state, be it Y updates or document ID or comments?
This would make getting the document for a given ID more complicated though, because we would need to walk through all the sub-directories to find the ID.

@echarles
Copy link
Member

echarles commented Jun 7, 2022

Whatever implementation is defined by the platform administrator (File, SQLite...), I am thinking that the storage would cover the complete arborescence under the defined root_dir. So if the server is started with root_dir==/foo/barwith SQLite , there should be a single database serving all the update of all the documents under/foo/bar`.

Having that same persistence system to serve the ID mapping could be useful. What if the platform administrator is willing to configure with file for Y.js updates and SQLite for IDs?

@davidbrochart
Copy link
Contributor Author

Having that same persistence system to serve the ID mapping could be useful. What if the platform administrator is willing to configure with file for Y.js updates and SQLite for IDs?

We currently have the flexibility to store Y updates into files, but it could be kept in ypy-websocket (which is independent of Jupyter), while Jupyter would require them to be stored in its own storage type.

RENAME_SESSION = 127


class JupyterTempFileYStore(TempFileYStore):
prefix_dir = "jupyter_ystore_"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use class variables? You do not control their life cycle. It will be better to pass a dictionary of configurable kwargs to the constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that a new Y store is instantiated for each document, identifying the document with its path. But you want to store the document updates to some "common root": the database for an SQLiteYStore, or a directory prefix for a TempFileYStore. So you cannot pass this common root in the constructor, hence the class variable.
There is not issue of class variable life cycle if you create a new class for your store, which is what the docstrings suggest here and there.

@fcollonval
Copy link
Member

Whatever implementation is defined by the platform administrator (File, SQLite...), I am thinking that the storage would cover the complete arborescence under the defined root_dir. So if the server is started with root_dir==/foo/barwith SQLite , there should be a single database serving all the update of all the documents under/foo/bar`.

It will also reduce the number of files created that will crowd the user filesystem.

Or if we store it for each directory, it could be stored it under the checkpoint directory (as this is kind of an enhanced version of checkpoints). So that users already know that this is related to Jupyter, it is probably already ignored by their Version Control System and it may be removed if they need to clean thing up.

@davidbrochart
Copy link
Contributor Author

I feel that the question of where to store the updates could be handled in a separate issue/PR, and that we should merge this PR.
Let's not forget that it solves a bug where content is duplicated when a user disconnects for more than one minute (by default).

@davidbrochart
Copy link
Contributor Author

Also, we have been using YStores to debug RTC issues, by "replaying" changes to a document, and it appears to be a very valuable tool.
A JupyterLab extension to navigate through a document history using its YStore could be nice to have.

@davidbrochart
Copy link
Contributor Author

This PR was moved to jupyterlab/jupyter-collaboration#2.

@davidbrochart davidbrochart mentioned this pull request Jul 25, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save document changes to disk in RTC
5 participants