What is the status of per-chunk caching? Is it supported/planned? #48

grwlf · 2021-06-04T17:18:04Z

Hi. I quickly reviewed the documentation but found no clues about per-chunk caching. I suppose it is not supported, is it?

gpoore · 2021-06-04T19:12:47Z

No, there is not per-chunk caching, since that's not really practical across languages.

In the future, it may be possible to have per chunk caching for some languages, as long as there is pre-existing software that can manage caching. I believe knitr has per-chunk caching for R and possibly Julia and Python...there might be a way to leverage some existing solutions.

grwlf · 2021-08-25T21:06:31Z

I am working on a project called Pylightnix which in theory should handle the required caching. If you don't mind, I could try to add this feature to the codebraid. I plan to find the place where python blocks are executed and attempt to wrap it into pylightnix "stages".
Update: I realized that one would need to save an internal state of the interpreter in order to solve the problem. This could really be hard, so now I am not so sure that I can handle it.

gpoore · 2021-08-25T23:06:27Z

I looked at Pylightnix, and was also wondering about dealing with global state. You can basically think about the code chunks as a list of strings, with each string being the code from a code chunk. If you can come up with a function that takes such a list of strings and executes them with caching, then that function can be incorporated into Codebraid. If you want to try to implement caching, I'd suggest working on a function that operates on a list like this first, before trying to build something within Codebraid itself. (Also, I'm working on adding new features to Codebraid that involve a lot of modifications, so the existing code is about to change significantly.)

There are a few ways that we might get some caching without full per-chunk caching. Let me know if any of these are of interest for what you are doing.

It would be possible for a session to depend on one or more other sessions. For example, you could put expensive calculations in a session that saves the output at the end, and then put visualization in a separate session that loads the saved output and plots it. The visualization session would specify that it depends on the calculation session. Any time the calculation session changes, it causes the visualization session to be re-executed as well. But the visualization session can be modified without affecting the calculations.
It would be possible to use a Jupyter kernel that is not restarted between document creation runs. Only modified code would be executed by the kernel. This would have the standard out-of-order execution downsides as a Jupyter notebook. However, it would also allow very fast iteration.

grwlf · 2021-08-26T10:21:22Z

Thanks for your advices. I agree that indeed it would be better for me to do a simplified proof-of-concept first. I thought a bit more on the problem: I don't like Jupyter because I think it is too heavy to be manageable. Instead, it may be just fine to open a pipe pointing to the python shell running in the background and save this pipe as a file. Then I could require users to pass the name of this file a an argument and call it a poor-man's serialization of the interpreter state:) The rest of the demo should not be hard - I think we could assume that (a) lines of code in each chunk are "prerequisits" for this chunk; (b) the output recevied from the pipe during the last execution of the chunk is the "artifact" of this chunke that needs to be cached (I'll ignore stderr for simplicity); (c) the job now is to build dependencies between chunks, e.g. by saying that each chunk depends on all previous chunks in a file.

That could be a bit fragile, but I think it could work.

gpoore · 2021-08-26T12:17:31Z

A pipe might work. Saving the pipe and then passing it as an argument for the next document build might not be necessary. I'm interested in adding a new mode where Codebraid runs continuously in the background and automatically rebuilds the document under various conditions. For example, when the document is saved it could be rebuilt with all code replaced by the text "waiting for results", and then every 10 seconds it could be rebuilt with all code results that are available by that time. This will ultimately allow for a (nearly) live preview mode.

grwlf · 2021-09-05T22:44:46Z

Got it. I'm aware that there are compilers which work this way, some of my colleagues used one for compiling Haskell code in the background. However, I have an impression that Python environment will never be stable enough to withstand a moderatly-long editing session: as an example, I have to restart my IPython console from time to time to let it re-load files and fix some internal problems with multiple versions of classes. Apart from these doubts, I agree that it could be a nice feature.

Meanwhile I've uploaded a small proof-of-concept application called MDRUN. It processes Markdown documents by sending code sections through the Python interpreter. It runs everything in one-pass, uses non-trivial POSIX-plumbing to keep the interpreter alive between sessions. It also uses Pylightnix for the per-chunk cache management, as planned. At every run the program evaluates only the changed sections and their successors.

An example input document is here and there is the result.

I'm going to keep the master branch of Pylightnix in a working state for some time, including this sample. Feel free to let me know when/if you think I could help with adding a similar feature to the codebraid.

gpoore · 2021-09-14T12:25:07Z

The current built-in code execution system is based on templates. The code from the Markdown document is extracted from the Pandoc AST, then inserted into templates to create a source file that is executed. For this approach, adding new code execution features means creating new templates. This isn't ideal for what you need.

I've been working on adding support for running code with interactive subprocesses like a Python interactive shell for some time. I'm currently in the midst of modifying the built-in code execution system to add better support for this as well as some async-related features. Once this is finished, adding new code-execution features will be possible by specifying an executable that will read code from stdin (or potentially a file) and write (properly formatted) code output to stdout (or potentially a file). This should make it straightforward to use a slightly modified version of your MDRUN.py with the built-in code execution system.

It will probably be at least a few weeks till the new features are finished...it's part of a larger set of features I've been working on for months. I will try to remember to add a note in this issue when that's available for experimentation. If I don't add a note in the next month or so, you might check back about progress.

grwlf · 2022-08-28T13:16:13Z

FYI The request to cache the results was mainly to enjoy the partial document evaluation. Now I've implemented the latter feature as a separate project, see LitREPL. The editor (currently - vim) sends the whole document to the backend which extracts code/result sections using a lightweight parser, pipes the code through the background interpreter (Python and IPython are supported), produces the result and finally sends the document back to the editor. The communication is performed via Unix pipes, thus, there is a POSIX-compatible OS requirement for now. I found that the Lark library greatly simplifies the parsing business. With its help, the tool supports both Markdown and Latex document formats. Feel free to borrow the code if needed, I've used the same BSDv3 license as you do in Codebraid.

gpoore added the enhancement? Possible enhancement label Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the status of per-chunk caching? Is it supported/planned? #48

What is the status of per-chunk caching? Is it supported/planned? #48

grwlf commented Jun 4, 2021

gpoore commented Jun 4, 2021

grwlf commented Aug 25, 2021 •

edited

gpoore commented Aug 25, 2021

grwlf commented Aug 26, 2021 •

edited

gpoore commented Aug 26, 2021

grwlf commented Sep 5, 2021

gpoore commented Sep 14, 2021

grwlf commented Aug 28, 2022 •

edited

What is the status of per-chunk caching? Is it supported/planned? #48

What is the status of per-chunk caching? Is it supported/planned? #48

Comments

grwlf commented Jun 4, 2021

gpoore commented Jun 4, 2021

grwlf commented Aug 25, 2021 • edited

gpoore commented Aug 25, 2021

grwlf commented Aug 26, 2021 • edited

gpoore commented Aug 26, 2021

grwlf commented Sep 5, 2021

gpoore commented Sep 14, 2021

grwlf commented Aug 28, 2022 • edited

grwlf commented Aug 25, 2021 •

edited

grwlf commented Aug 26, 2021 •

edited

grwlf commented Aug 28, 2022 •

edited