Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Cache Values #16

Closed
AnotherCodeArtist opened this issue Apr 28, 2023 · 11 comments
Closed

Feature Request: Cache Values #16

AnotherCodeArtist opened this issue Apr 28, 2023 · 11 comments

Comments

@AnotherCodeArtist
Copy link

Hi Jan!

Thanks for the great work. It already became my favorite Go kernel and I'm using it on a JupyterHub cluster.
One thing, however, would be great: When declaring functions, types or variables they can be re-used over multiple cells. But, if a variable is holding the results of a function call

var lotsOfData = LoadOverTheInternet("https://.....")

and this variable is used later in another cell

%%
processData(lotsOfData)

then not the initial result that was loaded in the previous cell is used, but the function call is executed again.
Is there any chance to cache the data instead of executing the function over and over again?

BTW: If you need a Dockerfile, I already have (although a bit specific, since it is running in our cluster with some custom modifications)

@janpfeifer
Copy link
Owner

Thanks @AnotherCodeArtist happy to hear that 😃!

Yes, the idea of having something that carry over values to another cell is high on my want list -- it's the item "Library to easily store/retrieve calculated content." in the TODO list.

The thing is, GoNB works by recompiling and re-executing at every cell execution, so it's not that the value can stay alive. One work around that comes to mind is to have a library that quickly serializes/deserializes on demand. So your example would look like:

var lotsOfData = CacheValue(func () Data { return LoadOverTheInternet("https://.....") }, "lotsOfData")

Where:

func CacheValue[T any](fn func() T, key string) T {
...
}

Would try to first read the value T from a cache, keyed by key. And if it doesn't find, it would call fn and save the result in the cache.

So it actually runs the LoadOverTheInternet() only at the first execution, and all other cells will simply reload it. And assuming one is using the notebook, this will be in the computer cache and always in memory for fast access.

For really large blobs of data this may still not be good enough, but then maybe one could memory-map a file with the data in binary formal (e.g; a large array of floats). It would require special memory management, but it's easy to make this manageable.

Any thoughts ? I suppose that's what you meant with caching the data ?

Btw, on the Dockerfile indeed it needs one. @sirliu suggested he/she (?) would do it in #13, but I haven't heard back in a bit. What about we follow up on Dockerfile in that thread ?

cheers

@AnotherCodeArtist
Copy link
Author

Hi Jan!

Sounds like a good first shot. Would be even cooler if there were a chance to provide some cell magic like %% for the main function so that it all becomes transparent to the user. But conceptually that seems to be the way to go (pun not intended!).

Anyhow, here's a working Dockerfile:

ARG BASE_IMAGE=jupyter/base-notebook
ARG BASE_TAG=python-3.10

FROM ${BASE_IMAGE}:${BASE_TAG}


USER $NB_USER

ENV GOVERSION=1.20

USER root

WORKDIR /root
RUN wget https://dl.google.com/go/go$GOVERSION.linux-amd64.tar.gz && \
	tar -C /usr/local -xzf go$GOVERSION.linux-amd64.tar.gz

RUN apt-get update && apt-get install -y git libtool pkg-config build-essential autoconf automake uuid-dev libzmq3-dev



USER $NB_USER
WORKDIR /home/jovyan

ENV GOROOT=/usr/local/go
ENV GOPATH=/home/jovyan/go
ENV PATH=$PATH:$GOROOT/bin:$GOPATH/bin

# Install GoNB (https://github.com/janpfeifer/gonb)
RUN go install github.com/janpfeifer/gonb@latest && \
    go install golang.org/x/tools/cmd/goimports@latest && \
    go install golang.org/x/tools/gopls@latest && \
    gonb --install

WORKDIR /home/jovyan/work

USER root

Build it with

docker build -t gonb:latest .

Run it with

docker run -p 8888:8888 --rm gonb:latest

@janpfeifer
Copy link
Owner

On the cache: I'm hesitant to create the cell magic -- I'm a big fan of making things explicit, even if requires a bit more typing. Also because the cache system is also useful outside GoNB. So if it can use normal Go language to achieve the same thing, I think it is a plus (one less thing to be learned by the end-user).

Thanks for the Dockerfile! I'll add it this weekend, and generate one in Docker Hub so folks can simply pool from it.

@janpfeifer
Copy link
Owner

Thx again @AnotherCodeArtist . I added a few more things to your initial Dockerfile and pushed it out. Check it out.

Let me see if I can cook a Cache Values library next.

@janpfeifer
Copy link
Owner

janpfeifer commented Apr 30, 2023

I took an initial stab at it, check it out in c9a1f3198096180f63042cd667675ddee8c7f2bc.

I haven't yet created a new release. I'll give it a few days, if you see any issues, or it doesn't work for you let me know.

If everything works I'll create a section in the tutorial about it and the 0.6 release.

@AnotherCodeArtist
Copy link
Author

Hi Jan!

Just tried to use the new cache in a notebook. Get the following error message when running the cell:

 go: downloading github.com/janpfeifer/gonb v0.5.1
 gonb_110feb6a imports
 	github.com/janpfeifer/gonb/cache: cannot find module providing package github.com/janpfeifer/gonb/cache
 
 exit status 1

My dependencies in the docker image are

RUN go install github.com/janpfeifer/gonb@c9a1f3198096180f63042cd667675ddee8c7f2bc && \
    go install golang.org/x/tools/cmd/goimports@latest && \
    go install golang.org/x/tools/gopls@latest && \
    gonb --install

Nevertheless, it seems that 0.5.1 gets downloaded when running

var phonebook = cache.Cache("my_data", func() *Data { return LoadPhoneBookFile() })

Do I require any additional dependencies/imports?

@janpfeifer
Copy link
Owner

janpfeifer commented May 2, 2023

Sorry I probably should have explained. But the gonb/cache package you use in your notebook is not the same that is running the kernel. So you didn't even need to rebuild the docker, what matters is the version your notebook will use.

Try this, in 3 different cells:

  1. Tell Go (for the notebook) to download the cache package at the given version. Notice the !* prefix executes the bash command in the temporary directory where the cells Go program is being executed (there is an example in the tutorial):
!*go get -u github.com/janpfeifer/gonb/cache@c9a1f3198096180f63042cd667675ddee8c7f2bc
  1. Let's define two variables, one cached one not, taking some random value:
import (
    "math/rand"
    "github.com/janpfeifer/gonb/cache"
)

var (
    a = rand.Intn(100)
    b = cache.Cache("b", func() int { return rand.Intn(100) })
)
  1. And then try running a few times:
%%
fmt.Printf("a=%d, b=%d\n", a, b)

@AnotherCodeArtist
Copy link
Author

Hi Jan!

Works great for my scenario! Just a thought: Would it be possible to use some in-memory database (which needs to be started and controlled by the kernel) for caching or would it have no influence on execution time anyway?

@janpfeifer
Copy link
Owner

Nice, I'm happy it worked.

So about having an in-memory database of sorts to store the cache: the thing is that the OS is pretty good with caching of disk: in most cases (if we are not talking GB of data) interactively working on the notebook everything will be in memory anyway. Another inefficiency of the OS filesystem may be the number of files, if you start having thousands or millions of cached values. In those cases one reasonable option would be packing collections of values into a container, and caching the container instead ?

Also, notice that the cache.New() method allows you to create a cache.Storage in any arbitrary disk. Depending on your set up, you could also create an in-memory filesystem (with tmpfs, see this article), and store it there ?

If none of those work for you, let me know what is your scenario.

The API is flexible to support different type of backends -- one could create a NewInDatabase call that uses a Database as storage, or something like that.

@AnotherCodeArtist
Copy link
Author

Thanks for the hint with the ramdisk. As I've already pointed out, if there's no significant performance gain it's not worth going through the hassle!

@janpfeifer
Copy link
Owner

Nice. Closing the issue then. Next weekend I'll create a new release, and update the tutorial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants