Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify NB Dockerfile and start scripts to support mount of /home/jovyan #786

Merged
merged 1 commit into from May 23, 2018

Conversation

pdmack
Copy link
Member

@pdmack pdmack commented May 10, 2018

Put all of /home/jovyan on a PVC when using PVC with Jupyter notebooks.

  • One consequence of mounting a PVC at /home/jovyan is that any files placed in /home/jovyan will be
    hidden by the PVC once its mounted
  • We take two approaches to solving this
    • We move most things we care about (e.g python pip/conda packages in /home/jovyan ) to other locations e.g. /opt that won't be overwritten
    • Files that need to be in /home/jovyan like jupyter_notebook_config.py are copied on container
      start into /home/jovyan so that they will be copied onto the PVC

Fixes #561

/cc @jlewi
/cc @inc0

Pretty much what @jlewi described. Dockerfile installs conda, pip, etc. in situ at /home/jovyan then we tar it up to /tmp and remove the content at /home/jovyan. Then start-singleuser.sh invokes a new pvc-check.sh which determines whether or not we init a new /home/jovyan, regardless of whether it is mounted or not (it doesn't know). If it detects content there from a PVC it just re-installs from that pip cache. If nothing is there it untars to that location which may or may not be a mount (doesn't matter).

~~The downside of the tar technique is that it unfortunately adds a 2.5 GB (mostly .conda) layer to the NB image, but I'm at a loss for a cleaner way to do this. I don't think it's worth doing compression in the Docker build, nor decompression at pod launch. ~~

I tested with a hostPath PV like this. Obviously that dir needs to be chmod 777 for the hostPath to work.

Could use some tips/help/suggestions on e2e testing this. README updates can follow but IMHO we better agree on this strategy first.


This change is Reviewable

@k8s-ci-robot k8s-ci-robot requested review from inc0 and jlewi May 10, 2018 17:36
@pdmack pdmack changed the title Modify NB Dockerfile and start scripts to detect mount for /home/jovyan Modify NB Dockerfile and start scripts to allow mount of /home/jovyan May 10, 2018
@pdmack pdmack changed the title Modify NB Dockerfile and start scripts to allow mount of /home/jovyan Modify NB Dockerfile and start scripts to support mount of /home/jovyan May 10, 2018
@jlewi
Copy link
Contributor

jlewi commented May 10, 2018

Can you run autoformat_jsonnet to fix lint issues? The minikube failure should have been transient.

/test all

@jlewi
Copy link
Contributor

jlewi commented May 10, 2018

What is being installed in /home/jovyan ? Could that be installed someplace else?

@jlewi
Copy link
Contributor

jlewi commented May 10, 2018

45M    ./.cache
92K     ./.ipython
52K     ./.config
72K     ./.local
3.0G    ./.conda
16K     ./.jupyter
152K    ./.npm
8.0K    ./.keras
6.0G    ./work

So it looks like conda is installing itself in the home directory. Can we just put that in /opt or /usr?

@ojarjur @willingc any advice?

@pdmack
Copy link
Member Author

pdmack commented May 10, 2018

# pwd && ls -A && du -hs .conda/ && which jupyter
/home/jovyan
.cache	.config   .local  requirements.txt  work
.conda	.jupyter  .npm	  .wget-hsts
2.5G	.conda/
/home/jovyan/.conda/bin/jupyter

We can install anywhere writable in the base image (i.e., /tmp). The question is: what are we trying to enable for the NB user? My assumption is that there will be incremental changes to the pip cache as the user installs their desired pkgs. Ergo, those must survive pod restarts.

@jlewi
Copy link
Contributor

jlewi commented May 10, 2018

I'm perfectly fine with user installed packages not persisting across pod restarts. This conforms to my normal expectation that if a pod restarts you get back the original OS as baked into the image.

From a user perspective, I think the fact that a pod restart terminates your kernel and you lose all state is more problematic. So in that regard Jupyter notebooks need to have high availability.

If users want to install additional packages and be resilient to restarts I think the way to do that is to build a custom docker image which has the additional packages.

So my preference would be to

  • Avoid bloating the image
  • Simplify the image build and startup process

So I'd be in favor of installing into /opt or /usr even if it means any additional user installed packages would need to be reinstalled on restart.

@pdmack
Copy link
Member Author

pdmack commented May 11, 2018

OK, apparently there’s misgivings about shared, multi-user conda installs since it is by design geared for separate user environments. However, in our case it is indeed a one-user env. I’ll have a look tomorrow.

@pdmack pdmack changed the title Modify NB Dockerfile and start scripts to support mount of /home/jovyan [WIP] Modify NB Dockerfile and start scripts to support mount of /home/jovyan May 11, 2018
@pdmack pdmack changed the title [WIP] Modify NB Dockerfile and start scripts to support mount of /home/jovyan Modify NB Dockerfile and start scripts to support mount of /home/jovyan May 14, 2018
jupyter labextension install @jupyter-widgets/jupyterlab-manager
jupyter labextension install @jupyter-widgets/jupyterlab-manager && \
# Do chown in this layer for significant size savings
chown -R ${NB_USER}:users $HOME && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!


# Setup work directory for backward-compatibility
RUN mkdir /home/$NB_USER/work

# Install Tini - used as entrypoint for container
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth adding a comment that we want a single layer to minimize size so that people don't split it up later into multiple commands not realizing the benefit of using a single layer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -16,7 +16,7 @@ ENV DEBIAN_FRONTEND noninteractive
ENV NB_USER jovyan
ENV NB_UID 1000
ENV HOME /home/$NB_USER
ENV CONDA_DIR=$HOME/.conda
ENV CONDA_DIR=/opt/conda
Copy link
Contributor

@jlewi jlewi May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining why we use /opt/conda and not /home

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

RUN useradd -m -s /bin/bash -N -u $NB_UID $NB_USER && \
chown -R ${NB_USER}:users /usr/local/bin
# but allow for non-initial launches of the notebook to have
# $HOME provided by the contents of a PV
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we doing to allow non-initial launches of the notebook to have HOME provided by contents of a PV?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below.


# Install common packages from requirements.txt for both python2 and python3
# NB: the COPY chown can't expand a bash variable for NB_USER
COPY --chown=jovyan:users requirements.txt $HOME/requirements.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use $HOME here as opposed to some directory in /opt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below.

RUN pip --no-cache-dir install -r $HOME/requirements.txt && \
source activate py2 && \
pip --no-cache-dir install -r $HOME/requirements.txt

# Tar and delete staged content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to do this? I thought the whole point of installing in /opt was to avoid having to create a tarball and then installing that tarball?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See below.

# limitations under the License.

# stored in the NB Dockerfile
SRC_TAR=/tmp/$NB_USER.tar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to do this? Can't we just pip install the packages into a directory other than home? So that we don't have to reinstall them on startup?

Copy link
Member Author

@pdmack pdmack May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi I'm kind of struggling here. It seems like there is a minimal set of content from the Dockerfile build that should be put aside and then loaded into a "fresh" /home/jovyan including:

  • gcloud config
  • jupyter_notebook_config.py
  • requirements.txt (at least as a user-visible record of what was installed...sure they can do pip list, I suppose...)

If the PV is empty then they will have none of that since it will mask that content.

/cc @ankushagarwal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deciding what we want and what we don't want to put on a PV is really tricky. I can argue that we might want to put python packages into PV as well since it's very unlikely that users will be using only the packages pre-installed in the image - users would want those packages to survive container restarts as well.

Putting things in PV is also really tricky because to put things on PV, we can't do it at build time and have to do it at runtime with solutions like copying the tarball. I don't have a good solution on how to do this cleanly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing precludes a user from installing whatever they want into their home directory / PV after they've started the pod.

If at all possible I'd like to avoid having the Dockerfile put anything in /home/jovyan so we can avoid the complexity of having to copy files into /home after mounting the PVC.

gcloud config - I don't think we need this. My guess is it will be created automatically if a user runs gcloud.

requirements.txt I don't think a user visible record is that important. So we can just move it somewhere else or just forget about it for now.

.jupyter/jupyter_notebook_config.py - Looks like this is autocreated when you start a notebook.

The only file I think we need is .bashrc and I think it would be fine to copy that on startup to the PVC (although even better would be if it uses the system default by default).

I ran a simple experiment and removed all the "." directories in my home directory and launched a notebook; it ran fine.

So I think if you've installed all the pip packages in a directory other than /home/jovyan I think we should be good to go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jlewi, we seem to be fairly opinionated in the current Dockerfile....

gcloud:

gcloud config set core/disable_usage_reporting true && \
gcloud config set component_manager/disable_update_check true && \
gcloud config set metrics/environment github_docker_image

.jupyter/jupyter_notebook_config.py:
We definitely don't generate a default for this in the Dockerfile with jupyter notebook --generate-config. Instead we have a custom one copied in that seems specifically crafted.

requirements.txt:
Agreed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willingc @yuvipanda @foxish Any reason why we would need a custom jupyternotebook_config?

For the gcloud config its fine to overwrite those defaults.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.jupyter/jupyter_notebook_config.py - Looks like this is autocreated when you start a notebook.

That's correct @jlewi. jupyter_notebook_config.py contains the default configuration for notebook when jupyter notebook --generate-config is run or the jupyter notebook is run for the first time. Ref

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @willingc

@pdmack It seems like it would be useful to support using a custom jupyter_notebook_config.py. So I think you are right we might want to copy that file to /home/jovyan if it doesn't exist. But I'd say that's probably the only such file I can think of.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, if there is no other content to preserve then I will update the PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I suggest we try it out and if we found we blew away something we shouldn't have we can add it back.

@pdmack
Copy link
Member Author

pdmack commented May 22, 2018

/retest

@pdmack
Copy link
Member Author

pdmack commented May 22, 2018

/retest

@pdmack
Copy link
Member Author

pdmack commented May 22, 2018

@jlewi PTAL

  • $HOME wiped in Dockerfile (instead of tarred)
  • gcloud config removed (since we don't care if it's there on init)
  • pip from /tmp/requirements.txt
  • now only mkdir work and cp jupyter_notebook_config.py in $HOME from /tmp

@jlewi
Copy link
Contributor

jlewi commented May 23, 2018

This is great. Thank you so much for doing this.

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jlewi
Copy link
Contributor

jlewi commented May 23, 2018

BTW I took the liberty of updating the PR description.

I filed #854 to verify it is working.

@k8s-ci-robot k8s-ci-robot merged commit 2a68f29 into kubeflow:master May 23, 2018
@pdmack pdmack deleted the 561-jovyan-home-mount branch May 23, 2018 18:18
saffaalvi pushed a commit to StatCan/kubeflow that referenced this pull request Feb 11, 2021
yanniszark pushed a commit to arrikto/kubeflow that referenced this pull request Feb 15, 2021
* fix: Add log

Signed-off-by: Ce Gao <gaoce@caicloud.io>

* WIP

Signed-off-by: Ce Gao <gaoce@caicloud.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants