Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[discussion] How can we play well with Kaggle? #258

Closed
jlewi opened this issue Feb 16, 2018 · 14 comments
Closed

[discussion] How can we play well with Kaggle? #258

jlewi opened this issue Feb 16, 2018 · 14 comments
Labels

Comments

@jlewi
Copy link
Contributor

jlewi commented Feb 16, 2018

Kaggle is an amazing resource.

Can we make it easy for people to use datasets and code they find on Kaggle?

The docker container for Kaggle kernels is publicly available
https://hub.docker.com/r/kaggle/python
https://github.com/kaggle/docker-python

and provides a vast array of libraries.

What would it take to turn that into an image we could launch via JupyterHub?

/cc @yuvipanda @aronchick

@yuvipanda
Copy link
Contributor

yuvipanda commented Feb 17, 2018 via email

@jlewi jlewi added area/jupyter Issues related to Jupyter priority/p2 labels Mar 3, 2018
@benhamner
Copy link

If it's helpful we could add JupyterHub to our Docker builds

@jlewi
Copy link
Contributor Author

jlewi commented Jun 21, 2018

Looks like the DockerHub image is very outdated
https://hub.docker.com/r/kaggle/python/

@jlewi
Copy link
Contributor Author

jlewi commented Jul 6, 2018

@pdmack created a Docker image in #258

I retagged into gcr.io/kubeflow-images-public/kaggle-notebook:v20180629

I retagged it using Google Container Builder; trying to use gcloud container add tag choked.
Here's the GCB config
https://github.com/jlewi/kubeflow-dev/tree/master/kaggle-image

I was able to launch a Jupyter image successfully using JupyterHub on Kubeflow.

I randomly tried this notebook
https://www.kaggle.com/pavanraj159/fifa-world-cup-1930-to-2014-data-analysis

It failed on trying to import matplotlib with matplotlib not found.

I wonder if this is an issue with the home directory / conda install in the Kaggle image.

It looks like Kaggle does a pip install on matplotlib here
https://github.com/Kaggle/docker-python/blob/c8985b357b50a0c84fd8f04134ef13f9824bf26a/Dockerfile#L446

I guess next step would be to check whether matplotlib is present in the kaggle image we are using as a base image.

/cc @pdmack

@pdmack
Copy link
Member

pdmack commented Jul 7, 2018

Looks like you were yellow carded
I'll have a look in a bit

@jlewi
Copy link
Contributor Author

jlewi commented Jul 15, 2018

I published @pdmack 's latest image to

gcr.io/kubeflow-images-public/kaggle-notebook:v20180713

I did a smoke test loading this notebook
https://www.kaggle.com/pavanraj159/fifa-world-cup-1930-to-2014-data-analysis

The imports worked but I didn't have the data so wasn't able to run the notebook I think this is ready for people try.

Note For this to work people need to change the default for the PVC mount e.g.

ks param set kubeflow-core jupyterNotebookPVCMount /home/jovyan/work

because I think the Kaggle image is installing some things in the home directory that would be overwritten. But we might want to verify that.

@pdmack
Copy link
Member

pdmack commented Jul 15, 2018

Hmmm, not sure I'm following but maybe I'll inquire on the Tuesday call

@pdmack
Copy link
Member

pdmack commented Jul 15, 2018

Ah nevermind
We could update the Kaggle Dockerfile to grab everything from /home/jovyan and stash it to /tmp, then have the PVC check restore it. This goes back to the earlier debates about what to save from /home/jovyan.

@jlewi
Copy link
Contributor Author

jlewi commented Jul 15, 2018

I thought you'd considered that and said it was infeasible because it would lead to extra layers in Docker that double the image size?

@pdmack
Copy link
Member

pdmack commented Jul 15, 2018

No, it's changing ownership that creates the monster layer. But saving a few files from /home/jovyan is not that impactful.
Perhaps we should take a closer look at what's there from the Kaggle base.

@pdmack
Copy link
Member

pdmack commented Jul 20, 2018

@jlewi just poking this again. How did you discover that the PVC needed to be at /home/jovyan/work?

@jlewi
Copy link
Contributor Author

jlewi commented Jul 21, 2018

I may have drawn an incorrect conclusion from a previous experiment.

When I used

gcr.io/kubeflow-images-public/kaggle-notebook:v20180711

With PV mounted at /home/jovyan
the container didn't even start

When I tried it with PV mounted at /home/jovyan/work

I got the error

checking if /home/jovyan volume needs init...
.../home/jovyan already has content...
...done
Execute the command
/usr/local/bin/start.sh: line 62: exec: jupyterhub-singleuser: not found

Its worth trying again with the latest image that we know works and using PV mounted at /home/jovyan. I would be very happy to be proven wrong and will apologize in advance for the confusion.

@pdmack
Copy link
Member

pdmack commented Jul 21, 2018

Yeah i think that was one of my misfires in those Dockerfile iterations
I'll take another look

@stale
Copy link

stale bot commented May 16, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed May 23, 2019
yanniszark pushed a commit to arrikto/kubeflow that referenced this issue Nov 1, 2019
elenzio9 pushed a commit to arrikto/kubeflow that referenced this issue Oct 31, 2022
* Add jiahaoc1993 as kubeflow org member

Signed-off-by: jiahaoc <jiahaoc@vmware.com>

* Add haozheng95 and owlet42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants