Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd Dockerfile to simplify installation #93
Conversation
This comment has been minimized.
This comment has been minimized.
|
I haven't used Docker, so bear with me...
Thanks for working on this! |
This comment has been minimized.
This comment has been minimized.
slang800
commented
Sep 6, 2016
|
No prob - you can think of a Docker container as a lightweight VM... Like VirtualBox, but with better tooling and less overhead). The Dockerfile automates building/configuring the container and the This creates a fully isolated, reproducible installation of grab-site in a 200-300MB image. This image can be run on any host OS, including CoreOS where Python isn't even installed. Using Alpine as a base we could get this image down to 20-50MB, but that requires some modifications to py-lmdb. As for testing, we can have https://hub.docker.com automatically rebuild the image whenever new code is pushed (see: https://docs.docker.com/docker-hub/builds/) and run Docker-based tests in Travis if you want: https://docs.travis-ci.com/user/docker/ |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
Is there an issue filed somewhere for py-lmdb's failure to compile on Alpine Linux's gcc? |
This comment has been minimized.
This comment has been minimized.
slang800
commented
Sep 6, 2016
|
|
This comment has been minimized.
This comment has been minimized.
|
Oh, that explains it :-) |
This comment has been minimized.
This comment has been minimized.
slang800
commented
Sep 6, 2016
|
There isn't an issue filed on https://github.com/dw/py-lmdb/issues yet. |
This comment has been minimized.
This comment has been minimized.
igorbrigadir
commented
Sep 6, 2016
I haven't tried running grab-site, but it seems like installing py-lmdb works on
|
ivan
reviewed
Sep 6, 2016
| RUN pip3 install ./ | ||
| RUN apt-get purge -y build-essential | ||
| RUN apt-get autoremove -y | ||
| RUN apt-get clean |
This comment has been minimized.
This comment has been minimized.
ivan
Sep 6, 2016
Member
Someone tells me that each RUN creates a new layer, so the purge/autoremove/clean would not reduce the size of the final Docker image. What do you think about combining the RUNs on lines 6-12 into one RUN command?
This comment has been minimized.
This comment has been minimized.
mithrandi
Sep 6, 2016
"Someone" is me, in case additional clarification of this comment is needed :)
This comment has been minimized.
This comment has been minimized.
mithrandi
Sep 6, 2016
Another possibility is to use FROM python:3.4 instead of FROM python:3.4-slim; the non-slim variant is based off of buildpack-deps which has a lot of compilers / tools / libraries installed. The resulting total image size would be bigger, but the advantage is that the buildpack-deps portion would be shared with every other image based off of that, so in the usual case where you have several images, the total space usage would be lower.
This comment has been minimized.
This comment has been minimized.
slang800
Sep 6, 2016
Good point! I combined the commands & got the image size down to 235.4MB. I'm kinda surprised that images don't get flattened, but moby/moby#332 offers a lengthy discussion on it.
As for basing it on python:3.4, that would reduce the build time & total size of images on the system, but only if a significant number of the other images on the system are based on it too, which I don't think we can assume. It's probably better to just optimize for the smallest resulting image size.
slang800
force-pushed the
slang800:patch-1
branch
from
3debff9
to
73d2726
Sep 6, 2016
slang800
force-pushed the
slang800:patch-1
branch
2 times, most recently
from
fb56712
to
ce4d178
Sep 13, 2016
ivan
reviewed
Sep 13, 2016
| @@ -34,6 +34,7 @@ Note: grab-site currently **does not work with Python 3.5**; please use Python 3 | |||
| <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE --> | |||
This comment has been minimized.
This comment has been minimized.
ivan
Sep 13, 2016
Member
This comment is a lie, sorry. I've been updating this TOC manually and probably don't want the Tips for specific websites expanded.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
slang800
commented
Sep 13, 2016
|
You're right about it working on Alpine - I was just missing |
slang800
changed the title
[wip] add Dockerfile to simplify installation
Add Dockerfile to simplify installation
Sep 13, 2016
slang800
force-pushed the
slang800:patch-1
branch
2 times, most recently
from
ee5a78e
to
3393d82
Sep 13, 2016
ivan
reviewed
Sep 13, 2016
| Start the grab-site server. You can set the port, volume, and name to whatever you want: | ||
|
|
||
| ```bash | ||
| docker run --detach -p 29000:29000 -v /home/ludios/download/grab-site-data:/data --name warcfactory slang800/grab-site |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
slang800
Sep 13, 2016
docker requires an absolute path for mounts... I suppose I could do $(pwd)/grabs, if that's obvious to most users.
This comment has been minimized.
This comment has been minimized.
ivan
reviewed
Sep 13, 2016
| docker exec warcfactory grab-site --no-offsite-links --1 http://xkcd.com/ | ||
| ``` | ||
|
|
||
| The downloaded data, temp files, ignores list, and other configuration will be in a sub-directory of the mounted volume. In this case, `/home/ludios/download/grab-site-data/xkcd.com-2016-09-05-caf0a39c`. |
This comment has been minimized.
This comment has been minimized.
ivan
reviewed
Sep 13, 2016
| @@ -55,6 +56,27 @@ Note: grab-site currently **does not work with Python 3.5**; please use Python 3 | |||
|
|
|||
| <!-- END doctoc generated TOC please keep comment here to allow auto update --> | |||
|
|
|||
| Install with Docker | |||
| --- | |||
| Get the pre-built docker container: | |||
This comment has been minimized.
This comment has been minimized.
ivan
Sep 13, 2016
Member
I would really like the instructions to be so complete that people who don't have Docker installed yet (me included) know where to get it.
This comment has been minimized.
This comment has been minimized.
slang800
Sep 13, 2016
•
Eh, that depends on your package manager... for me it's pacman -S docker (on Arch), but it's apt-get install if you're on a debian-based system, yum on RHEL/CentOS/Fedora and I imagine there's something similar for OSX. Or it's pre-installed if you're on CoreOS.
This comment has been minimized.
This comment has been minimized.
ivan
Sep 13, 2016
•
Member
True, maybe just point to https://docs.docker.com/engine/installation/ if you consider that accurate.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Thanks for the fixes. I am currently somewhat busy and under-dockered, can a grab-site user please give the Docker instructions a try and see if they work? (And let me know if you had to perform any other steps to make this a useful setup?) |
This comment has been minimized.
This comment has been minimized.
igorbrigadir
commented
Sep 13, 2016
|
I had docker already, but worth linking to https://docs.docker.com/engine/installation/ instructions I tried it with: Ran (sudo for docker commands because i skipped this step https://docs.docker.com/engine/installation/linux/ubuntulinux/#/create-a-docker-group):
Crawl finished successfully! |
slang800
force-pushed the
slang800:patch-1
branch
from
3393d82
to
23f9099
Sep 13, 2016
slang800
force-pushed the
slang800:patch-1
branch
from
23f9099
to
8b57b72
Sep 13, 2016
This comment has been minimized.
This comment has been minimized.
|
I tried this out, but couldn't find a way to attach a terminal to a Would adding tmux to the container and using tmux work? (Note, tmux 2.1 is broken; 1.8 is a known-good version.) I just hope that |
This comment has been minimized.
This comment has been minimized.
|
Also, running |
This comment has been minimized.
This comment has been minimized.
slang800
commented
Oct 20, 2016
Splitting up the server and client would make sense, especially since you could then run them on different machines, but I should probably do that as a separate PR, since I'll need to look into how they communicate.
Would using dumb-init as PID 1 allow the orphaned grab-site processes to keep running in the case where
You could run |
This comment has been minimized.
This comment has been minimized.
semente
commented
Nov 7, 2018
|
hey people! what is the status of this PR? I could give a hand. |
This comment has been minimized.
This comment has been minimized.
|
For now, I would like someone else to be the Dockerized grab-site upstream. I don't use Docker and I don't have the resources to 1) figure out if a PR is taking the right approach with Dockerization (which base? which init? one container per grab-site? how to integrate tmux, if needed?) 2) double my manual testing matrix. So, please, have at it and promote your fork/Dockerfile here. If you (or someone else) stays interested in maintaining and testing it, I might take a PR in the future. |
slang800 commentedSep 5, 2016
I'm deploying a couple instances of grab-site to a CoreOS cluster, so I made a Dockerfile... Hopefully this is a bit easier to use than pip/virtualenv. The reason why this uses the larger
python:3.4-slimimage (rather thanpython:3.4-alpine) is because Alpine had some issues compiling https://github.com/dw/py-lmdb with its version of gcc.This PR still needs docs, so it's a work-in-progress right now.
After starting the container you can use the regular
grab-sitecommand viadocker exec <container-name> grab-site <args and site url>